Time series interpolating with sktime#
Suppose we have a set of time series with different lengths, i.e. different number of time points. Currently, most of sktime’s functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the TSInterpolator
to do so.
[1]:
import random
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
Ordinary situation#
Here is a normal situation, when all time series have same length. We load an example data set from sktime and train a classifier.
[2]:
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[2]:
1.0
If time series are unequal length, sktime’s algorithm may raise an error#
Now we are going to spoil the data set a little bit by randomly cutting the time series. This leads to unequal-length time series. Consequently, we have an error while attempt to train a classifier.
[3]:
# randomly cut the data series in-place
def random_cut(df):
for row_i in range(df.shape[0]):
for dim_i in range(df.shape[1]):
ts = df.iloc[row_i][f"dim_{dim_i}"]
df.iloc[row_i][f"dim_{dim_i}"] = pd.Series(
ts.tolist()[: random.randint(len(ts) - 5, len(ts) - 3)]
) # here is a problem
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
for df in [X_train, X_test]:
random_cut(df)
try:
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
except ValueError as e:
print(f"IndexError: {e}")
IndexError: Tabularization failed, it's possible that not all series were of equal length
/Users/mloning/.conda/envs/sktime-dev/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
Now the interpolator enters#
Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points.
After interpolating the data, the classifier works again.
[4]:
from sktime.transformations.panel.interpolate import TSInterpolator
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)
for df in [X_train, X_test]:
random_cut(df)
steps = [
("transform", TSInterpolator(50)),
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[4]:
1.0
Generated using nbsphinx. The Jupyter notebook can be found here.