binder

Time series interpolating with sktime#

Suppose we have a set of time series with different lengths, i.e. different number of time points. Currently, most of sktime’s functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the TSInterpolator to do so.

[1]:
import random

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator

Ordinary situation#

Here is a normal situation, when all time series have same length. We load an example data set from sktime and train a classifier.

[2]:
X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

steps = [
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[2]:
1.0

If time series are unequal length, sktime’s algorithm may raise an error#

Now we are going to spoil the data set a little bit by randomly cutting the time series. This leads to unequal-length time series. Consequently, we have an error while attempt to train a classifier.

[3]:
# randomly cut the data series in-place


def random_cut(df):
    for row_i in range(df.shape[0]):
        for dim_i in range(df.shape[1]):
            ts = df.iloc[row_i][f"dim_{dim_i}"]
            df.iloc[row_i][f"dim_{dim_i}"] = pd.Series(
                ts.tolist()[: random.randint(len(ts) - 5, len(ts) - 3)]
            )  # here is a problem


X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

for df in [X_train, X_test]:
    random_cut(df)

try:
    steps = [
        ("concatenate", ColumnConcatenator()),
        ("classify", TimeSeriesForestClassifier(n_estimators=100)),
    ]
    clf = Pipeline(steps)
    clf.fit(X_train, y_train)
    clf.score(X_test, y_test)
except ValueError as e:
    print(f"IndexError: {e}")
IndexError: Tabularization failed, it's possible that not all series were of equal length
/Users/mloning/.conda/envs/sktime-dev/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)

Now the interpolator enters#

Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points.

After interpolating the data, the classifier works again.

[4]:
from sktime.transformations.panel.interpolate import TSInterpolator

X, y = load_basic_motions()
X_train, X_test, y_train, y_test = train_test_split(X, y)

for df in [X_train, X_test]:
    random_cut(df)

steps = [
    ("transform", TSInterpolator(50)),
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
[4]:
1.0

Generated using nbsphinx. The Jupyter notebook can be found here.