binder

Abstract: this notebook give an introduction to sktime in-memory data containers and data sets, with associated functionality such as in-memory format validation, conversion, and data set loading.

Set-up instructions: on binder, this notebook should run out-of-the-box.

To run this notebook as intended, ensure that sktime with basic dependency requirements is installed in your python environment.

To run this notebook with a local development version of sktime, either uncomment and run the below, or pip install -e a local clone of the sktime main branch.

[1]:
# from os import sys
# sys.path.append("..")

In-memory data representations and data loading#

sktime provides modules for a number of time series related learning tasks.

These modules use sktime specific in-memory (i.e., python workspace) representations for time series and related objects, most importantly individual time series and time series panels. sktime’s in-memory representations rely on pandas and numpy, with additional conventions on the pandas and numpy object.

Users of sktime should be aware of these representations, since presenting the data in an sktime compatible representation is usually the first step in using any of the sktime modules.

This notebook introduces the data types used in sktime, related functionality such as converters and validity checkers, and common workflows for loading and conversion:

Section 1 introduces in-memory data container formats used in sktime, with examples.

Section 2 introduces validity checkers and conversion functionality for in-memory data containers.

Section 3 introduces common workflows to load predefined benchmark datasets.

Section 4 showcases common workflows to load from tabular csv formats.

Section 1: in-memory data containers#

This section provides a reference to data containers used for time series and related objets in sktime.

Conceptually, sktime distinguishes:

  • the data scientific abstract data type - or short: scitype - of a data container, defined by relational and statistical properties of the data being represented and common operations on it - for instance, an abstract “time series” or an abstract “time series panel”, without specifying a particular machine implementation in python

  • the machine implementation type - or short: mtype - of a data container, which, for a defined scitype, specifies the python type and conventions on structure and value of the python in-memory object. For instance, a concrete (mathematical) time series is represented by a concrete pandas.DataFrame in sktime, subject to certain conventions on the pandas.DataFrame. Formally, these conventions form a specific mtype, i.e., a way to represent the (abstract) “time series” scitype.

In sktime, the same scitype can be implemented by multiple mtypes. For instance, sktime allows the user to specify time series as pandas.DataFrame, as pandas.Series, or as a numpy.ndarray. These are different mtypes which are admissible representations of the same scitype, “time series”. Also, not all mtypes are equally rich in metadata - for instance, pandas.DataFrame can store column names, while this is not possible in numpy.ndarray.

Scitypes and mtypes are encoded by strings in sktime, for easy reference.

This section introduces the mtypes for the following scitypes: * "Series", the sktime scitype for time series of any kind * "Panel", the sktime scitype for time series panels of any kind * "Hierarchical", the sktime scitype for hierarchical time series

Section 1.1: Time series - the "Series" scitype#

The major representations of time series in sktime are:

  • "pd.DataFrame" - a uni- or multivariate pandas.DataFrame, with rows = time points, cols = variables

  • "pd.Series" - a (univariate) pandas.Series, with entries corresponding to different time points

  • "np.ndarray" - a 2D numpy.ndarray, with rows = time points, cols = variables

pandas objects must have one of the following pandas index types: Int64Index, RangeIndex, DatetimeIndex, PeriodIndex; if DatetimeIndex, the freq attribute must be set.

numpy.ndarray 2D arrays are interpreted as having an RangeIndex on the rows, and generally equivalent to the pandas.DataFrame obtained after default coercion using the pandas.DataFrame constructor.

[2]:
# import to retrieve examples
from sktime.datatypes import get_examples

Section 1.1.1: Time series - the "pd.DataFrame" mtype#

In the "pd.DataFrame" mtype, time series are represented by an in-memory container obj: pandas.DataFrame as follows.

  • structure convention: obj.index must be monotonous, and one of Int64Index, RangeIndex, DatetimeIndex, PeriodIndex.

  • variables: columns of obj correspond to different variables

  • variable names: column names obj.columns

  • time points: rows of obj correspond to different, distinct time points

  • time index: obj.index is interpreted as a time index.

  • capabilities: can represent multivariate series; can represent unequally spaced series

Example of a univariate series in "pd.DataFrame" representation. The single variable has name "a", and is observed at four time points 0, 1, 2, 3.

[3]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[0]
[3]:
a
0 1.0
1 4.0
2 0.5
3 -3.0

Example of a bivariate series in "pd.DataFrame" representation. This series has two variables, named "a" and "b". Both are observed at the same four time points 0, 1, 2, 3.

[4]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[1]
[4]:
a b
0 1.0 3.000000
1 4.0 7.000000
2 0.5 2.000000
3 -3.0 -0.428571

Section 1.1.2: Time series - the "pd.Series" mtype#

In the "pd.Series" mtype, time series are represented by an in-memory container obj: pandas.Series as follows.

  • structure convention: obj.index must be monotonous, and one of Int64Index, RangeIndex, DatetimeIndex, PeriodIndex.

  • variables: there is a single variable, corresponding to the values of obj. Only univariate series can be represented.

  • variable names: by default, there is no column name. If needed, a variable name can be provided as obj.name.

  • time points: entries of obj correspond to different, distinct time points

  • time index: obj.index is interpreted as a time index.

  • capabilities: cannot represent multivariate series; can represent unequally spaced series

Example of a univariate series in "pd.Series" mtype representation. The single variable has name "a", and is observed at four time points 0, 1, 2, 3.

[5]:
get_examples(mtype="pd.Series", as_scitype="Series")[0]
[5]:
0    1.0
1    4.0
2    0.5
3   -3.0
Name: a, dtype: float64

Section 1.1.3: Time series - the "np.ndarray" mtype#

In the "np.ndarray" mtype, time series are represented by an in-memory container obj: np.ndarray as follows.

  • structure convention: obj must be 2D, i.e., obj.shape must have length 2. This is also true for univariate time series.

  • variables: variables correspond to columns of obj.

  • variable names: the "np.ndarray" mtype cannot represent variable names.

  • time points: the rows of obj correspond to different, distinct time points.

  • time index: The time index is implicit and by-convention. The i-th row (for an integer i) is interpreted as an observation at the time point i.

  • capabilities: cannot represent multivariate series; cannot represent unequally spaced series

Example of a univariate series in "np.ndarray" mtype representation. There is a single (unnamed) variable, it is observed at four time points 0, 1, 2, 3.

[6]:
get_examples(mtype="np.ndarray", as_scitype="Series")[0]
[6]:
array([[ 1. ],
       [ 4. ],
       [ 0.5],
       [-3. ]])

Example of a bivariate series in "np.ndarray" mtype representation. There are two (unnamed) variables, they are both observed at four time points 0, 1, 2, 3.

[7]:
get_examples(mtype="np.ndarray", as_scitype="Series")[1]
[7]:
array([[ 1.        ,  3.        ],
       [ 4.        ,  7.        ],
       [ 0.5       ,  2.        ],
       [-3.        , -0.42857143]])

Section 1.2: Time series panels - the "Panel" scitype#

The major representations of time series panels in sktime are:

  • "pd-multiindex" - a pandas.DataFrame, with row multi-index (instances, time), cols = variables

  • "numpy3D" - a 3D np.ndarray, with axis 0 = instances, axis 1 = variables, axis 2 = time points

  • "df-list" - a list of pandas.DataFrame, with list index = instances, data frame rows = time points, data frame cols = variables

These representations are considered primary representations in sktime and are core to internal computations.

There are further, minor representations of time series panels in sktime:

  • "nested_univ" - a pandas.DataFrame, with pandas.Series in cells. data frame rows = instances, data frame cols = variables, and series axis = time points

  • "numpyflat" - a 2D np.ndarray with rows = instances, and columns indexed by a pair index of (variables, time points). This format is only being converted to and cannot be converted from (since number of variables and time points may be ambiguous).

  • "pd-wide" - a pandas.DataFrame in wide format: has column multi-index (variables, time points), rows = instances; the “variables” index can be omitted for univariate time series

  • "pd-long" - a pandas.DataFrame in long format: has cols instances, timepoints, variable, value; entries in value are indexed by tuples of values in (instances, timepoints, variable).

The minor representations are currently not fully consolidated in-code and are not discussed further below. Contributions are appreciated.

Section 1.2.1: Time series panels - the "pd-multiindex" mtype#

In the "pd-multiindex" mtype, time series panels are represented by an in-memory container obj: pandas.DataFrame as follows.

  • structure convention: obj.index must be a pair multi-index of type (Index, t), where t is one of Int64Index, RangeIndex, DatetimeIndex, PeriodIndex and monotonous. obj.index must have two levels (can be named or not).

  • instance index: the first element of pairs in obj.index (0-th level value) is interpreted as an instance index, we call it “instance index” below.

  • instances: rows with the same “instance index” index value correspond to the same instance; rows with different “instance index” values correspond to different instances.

  • time index: the second element of pairs in obj.index (1-st level value) is interpreted as a time index, we call it “time index” below.

  • time points: rows of obj with the same “time index” value correspond correspond to the same time point; rows of obj with different “time index” index correspond correspond to the different time points.

  • variables: columns of obj correspond to different variables

  • variable names: column names obj.columns

  • capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; cannot represent panels of series with different sets of variables.

Example of a panel of multivariate series in "pd-multiindex" mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0", "var_1". All series are observed at three time points 0, 1, 2.

[8]:
get_examples(mtype="pd-multiindex", as_scitype="Panel")[0]
[8]:
var_0 var_1
instances timepoints
0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6

Section 1.2.2: Time series panels - the "numpy3D" mtype#

In the "numpy3D" mtype, time series panels are represented by an in-memory container obj: np.ndarray as follows.

  • structure convention: obj must be 3D, i.e., obj.shape must have length 3.

  • instances: instances correspond to axis 0 elements of obj.

  • instance index: the instance index is implicit and by-convention. The i-th element of axis 0 (for an integer i) is interpreted as indicative of observing instance i.

  • variables: variables correspond to axis 1 elements of obj.

  • variable names: the "numpy3D" mtype cannot represent variable names.

  • time points: time points correspond to axis 2 elements of obj.

  • time index: the time index is implicit and by-convention. The i-th elemtn of axis 2 (for an integer i) is interpreted as an observation at the time point i.

  • capabilities: can represent panels of multivariate series; cannot represent unequally spaced series; cannot represent panels of unequally supported series; cannot represent panels of series with different sets of variables.

Example of a panel of multivariate series in "numpy3D" mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables (unnamed). All series are observed at three time points 0, 1, 2.

[9]:
get_examples(mtype="numpy3D", as_scitype="Panel")[0]
[9]:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 1,  2,  3],
        [ 4, 55,  6]],

       [[ 1,  2,  3],
        [42,  5,  6]]], dtype=int64)

Section 1.2.3: Time series panels - the "df-list" mtype#

In the "df-list" mtype, time series panels are represented by an in-memory container obj: List[pandas.DataFrame] as follows.

  • structure convention: obj must be a list of pandas.DataFrames. Individual list elements of obj must follow the "pd.DataFrame" mtype convention for the "Series" scitype.

  • instances: instances correspond to different list elements of obj.

  • instance index: the instance index of an instance is the list index at which it is located in obj. That is, the data at obj[i] correspond to observations of the instance with index i.

  • time points: rows of obj[i] correspond to different, distinct time points, at which instance i is observed.

  • time index: obj[i].index is interpreted as the time index for instance i.

  • variables: columns of obj[i] correspond to different variables available for instance i.

  • variable names: column names obj[i].columns are the names of variables available for instance i.

  • capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; can represent panels of series with different sets of variables.

Example of a panel of multivariate series in "df-list" mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0", "var_1". All series are observed at three time points 0, 1, 2.

[10]:
get_examples(mtype="df-list", as_scitype="Panel")[0]
[10]:
[   var_0  var_1
 0      1      4
 1      2      5
 2      3      6,
    var_0  var_1
 0      1      4
 1      2     55
 2      3      6,
    var_0  var_1
 0      1     42
 1      2      5
 2      3      6]

Section 1.3: Hierarchical time series - the "Hierarchical" scitype#

There is currently only one representation for hierarchical time series in sktime:

  • "pd_multiindex_hier" - a pandas.DataFrame, with row multi-index, last level interpreted as time, others as hierarchy, cols = variables

Hierarchical time series - the "pd_multiindex_hier" mtype#

  • structure convention: obj.index must be a 3 or more level multi-index of type (Index, ..., Index, t), where t is one of Int64Index, RangeIndex, DatetimeIndex, PeriodIndex and monotonous. We call the last index the “time-like” index.

  • hierarchy level: rows with the same non-time-like index values correspond to the same hierarchy unit; rows with different non-time-like index combination correspond to different hierarchy unit.

  • hierarchy: the non-time-like indices in obj.index are interpreted as a hierarchy identifying index.

  • time index: the last element of tuples in obj.index is interpreted as a time index.

  • time points: rows of obj with the same "timepoints" index correspond correspond to the same time point; rows of obj with different "timepoints" index correspond correspond to the different time points.

  • variables: columns of obj correspond to different variables

  • variable names: column names obj.columns

  • capabilities: can represent hierarchical series; can represent unequally spaced series; can represent unequally supported hierarchical series; cannot represent hierarchical series with different sets of variables.

[11]:
get_examples(mtype="pd_multiindex_hier", as_scitype="Hierarchical")[0]
[11]:
var_0 var_1
foo bar timepoints
a 0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6
b 0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6

Section 2: validity checking and mtype conversion#

sktime’s datatypes module provides users with generic functionality for:

  • checking in-memory containers against mtype conventions, with informative error messages that help moving data to the right format

  • converting different mtypes to each other, for a given scitype

In this section, this functionality and intended usage worfklows are presented.

Section 2.1: Preparing data, checking in-memory containers for validity#

sktime’s datatypes module provides convenient functionality for users to check validity of their in-memory data containers, using the check_is_mtype and check_raise functions. Both functions provide generic validity checking functionality, check_is_mtype returns metadata and potential issues as return arguments, while check_raise directly produces informative error messages in case a container does not comply with a given mtype.

A recommended notebook workflow to ensure that a given data container is compliant with sktime mtype specification is as follows:

  1. load the data in an in-memory data container

  2. identify the scitype, e.g., is this supposed to be a time series (Series) or a panel of time series (Panel)

  3. select the target mtype (see Section 1 for a list), and attempt to manually reformat the data to comply with the mtype specification if it is not already compliant

  4. run check_raise on the data container, to check whether it complies with the mtype and scitype

  5. if an error is raised, repeat 3 and 4 until no error is raised

Section 2.1.1: validity checking, example 1 (simple mistake)#

Suppose we have the following numpy.ndarray representing a univariate time series:

[12]:
import numpy as np

y = np.array([1, 6, 3, 7, 2])

to check compatibility with sktime:

(instruction: uncomment and run the code to see the informative error message)

[13]:
from sktime.datatypes import check_raise

# check_raise(y, mtype="np.ndarray")

this tells us that sktime uses 2D numpy arrays for time series, if the np.ndarray mtype is used. While most methods provide convenience functionality to do this coercion automatically, the “correct” format would be 2D as follows:

[14]:
check_raise(y.reshape(-1, 1), mtype="np.ndarray")
[14]:
True

For use in own code or additional metadata, the error message can be obtained using the check_is_mtype function:

[15]:
from sktime.datatypes import check_is_mtype

check_is_mtype(y, mtype="np.ndarray", return_metadata=True)
[15]:
(True,
 None,
 {'is_empty': False,
  'is_univariate': True,
  'is_equally_spaced': True,
  'has_nans': False,
  'mtype': 'np.ndarray',
  'scitype': 'Series'})

and metadata is produced if the argument passes the validity check:

[16]:
check_is_mtype(y.reshape(-1, 1), mtype="np.ndarray", return_metadata=True)
[16]:
(True,
 None,
 {'is_empty': False,
  'is_univariate': True,
  'is_equally_spaced': True,
  'has_nans': False,
  'mtype': 'np.ndarray',
  'scitype': 'Series'})

Note: if the name of the mtype is ambiguous and can refer to multiple scitypes, the additional argument scitype must be provided. This should not be the case for any common in-memory containers, we mention this for completeness.

[17]:
check_is_mtype(y, mtype="np.ndarray", scitype="Series")
[17]:
True

Section 2.1.2: validity checking, example 2 (non-obvious mistake)#

Suppose we have converted our data into a multi-index panel, i.e., we want to have a Panel of mtype pd-multiindex.

[18]:
import pandas as pd

cols = ["instances", "time points"] + [f"var_{i}" for i in range(2)]
X = pd.concat(
    [
        pd.DataFrame([[0, 0, 1, 4], [0, 1, 2, 5], [0, 2, 3, 6]], columns=cols),
        pd.DataFrame([[1, 0, 1, 4], [1, 1, 2, 55], [1, 2, 3, 6]], columns=cols),
        pd.DataFrame([[2, 0, 1, 42], [2, 1, 2, 5], [2, 2, 3, 6]], columns=cols),
    ]
).set_index(["instances", "time points"])

It is not obvious whether X satisfies the pd-multiindex specification, so let’s check:

(instruction: uncomment and run the code to see the informative error message)

[19]:
from sktime.datatypes import check_raise

# check_raise(X, mtype="pd-multiindex")

The informative error message highlights a typo in one of the multi-index columns, so we do this:

[20]:
X.index.names = ["instances", "timepoints"]

Now the validity check passes:

[21]:
check_raise(X, mtype="pd-multiindex")
[21]:
True

Section 2.1.3: inferring the mtype#

sktime also provides functionality to infer the mtype of an in-memory data container, which is useful in case one is sure that the container is compliant but one has forgotten the exact string, or in a case where one would like to know whether an in-memory container is already in some supported, compliant format. For this, only the scitype needs to be specified:

[22]:
from sktime.datatypes import mtype

mtype(X, as_scitype="Panel")
[22]:
'pd-multiindex'

Section 2.2: conversion between mtypes#

sktime’s datatypes module also offers uninfied conversion functionality between mtypes. This is useful for users as well as for method developers.

The convert function requires to specify the mtype to convert from, and the mtype to convert to. The convert_to function only requires to specify the mtype to convert to, automatically inferring the mtype of the input if it can be inferred. convert_to should be used if the input can have multiple mtypes.

Section 2.2.1: simple conversion#

Example: converting a numpy3D panel of time series to pd-multiindex mtype:

[23]:
from sktime.datatypes import get_examples

X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[23]:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 1,  2,  3],
        [ 4, 55,  6]],

       [[ 1,  2,  3],
        [42,  5,  6]]], dtype=int64)
[24]:
from sktime.datatypes import convert

convert(X, from_type="numpy3D", to_type="pd-multiindex")
[24]:
var_0 var_1
instances timepoints
0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6
[25]:
from sktime.datatypes import convert_to

convert_to(X, to_type="pd-multiindex")
[25]:
var_0 var_1
instances timepoints
0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6

Section 2.2.2: advanced conversion features#

convert_to also allows to specify multiple output types. The to_type argument can be a list of mtypes. In that case, the input passed through unchanged if its mtype is on the list; if the mtype of the input is not on the list, it is converted to the mtype which is the first element of the list.

Example: converting a panel of time series of to either "pd-multiindex" or "numpy3D". If the input is "numpy3D", it remains unchanged. If the input is "df-list", it is converted to "pd-multiindex".

[26]:
from sktime.datatypes import get_examples

X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[26]:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 1,  2,  3],
        [ 4, 55,  6]],

       [[ 1,  2,  3],
        [42,  5,  6]]], dtype=int64)
[27]:
from sktime.datatypes import convert_to

convert_to(X, to_type=["pd-multiindex", "numpy3D"])
[27]:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 1,  2,  3],
        [ 4, 55,  6]],

       [[ 1,  2,  3],
        [42,  5,  6]]], dtype=int64)
[28]:
X = get_examples(mtype="df-list", as_scitype="Panel")[0]
X
[28]:
[   var_0  var_1
 0      1      4
 1      2      5
 2      3      6,
    var_0  var_1
 0      1      4
 1      2     55
 2      3      6,
    var_0  var_1
 0      1     42
 1      2      5
 2      3      6]
[29]:
convert_to(X, to_type=["pd-multiindex", "numpy3D"])
[29]:
var_0 var_1
instances timepoints
0 0 1 4
1 2 5
2 3 6
1 0 1 4
1 2 55
2 3 6
2 0 1 42
1 2 5
2 3 6

Section 2.2.3: inspecting implemented conversions#

Currently, conversions are work in progress, and not all possible conversions are available - contributions are welcome. To see which conversions are currently implemented for a scitype, use the _conversions_defined developer method from the datatypes._convert module. This produces a table with a “1” if conversion from mtype in row row to mtypw in column is implemented.

[30]:
from sktime.datatypes._convert import _conversions_defined

_conversions_defined(scitype="Panel")
[30]:
dask_panel df-list nested_univ numpy3D numpyflat pd-long pd-multiindex pd-wide
dask_panel 1 1 1 1 1 1 1 0
df-list 1 1 1 1 1 1 1 0
nested_univ 1 1 1 1 1 1 1 1
numpy3D 1 1 1 1 1 1 1 0
numpyflat 1 1 1 1 1 1 1 0
pd-long 1 1 1 1 1 1 1 0
pd-multiindex 1 1 1 1 1 1 1 0
pd-wide 0 0 1 0 0 0 0 1

Section 3: loading pre-defined data sets#

sktime’s datasets module allows to load datasets for testing and benchmarking. This includes:

  • example data sets that ship directly with sktime

  • downloaders for data sets from common repositories

All data retrieved in this way are in sktime compatible in-memory and/or file formats.

Currently, no systematic tagging and registry retrieval for the available data sets is implemented - contributions to this would be very welcome.

Section 3.1: forecasting data sets#

sktime’s datasets module currently allows to load a the following forecasting example data sets:

dataset name

loader function

properties

Box/Jenkins airline data

load_airline

univariate

Lynx sales data

load_lynx

univariate

Shampoo sales data

load_shampoo_sales

univariate

Pharmaceutical Benefit Scheme data

load_PBS_dataset

univariate

Longley US macroeconomic data

load_longley

multivariate

MTS consumption/income data

load_uschange

multivariate

sktime currently has no connectors to forecasting data repositories - contributions are much appreciated.

Forecasting data sets are all of Series scitype, they can be univariate or multivariate.

Loaders for univariate data have no arguments, and always return the data in the "pd.Series" mtype:

[31]:
from sktime.datasets import load_airline

load_airline()
[31]:
Period
1949-01    112.0
1949-02    118.0
1949-03    132.0
1949-04    129.0
1949-05    121.0
           ...
1960-08    606.0
1960-09    508.0
1960-10    461.0
1960-11    390.0
1960-12    432.0
Freq: M, Name: Number of airline passengers, Length: 144, dtype: float64

Loaders for multivariate data can be called in two ways:

  • without an argument, in which case a multivariate series of "pd.DataFrame" mtype is returned:

[32]:
from sktime.datasets import load_longley

load_longley()
[32]:
(Period
 1947    60323.0
 1948    61122.0
 1949    60171.0
 1950    61187.0
 1951    63221.0
 1952    63639.0
 1953    64989.0
 1954    63761.0
 1955    66019.0
 1956    67857.0
 1957    68169.0
 1958    66513.0
 1959    68655.0
 1960    69564.0
 1961    69331.0
 1962    70551.0
 Freq: A-DEC, Name: TOTEMP, dtype: float64,
         GNPDEFL       GNP   UNEMP   ARMED       POP
 Period
 1947       83.0  234289.0  2356.0  1590.0  107608.0
 1948       88.5  259426.0  2325.0  1456.0  108632.0
 1949       88.2  258054.0  3682.0  1616.0  109773.0
 1950       89.5  284599.0  3351.0  1650.0  110929.0
 1951       96.2  328975.0  2099.0  3099.0  112075.0
 1952       98.1  346999.0  1932.0  3594.0  113270.0
 1953       99.0  365385.0  1870.0  3547.0  115094.0
 1954      100.0  363112.0  3578.0  3350.0  116219.0
 1955      101.2  397469.0  2904.0  3048.0  117388.0
 1956      104.6  419180.0  2822.0  2857.0  118734.0
 1957      108.4  442769.0  2936.0  2798.0  120445.0
 1958      110.8  444546.0  4681.0  2637.0  121950.0
 1959      112.6  482704.0  3813.0  2552.0  123366.0
 1960      114.2  502601.0  3931.0  2514.0  125368.0
 1961      115.7  518173.0  4806.0  2572.0  127852.0
 1962      116.9  554894.0  4007.0  2827.0  130081.0)
  • with an argument y_name that must coincide with one of the column/variable names, in which a pair of series y, X is returned, with y of "pd.Series" mtype, and X of "pd.DataFrame" mtype - this is convenient for univariate forecasting with exogeneous variables.

[33]:
y, X = load_longley(y_name="TOTEMP")
[34]:
y
[34]:
Period
1947    60323.0
1948    61122.0
1949    60171.0
1950    61187.0
1951    63221.0
1952    63639.0
1953    64989.0
1954    63761.0
1955    66019.0
1956    67857.0
1957    68169.0
1958    66513.0
1959    68655.0
1960    69564.0
1961    69331.0
1962    70551.0
Freq: A-DEC, Name: TOTEMP, dtype: float64
[35]:
X
[35]:
GNPDEFL GNP UNEMP ARMED POP
Period
1947 83.0 234289.0 2356.0 1590.0 107608.0
1948 88.5 259426.0 2325.0 1456.0 108632.0
1949 88.2 258054.0 3682.0 1616.0 109773.0
1950 89.5 284599.0 3351.0 1650.0 110929.0
1951 96.2 328975.0 2099.0 3099.0 112075.0
1952 98.1 346999.0 1932.0 3594.0 113270.0
1953 99.0 365385.0 1870.0 3547.0 115094.0
1954 100.0 363112.0 3578.0 3350.0 116219.0
1955 101.2 397469.0 2904.0 3048.0 117388.0
1956 104.6 419180.0 2822.0 2857.0 118734.0
1957 108.4 442769.0 2936.0 2798.0 120445.0
1958 110.8 444546.0 4681.0 2637.0 121950.0
1959 112.6 482704.0 3813.0 2552.0 123366.0
1960 114.2 502601.0 3931.0 2514.0 125368.0
1961 115.7 518173.0 4806.0 2572.0 127852.0
1962 116.9 554894.0 4007.0 2827.0 130081.0

Section 3.2: time series classification data sets#

sktime’s datasets module currently allows to load a the following time series classification example data sets:

dataset name

loader function

properties

Appliance power consumption data

load_acsf1

univariate, equal length/index

Arrowhead shape data

load_arrow_head

univariate, equal length/index

Gunpoint motion data

load_gunpoint

univariate, equal length/index

Italy power demand data

load_italy_power_demand

univariate, equal length/index

Japanese vowels data

load_japanese_vowels

univariate, equal length/index

OSUleaf leaf shape data

load_osuleaf

univariate, equal length/index

Basic motions data

load_basic_motions

multivariate, equal length/index

Currently, there are no unequal length or unequal index time series classification example data directly in sktime.

sktime also provides a full interface to the UCR/UEA time series data set archive, via the load_UCR_UEA_dataset function. The UCR/UEA archive also contains time series classification data sets which are multivariate, or unequal length/index (in either combination).

Section 3.2.2: time series classification data sets in sktime#

Time series classification data sets consists of a panel of time series of Panel scitype, together with classification labels, one per time series.

If a loader is invoked with minimal arguments, the data are returned as "nested_univ" mtype, with labels and series to classify in the same pd.DataFrame. Using the return_X_y=True argument, the data are returned separated into features X and labels y, with X a Panel of nested_univ mtype, and y and a sklearn compatible numpy vector of labels:

[36]:
from sktime.datasets import load_arrow_head

X, y = load_arrow_head(return_X_y=True)
[37]:
X
[37]:
dim_0
0 0 -1.963009 1 -1.957825 2 -1.95614...
1 0 -1.774571 1 -1.774036 2 -1.77658...
2 0 -1.866021 1 -1.841991 2 -1.83502...
3 0 -2.073758 1 -2.073301 2 -2.04460...
4 0 -1.746255 1 -1.741263 2 -1.72274...
... ...
206 0 -1.625142 1 -1.622988 2 -1.62606...
207 0 -1.657757 1 -1.664673 2 -1.63264...
208 0 -1.603279 1 -1.587365 2 -1.57740...
209 0 -1.739020 1 -1.741534 2 -1.73286...
210 0 -1.630727 1 -1.629918 2 -1.62055...

211 rows × 1 columns

[38]:
y
[38]:
array(['0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0',
       '1', '2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '1',
       '2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
       '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
       '2', '2', '2'], dtype='<U1')

The panel can be converted from "nested_univ" mtype to other mtype formats, using datatypes.convert or convert_to (see above):

[39]:
from sktime.datatypes import convert_to

convert_to(X, to_type="pd-multiindex")
[39]:
dim_0
timepoints
0 0 -1.963009
1 -1.957825
2 -1.956145
3 -1.938289
4 -1.896657
... ... ...
210 246 -1.513637
247 -1.550431
248 -1.581576
249 -1.595273
250 -1.620783

52961 rows × 1 columns

Data set loaders can be invoked with the split parameter to obtain reproducible training and test sets for comparison across studies. If split="train", a pre-defined training set is retrieved; if split="test", a pre-defined test set is retrieved.

[40]:
X_train, y_train = load_arrow_head(return_X_y=True, split="train")
X_test, y_test = load_arrow_head(return_X_y=True, split="test")
# this retrieves training and test X/y for reproducible use in studies

Section 3.2.3: time series classification data sets from the UCR/UEA time series classification repository#

The load_UCR_UEA_dataset utility will download datasetes from the UCR/UEA time series classification repository and make them available as in-memory datasets, with the same syntax as sktime native data set loaders.

Datasets are indexed by unique string identifiers, which can be inspected on the repository itself, or via the register in the datasets.tsc_dataset_names module, by property:

[41]:
from sktime.datasets.tsc_dataset_names import univariate

The imported variables are all lists of strings which contain the unique string identifiers of datasets with certain properties, as follows:

register name

uni-/multivariate

equal/unequal length

with/without missing values

univariate

only univariate

both included

both included

multivariate

only multivariate

both included

both included

univariate_equal_length

only univariate

only equal length

both included

univariate_variable_length

only univariate

only unequal length

both included

univariate_missing_values

only univariate

both included

only with missing values

multivariate_equal_length

only multivariate

only equal length

both included

multivariate_unequal_length

only multivariate

only unequal length

both included

Lookup and retrieval using these lists is, admittedly, a bit inconvenient - contributions to sktime to write a lookup functions such as all_estimators or all_tags, based on capability or property tags attached to datasets would be very much appreciated.

An example list is displayed below:

[42]:
univariate
[42]:
['ACSF1',
 'Adiac',
 'AllGestureWiimoteX',
 'AllGestureWiimoteY',
 'AllGestureWiimoteZ',
 'ArrowHead',
 'AsphaltObstacles',
 'Beef',
 'BeetleFly',
 'BirdChicken',
 'BME',
 'Car',
 'CBF',
 'Chinatown',
 'ChlorineConcentration',
 'CinCECGTorso',
 'Coffee',
 'Computers',
 'CricketX',
 'CricketY',
 'CricketZ',
 'Crop',
 'DiatomSizeReduction',
 'DistalPhalanxOutlineCorrect',
 'DistalPhalanxOutlineAgeGroup',
 'DistalPhalanxTW',
 'DodgerLoopDay',
 'DodgerLoopGame',
 'DodgerLoopWeekend',
 'Earthquakes',
 'ECG200',
 'ECG5000',
 'ECGFiveDays',
 'ElectricDevices',
 'EOGHorizontalSignal',
 'EOGVerticalSignal',
 'EthanolLevel',
 'FaceAll',
 'FaceFour',
 'FacesUCR',
 'FiftyWords',
 'Fish',
 'FordA',
 'FordB',
 'FreezerRegularTrain',
 'FreezerSmallTrain',
 'Fungi',
 'GestureMidAirD1',
 'GestureMidAirD2',
 'GestureMidAirD3',
 'GesturePebbleZ1',
 'GesturePebbleZ2',
 'GunPoint',
 'GunPointAgeSpan',
 'GunPointMaleVersusFemale',
 'GunPointOldVersusYoung',
 'Ham',
 'HandOutlines',
 'Haptics',
 'Herring',
 'HouseTwenty',
 'InlineSkate',
 'InsectEPGRegularTrain',
 'InsectEPGSmallTrain',
 'InsectWingbeatSound',
 'ItalyPowerDemand',
 'LargeKitchenAppliances',
 'Lightning2',
 'Lightning7',
 'Mallat',
 'Meat',
 'MedicalImages',
 'MelbournePedestrian',
 'MiddlePhalanxOutlineCorrect',
 'MiddlePhalanxOutlineAgeGroup',
 'MiddlePhalanxTW',
 'MixedShapesRegularTrain',
 'MixedShapesSmallTrain',
 'MoteStrain',
 'NonInvasiveFetalECGThorax1',
 'NonInvasiveFetalECGThorax2',
 'OliveOil',
 'OSULeaf',
 'PhalangesOutlinesCorrect',
 'Phoneme',
 'PickupGestureWiimoteZ',
 'PigAirwayPressure',
 'PigArtPressure',
 'PigCVP',
 'PLAID',
 'Plane',
 'PowerCons',
 'ProximalPhalanxOutlineCorrect',
 'ProximalPhalanxOutlineAgeGroup',
 'ProximalPhalanxTW',
 'RefrigerationDevices',
 'Rock',
 'ScreenType',
 'SemgHandGenderCh2',
 'SemgHandMovementCh2',
 'SemgHandSubjectCh2',
 'ShakeGestureWiimoteZ',
 'ShapeletSim',
 'ShapesAll',
 'SmallKitchenAppliances',
 'SmoothSubspace',
 'SonyAIBORobotSurface1',
 'SonyAIBORobotSurface2',
 'StarLightCurves',
 'Strawberry',
 'SwedishLeaf',
 'Symbols',
 'SyntheticControl',
 'ToeSegmentation1',
 'ToeSegmentation2',
 'Trace',
 'TwoLeadECG',
 'TwoPatterns',
 'UMD',
 'UWaveGestureLibraryAll',
 'UWaveGestureLibraryX',
 'UWaveGestureLibraryY',
 'UWaveGestureLibraryZ',
 'Wafer',
 'Wine',
 'WordSynonyms',
 'Worms',
 'WormsTwoClass',
 'IOError']

The loader function load_UCR_UEA_dataset behaves exactly as sktime data loaders, with an additional argument name that should be set to one of the unique identifying strings for the UCR/UEA datasets, for instance:

[43]:
from sktime.datasets import load_UCR_UEA_dataset

X, y = load_UCR_UEA_dataset(name="Yoga", return_X_y=True)

This will download the dataset into a local directory (by default: for a local clone, the datasets/data directory in the local repository; for a release install, in the local python environment folder). To change that directory, specify it using the extract_path argument of the load_UCR_UEA_dataset function.

Section 4: loading data from csv files#

This section shows how to load some common tabular csv formats into sktime compatible containers.

We’ll cover:

  • converting series datasets to sktime compatible containers

  • converting panel datasets to sktime compatible containers

We assume that all csv files are have some tabular formats.

This means that the csv file contains columns for the time index, or instance index for panel data, or are in a wide tabular format.

Note: at every step, we could use check_is_mtype to check against the target format. A reader may like to do so.

Section 4.1: simple time series example#

[44]:
import pandas as pd

df_series = pd.read_csv("../sktime/datasets/data/Airline/Airline.csv")
df_series.head()
[44]:
Date Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
[45]:
df_series = df_series.set_index(
    "Date"
).squeeze()  # replace "Period" with the column name of the time index
df_series.index = pd.DatetimeIndex(df_series.index)
[46]:
mtype(df_series, as_scitype="Series")
[46]:
'pd.Series'

Section 4.2: easy panel data example#

[47]:
# mimicking a scenario where we already have a csv file in the right format
from sktime.datasets import load_arrow_head
from sktime.datatypes import convert_to

df_panel = load_arrow_head(split="TRAIN")[0]
df_panel = convert_to(df_panel, "pd-multiindex").reset_index()
# imagine this is the result of df_panel = pd.read_csv
[48]:
df_panel.head()
[48]:
level_0 timepoints dim_0
0 0 0 -1.963009
1 0 1 -1.957825
2 0 2 -1.956145
3 0 3 -1.938289
4 0 4 -1.896657

this is similar to pd-multiindex format, so we try to move it to that.

The one thing that we need to change is setting instance/time as index:

[49]:
df_panel = df_panel.set_index(["timepoints", "level_0"])
type(df_panel.index)
[49]:
pandas.core.indexes.multi.MultiIndex

this is now recognized by sktime as being in pd-multiindex format

[50]:
mtype(df_panel, as_scitype="Panel")
# in general:
# replace "timepoints" with the time index column name
# replace "level_0" with the higher level column name of your file
[50]:
'pd-multiindex'

Section 4.3: difficult panel data example#

We now try to load panel data from a file where the format is a bit challenging

In the below file:

  1. the separator is not the default (comma) but tab. For this, we set sep=\t

  2. there is no header in the file

  3. there is no instance index, so we need to add it

  4. the indexing is not similar to sktime - the first column has variable index, and the columns are time index

These or similar challenges are common in csv files for panel data, so we show below how to address them.

It is advised to bring the data to either the plain pandas or numpy based format. In the below case, we will bring the data to pd-multiindex format.

[51]:
# 1, 2 - dealing with the separator and header
import pandas as pd

df_panel = pd.read_csv(
    "../sktime/datasets/data/ArrowHead/ArrowHead_TRAIN.tsv",
    sep="\t",
    header=None,
)
df_panel.head()
[51]:
0 1 2 3 4 5 6 7 8 9 ... 242 243 244 245 246 247 248 249 250 251
0 0 -1.963009 -1.957825 -1.956145 -1.938289 -1.896657 -1.869857 -1.838705 -1.812289 -1.736433 ... -1.583857 -1.655329 -1.719153 -1.750881 -1.796273 -1.841345 -1.884289 -1.905393 -1.923905 -1.909153
1 1 -1.774571 -1.774036 -1.776586 -1.730749 -1.696268 -1.657377 -1.636227 -1.609807 -1.543439 ... -1.471688 -1.484666 -1.539972 -1.590150 -1.635663 -1.639989 -1.678683 -1.729227 -1.775670 -1.789324
2 2 -1.866021 -1.841991 -1.835025 -1.811902 -1.764390 -1.707687 -1.648280 -1.582643 -1.531502 ... -1.584132 -1.652337 -1.684565 -1.743972 -1.799117 -1.829069 -1.875828 -1.862512 -1.863368 -1.846493
3 0 -2.073758 -2.073301 -2.044607 -2.038346 -1.959043 -1.874494 -1.805619 -1.731043 -1.712653 ... -1.678942 -1.743732 -1.819801 -1.858136 -1.886146 -1.951247 -2.012927 -2.026963 -2.073405 -2.075292
4 1 -1.746255 -1.741263 -1.722741 -1.698640 -1.677223 -1.630356 -1.579440 -1.551225 -1.473980 ... -1.547111 -1.607101 -1.635137 -1.686346 -1.691274 -1.716886 -1.740726 -1.743442 -1.762729 -1.763428

5 rows × 252 columns

[52]:
# 2 - adding an instance index manually
import numpy as np

df_panel["instance"] = np.repeat(range(len(df_panel) // 3), 3)

now we bring the data into long format by depivoting:

[53]:
# 3 - add instance index
df_panel.columns = ["var"] + [f"value{i}" for i in range(251)] + ["instance"]
# 4 - move to long format
df_panel = pd.wide_to_long(df_panel, "value", i=["var", "instance"], j="time")
[54]:
df_panel.head()
[54]:
value
var instance time
0 0 0 -1.963009
1 -1.957825
2 -1.956145
3 -1.938289
4 -1.896657

now the “var” index is in the rows, but it should be in the columns:

[55]:
# 3 - move variable index to columns
df_panel = df_panel.reset_index("var")
df_panel = df_panel.pivot(columns="var", values="value")
[56]:
df_panel.head()
[56]:
var 0 1 2
instance time
0 0 -1.963009 -1.774571 -1.866021
1 -1.957825 -1.774036 -1.841991
2 -1.956145 -1.776586 -1.835025
3 -1.938289 -1.730749 -1.811902
4 -1.896657 -1.696268 -1.764390

This is now in the pd-multiindex format:

[57]:
mtype(df_panel, as_scitype="Panel")
[57]:
'pd-multiindex'

an alternative route would be removing the index, and using reshapes in numpy.


Generated using nbsphinx. The Jupyter notebook can be found here.