Abstract: this notebook give an introduction to sktime
inmemory data containers and data sets, with associated functionality such as inmemory format validation, conversion, and data set loading.
Setup instructions: on binder, this notebook should run outofthebox.
To run this notebook as intended, ensure that sktime
with basic dependency requirements is installed in your python environment.
To run this notebook with a local development version of sktime, either uncomment and run the below, or pip install e
a local clone of the sktime
main
branch.
[ ]:
# from os import sys
# sys.path.append("..")
Inmemory data representations and data loading#
sktime
provides modules for a number of time series related learning tasks.
These modules use sktime
specific inmemory (i.e., python workspace) representations for time series and related objects, most importantly individual time series and time series panels. sktime
’s inmemory representations rely on pandas
and numpy
, with additional conventions on the pandas
and numpy
object.
Users of sktime
should be aware of these representations, since presenting the data in an sktime
compatible representation is usually the first step in using any of the sktime
modules.
This notebook introduces the data types used in sktime
, related functionality such as converters and validity checkers, and common workflows for loading and conversion:
Section 1 introduces inmemory data containers used in sktime
, with examples.
Section 2 introduces validity checkers and conversion functionality for inmemory data containers.
Section 3 introduces common workflows to load data from file formats
Section 1: inmemory data containers#
This section provides a reference to data containers used for time series and related objets in sktime
.
Conceptually, sktime
distinguishes:
the data scientific abstract data type  or short: scitype  of a data container, defined by relational and statistical properties of the data being represented and common operations on it  for instance, an abstract “time series” or an abstract “time series panel”, without specifying a particular machine implementation in python
the machine implementation type  or short: mtype  of a data container, which, for a defined scitype, specifies the python type and conventions on structure and value of the python inmemory object. For instance, a concrete (mathematical) time series is represented by a concrete
pandas.DataFrame
insktime
, subject to certain conventions on thepandas.DataFrame
. Formally, these conventions form a specific mtype, i.e., a way to represent the (abstract) “time series” scitype.
In sktime
, the same scitype can be implemented by multiple mtypes. For instance, sktime
allows the user to specify time series as pandas.DataFrame
, as pandas.Series
, or as a numpy.ndarray
. These are different mtypes which are admissible representations of the same scitype, “time series”. Also, not all mtypes are equally rich in metadata  for instance, pandas.DataFrame
can store column names, while this is not possible in numpy.ndarray
.
Scitypes and mtypes are encoded by strings in sktime
, for easy reference.
This section introduces the mtypes for the following scitypes: * "Series"
, the sktime
scitype for time series of any kind * "Panel"
, the sktime
scitype for time series panels of any kind * "Hierarchical"
, the sktime
scitype for hierarchical time series
Section 1.1: Time series  the "Series"
scitype#
The major representations of time series in sktime
are:
"pd.DataFrame"
 a uni or multivariatepandas.DataFrame
, with rows = time points, cols = variables"pd.Series"
 a (univariate)pandas.Series
, with entries corresponding to different time points"np.ndarray"
 a 2Dnumpy.ndarray
, with rows = time points, cols = variables
pandas
objects must have one of the following pandas
index types: Int64Index
, RangeIndex
, DatetimeIndex
, PeriodIndex
; if DatetimeIndex
, the freq
attribute must be set.
numpy.ndarray
2D arrays are interpreted as having an RangeIndex
on the rows, and generally equivalent to the pandas.DataFrame
obtained after default coercion using the pandas.DataFrame
constructor.
[ ]:
# import to retrieve examples
from sktime.datatypes import get_examples
Section 1.1.1: Time series  the "pd.DataFrame"
mtype#
In the "pd.DataFrame"
mtype, time series are represented by an inmemory container obj: pandas.DataFrame
as follows.
structure convention:
obj.index
must be monotonous, and one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
time points: rows of
obj
correspond to different, distinct time pointstime index:
obj.index
is interpreted as a time index.capabilities: can represent multivariate series; can represent unequally spaced series
Example of a univariate series in "pd.DataFrame"
representation. The single variable has name "a"
, and is observed at four time points 0, 1, 2, 3.
[ ]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[0]
Example of a bivariate series in "pd.DataFrame"
representation. This series has two variables, named "a"
and "b"
. Both are observed at the same four time points 0, 1, 2, 3.
[ ]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[1]
Section 1.1.2: Time series  the "pd.Series"
mtype#
In the "pd.Series"
mtype, time series are represented by an inmemory container obj: pandas.Series
as follows.
structure convention:
obj.index
must be monotonous, and one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
.variables: there is a single variable, corresponding to the values of
obj
. Only univariate series can be represented.variable names: by default, there is no column name. If needed, a variable name can be provided as
obj.name
.time points: entries of
obj
correspond to different, distinct time pointstime index:
obj.index
is interpreted as a time index.capabilities: cannot represent multivariate series; can represent unequally spaced series
Example of a univariate series in "pd.Series"
mtype representation. The single variable has name "a"
, and is observed at four time points 0, 1, 2, 3.
[ ]:
get_examples(mtype="pd.Series", as_scitype="Series")[0]
Section 1.1.3: Time series  the "np.ndarray"
mtype#
In the "np.ndarray"
mtype, time series are represented by an inmemory container obj: np.ndarray
as follows.
structure convention:
obj
must be 2D, i.e.,obj.shape
must have length 2. This is also true for univariate time series.variables: variables correspond to columns of
obj
.variable names: the
"np.ndarray"
mtype cannot represent variable names.time points: the rows of
obj
correspond to different, distinct time points.time index: The time index is implicit and byconvention. The
i
th row (for an integeri
) is interpreted as an observation at the time pointi
.capabilities: cannot represent multivariate series; cannot represent unequally spaced series
Example of a univariate series in "np.ndarray"
mtype representation. There is a single (unnamed) variable, it is observed at four time points 0, 1, 2, 3.
[ ]:
get_examples(mtype="np.ndarray", as_scitype="Series")[0]
Example of a bivariate series in "np.ndarray"
mtype representation. There are two (unnamed) variables, they are both observed at four time points 0, 1, 2, 3.
[ ]:
get_examples(mtype="np.ndarray", as_scitype="Series")[1]
Section 1.2: Time series panels  the "Panel"
scitype#
The major representations of time series panels in sktime
are:
"pdmultiindex"
 apandas.DataFrame
, with row multiindex (instances, time), cols = variables"numpy3D"
 a 3Dnp.ndarray
, with axis 0 = instances, axis 1 = variables, axis 2 = time points"dflist"
 alist
ofpandas.DataFrame
, with list index = instances, data frame rows = time points, data frame cols = variables
These representations are considered primary representations in sktime
and are core to internal computations.
There are further, minor representations of time series panels in sktime
:
"nested_univ"
 apandas.DataFrame
, withpandas.Series
in cells. data frame rows = instances, data frame cols = variables, and series axis = time points"numpyflat"
 a 2Dnp.ndarray
with rows = instances, and columns indexed by a pair index of (variables, time points). This format is only being converted to and cannot be converted from (since number of variables and time points may be ambiguous)."pdwide"
 apandas.DataFrame
in wide format: has column multiindex (variables, time points), rows = instances; the “variables” index can be omitted for univariate time series"pdlong"
 apandas.DataFrame
in long format: has colsinstances
,timepoints
,variable
,value
; entries invalue
are indexed by tuples of values in (instances
,timepoints
,variable
).
The minor representations are currently not fully consolidated incode and are not discussed further below. Contributions are appreciated.
Section 1.2.1: Time series panels  the "pdmultiindex"
mtype#
In the "pdmultiindex"
mtype, time series panels are represented by an inmemory container obj: pandas.DataFrame
as follows.
structure convention:
obj.index
must be a pair multiindex of type(Index, t)
, wheret
is one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
and monotonous.obj.index
must have two levels (can be named or not).instance index: the first element of pairs in
obj.index
(0th level value) is interpreted as an instance index, we call it “instance index” below.instances: rows with the same “instance index” index value correspond to the same instance; rows with different “instance index” values correspond to different instances.
time index: the second element of pairs in
obj.index
(1st level value) is interpreted as a time index, we call it “time index” below.time points: rows of
obj
with the same “time index” value correspond correspond to the same time point; rows ofobj
with different “time index” index correspond correspond to the different time points.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; cannot represent panels of series with different sets of variables.
Example of a panel of multivariate series in "pdmultiindex"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0"
, "var_1"
. All series are observed at three time points 0, 1, 2.
[ ]:
get_examples(mtype="pdmultiindex", as_scitype="Panel")[0]
Section 1.2.2: Time series panels  the "numpy3D"
mtype#
In the "numpy3D"
mtype, time series panels are represented by an inmemory container obj: np.ndarray
as follows.
structure convention:
obj
must be 3D, i.e.,obj.shape
must have length 3.instances: instances correspond to axis 0 elements of
obj
.instance index: the instance index is implicit and byconvention. The
i
th element of axis 0 (for an integeri
) is interpreted as indicative of observing instancei
.variables: variables correspond to axis 1 elements of
obj
.variable names: the
"numpy3D"
mtype cannot represent variable names.time points: time points correspond to axis 2 elements of
obj
.time index: the time index is implicit and byconvention. The
i
th elemtn of axis 2 (for an integeri
) is interpreted as an observation at the time pointi
.capabilities: can represent panels of multivariate series; cannot represent unequally spaced series; cannot represent panels of unequally supported series; cannot represent panels of series with different sets of variables.
Example of a panel of multivariate series in "numpy3D"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables (unnamed). All series are observed at three time points 0, 1, 2.
[ ]:
get_examples(mtype="numpy3D", as_scitype="Panel")[0]
Section 1.2.3: Time series panels  the "dflist"
mtype#
In the "dflist"
mtype, time series panels are represented by an inmemory container obj: List[pandas.DataFrame]
as follows.
structure convention:
obj
must be a list ofpandas.DataFrames
. Individual list elements ofobj
must follow the"pd.DataFrame"
mtype convention for the"Series"
scitype.instances: instances correspond to different list elements of
obj
.instance index: the instance index of an instance is the list index at which it is located in
obj
. That is, the data atobj[i]
correspond to observations of the instance with indexi
.time points: rows of
obj[i]
correspond to different, distinct time points, at which instancei
is observed.time index:
obj[i].index
is interpreted as the time index for instancei
.variables: columns of
obj[i]
correspond to different variables available for instancei
.variable names: column names
obj[i].columns
are the names of variables available for instancei
.capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; can represent panels of series with different sets of variables.
Example of a panel of multivariate series in "dflist"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0"
, "var_1"
. All series are observed at three time points 0, 1, 2.
[ ]:
get_examples(mtype="dflist", as_scitype="Panel")[0]
Section 1.3: Hierarchical time series  the "Hierarchical"
scitype#
There is currently only one representation for hierarchical time series in sktime
:
"pd_multiindex_hier"
 apandas.DataFrame
, with row multiindex, last level interpreted as time, others as hierarchy, cols = variables
Hierarchical time series  the "pd_multiindex_hier"
mtype#
structure convention:
obj.index
must be a 3 or more level multiindex of type(Index, ..., Index, t)
, wheret
is one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
and monotonous. We call the last index the “timelike” index.hierarchy level: rows with the same nontimelike index values correspond to the same hierarchy unit; rows with different nontimelike index combination correspond to different hierarchy unit.
hierarchy: the nontimelike indices in
obj.index
are interpreted as a hierarchy identifying index.time index: the last element of tuples in
obj.index
is interpreted as a time index.time points: rows of
obj
with the same"timepoints"
index correspond correspond to the same time point; rows ofobj
with different"timepoints"
index correspond correspond to the different time points.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
capabilities: can represent hierarchical series; can represent unequally spaced series; can represent unequally supported hierarchical series; cannot represent hierarchical series with different sets of variables.
[ ]:
get_examples(mtype="pd_multiindex_hier", as_scitype="Hierarchical")[0]
Section 2: validity checking and mtype conversion#
sktime
’s datatypes
module provides users with generic functionality for:
checking inmemory containers against mtype conventions, with informative error messages that help moving data to the right format
converting different mtypes to each other, for a given scitype
In this section, this functionality and intended usage worfklows are presented.
Section 2.1: Preparing data, checking inmemory containers for validity#
sktime
’s datatypes
module provides convenient functionality for users to check validity of their inmemory data containers, using the check_is_mtype
and check_raise
functions. Both functions provide generic validity checking functionality, check_is_mtype
returns metadata and potential issues as return arguments, while check_raise
directly produces informative error messages in case a container does not comply with a given mtype
.
A recommended notebook workflow to ensure that a given data container is compliant with sktime
mtype
specification is as follows:
load the data in an inmemory data container
identify the
scitype
, e.g., is this supposed to be a time series (Series
) or a panel of time series (Panel
)select the target
mtype
(see Section 1 for a list), and attempt to manually reformat the data to comply with themtype
specification if it is not already compliantrun
check_raise
on the data container, to check whether it complies with themtype
andscitype
if an error is raised, repeat 3 and 4 until no error is raised
Section 2.1.1: validity checking, example 1 (simple mistake)#
Suppose we have the following numpy.ndarray
representing a univariate time series:
[ ]:
import numpy as np
y = np.array([1, 6, 3, 7, 2])
to check compatibility with sktime:
(instruction: uncomment and run the code to see the informative error message)
[ ]:
from sktime.datatypes import check_raise
# check_raise(y, mtype="np.ndarray")
this tells us that sktime
uses 2D numpy arrays for time series, if the np.ndarray
mtype is used. While most methods provide convenience functionality to do this coercion automatically, the “correct” format would be 2D as follows:
[ ]:
check_raise(y.reshape(1, 1), mtype="np.ndarray")
For use in own code or additional metadata, the error message can be obtained using the check_is_mtype
function:
[ ]:
from sktime.datatypes import check_is_mtype
check_is_mtype(y, mtype="np.ndarray", return_metadata=True)
and metadata is produced if the argument passes the validity check:
[ ]:
check_is_mtype(y.reshape(1, 1), mtype="np.ndarray", return_metadata=True)
Note: if the name of the mtype is ambiguous and can refer to multiple scitypes, the additional argument scitype
must be provided. This should not be the case for any common inmemory containers, we mention this for completeness.
[ ]:
check_is_mtype(y, mtype="np.ndarray", scitype="Series")
Section 2.1.2: validity checking, example 2 (nonobvious mistake)#
Suppose we have converted our data into a multiindex panel, i.e., we want to have a Panel
of mtype pdmultiindex
.
[ ]:
import pandas as pd
cols = ["instances", "time points"] + [f"var_{i}" for i in range(2)]
X = pd.concat(
[
pd.DataFrame([[0, 0, 1, 4], [0, 1, 2, 5], [0, 2, 3, 6]], columns=cols),
pd.DataFrame([[1, 0, 1, 4], [1, 1, 2, 55], [1, 2, 3, 6]], columns=cols),
pd.DataFrame([[2, 0, 1, 42], [2, 1, 2, 5], [2, 2, 3, 6]], columns=cols),
]
).set_index(["instances", "time points"])
It is not obvious whether X
satisfies the pdmultiindex
specification, so let’s check:
(instruction: uncomment and run the code to see the informative error message)
[ ]:
from sktime.datatypes import check_raise
# check_raise(X, mtype="pdmultiindex")
The informative error message highlights a typo in one of the multiindex columns, so we do this:
[ ]:
X.index.names = ["instances", "timepoints"]
Now the validity check passes:
[ ]:
check_raise(X, mtype="pdmultiindex")
Section 2.1.3: inferring the mtype#
sktime
also provides functionality to infer the mtype of an inmemory data container, which is useful in case one is sure that the container is compliant but one has forgotten the exact string, or in a case where one would like to know whether an inmemory container is already in some supported, compliant format. For this, only the scitype needs to be specified:
[ ]:
from sktime.datatypes import mtype
mtype(X, as_scitype="Panel")
Section 2.2: conversion between mtypes#
sktime
’s datatypes
module also offers uninfied conversion functionality between mtypes. This is useful for users as well as for method developers.
The convert
function requires to specify the mtype to convert from, and the mtype to convert to. The convert_to
function only requires to specify the mtype to convert to, automatically inferring the mtype of the input if it can be inferred. convert_to
should be used if the input can have multiple mtypes.
Section 2.2.1: simple conversion#
Example: converting a numpy3D
panel of time series to pdmultiindex
mtype:
[ ]:
from sktime.datatypes import get_examples
X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[ ]:
from sktime.datatypes import convert
convert(X, from_type="numpy3D", to_type="pdmultiindex")
[ ]:
from sktime.datatypes import convert_to
convert_to(X, to_type="pdmultiindex")
Section 2.2.2: advanced conversion features#
convert_to
also allows to specify multiple output types. The to_type
argument can be a list of mtypes. In that case, the input passed through unchanged if its mtype is on the list; if the mtype of the input is not on the list, it is converted to the mtype which is the first element of the list.
Example: converting a panel of time series of to either "pdmultiindex"
or "numpy3D"
. If the input is "numpy3D"
, it remains unchanged. If the input is "dflist"
, it is converted to "pdmultiindex"
.
[ ]:
from sktime.datatypes import get_examples
X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[ ]:
from sktime.datatypes import convert_to
convert_to(X, to_type=["pdmultiindex", "numpy3D"])
[ ]:
X = get_examples(mtype="dflist", as_scitype="Panel")[0]
X
[ ]:
convert_to(X, to_type=["pdmultiindex", "numpy3D"])
Section 2.2.3: inspecting implemented conversions#
Currently, conversions are work in progress, and not all possible conversions are available  contributions are welcome. To see which conversions are currently implemented for a scitype, use the _conversions_defined
developer method from the datatypes._convert
module. This produces a table with a “1” if conversion from mtype in row row to mtypw in column is implemented.
[ ]:
from sktime.datatypes._convert import _conversions_defined
_conversions_defined(scitype="Panel")
Section 3: loading data sets#
sktime
’s datasets
module allows to load datasets for testing and benchmarking. This includes:
example data sets that ship directly with
sktime
downloaders for data sets from common repositories
All data retrieved in this way are in sktime
compatible inmemory and/or file formats.
Currently, no systematic tagging and registry retrieval for the available data sets is implemented  contributions to this would be very welcome.
Section 3.1: forecasting data sets#
sktime
’s datasets
module currently allows to load a the following forecasting example data sets:
dataset name 
loader function 
properties 

Box/Jenkins airline data 

univariate 
Lynx sales data 

univariate 
Shampoo sales data 

univariate 
Pharmaceutical Benefit Scheme data 

univariate 
Longley US macroeconomic data 

multivariate 
MTS consumption/income data 

multivariate 
sktime
currently has no connectors to forecasting data repositories  contributions are much appreciated.
Forecasting data sets are all of Series
scitype, they can be univariate or multivariate.
Loaders for univariate data have no arguments, and always return the data in the "pd.Series"
mtype:
[ ]:
from sktime.datasets import load_airline
load_airline()
Loaders for multivariate data can be called in two ways:
without an argument, in which case a multivariate series of
"pd.DataFrame"
mtype is returned:
[ ]:
from sktime.datasets import load_longley
load_longley()
with an argument
y_name
that must coincide with one of the column/variable names, in which a pair of seriesy
,X
is returned, withy
of"pd.Series"
mtype, andX
of"pd.DataFrame"
mtype  this is convenient for univariate forecasting with exogeneous variables.
[ ]:
y, X = load_longley(y_name="TOTEMP")
[ ]:
y
[ ]:
X
Section 3.2: time series classification data sets#
sktime
’s datasets
module currently allows to load a the following time series classification example data sets:
dataset name 
loader function 
properties 

Appliance power consumption data 

univariate, equal length/index 
Arrowhead shape data 

univariate, equal length/index 
Gunpoint motion data 

univariate, equal length/index 
Italy power demand data 

univariate, equal length/index 
Japanese vowels data 

univariate, equal length/index 
OSUleaf leaf shape data 

univariate, equal length/index 
Basic motions data 

multivariate, equal length/index 
Currently, there are no unequal length or unequal index time series classification example data directly in sktime
.
sktime
also provides a full interface to the UCR/UEA time series data set archive, via the load_UCR_UEA_dataset
function. The UCR/UEA archive also contains time series classification data sets which are multivariate, or unequal length/index (in either combination).
Section 3.2.2: time series classification data sets in sktime
#
Time series classification data sets consists of a panel of time series of Panel
scitype, together with classification labels, one per time series.
If a loader is invoked with minimal arguments, the data are returned as "nested_univ"
mtype, with labels and series to classify in the same pd.DataFrame
. Using the return_X_y=True
argument, the data are returned separated into features X
and labels y
, with X
a Panel
of nested_univ
mtype, and y
and a sklearn
compatible numpy vector of labels:
[ ]:
from sktime.datasets import load_arrow_head
X, y = load_arrow_head(return_X_y=True)
[ ]:
X
[ ]:
y
The panel can be converted from "nested_univ"
mtype to other mtype formats, using datatypes.convert
or convert_to
(see above):
[ ]:
from sktime.datatypes import convert_to
convert_to(X, to_type="pdmultiindex")
Data set loaders can be invoked with the split
parameter to obtain reproducible training and test sets for comparison across studies. If split="train"
, a predefined training set is retrieved; if split="test"
, a predefined test set is retrieved.
[ ]:
X_train, y_train = load_arrow_head(return_X_y=True, split="train")
X_test, y_test = load_arrow_head(return_X_y=True, split="test")
# this retrieves training and test X/y for reproducible use in studies
Section 3.2.3: time series classification data sets from the UCR/UEA time series classification repository#
The load_UCR_UEA_dataset
utility will download datasetes from the UCR/UEA time series classification repository and make them available as inmemory datasets, with the same syntax as sktime
native data set loaders.
Datasets are indexed by unique string identifiers, which can be inspected on the repository itself, or via the register in the datasets.tsc_dataset_names
module, by property:
[ ]:
from sktime.datasets.tsc_dataset_names import univariate
The imported variables are all lists of strings which contain the unique string identifiers of datasets with certain properties, as follows:
register name 
uni/multivariate 
equal/unequal length 
with/without missing values 


only univariate 
both included 
both included 

only multivariate 
both included 
both included 

only univariate 
only equal length 
both included 

only univariate 
only unequal length 
both included 

only univariate 
both included 
only with missing values 

only multivariate 
only equal length 
both included 

only multivariate 
only unequal length 
both included 
Lookup and retrieval using these lists is, admittedly, a bit inconvenient  contributions to sktime
to write a lookup functions such as all_estimators
or all_tags
, based on capability or property tags attached to datasets would be very much appreciated.
An example list is displayed below:
[ ]:
univariate
The loader function load_UCR_UEA_dataset
behaves exactly as sktime
data loaders, with an additional argument name
that should be set to one of the unique identifying strings for the UCR/UEA datasets, for instance:
[ ]:
from sktime.datasets import load_UCR_UEA_dataset
X, y = load_UCR_UEA_dataset(name="Yoga", return_X_y=True)
This will download the dataset into a local directory (by default: for a local clone, the datasets/data
directory in the local repository; for a release install, in the local python environment folder). To change that directory, specify it using the extract_path
argument of the load_UCR_UEA_dataset
function.
Generated using nbsphinx. The Jupyter notebook can be found here.