Abstract: this notebook give an introduction to sktime
in-memory data containers and data sets, with associated functionality such as in-memory format validation, conversion, and data set loading.
Set-up instructions: on binder, this notebook should run out-of-the-box.
To run this notebook as intended, ensure that sktime
with basic dependency requirements is installed in your python environment.
To run this notebook with a local development version of sktime, either uncomment and run the below, or pip install -e
a local clone of the sktime
main
branch.
[1]:
# from os import sys
# sys.path.append("..")
In-memory data representations and data loading#
sktime
provides modules for a number of time series related learning tasks.
These modules use sktime
specific in-memory (i.e., python workspace) representations for time series and related objects, most importantly individual time series and time series panels. sktime
’s in-memory representations rely on pandas
and numpy
, with additional conventions on the pandas
and numpy
object.
Users of sktime
should be aware of these representations, since presenting the data in an sktime
compatible representation is usually the first step in using any of the sktime
modules.
This notebook introduces the data types used in sktime
, related functionality such as converters and validity checkers, and common workflows for loading and conversion:
Section 1 introduces in-memory data container formats used in sktime
, with examples.
Section 2 introduces validity checkers and conversion functionality for in-memory data containers.
Section 3 introduces common workflows to load predefined benchmark datasets.
Section 4 showcases common workflows to load from tabular csv
formats.
Section 1: in-memory data containers#
This section provides a reference to data containers used for time series and related objects in sktime
.
Conceptually, sktime
distinguishes:
the data scientific abstract data type - or short: scitype - of a data container, defined by relational and statistical properties of the data being represented and common operations on it - for instance, an abstract “time series” or an abstract “time series panel”, without specifying a particular machine implementation in python
the machine implementation type - or short: mtype - of a data container, which, for a defined scitype, specifies the python type and conventions on structure and value of the python in-memory object. For instance, a concrete (mathematical) time series is represented by a concrete
pandas.DataFrame
insktime
, subject to certain conventions on thepandas.DataFrame
. Formally, these conventions form a specific mtype, i.e., a way to represent the (abstract) “time series” scitype.
In sktime
, the same scitype can be implemented by multiple mtypes. For instance, sktime
allows the user to specify time series as pandas.DataFrame
, as pandas.Series
, or as a numpy.ndarray
. These are different mtypes which are admissible representations of the same scitype, “time series”. Also, not all mtypes are equally rich in metadata - for instance, pandas.DataFrame
can store column names, while this is not possible in numpy.ndarray
.
Scitypes and mtypes are encoded by strings in sktime
, for easy reference.
This section introduces the mtypes for the following scitypes: * "Series"
, the sktime
scitype for time series of any kind * "Panel"
, the sktime
scitype for time series panels of any kind * "Hierarchical"
, the sktime
scitype for hierarchical time series
Section 1.1: Time series - the "Series"
scitype#
The major representations of time series in sktime
are:
"pd.DataFrame"
- a uni- or multivariatepandas.DataFrame
, with rows = time points, cols = variables"pd.Series"
- a (univariate)pandas.Series
, with entries corresponding to different time points"np.ndarray"
- a 2Dnumpy.ndarray
, with rows = time points, cols = variables
pandas
objects must have one of the following pandas
index types: Int64Index
, RangeIndex
, DatetimeIndex
, PeriodIndex
; if DatetimeIndex
, the freq
attribute must be set.
numpy.ndarray
2D arrays are interpreted as having an RangeIndex
on the rows, and generally equivalent to the pandas.DataFrame
obtained after default coercion using the pandas.DataFrame
constructor.
[2]:
# import to retrieve examples
from sktime.datatypes import get_examples
Section 1.1.1: Time series - the "pd.DataFrame"
mtype#
In the "pd.DataFrame"
mtype, time series are represented by an in-memory container obj: pandas.DataFrame
as follows.
structure convention:
obj.index
must be monotonic, and one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
time points: rows of
obj
correspond to different, distinct time pointstime index:
obj.index
is interpreted as a time index.capabilities: can represent multivariate series; can represent unequally spaced series
Example of a univariate series in "pd.DataFrame"
representation. The single variable has name "a"
, and is observed at four time points 0, 1, 2, 3.
[3]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[0]
[3]:
a | |
---|---|
0 | 1.0 |
1 | 4.0 |
2 | 0.5 |
3 | -3.0 |
Example of a bivariate series in "pd.DataFrame"
representation. This series has two variables, named "a"
and "b"
. Both are observed at the same four time points 0, 1, 2, 3.
[4]:
get_examples(mtype="pd.DataFrame", as_scitype="Series")[1]
[4]:
a | b | |
---|---|---|
0 | 1.0 | 3.000000 |
1 | 4.0 | 7.000000 |
2 | 0.5 | 2.000000 |
3 | -3.0 | -0.428571 |
Section 1.1.2: Time series - the "pd.Series"
mtype#
In the "pd.Series"
mtype, time series are represented by an in-memory container obj: pandas.Series
as follows.
structure convention:
obj.index
must be monotonic, and one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
.variables: there is a single variable, corresponding to the values of
obj
. Only univariate series can be represented.variable names: by default, there is no column name. If needed, a variable name can be provided as
obj.name
.time points: entries of
obj
correspond to different, distinct time pointstime index:
obj.index
is interpreted as a time index.capabilities: cannot represent multivariate series; can represent unequally spaced series
Example of a univariate series in "pd.Series"
mtype representation. The single variable has name "a"
, and is observed at four time points 0, 1, 2, 3.
[5]:
get_examples(mtype="pd.Series", as_scitype="Series")[0]
[5]:
0 1.0
1 4.0
2 0.5
3 -3.0
Name: a, dtype: float64
Section 1.1.3: Time series - the "np.ndarray"
mtype#
In the "np.ndarray"
mtype, time series are represented by an in-memory container obj: np.ndarray
as follows.
structure convention:
obj
must be 2D, i.e.,obj.shape
must have length 2. This is also true for univariate time series.variables: variables correspond to columns of
obj
.variable names: the
"np.ndarray"
mtype cannot represent variable names.time points: the rows of
obj
correspond to different, distinct time points.time index: The time index is implicit and by-convention. The
i
-th row (for an integeri
) is interpreted as an observation at the time pointi
.capabilities: can represent multivariate series; cannot represent unequally spaced series
Example of a univariate series in "np.ndarray"
mtype representation. There is a single (unnamed) variable, it is observed at four time points 0, 1, 2, 3.
[6]:
get_examples(mtype="np.ndarray", as_scitype="Series")[0]
[6]:
array([[ 1. ],
[ 4. ],
[ 0.5],
[-3. ]])
Example of a bivariate series in "np.ndarray"
mtype representation. There are two (unnamed) variables, they are both observed at four time points 0, 1, 2, 3.
[7]:
get_examples(mtype="np.ndarray", as_scitype="Series")[1]
[7]:
array([[ 1. , 3. ],
[ 4. , 7. ],
[ 0.5 , 2. ],
[-3. , -0.42857143]])
Section 1.2: Time series panels - the "Panel"
scitype#
The major representations of time series panels in sktime
are:
"pd-multiindex"
- apandas.DataFrame
, with row multi-index (instances, time), cols = variables"numpy3D"
- a 3Dnp.ndarray
, with axis 0 = instances, axis 1 = variables, axis 2 = time points"df-list"
- alist
ofpandas.DataFrame
, with list index = instances, data frame rows = time points, data frame cols = variables
These representations are considered primary representations in sktime
and are core to internal computations.
There are further, minor representations of time series panels in sktime
:
"nested_univ"
- apandas.DataFrame
, withpandas.Series
in cells. data frame rows = instances, data frame cols = variables, and series axis = time points"numpyflat"
- a 2Dnp.ndarray
with rows = instances, and columns indexed by a pair index of (variables, time points). This format is only being converted to and cannot be converted from (since number of variables and time points may be ambiguous)."pd-wide"
- apandas.DataFrame
in wide format: has column multi-index (variables, time points), rows = instances; the “variables” index can be omitted for univariate time series"pd-long"
- apandas.DataFrame
in long format: has colsinstances
,timepoints
,variable
,value
; entries invalue
are indexed by tuples of values in (instances
,timepoints
,variable
).
The minor representations are currently not fully consolidated in-code and are not discussed further below. Contributions are appreciated.
Section 1.2.1: Time series panels - the "pd-multiindex"
mtype#
In the "pd-multiindex"
mtype, time series panels are represented by an in-memory container obj: pandas.DataFrame
as follows.
structure convention:
obj.index
must be a pair multi-index of type(Index, t)
, wheret
is one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
and monotonic.obj.index
must have two levels (can be named or not).instance index: the first element of pairs in
obj.index
(0-th level value) is interpreted as an instance index, we call it “instance index” below.instances: rows with the same “instance index” index value correspond to the same instance; rows with different “instance index” values correspond to different instances.
time index: the second element of pairs in
obj.index
(1-st level value) is interpreted as a time index, we call it “time index” below.time points: rows of
obj
with the same “time index” value correspond correspond to the same time point; rows ofobj
with different “time index” index correspond correspond to the different time points.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; cannot represent panels of series with different sets of variables.
Example of a panel of multivariate series in "pd-multiindex"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0"
, "var_1"
. All series are observed at three time points 0, 1, 2.
[8]:
get_examples(mtype="pd-multiindex", as_scitype="Panel")[0]
[8]:
var_0 | var_1 | ||
---|---|---|---|
instances | timepoints | ||
0 | 0 | 1 | 4 |
1 | 2 | 5 | |
2 | 3 | 6 | |
1 | 0 | 1 | 4 |
1 | 2 | 55 | |
2 | 3 | 6 | |
2 | 0 | 1 | 42 |
1 | 2 | 5 | |
2 | 3 | 6 |
Section 1.2.2: Time series panels - the "numpy3D"
mtype#
In the "numpy3D"
mtype, time series panels are represented by an in-memory container obj: np.ndarray
as follows.
structure convention:
obj
must be 3D, i.e.,obj.shape
must have length 3.instances: instances correspond to axis 0 elements of
obj
.instance index: the instance index is implicit and by-convention. The
i
-th element of axis 0 (for an integeri
) is interpreted as indicative of observing instancei
.variables: variables correspond to axis 1 elements of
obj
.variable names: the
"numpy3D"
mtype cannot represent variable names.time points: time points correspond to axis 2 elements of
obj
.time index: the time index is implicit and by-convention. The
i
-th elemtn of axis 2 (for an integeri
) is interpreted as an observation at the time pointi
.capabilities: can represent panels of multivariate series; cannot represent unequally spaced series; cannot represent panels of unequally supported series; cannot represent panels of series with different sets of variables.
Example of a panel of multivariate series in "numpy3D"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables (unnamed). All series are observed at three time points 0, 1, 2.
[9]:
get_examples(mtype="numpy3D", as_scitype="Panel")[0]
[9]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 55, 6]],
[[ 1, 2, 3],
[42, 5, 6]]])
Section 1.2.3: Time series panels - the "df-list"
mtype#
In the "df-list"
mtype, time series panels are represented by an in-memory container obj: List[pandas.DataFrame]
as follows.
structure convention:
obj
must be a list ofpandas.DataFrames
. Individual list elements ofobj
must follow the"pd.DataFrame"
mtype convention for the"Series"
scitype.instances: instances correspond to different list elements of
obj
.instance index: the instance index of an instance is the list index at which it is located in
obj
. That is, the data atobj[i]
correspond to observations of the instance with indexi
.time points: rows of
obj[i]
correspond to different, distinct time points, at which instancei
is observed.time index:
obj[i].index
is interpreted as the time index for instancei
.variables: columns of
obj[i]
correspond to different variables available for instancei
.variable names: column names
obj[i].columns
are the names of variables available for instancei
.capabilities: can represent panels of multivariate series; can represent unequally spaced series; can represent panels of unequally supported series; can represent panels of series with different sets of variables.
Example of a panel of multivariate series in "df-list"
mtype representation. The panel contains three multivariate series, with instance indices 0, 1, 2. All series have two variables with names "var_0"
, "var_1"
. All series are observed at three time points 0, 1, 2.
[10]:
get_examples(mtype="df-list", as_scitype="Panel")[0]
[10]:
[ var_0 var_1
0 1 4
1 2 5
2 3 6,
var_0 var_1
0 1 4
1 2 55
2 3 6,
var_0 var_1
0 1 42
1 2 5
2 3 6]
Section 1.3: Hierarchical time series - the "Hierarchical"
scitype#
There is currently only one representation for hierarchical time series in sktime
:
"pd_multiindex_hier"
- apandas.DataFrame
, with row multi-index, last level interpreted as time, others as hierarchy, cols = variables
Hierarchical time series - the "pd_multiindex_hier"
mtype#
structure convention:
obj.index
must be a 3 or more level multi-index of type(Index, ..., Index, t)
, wheret
is one ofInt64Index
,RangeIndex
,DatetimeIndex
,PeriodIndex
and monotonic. We call the last index the “time-like” index.hierarchy level: rows with the same non-time-like index values correspond to the same hierarchy unit; rows with different non-time-like index combination correspond to different hierarchy unit.
hierarchy: the non-time-like indices in
obj.index
are interpreted as a hierarchy identifying index.time index: the last element of tuples in
obj.index
is interpreted as a time index.time points: rows of
obj
with the same"timepoints"
index correspond correspond to the same time point; rows ofobj
with different"timepoints"
index correspond correspond to the different time points.variables: columns of
obj
correspond to different variablesvariable names: column names
obj.columns
capabilities: can represent hierarchical series; can represent unequally spaced series; can represent unequally supported hierarchical series; cannot represent hierarchical series with different sets of variables.
[11]:
get_examples(mtype="pd_multiindex_hier", as_scitype="Hierarchical")[0]
[11]:
var_0 | var_1 | |||
---|---|---|---|---|
foo | bar | timepoints | ||
a | 0 | 0 | 1 | 4 |
1 | 2 | 5 | ||
2 | 3 | 6 | ||
1 | 0 | 1 | 4 | |
1 | 2 | 55 | ||
2 | 3 | 6 | ||
2 | 0 | 1 | 42 | |
1 | 2 | 5 | ||
2 | 3 | 6 | ||
b | 0 | 0 | 1 | 4 |
1 | 2 | 5 | ||
2 | 3 | 6 | ||
1 | 0 | 1 | 4 | |
1 | 2 | 55 | ||
2 | 3 | 6 | ||
2 | 0 | 1 | 42 | |
1 | 2 | 5 | ||
2 | 3 | 6 |
Section 2: validity checking and mtype conversion#
sktime
’s datatypes
module provides users with generic functionality for:
checking in-memory containers against mtype conventions, with informative error messages that help moving data to the right format
converting different mtypes to each other, for a given scitype
In this section, this functionality and intended usage workflows are presented.
Section 2.1: Preparing data, checking in-memory containers for validity#
sktime
’s datatypes
module provides convenient functionality for users to check validity of their in-memory data containers, using the check_is_mtype
and check_raise
functions. Both functions provide generic validity checking functionality, check_is_mtype
returns metadata and potential issues as return arguments, while check_raise
directly produces informative error messages in case a container does not comply with a given mtype
.
A recommended notebook workflow to ensure that a given data container is compliant with sktime
mtype
specification is as follows:
load the data in an in-memory data container
identify the
scitype
, e.g., is this supposed to be a time series (Series
) or a panel of time series (Panel
)select the target
mtype
(see Section 1 for a list), and attempt to manually reformat the data to comply with themtype
specification if it is not already compliantrun
check_raise
on the data container, to check whether it complies with themtype
andscitype
if an error is raised, repeat 3 and 4 until no error is raised
Section 2.1.1: validity checking, example 1 (simple mistake)#
Suppose we have the following numpy.ndarray
representing a univariate time series:
[12]:
import numpy as np
y = np.array([1, 6, 3, 7, 2])
to check compatibility with sktime:
(instruction: uncomment and run the code to see the informative error message)
[13]:
from sktime.datatypes import check_raise
# check_raise(y, mtype="np.ndarray")
this tells us that sktime
uses 2D numpy arrays for time series, if the np.ndarray
mtype is used. While most methods provide convenience functionality to do this coercion automatically, the “correct” format would be 2D as follows:
[14]:
check_raise(y.reshape(-1, 1), mtype="np.ndarray")
[14]:
True
For use in own code or additional metadata, the error message can be obtained using the check_is_mtype
function:
[15]:
from sktime.datatypes import check_is_mtype
check_is_mtype(y, mtype="np.ndarray", return_metadata=True)
[15]:
(True,
None,
{'is_empty': False,
'is_univariate': True,
'n_features': 1,
'feature_names': [0],
'dtypekind_dfip': [<DtypeKind.FLOAT: 2>],
'feature_kind': [<DtypeKind.FLOAT: 2>],
'is_equally_spaced': True,
'has_nans': False,
'mtype': 'np.ndarray',
'scitype': 'Series'})
and metadata is produced if the argument passes the validity check:
[16]:
check_is_mtype(y.reshape(-1, 1), mtype="np.ndarray", return_metadata=True)
[16]:
(True,
None,
{'is_empty': False,
'is_univariate': True,
'n_features': 1,
'feature_names': [0],
'dtypekind_dfip': [<DtypeKind.FLOAT: 2>],
'feature_kind': [<DtypeKind.FLOAT: 2>],
'is_equally_spaced': True,
'has_nans': False,
'mtype': 'np.ndarray',
'scitype': 'Series'})
Note: if the name of the mtype is ambiguous and can refer to multiple scitypes, the additional argument scitype
must be provided. This should not be the case for any common in-memory containers, we mention this for completeness.
[17]:
check_is_mtype(y, mtype="np.ndarray", scitype="Series")
[17]:
True
Section 2.1.2: validity checking, example 2 (non-obvious mistake)#
Suppose we have converted our data into a multi-index panel, i.e., we want to have a Panel
of mtype pd-multiindex
.
[18]:
import pandas as pd
cols = ["instances", "time points"] + [f"var_{i}" for i in range(2)]
X = pd.concat(
[
pd.DataFrame([[0, 0, 1, 4], [0, 1, 2, 5], [0, 2, 3, 6]], columns=cols),
pd.DataFrame([[1, 0, 1, 4], [1, 1, 2, 55], [1, 2, 3, 6]], columns=cols),
pd.DataFrame([[2, 0, 1, 42], [2, 1, 2, 5], [2, 2, 3, 6]], columns=cols),
]
).set_index(["instances", "time points"])
It is not obvious whether X
satisfies the pd-multiindex
specification, so let’s check:
(instruction: uncomment and run the code to see the informative error message)
[19]:
from sktime.datatypes import check_raise
# check_raise(X, mtype="pd-multiindex")
The informative error message highlights a typo in one of the multi-index columns, so we do this:
[20]:
X.index.names = ["instances", "timepoints"]
Now the validity check passes:
[21]:
check_raise(X, mtype="pd-multiindex")
[21]:
True
Section 2.1.3: inferring the mtype#
sktime
also provides functionality to infer the mtype of an in-memory data container, which is useful in case one is sure that the container is compliant but one has forgotten the exact string, or in a case where one would like to know whether an in-memory container is already in some supported, compliant format. For this, only the scitype needs to be specified:
[22]:
from sktime.datatypes import mtype
mtype(X, as_scitype="Panel")
[22]:
'pd-multiindex'
Section 2.2: conversion between mtypes#
sktime
’s datatypes
module also offers uninfied conversion functionality between mtypes. This is useful for users as well as for method developers.
The convert
function requires to specify the mtype to convert from, and the mtype to convert to. The convert_to
function only requires to specify the mtype to convert to, automatically inferring the mtype of the input if it can be inferred. convert_to
should be used if the input can have multiple mtypes.
Section 2.2.1: simple conversion#
Example: converting a numpy3D
panel of time series to pd-multiindex
mtype:
[23]:
from sktime.datatypes import get_examples
X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[23]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 55, 6]],
[[ 1, 2, 3],
[42, 5, 6]]])
[24]:
from sktime.datatypes import convert
convert(X, from_type="numpy3D", to_type="pd-multiindex")
[24]:
var_0 | var_1 | ||
---|---|---|---|
instances | timepoints | ||
0 | 0 | 1 | 4 |
1 | 2 | 5 | |
2 | 3 | 6 | |
1 | 0 | 1 | 4 |
1 | 2 | 55 | |
2 | 3 | 6 | |
2 | 0 | 1 | 42 |
1 | 2 | 5 | |
2 | 3 | 6 |
[25]:
from sktime.datatypes import convert_to
convert_to(X, to_type="pd-multiindex")
[25]:
var_0 | var_1 | ||
---|---|---|---|
instances | timepoints | ||
0 | 0 | 1 | 4 |
1 | 2 | 5 | |
2 | 3 | 6 | |
1 | 0 | 1 | 4 |
1 | 2 | 55 | |
2 | 3 | 6 | |
2 | 0 | 1 | 42 |
1 | 2 | 5 | |
2 | 3 | 6 |
Section 2.2.2: advanced conversion features#
convert_to
also allows to specify multiple output types. The to_type
argument can be a list of mtypes. In that case, the input passed through unchanged if its mtype is on the list; if the mtype of the input is not on the list, it is converted to the mtype which is the first element of the list.
Example: converting a panel of time series of to either "pd-multiindex"
or "numpy3D"
. If the input is "numpy3D"
, it remains unchanged. If the input is "df-list"
, it is converted to "pd-multiindex"
.
[26]:
from sktime.datatypes import get_examples
X = get_examples(mtype="numpy3D", as_scitype="Panel")[0]
X
[26]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 55, 6]],
[[ 1, 2, 3],
[42, 5, 6]]])
[27]:
from sktime.datatypes import convert_to
convert_to(X, to_type=["pd-multiindex", "numpy3D"])
[27]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 55, 6]],
[[ 1, 2, 3],
[42, 5, 6]]])
[28]:
X = get_examples(mtype="df-list", as_scitype="Panel")[0]
X
[28]:
[ var_0 var_1
0 1 4
1 2 5
2 3 6,
var_0 var_1
0 1 4
1 2 55
2 3 6,
var_0 var_1
0 1 42
1 2 5
2 3 6]
[29]:
convert_to(X, to_type=["pd-multiindex", "numpy3D"])
[29]:
var_0 | var_1 | ||
---|---|---|---|
instances | timepoints | ||
0 | 0 | 1 | 4 |
1 | 2 | 5 | |
2 | 3 | 6 | |
1 | 0 | 1 | 4 |
1 | 2 | 55 | |
2 | 3 | 6 | |
2 | 0 | 1 | 42 |
1 | 2 | 5 | |
2 | 3 | 6 |
Section 2.2.3: inspecting implemented conversions#
Currently, conversions are work in progress, and not all possible conversions are available - contributions are welcome. To see which conversions are currently implemented for a scitype, use the _conversions_defined
developer method from the datatypes._convert
module. This produces a table with a “1” if conversion from mtype in row row to mtypw in column is implemented.
[30]:
from sktime.datatypes._convert import _conversions_defined
_conversions_defined(scitype="Panel")
[30]:
df-list | gluonts_ListDataset_panel | gluonts_PandasDataset_panel | nested_univ | numpy3D | numpyflat | pd-long | pd-multiindex | pd-wide | |
---|---|---|---|---|---|---|---|---|---|
df-list | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
gluonts_ListDataset_panel | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
gluonts_PandasDataset_panel | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
nested_univ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
numpy3D | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
numpyflat | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
pd-long | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
pd-multiindex | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
pd-wide | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
Section 3: loading pre-defined data sets#
sktime
’s datasets
module allows to load datasets for testing and benchmarking. This includes:
example data sets that ship directly with
sktime
downloaders for data sets from common repositories
All data retrieved in this way are in sktime
compatible in-memory and/or file formats.
Currently, no systematic tagging and registry retrieval for the available data sets is implemented - contributions to this would be very welcome.
Section 3.1: forecasting data sets#
sktime
’s datasets
module currently allows to load a the following forecasting example data sets:
dataset name |
loader function |
properties |
---|---|---|
Box/Jenkins airline data |
|
univariate |
Lynx sales data |
|
univariate |
Shampoo sales data |
|
univariate |
Pharmaceutical Benefit Scheme data |
|
univariate |
Longley US macroeconomic data |
|
multivariate |
MTS consumption/income data |
|
multivariate |
sktime
currently has no connectors to forecasting data repositories - contributions are much appreciated.
Forecasting data sets are all of Series
scitype, they can be univariate or multivariate.
Loaders for univariate data have no arguments, and always return the data in the "pd.Series"
mtype:
[31]:
from sktime.datasets import load_airline
load_airline()
[31]:
Period
1949-01 112.0
1949-02 118.0
1949-03 132.0
1949-04 129.0
1949-05 121.0
...
1960-08 606.0
1960-09 508.0
1960-10 461.0
1960-11 390.0
1960-12 432.0
Freq: M, Name: Number of airline passengers, Length: 144, dtype: float64
Loaders for multivariate data can be called in two ways:
without an argument, in which case a multivariate series of
"pd.DataFrame"
mtype is returned:
[32]:
from sktime.datasets import load_longley
load_longley()
[32]:
(Period
1947 60323.0
1948 61122.0
1949 60171.0
1950 61187.0
1951 63221.0
1952 63639.0
1953 64989.0
1954 63761.0
1955 66019.0
1956 67857.0
1957 68169.0
1958 66513.0
1959 68655.0
1960 69564.0
1961 69331.0
1962 70551.0
Freq: A-DEC, Name: TOTEMP, dtype: float64,
GNPDEFL GNP UNEMP ARMED POP
Period
1947 83.0 234289.0 2356.0 1590.0 107608.0
1948 88.5 259426.0 2325.0 1456.0 108632.0
1949 88.2 258054.0 3682.0 1616.0 109773.0
1950 89.5 284599.0 3351.0 1650.0 110929.0
1951 96.2 328975.0 2099.0 3099.0 112075.0
1952 98.1 346999.0 1932.0 3594.0 113270.0
1953 99.0 365385.0 1870.0 3547.0 115094.0
1954 100.0 363112.0 3578.0 3350.0 116219.0
1955 101.2 397469.0 2904.0 3048.0 117388.0
1956 104.6 419180.0 2822.0 2857.0 118734.0
1957 108.4 442769.0 2936.0 2798.0 120445.0
1958 110.8 444546.0 4681.0 2637.0 121950.0
1959 112.6 482704.0 3813.0 2552.0 123366.0
1960 114.2 502601.0 3931.0 2514.0 125368.0
1961 115.7 518173.0 4806.0 2572.0 127852.0
1962 116.9 554894.0 4007.0 2827.0 130081.0)
with an argument
y_name
that must coincide with one of the column/variable names, in which a pair of seriesy
,X
is returned, withy
of"pd.Series"
mtype, andX
of"pd.DataFrame"
mtype - this is convenient for univariate forecasting with exogeneous variables.
[33]:
y, X = load_longley(y_name="TOTEMP")
[34]:
y
[34]:
Period
1947 60323.0
1948 61122.0
1949 60171.0
1950 61187.0
1951 63221.0
1952 63639.0
1953 64989.0
1954 63761.0
1955 66019.0
1956 67857.0
1957 68169.0
1958 66513.0
1959 68655.0
1960 69564.0
1961 69331.0
1962 70551.0
Freq: A-DEC, Name: TOTEMP, dtype: float64
[35]:
X
[35]:
GNPDEFL | GNP | UNEMP | ARMED | POP | |
---|---|---|---|---|---|
Period | |||||
1947 | 83.0 | 234289.0 | 2356.0 | 1590.0 | 107608.0 |
1948 | 88.5 | 259426.0 | 2325.0 | 1456.0 | 108632.0 |
1949 | 88.2 | 258054.0 | 3682.0 | 1616.0 | 109773.0 |
1950 | 89.5 | 284599.0 | 3351.0 | 1650.0 | 110929.0 |
1951 | 96.2 | 328975.0 | 2099.0 | 3099.0 | 112075.0 |
1952 | 98.1 | 346999.0 | 1932.0 | 3594.0 | 113270.0 |
1953 | 99.0 | 365385.0 | 1870.0 | 3547.0 | 115094.0 |
1954 | 100.0 | 363112.0 | 3578.0 | 3350.0 | 116219.0 |
1955 | 101.2 | 397469.0 | 2904.0 | 3048.0 | 117388.0 |
1956 | 104.6 | 419180.0 | 2822.0 | 2857.0 | 118734.0 |
1957 | 108.4 | 442769.0 | 2936.0 | 2798.0 | 120445.0 |
1958 | 110.8 | 444546.0 | 4681.0 | 2637.0 | 121950.0 |
1959 | 112.6 | 482704.0 | 3813.0 | 2552.0 | 123366.0 |
1960 | 114.2 | 502601.0 | 3931.0 | 2514.0 | 125368.0 |
1961 | 115.7 | 518173.0 | 4806.0 | 2572.0 | 127852.0 |
1962 | 116.9 | 554894.0 | 4007.0 | 2827.0 | 130081.0 |
Section 3.2: time series classification data sets#
sktime
’s datasets
module currently allows to load a the following time series classification example data sets:
dataset name |
loader function |
properties |
---|---|---|
Appliance power consumption data |
|
univariate, equal length/index |
Arrowhead shape data |
|
univariate, equal length/index |
Gunpoint motion data |
|
univariate, equal length/index |
Italy power demand data |
|
univariate, equal length/index |
Japanese vowels data |
|
univariate, equal length/index |
OSUleaf leaf shape data |
|
univariate, equal length/index |
Basic motions data |
|
multivariate, equal length/index |
Currently, there are no unequal length or unequal index time series classification example data directly in sktime
.
sktime
also provides a full interface to the UCR/UEA time series data set archive, via the load_UCR_UEA_dataset
function. The UCR/UEA archive also contains time series classification data sets which are multivariate, or unequal length/index (in either combination).
Section 3.2.2: time series classification data sets in sktime
#
Time series classification data sets consists of a panel of time series of Panel
scitype, together with classification labels, one per time series.
If a loader is invoked with minimal arguments, the data are returned as "nested_univ"
mtype, with labels and series to classify in the same pd.DataFrame
. Using the return_X_y=True
argument, the data are returned separated into features X
and labels y
, with X
a Panel
of nested_univ
mtype, and y
and a sklearn
compatible numpy vector of labels:
[36]:
from sktime.datasets import load_arrow_head
X, y = load_arrow_head(return_X_y=True)
[37]:
X
[37]:
dim_0 | |
---|---|
0 | 0 -1.963009 1 -1.957825 2 -1.95614... |
1 | 0 -1.774571 1 -1.774036 2 -1.77658... |
2 | 0 -1.866021 1 -1.841991 2 -1.83502... |
3 | 0 -2.073758 1 -2.073301 2 -2.04460... |
4 | 0 -1.746255 1 -1.741263 2 -1.72274... |
... | ... |
206 | 0 -1.625142 1 -1.622988 2 -1.62606... |
207 | 0 -1.657757 1 -1.664673 2 -1.63264... |
208 | 0 -1.603279 1 -1.587365 2 -1.57740... |
209 | 0 -1.739020 1 -1.741534 2 -1.73286... |
210 | 0 -1.630727 1 -1.629918 2 -1.62055... |
211 rows × 1 columns
[38]:
y
[38]:
array(['0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0',
'1', '2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '1',
'2', '0', '1', '2', '0', '1', '2', '0', '1', '2', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
'1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
'1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
'1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
'1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
'2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
'2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
'2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2',
'2', '2', '2'], dtype='<U1')
The panel can be converted from "nested_univ"
mtype to other mtype formats, using datatypes.convert
or convert_to
(see above):
[39]:
from sktime.datatypes import convert_to
convert_to(X, to_type="pd-multiindex")
[39]:
dim_0 | ||
---|---|---|
0 | 0 | -1.963009 |
1 | -1.957825 | |
2 | -1.956145 | |
3 | -1.938289 | |
4 | -1.896657 | |
... | ... | ... |
210 | 246 | -1.513637 |
247 | -1.550431 | |
248 | -1.581576 | |
249 | -1.595273 | |
250 | -1.620783 |
52961 rows × 1 columns
Data set loaders can be invoked with the split
parameter to obtain reproducible training and test sets for comparison across studies. If split="train"
, a pre-defined training set is retrieved; if split="test"
, a pre-defined test set is retrieved.
[40]:
X_train, y_train = load_arrow_head(return_X_y=True, split="train")
X_test, y_test = load_arrow_head(return_X_y=True, split="test")
# this retrieves training and test X/y for reproducible use in studies
Section 3.2.3: time series classification data sets from the UCR/UEA time series classification repository#
The load_UCR_UEA_dataset
utility will download datasetes from the UCR/UEA time series classification repository and make them available as in-memory datasets, with the same syntax as sktime
native data set loaders.
Datasets are indexed by unique string identifiers, which can be inspected on the repository itself, or via the register in the datasets.tsc_dataset_names
module, by property:
[41]:
from sktime.datasets.tsc_dataset_names import univariate
The imported variables are all lists of strings which contain the unique string identifiers of datasets with certain properties, as follows:
register name |
uni-/multivariate |
equal/unequal length |
with/without missing values |
---|---|---|---|
|
only univariate |
both included |
both included |
|
only multivariate |
both included |
both included |
|
only univariate |
only equal length |
both included |
|
only univariate |
only unequal length |
both included |
|
only univariate |
both included |
only with missing values |
|
only multivariate |
only equal length |
both included |
|
only multivariate |
only unequal length |
both included |
Lookup and retrieval using these lists is, admittedly, a bit inconvenient - contributions to sktime
to write a lookup functions such as all_estimators
or all_tags
, based on capability or property tags attached to datasets would be very much appreciated.
An example list is displayed below:
[42]:
univariate
[42]:
['ACSF1',
'Adiac',
'AllGestureWiimoteX',
'AllGestureWiimoteY',
'AllGestureWiimoteZ',
'ArrowHead',
'AsphaltObstacles',
'Beef',
'BeetleFly',
'BirdChicken',
'BME',
'Car',
'CBF',
'Chinatown',
'ChlorineConcentration',
'CinCECGTorso',
'Coffee',
'Computers',
'CricketX',
'CricketY',
'CricketZ',
'Crop',
'DiatomSizeReduction',
'DistalPhalanxOutlineCorrect',
'DistalPhalanxOutlineAgeGroup',
'DistalPhalanxTW',
'DodgerLoopDay',
'DodgerLoopGame',
'DodgerLoopWeekend',
'Earthquakes',
'ECG200',
'ECG5000',
'ECGFiveDays',
'ElectricDevices',
'EOGHorizontalSignal',
'EOGVerticalSignal',
'EthanolLevel',
'FaceAll',
'FaceFour',
'FacesUCR',
'FiftyWords',
'Fish',
'FordA',
'FordB',
'FreezerRegularTrain',
'FreezerSmallTrain',
'Fungi',
'GestureMidAirD1',
'GestureMidAirD2',
'GestureMidAirD3',
'GesturePebbleZ1',
'GesturePebbleZ2',
'GunPoint',
'GunPointAgeSpan',
'GunPointMaleVersusFemale',
'GunPointOldVersusYoung',
'Ham',
'HandOutlines',
'Haptics',
'Herring',
'HouseTwenty',
'InlineSkate',
'InsectEPGRegularTrain',
'InsectEPGSmallTrain',
'InsectWingbeatSound',
'ItalyPowerDemand',
'LargeKitchenAppliances',
'Lightning2',
'Lightning7',
'Mallat',
'Meat',
'MedicalImages',
'MelbournePedestrian',
'MiddlePhalanxOutlineCorrect',
'MiddlePhalanxOutlineAgeGroup',
'MiddlePhalanxTW',
'MixedShapesRegularTrain',
'MixedShapesSmallTrain',
'MoteStrain',
'NonInvasiveFetalECGThorax1',
'NonInvasiveFetalECGThorax2',
'OliveOil',
'OSULeaf',
'PhalangesOutlinesCorrect',
'Phoneme',
'PickupGestureWiimoteZ',
'PigAirwayPressure',
'PigArtPressure',
'PigCVP',
'PLAID',
'Plane',
'PowerCons',
'ProximalPhalanxOutlineCorrect',
'ProximalPhalanxOutlineAgeGroup',
'ProximalPhalanxTW',
'RefrigerationDevices',
'Rock',
'ScreenType',
'SemgHandGenderCh2',
'SemgHandMovementCh2',
'SemgHandSubjectCh2',
'ShakeGestureWiimoteZ',
'ShapeletSim',
'ShapesAll',
'SmallKitchenAppliances',
'SmoothSubspace',
'SonyAIBORobotSurface1',
'SonyAIBORobotSurface2',
'StarLightCurves',
'Strawberry',
'SwedishLeaf',
'Symbols',
'SyntheticControl',
'ToeSegmentation1',
'ToeSegmentation2',
'Trace',
'TwoLeadECG',
'TwoPatterns',
'UMD',
'UWaveGestureLibraryAll',
'UWaveGestureLibraryX',
'UWaveGestureLibraryY',
'UWaveGestureLibraryZ',
'Wafer',
'Wine',
'WordSynonyms',
'Worms',
'WormsTwoClass']
The loader function load_UCR_UEA_dataset
behaves exactly as sktime
data loaders, with an additional argument name
that should be set to one of the unique identifying strings for the UCR/UEA datasets, for instance:
[43]:
from sktime.datasets import load_UCR_UEA_dataset
X, y = load_UCR_UEA_dataset(name="ArrowHead", return_X_y=True)
This will download the dataset into a local directory (by default: for a local clone, the datasets/data
directory in the local repository; for a release install, in the local python environment folder). To change that directory, specify it using the extract_path
argument of the load_UCR_UEA_dataset
function.
Section 4: loading data from csv
files#
This section shows how to load some common tabular csv
formats into sktime
compatible containers.
We’ll cover:
converting series datasets to
sktime
compatible containersconverting panel datasets to
sktime
compatible containers
We assume that all csv files are have some tabular formats.
This means that the csv
file contains columns for the time index, or instance index for panel data, or are in a wide tabular format.
Note: at every step, we could use check_is_mtype
to check against the target format. A reader may like to do so.
Section 4.1: simple time series example#
[44]:
import pandas as pd
df_series = pd.read_csv("../sktime/datasets/data/Airline/Airline.csv")
df_series.head()
[44]:
Date | Passengers | |
---|---|---|
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
[45]:
df_series = df_series.set_index(
"Date"
).squeeze() # replace "Period" with the column name of the time index
df_series.index = pd.DatetimeIndex(df_series.index)
[46]:
mtype(df_series, as_scitype="Series")
[46]:
'pd.Series'
Section 4.2: easy panel data example#
[47]:
# mimicking a scenario where we already have a csv file in the right format
from sktime.datasets import load_arrow_head
df_panel = load_arrow_head(split="TRAIN", return_type="pd-multiindex")[0].reset_index()
# imagine this is the result of df_panel = pd.read_csv
[48]:
df_panel.head()
[48]:
level_0 | level_1 | dim_0 | |
---|---|---|---|
0 | 0 | 0 | -1.963009 |
1 | 0 | 1 | -1.957825 |
2 | 0 | 2 | -1.956145 |
3 | 0 | 3 | -1.938289 |
4 | 0 | 4 | -1.896657 |
this is similar to pd-multiindex format, so we try to move it to that.
The one thing that we need to change is setting instance/time as index:
[49]:
df_panel = df_panel.set_index(["level_0", "level_1"])
type(df_panel.index)
[49]:
pandas.core.indexes.multi.MultiIndex
this is now recognized by sktime
as being in pd-multiindex format
[50]:
mtype(df_panel, as_scitype="Panel")
# in general:
# replace "timepoints" with the time index column name
# replace "level_0" with the higher level column name of your file
[50]:
'pd-multiindex'
Section 4.3: difficult panel data example#
We now try to load panel data from a file where the format is a bit challenging
In the below file:
the separator is not the default (comma) but tab. For this, we set
sep=\t
there is no header in the file
there is no instance index, so we need to add it
the indexing is not similar to
sktime
- the first column has variable index, and the columns are time index
These or similar challenges are common in csv
files for panel data, so we show below how to address them.
It is advised to bring the data to either the plain pandas
or numpy
based format. In the below case, we will bring the data to pd-multiindex
format.
[51]:
# 1, 2 - dealing with the separator and header
import pandas as pd
df_panel = pd.read_csv(
"../sktime/datasets/data/ArrowHead/ArrowHead_TRAIN.tsv",
sep="\t",
header=None,
)
df_panel.head()
[51]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -1.963009 | -1.957825 | -1.956145 | -1.938289 | -1.896657 | -1.869857 | -1.838705 | -1.812289 | -1.736433 | ... | -1.583857 | -1.655329 | -1.719153 | -1.750881 | -1.796273 | -1.841345 | -1.884289 | -1.905393 | -1.923905 | -1.909153 |
1 | 1 | -1.774571 | -1.774036 | -1.776586 | -1.730749 | -1.696268 | -1.657377 | -1.636227 | -1.609807 | -1.543439 | ... | -1.471688 | -1.484666 | -1.539972 | -1.590150 | -1.635663 | -1.639989 | -1.678683 | -1.729227 | -1.775670 | -1.789324 |
2 | 2 | -1.866021 | -1.841991 | -1.835025 | -1.811902 | -1.764390 | -1.707687 | -1.648280 | -1.582643 | -1.531502 | ... | -1.584132 | -1.652337 | -1.684565 | -1.743972 | -1.799117 | -1.829069 | -1.875828 | -1.862512 | -1.863368 | -1.846493 |
3 | 0 | -2.073758 | -2.073301 | -2.044607 | -2.038346 | -1.959043 | -1.874494 | -1.805619 | -1.731043 | -1.712653 | ... | -1.678942 | -1.743732 | -1.819801 | -1.858136 | -1.886146 | -1.951247 | -2.012927 | -2.026963 | -2.073405 | -2.075292 |
4 | 1 | -1.746255 | -1.741263 | -1.722741 | -1.698640 | -1.677223 | -1.630356 | -1.579440 | -1.551225 | -1.473980 | ... | -1.547111 | -1.607101 | -1.635137 | -1.686346 | -1.691274 | -1.716886 | -1.740726 | -1.743442 | -1.762729 | -1.763428 |
5 rows × 252 columns
[52]:
# 2 - adding an instance index manually
import numpy as np
df_panel["instance"] = np.repeat(range(len(df_panel) // 3), 3)
now we bring the data into long format by depivoting:
[53]:
# 3 - add instance index
df_panel.columns = ["var"] + [f"value{i}" for i in range(251)] + ["instance"]
# 4 - move to long format
df_panel = pd.wide_to_long(df_panel, "value", i=["var", "instance"], j="time")
[54]:
df_panel.head()
[54]:
value | |||
---|---|---|---|
var | instance | time | |
0 | 0 | 0 | -1.963009 |
1 | -1.957825 | ||
2 | -1.956145 | ||
3 | -1.938289 | ||
4 | -1.896657 |
now the “var” index is in the rows, but it should be in the columns:
[55]:
# 3 - move variable index to columns
df_panel = df_panel.reset_index("var")
df_panel = df_panel.pivot(columns="var", values="value")
[56]:
df_panel.head()
[56]:
var | 0 | 1 | 2 | |
---|---|---|---|---|
instance | time | |||
0 | 0 | -1.963009 | -1.774571 | -1.866021 |
1 | -1.957825 | -1.774036 | -1.841991 | |
2 | -1.956145 | -1.776586 | -1.835025 | |
3 | -1.938289 | -1.730749 | -1.811902 | |
4 | -1.896657 | -1.696268 | -1.764390 |
This is now in the pd-multiindex format:
[57]:
mtype(df_panel, as_scitype="Panel")
[57]:
'pd-multiindex'
an alternative route would be removing the index, and using reshapes in numpy
.
[57]:
Generated using nbsphinx. The Jupyter notebook can be found here.