Loading data into sktime#

This tutorial outlines time series related file formats and how to load data into sktime.

Users can load or convert data into sktime compatible formats in two main ways:

pathway 1: direct loading from time series formats. Data can be loaded directly from a bespoke time series storage format, for instance .ts (see Representing data with .ts files) or other supported file formats, such as Weka ARFF and .tsv (see other existing data sources).
pathway 2: loading in-memory, then conversion. sktime provides functions to convert between common in-memory representations, see AA_datatypes_and_datasets tutorial. Hence, data can be loaded via any loader utility (e.g., pandas.read_csv) first, then converted manually into one of the sktime compatible specifications, and then converted between specifications using the sktime convert or convert_to utility.

The rest of this tutorial provides descriptions of pathway 1, i.e., how to load data from supported file formats.

## The .ts file format One common use case is to load locally stored data. To make this easy, the .ts file format has been created for representing problems in a standard format for use with sktime.

Representing data with .ts files#

A .ts file include two main parts: * header information * data

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following:

@problemName <problem name>
@timeStamps <true/false>
@univariate <true/false>
@classLabel <true/false> <space delimited list of possible class values>
@data

The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, …, m). An instance may contain 1 to many dimensions, where instances are line-delimited and dimensions within an instance are colon (:) delimited. For example:

2,3,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2

This example data has 3 instances, corresponding to the three lines shown above. Each instance has 2 dimensions with 4 observations per dimension. For example, the initial instance’s first dimension has the timepoint values of 2, 3, 2, 4 and the second dimension has the values 4, 3, 2, 2.

Missing readings can be specified using ?. For example,

2,?,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2

would indicate the second timepoint value of the initial instance’s first dimension is missing.

Alternatively, for sparse datasets, readings can be specified by setting @timestamps to true in the header and representing the data with tuples in the form of (timestamp, value) just for the obser. For example, the first instance in the example above could be specified in this representation as:

(0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently, the sparser example

2,5,?,?,?,?,?,5,?,?,?,?,4

could be represented with just the non-missing timestamps as:

(0,2),(1,5),(7,5),(12,4)

When using the .ts file format to store data for timeseries classification problems, the class label for an instance should be specified in the last dimension and @classLabel should be set to true in the header information and be followed by the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

1,4,23,34:1

Loading from .ts file to pandas DataFrame#

A dataset can be loaded from a .ts file using the following method in sktime.datasets:

load_from_tsfile(full_file_path_and_name, replace_missing_vals_with='NaN')

This can be demonstrated using the Arrow Head problem that is included in sktime under sktime/datasets/data

[1]:

import os

import sktime
from sktime.datasets import load_from_tsfile

DATA_PATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")

train_x, train_y = load_from_tsfile(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.ts")
)
test_x, test_y = load_from_tsfile(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TEST.ts")
)

Train and test partitions of the ArrowHead problem have been loaded into nested dataframes (the nested_univ format for panel data) with an associated array of class values. As an example, below are the first 5 rows from the train_x and train_y:

[2]:

train_x.head()

[2]:

	dim_0
0	0 -1.963009 1 -1.957825 2 -1.95614...
1	0 -1.774571 1 -1.774036 2 -1.77658...
2	0 -1.866021 1 -1.841991 2 -1.83502...
3	0 -2.073758 1 -2.073301 2 -2.04460...
4	0 -1.746255 1 -1.741263 2 -1.72274...

[3]:

train_y[0:5]

[3]:

array(['0', '1', '2', '0', '1'], dtype='<U1')

The format of the loaded data can be controlled by the return_data_type argument.

Allowed strings are identifier strings for sktime compatible Panel mtypes, as introduced in the “datatypes and datasets” tutorial (AA_datatypes_and_datasets).

If provided, the loaded in-memory data container will comply with that type specification.

[4]:

train_x, train_y = load_from_tsfile(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.ts"), return_data_type="numpy3d"
)

[5]:

train_x

[5]:

array([[[-1.9630089, -1.9578249, -1.9561449, ..., -1.9053929,
         -1.9239049, -1.9091529]],

       [[-1.7745713, -1.7740359, -1.7765863, ..., -1.7292269,
         -1.7756704, -1.7893245]],

       [[-1.8660211, -1.8419912, -1.8350253, ..., -1.8625124,
         -1.8633682, -1.8464925]],

       ...,

       [[-2.1308119, -2.1044297, -2.0747549, ..., -2.0340977,
         -2.0800313, -2.103448 ]],

       [[-1.8803376, -1.8626622, -1.8496866, ..., -1.8485336,
         -1.8640342, -1.8798851]],

       [[-1.80105  , -1.7989155, -1.7783754, ..., -1.7965491,
         -1.7985443, -1.80105  ]]])

## Loading other file formats Researchers who have made timeseries data available have used two other common formats, including:

Weka ARFF files
UCR .tsv files

Loading from Weka ARFF files#

It is also possible to load data from Weka’s attribute-relation file format (ARFF) files. Data for timeseries problems are made available in this format by researchers at the University of East Anglia (among others) at www.timeseriesclassification.com. The load_from_arff_to_dataframe method in sktime.datasets supports reading data for both univariate and multivariate timeseries problems.

The univariate functionality is demonstrated below using data on the ArrowHead problem again (this time loading from ARFF file).

[4]:

from sktime.datasets import load_from_arff_to_dataframe

X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff")
)
X.head()

[4]:

	dim_0
0	0 -1.963009 1 -1.957825 2 -1.95614...
1	0 -1.774571 1 -1.774036 2 -1.77658...
2	0 -1.866021 1 -1.841991 2 -1.83502...
3	0 -2.073758 1 -2.073301 2 -2.04460...
4	0 -1.746255 1 -1.741263 2 -1.72274...

The multivariate BasicMotions problem is used below to illustrate the ability to read multivariate timeseries data from ARFF files into the sktime format.

[5]:

X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "BasicMotions/BasicMotions_TRAIN.arff")
)
X.head()

[5]:

	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5
0	0 0.079106 1 0.079106 2 -0.903497 3...	0 0.394032 1 0.394032 2 -3.666397 3...	0 0.551444 1 0.551444 2 -0.282844 3...	0 0.351565 1 0.351565 2 -0.095881 3...	0 0.023970 1 0.023970 2 -0.319605 3...	0 0.633883 1 0.633883 2 0.972131 3...
1	0 0.377751 1 0.377751 2 2.952965 3...	0 -0.610850 1 -0.610850 2 0.970717 3...	0 -0.147376 1 -0.147376 2 -5.962515 3...	0 -0.103872 1 -0.103872 2 -7.593275 3...	0 -0.109198 1 -0.109198 2 -0.697804 3...	0 -0.037287 1 -0.037287 2 -2.865789 3...
2	0 -0.813905 1 -0.813905 2 -0.424628 3...	0 0.825666 1 0.825666 2 -1.305033 3...	0 0.032712 1 0.032712 2 0.826170 3...	0 0.021307 1 0.021307 2 -0.372872 3...	0 0.122515 1 0.122515 2 -0.045277 3...	0 0.775041 1 0.775041 2 0.383526 3...
3	0 0.289855 1 0.289855 2 -0.669185 3...	0 0.284130 1 0.284130 2 -0.210466 3...	0 0.213680 1 0.213680 2 0.252267 3...	0 -0.314278 1 -0.314278 2 0.018644 3...	0 0.074574 1 0.074574 2 0.007990 3...	0 -0.079901 1 -0.079901 2 0.237040 3...
4	0 -0.123238 1 -0.123238 2 -0.249547 3...	0 0.379341 1 0.379341 2 0.541501 3...	0 -0.286006 1 -0.286006 2 0.208420 3...	0 -0.098545 1 -0.098545 2 -0.023970 3...	0 0.058594 1 0.058594 2 0.175783 3...	0 -0.074574 1 -0.074574 2 0.114525 3...

Loading from UCR .tsv Format Files#

A further option is to load data into sktime from tab separated value (.tsv) files. Researchers at the University of Riverside, California make a variety of timeseries data available in this format at https://www.cs.ucr.edu/~eamonn/time_series_data_2018.

The load_from_ucr_tsv_to_dataframe method in sktime.datasets supports reading

univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from the .ts and ARFF file formats.

[6]:

from sktime.datasets import load_from_ucr_tsv_to_dataframe

X, y = load_from_ucr_tsv_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv")
)
X.head()

[6]:

	dim_0
0	0 -1.963009 1 -1.957825 2 -1.95614...
1	0 -1.774571 1 -1.774036 2 -1.77658...
2	0 -1.866021 1 -1.841991 2 -1.83502...
3	0 -2.073758 1 -2.073301 2 -2.04460...
4	0 -1.746255 1 -1.741263 2 -1.72274...

## Converting between other NumPy and pandas formats

To convert loaded formats to other sktime internal formats, use the convert or convert_to utilities in the sktime.datatypes module. For more details, see the tutorial AA_datatypes_and_datasets.

Generated using nbsphinx. The Jupyter notebook can be found here.