evaluate#

evaluate(forecaster, cv, y, X=None, strategy: str = 'refit', scoring: callable | List[callable] | None = None, return_data: bool = False, error_score: str | int | float = nan, backend: str | None = None, compute: bool = True, cv_X=None, **kwargs)[source]#

Evaluate forecaster using timeseries cross-validation.

All-in-one statistical performance benchmarking utility for forecasters which runs a simple backtest experiment and returns a summary pd.DataFrame.

The experiment run is the following:

Denote by \(y_{train, 1}, y_{test, 1}, \dots, y_{train, K}, y_{test, K}\) the train/test folds produced by the generator cv.split_series(y). Denote by \(X_{train, 1}, X_{test, 1}, \dots, X_{train, K}, X_{test, K}\) the train/test folds produced by the generator cv_X.split_series(X) (if X is None, consider these to be None as well).

set i = 1.
fit the forecaster to \(y_{train, 1}\), \(X_{train, 1}\),

with a fh to forecast \(y_{test, 1}\).

y_pred = forecaster.predict

(or predict_proba or predict_quantiles, depending on scoring) with exogeneous data \(X_{test, i}\)

Compute scoring on ``y_pred``versus \(y_{test, 1}\).
if i == K, terminate, otherwise
set i = i + 1
ingest more data \(y_{train, i}\), \(X_{train, i}\),

how depends on strategy: * if strategy == "refit", reset and fit forecaster via fit,

on \(y_{train, i}\), \(X_{train, i}\) to forecast \(y_{test, i}\)

if strategy == "update", update forecaster via update, on \(y_{train, i}\), \(X_{train, i}\) to forecast \(y_{test, i}\)

if strategy == "no-update_params", forward forecaster via update, with argument update_params=False, to the cutoff of \(y_{train, i}\)

goto 2

Results returned in this function’s return are: * results of scoring calculations, from 3, in the i-th loop * runtimes for fitting and/or predicting, from 1, 2, 6, in the i-th loop * cutoff state of forecaster, at 2, in the i-th loop * \(y_{train, i}\), \(y_{test, i}\), y_pred (optional)

A distributed and-or parallel back-end can be chosen via the backend parameter.

Parameters:

forecastersktime BaseForecaster descendant (concrete forecaster): sktime forecaster to benchmark
cvsktime BaseSplitter descendant: determines split of y and possibly X into test and train folds y is always split according to cv, see above if cv_X is not passed, X splits are subset to loc equal to y if cv_X is passed, X is split according to cv_X
ysktime time series container: Target (endogeneous) time series used in the evaluation experiment
Xsktime time series container, of same mtype as y: Exogenous time series used in the evaluation experiment
strategy{“refit”, “update”, “no-update_params”}, optional, default=”refit”: defines the ingestion mode when the forecaster sees new data when window expands “refit” = forecaster is refitted to each training window “update” = forecaster is updated with training window data, in sequence provided “no-update_params” = fit to first training window, re-used without fit or update
scoringsubclass of sktime.performance_metrics.BaseMetric or list of same,: default=None. Used to get a score function that takes y_pred and y_test arguments and accept y_train as keyword argument. If None, then uses scoring = MeanAbsolutePercentageError(symmetric=True).
return_databool, default=False: Returns three additional columns in the DataFrame, by default False. The cells of the columns contain each a pd.Series for y_train, y_pred, y_test.
error_score“raise” or numeric, default=np.nan: Value to assign to the score if an exception occurs in estimator fitting. If set to “raise”, the exception is raised. If a numeric value is given, FitFailedWarning is raised.
backend{“dask”, “loky”, “multiprocessing”, “threading”}, by default None.: Runs parallel evaluate if specified and strategy is set as “refit”. - “loky”, “multiprocessing” and “threading”: uses joblib Parallel loops - “dask”: uses dask, requires dask package in environment Recommendation: Use “dask” or “loky” for parallel evaluate. “threading” is unlikely to see speed ups due to the GIL and the serialization backend (cloudpickle) for “dask” and “loky” is generally more robust than the standard pickle library used in “multiprocessing”.
computebool, default=True: If backend=”dask”, whether returned DataFrame is computed. If set to True, returns pd.DataFrame, otherwise dask.dataframe.DataFrame.
cv_Xsktime BaseSplitter descendant, optional: determines split of X into test and train folds default is X being split to identical loc indices as y if passed, must have same number of splits as cv
**kwargsKeyword arguments: Only relevant if backend is specified. Additional kwargs are passed into dask.distributed.get_client or dask.distributed.Client if backend is set to “dask”, otherwise kwargs are passed into joblib.Parallel.

Returns:

resultspd.DataFrame or dask.dataframe.DataFrame

DataFrame that contains several columns with information regarding each refit/update and prediction of the forecaster. Row index is splitter index of train/test fold in cv. Entries in the i-th row are for the i-th train/test split in cv. Columns are as follows: - test_{scoring.name}: (float) Model performance score. If scoring is a list,

then there is a column withname test_{scoring.name} for each scorer.

fit_time: (float) Time in sec for fit or update on train fold.
pred_time: (float) Time in sec to predict from fitted estimator.
len_train_window: (int) Length of train window.
cutoff: (int, pd.Timestamp, pd.Period) cutoff = last time index in train fold.
y_train: (pd.Series) only present if see return_data=True train fold of the i-th split in cv, used to fit/update the forecaster.
y_pred: (pd.Series) present if see return_data=True forecasts from fitted forecaster for the i-th test fold indices of cv.
y_test: (pd.Series) present if see return_data=True testing fold of the i-th split in cv, used to compute the metric.

>>> from sktime.datasets import load_airline
>>> from sktime.forecasting.model_evaluation import evaluate
>>> from sktime.forecasting.model_selection import ExpandingWindowSplitter
>>> from sktime.forecasting.naive import NaiveForecaster
>>> y = load_airline()
>>> forecaster = NaiveForecaster(strategy="mean", sp=12)
>>> cv = ExpandingWindowSplitter(initial_window=12, step_length=3,
... fh=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
>>> results = evaluate(forecaster=forecaster, y=y, cv=cv)

Optionally, users may select other metrics that can be supplied by scoring argument. These can be forecast metrics of any kind, i.e., point forecast metrics, interval metrics, quantile forecast metrics. https://www.sktime.net/en/stable/api_reference/performance_metrics.html?highlight=metrics To evaluate estimators using a specific metric, provide them to the scoring arg.

>>> from sktime.performance_metrics.forecasting import MeanAbsoluteError
>>> loss = MeanAbsoluteError()
>>> results = evaluate(forecaster=forecaster, y=y, cv=cv, scoring=loss)

Optionally, users can provide a list of metrics to scoring argument.

>>> from sktime.performance_metrics.forecasting import MeanSquaredError
>>> results = evaluate(
...     forecaster=forecaster,
...     y=y,
...     cv=cv,
...     scoring=[MeanSquaredError(square_root=True), MeanAbsoluteError()],
... )

An example of an interval metric is the PinballLoss. It can be used with all probabilistic forecasters.

>>> from sktime.forecasting.naive import NaiveVariance
>>> from sktime.performance_metrics.forecasting.probabilistic import PinballLoss
>>> loss = PinballLoss()
>>> forecaster = NaiveForecaster(strategy="drift")
>>> results = evaluate(forecaster=NaiveVariance(forecaster),
... y=y, cv=cv, scoring=loss)