evaluate#
- evaluate(forecaster, cv, y, X=None, strategy: str = 'refit', scoring: callable | List[callable] | None = None, return_data: bool = False, error_score: str | int | float = nan, backend: str | None = None, compute: bool = None, cv_X=None, backend_params: dict | None = None, **kwargs)[source]#
Evaluate forecaster using timeseries cross-validation.
All-in-one statistical performance benchmarking utility for forecasters which runs a simple backtest experiment and returns a summary pd.DataFrame.
The experiment run is the following:
Denote by \(y_{train, 1}, y_{test, 1}, \dots, y_{train, K}, y_{test, K}\) the train/test folds produced by the generator
cv.split_series(y)
. Denote by \(X_{train, 1}, X_{test, 1}, \dots, X_{train, K}, X_{test, K}\) the train/test folds produced by the generatorcv_X.split_series(X)
(ifX
isNone
, consider these to beNone
as well).Set
i = 1
Fit the
forecaster
to \(y_{train, 1}\), \(X_{train, 1}\), with afh
to forecast \(y_{test, 1}\)The
forecaster
predict with exogeneous data \(X_{test, i}\)y_pred = forecaster.predict
(orpredict_proba
orpredict_quantiles
, depending onscoring
)Compute
scoring
ony_pred
versus \(y_{test, 1}\)If
i == K
, terminate, otherwiseSet
i = i + 1
Ingest more data \(y_{train, i}\), \(X_{train, i}\), how depends on
strategy
:if
strategy == "refit"
, reset and fitforecaster
viafit
, on \(y_{train, i}\), \(X_{train, i}\) to forecast \(y_{test, i}\)if
strategy == "update"
, updateforecaster
viaupdate
, on \(y_{train, i}\), \(X_{train, i}\) to forecast \(y_{test, i}\)if
strategy == "no-update_params"
, forwardforecaster
viaupdate
, with argumentupdate_params=False
, to the cutoff of \(y_{train, i}\)
Go to 3
Results returned in this function’s return are:
results of
scoring
calculations, from 4, in the i-th loopruntimes for fitting and/or predicting, from 2, 3, 7, in the i-th loop
cutoff state of
forecaster
, at 3, in the i-th loop\(y_{train, i}\), \(y_{test, i}\),
y_pred
(optional)
A distributed and-or parallel back-end can be chosen via the
backend
parameter.- Parameters:
- forecastersktime BaseForecaster descendant (concrete forecaster)
sktime forecaster to benchmark
- cvsktime BaseSplitter descendant
determines split of
y
and possiblyX
into test and train folds y is always split according tocv
, see above ifcv_X
is not passed,X
splits are subset toloc
equal toy
ifcv_X
is passed,X
is split according tocv_X
- ysktime time series container
Target (endogeneous) time series used in the evaluation experiment
- Xsktime time series container, of same mtype as y
Exogenous time series used in the evaluation experiment
- strategy{“refit”, “update”, “no-update_params”}, optional, default=”refit”
defines the ingestion mode when the forecaster sees new data when window expands “refit” = forecaster is refitted to each training window “update” = forecaster is updated with training window data, in sequence provided “no-update_params” = fit to first training window, re-used without fit or update
- scoringsubclass of sktime.performance_metrics.BaseMetric or list of same,
default=None. Used to get a score function that takes y_pred and y_test arguments and accept y_train as keyword argument. If None, then uses scoring = MeanAbsolutePercentageError(symmetric=True).
- return_databool, default=False
Returns three additional columns in the DataFrame, by default False. The cells of the columns contain each a pd.Series for y_train, y_pred, y_test.
- error_score“raise” or numeric, default=np.nan
Value to assign to the score if an exception occurs in estimator fitting. If set to “raise”, the exception is raised. If a numeric value is given, FitFailedWarning is raised.
- backend{“dask”, “loky”, “multiprocessing”, “threading”}, by default None.
Runs parallel evaluate if specified and strategy is set as “refit”.
“None”: executes loop sequentally, simple list comprehension
“loky”, “multiprocessing” and “threading”: uses
joblib.Parallel
loops“joblib”: custom and 3rd party
joblib
backends, e.g.,spark
“dask”: uses
dask
, requiresdask
package in environment“dask_lazy”: same as “dask”, but changes the return to (lazy)
dask.dataframe.DataFrame
.
Recommendation: Use “dask” or “loky” for parallel evaluate. “threading” is unlikely to see speed ups due to the GIL and the serialization backend (
cloudpickle
) for “dask” and “loky” is generally more robust than the standardpickle
library used in “multiprocessing”.- computebool, default=True, deprecated and will be removed in 0.25.0.
If backend=”dask”, whether returned DataFrame is computed. If set to True, returns pd.DataFrame, otherwise dask.dataframe.DataFrame.
- cv_Xsktime BaseSplitter descendant, optional
determines split of
X
into test and train folds default isX
being split to identicalloc
indices asy
if passed, must have same number of splits ascv
- backend_paramsdict, optional
additional parameters passed to the backend as config. Directly passed to
utils.parallel.parallelize
. Valid keys depend on the value ofbackend
:“None”: no additional parameters,
backend_params
is ignored“loky”, “multiprocessing” and “threading”: default
joblib
backends any valid keys forjoblib.Parallel
can be passed here, e.g.,n_jobs
, with the exception ofbackend
which is directly controlled bybackend
. Ifn_jobs
is not passed, it will default to-1
, other parameters will default tojoblib
defaults.“joblib”: custom and 3rd party
joblib
backends, e.g.,spark
. any valid keys forjoblib.Parallel
can be passed here, e.g.,n_jobs
,backend
must be passed as a key ofbackend_params
in this case. Ifn_jobs
is not passed, it will default to-1
, other parameters will default tojoblib
defaults.“dask”: any valid keys for
dask.compute
can be passed, e.g.,scheduler
- Returns:
- resultspd.DataFrame or dask.dataframe.DataFrame
DataFrame that contains several columns with information regarding each refit/update and prediction of the forecaster. Row index is splitter index of train/test fold in cv. Entries in the i-th row are for the i-th train/test split in cv. Columns are as follows:
test_{scoring.name}: (float) Model performance score. If scoring is a list,
then there is a column withname test_{scoring.name} for each scorer.
fit_time: (float) Time in sec for fit or update on train fold.
pred_time: (float) Time in sec to predict from fitted estimator.
len_train_window: (int) Length of train window.
cutoff: (int, pd.Timestamp, pd.Period) cutoff = last time index in train fold.
y_train: (pd.Series) only present if see return_data=True
train fold of the i-th split in cv, used to fit/update the forecaster.
y_pred: (pd.Series) present if see return_data=True
forecasts from fitted forecaster for the i-th test fold indices of cv.
y_test: (pd.Series) present if see return_data=True
testing fold of the i-th split in cv, used to compute the metric.
Examples
The type of evaluation that is done by evaluate depends on metrics in param scoring. Default is MeanAbsolutePercentageError.
>>> from sktime.datasets import load_airline >>> from sktime.forecasting.model_evaluation import evaluate >>> from sktime.split import ExpandingWindowSplitter >>> from sktime.forecasting.naive import NaiveForecaster >>> y = load_airline()[:24] >>> forecaster = NaiveForecaster(strategy="mean", sp=3) >>> cv = ExpandingWindowSplitter(initial_window=12, step_length=6, fh=[1, 2, 3]) >>> results = evaluate(forecaster=forecaster, y=y, cv=cv)
Optionally, users may select other metrics that can be supplied by scoring argument. These can be forecast metrics of any kind as stated here i.e., point forecast metrics, interval metrics, quantile forecast metrics. To evaluate estimators using a specific metric, provide them to the scoring arg.
>>> from sktime.performance_metrics.forecasting import MeanAbsoluteError >>> loss = MeanAbsoluteError() >>> results = evaluate(forecaster=forecaster, y=y, cv=cv, scoring=loss)
Optionally, users can provide a list of metrics to scoring argument.
>>> from sktime.performance_metrics.forecasting import MeanSquaredError >>> results = evaluate( ... forecaster=forecaster, ... y=y, ... cv=cv, ... scoring=[MeanSquaredError(square_root=True), MeanAbsoluteError()], ... )
An example of an interval metric is the PinballLoss. It can be used with all probabilistic forecasters.
>>> from sktime.forecasting.naive import NaiveVariance >>> from sktime.performance_metrics.forecasting.probabilistic import PinballLoss >>> loss = PinballLoss() >>> forecaster = NaiveForecaster(strategy="drift") >>> results = evaluate(forecaster=NaiveVariance(forecaster), ... y=y, cv=cv, scoring=loss)