load_m5#

load_m5(extract_path=None, include_events=False, merged=True, test=False)[source]#

Fetch M5 dataset from https://zenodo.org/records/12636070 .

Downloads and extracts dataset if not already downloaded. Fetched dataset is in the standard .csv format and loaded into an sktime-compatible in-memory format (pd_multiindex_hier). For additional information on the dataset, including its structure and contents, refer to Notes section.

Parameters:
extract_pathstr, optional (default=None)

If provided, the path should use the appropriate path separators for the operating system.(e.g., forward slashes ‘/’ for Unix-based systems, backslashes ‘\’ for Windows). If extract_path is provided:

  • Check if the required files are present at the given extract_path.

  • If files are not found, check if the directory “m5-forecasting-accuracy” exists within the extract_path. Useful when the funciton has already run previously with the same path.

  • If the directory does not exist, download and extract the data into “m5-forecasting-accuracy” folder in the extract_path.

  • If the directory exists, takes the path to the existing directory.

if extract_path is None:
  • Check if the directory “m5-forecasting-accuracy” exists within the module level.

  • If the directory exists, takes path to current directory. Useful when the funciton has already run previously without any path.

  • If the directory does not exist, download and extract the data into “m5-forecasting-accuracy” folder at the module level.

include_eventsbool, optional (default=False)

If True, the resulting dataset will include additional columns related to events. Including these columns allows for a richer dataset that can be used to analyze the impact of events on sales. If False, the dataset will exclude these columns, providing a more streamlined version of the data.

mergedbool, optional (default=True)

Determines the format of the output: - If True, the function returns a single merged dataset. - If False, the function returns three separate datasets

sales_train_validation, sell_prices, and calendar.

testbool, optional (default=False)

Loads a smaller part of the dataset which doesn’t include events for testing purposes. This should not be used in standard usage but might be useful for developers running tests.

Returns:
pd.DataFrame or tuple of pd.DataFrame
  • If merged_dataset is True
    datapd.DataFrame of sktime type pd_multiindex_hier

    The preprocessed dataframe containing the time series.

  • If merged_dataset is False, returns a tuple of three dataframes:

    sales_train_validation : pd.DataFrame of sktime type pd_multiindex_hier sell_prices : pd.DataFrame calander : pd.DataFrame

Notes

The dataset consists of three main files: - sales_train_validation.csv: daily sales data for each product and store - sell_prices.csv: price data for each product and store - calendar.csv: calendar information including events

The dataframe will have a multi-index with the following levels: - state_id - store_id - cat_id - dept_id - date