load_m5#
- load_m5(extract_path=None, include_events=False, merged=True, test=False)[source]#
Fetch M5 dataset from https://zenodo.org/records/12636070 .
Downloads and extracts dataset if not already downloaded. Fetched dataset is in the standard .csv format and loaded into an sktime-compatible in-memory format (pd_multiindex_hier). For additional information on the dataset, including its structure and contents, refer to Notes section.
- Parameters:
- extract_pathstr, optional (default=None)
If provided, the path should use the appropriate path separators for the operating system.(e.g., forward slashes ‘/’ for Unix-based systems, backslashes ‘\’ for Windows). If extract_path is provided:
Check if the required files are present at the given extract_path.
If files are not found, check if the directory “m5-forecasting-accuracy” exists within the extract_path. Useful when the funciton has already run previously with the same path.
If the directory does not exist, download and extract the data into “m5-forecasting-accuracy” folder in the extract_path.
If the directory exists, takes the path to the existing directory.
- if extract_path is None:
Check if the directory “m5-forecasting-accuracy” exists within the module level.
If the directory exists, takes path to current directory. Useful when the funciton has already run previously without any path.
If the directory does not exist, download and extract the data into “m5-forecasting-accuracy” folder at the module level.
- include_eventsbool, optional (default=False)
If True, the resulting dataset will include additional columns related to events. Including these columns allows for a richer dataset that can be used to analyze the impact of events on sales. If False, the dataset will exclude these columns, providing a more streamlined version of the data.
- mergedbool, optional (default=True)
Determines the format of the output: - If True, the function returns a single merged dataset. - If False, the function returns three separate datasets
sales_train_validation, sell_prices, and calendar.
- testbool, optional (default=False)
Loads a smaller part of the dataset which doesn’t include events for testing purposes. This should not be used in standard usage but might be useful for developers running tests.
- Returns:
- pd.DataFrame or tuple of pd.DataFrame
- If merged_dataset is True
- datapd.DataFrame of sktime type pd_multiindex_hier
The preprocessed dataframe containing the time series.
- If merged_dataset is False, returns a tuple of three dataframes:
sales_train_validation : pd.DataFrame of sktime type pd_multiindex_hier sell_prices : pd.DataFrame calander : pd.DataFrame
Notes
The dataset consists of three main files: - sales_train_validation.csv: daily sales data for each product and store - sell_prices.csv: price data for each product and store - calendar.csv: calendar information including events
The dataframe will have a multi-index with the following levels: - state_id - store_id - cat_id - dept_id - date