Right now I am participating in the M5 Forecasting - Accuracy competition. This competition is about predicting sales based on US-based Walmart data.
The dataset is all about sales of a set of products over a period of roughly 5 years, in ten stores, in three US states. The dataset has roughly 46 million sales entries.
Sales forecasting usually involves lots of feature engineering. You will have to use a library like Python-based Pandas to perform the following operations, typically in this order:
- Cleanse the data set (includes removing incomplete data, statistical outliers).
- Re-format the data (pivoting columns for example).
- Merge or enrich the original data with other data sources (e.g. Google Trends or weather data). This step is typically optional, but can help improve your prediction accuracy significantly.
- Decompose the date of the sale and store its components like the year, month, day, day of the year, weekday in separate numerical fields.
- Add new fields based on statistical functions. Typically you use rolling averages over specific time periods, like 7 days, 30 days, etc, or row shifting.
Pandas is also a great tool for data analysis. Before even starting with the process of feature engineering, you will need to understand and also get to know your data. With Pandas you can quickly ingest data from CSV files and then process it using DataFrame's which is Panda's own tabular data structure. This is how you create a DataFrame with 10 rows using Pandas:
This is how the initial lines of a Pandas DataFrame look like in a Jupyter Notebook:
At the end of the feature engineering process, you should have a training and validation set that can be fed into the engine of a machine learning library, like Microsoft's open-source LightGBM or you can also use a deep learning framework like Facebook's Pytorch.
With these ML libraries, you will be able to train a model which you can later use in your sales forecasts.
Pandas is one of the favourite tools of ML practitioners when dealing with tabular data. It is on its own not an ML tool, but allows you to create the foundation for successfully training an ML model.