M5 Walmart Forecasting
Forecasting is the process of predicting or estimating the future trend or an event. In this, we will be forecasting the sales of a wide range of products by viewing the sales of the same products in the previous years.
What are the advantages of sales forecasting?
Sales forecasting is something which can help in increasing the profits of the organization since when we predict, we can make sure we maintain an inventory based on those predictions. For example, a perishable item will be procured based on the consumption and if we are able to predict this consumption for the future, then we can maintain an inventory of the same and prevent the disposal of such perishable items because of which many big companies suffer. It helps in increasing the profits by reducing the expenses related to inventory maintenance as well.
We used the dataset from https://www.kaggle.com/c/m5-forecasting-accuracy
The dataset actually contains 4 different .csv files and they contain the information related to the daily sales, prices of the different items for sale and such information.
The different datasets are:
- calendar.csv: It contains information about the dates on which the products are sold
- sales_train_validation.csv: It contains the historical daily unit sales data per product and store [d_1 – d_1913]
- sell_prices.csv: It contains information about the price of the products per store and date.
- sales_train_evaluation.csv: It includes the sales data from [d_1 – d_1941]
Exploratory Data Analysis
We use the sales_train_validation dataset to check for the number of days for which we have the data related to the different products. From the calendar dataset we observe that the column \’d\’ contains days till d_1969 and the sales_train_validation contains the sales of different items so, the columns are actually the days and it goes on till d_1913
There are three unique categories of products at the different stores i.e., FOODS, HOUSEHOLD & HOBBIES. Each category has departments in itself like FOOD has 3 departments and the other two have 2 departments each.
We have data from three states and those are California, Texas and Wisconsin. From each state, we are considering different stores. From California, we are considering four stores and from Texas and Wisconsin, we are considering three stores each which makes a total of 10 stores.
sell_prices[\'item_id\'].nunuique()sales_train_validation[\'store_id\'].value_counts()
The above two lines of code confirms to us that there are 3049 unique items available at each of the ten stores.
Now, just for the EDA purpose, we remove the columns with more than 80% missing values which includes : \’event_name_1\’, \’event_name_2\’, \’event_type_1\’, \’event_type_2\’.
Now we pick out a random item from the dataset and plot the sales of the same over time. We pick out \’FOODS_3_090_CA_3_validation\’ from the dataset. It indicates the sales of the item FOODS_3_090 from California\’s CA_3 store.
Now, to get actual dates on the x-axis, we merge the calendar dataset with the sales_train_validation
df = sales_train_validationexample = df.loc[df[\'id\'] == \'FOODS_3_090_CA_3_validation\'][dcols].T # column name will be the row numberexample = example.rename(columns={8412: \'FOODS_3_090_CA_3\'}) # renaming it to what we needexample = example.reset_index().rename(columns={\'index\': \'d\'}) # we make the first column as the indexexample = example.merge(calendar, how=\'left\', validate=\'1:1\')example.head()
From the above graph, we observe that there was a time period for which the product sales was stopped and continued again after sometime. This indicates that maybe there was a change in the product that was made after which the sales began again.
We picked out two more random products i.e., \’HOBBIES_1_234_CA_3\’ and \’HOUSEHOLD_1_118_CA_3\’ and we observed their sales over time. We also observe the sales of items weekly, monthly and yearly.
FOODS_3_090_CA_3
From the above graphs we observe:
- The sale of \’HOBBIES_1_234_CA_3\’ peaks on the 4th day i.e Tuesday, whereas the sale of other items is higher on the weekend days only that is the first and the second day
- The sale of \’HOUSEHOLD_1_118_CA_3\’ was zero after 2014 which indicates that maybe the product was discontinued
- The sale of \’ HOBBIES_1_234_CA_3\’ rises continuously to a certain extent after which it starts dropping
- The sale of \’FOODS_3_090_CA_3\’ for a week is a continuous curve which indicates that people order food or get food mostly on the weekends and not during the weekdays
Now, we pick out 20 random items from the dataset and view the sales of the same. This will help us gather valuable insights from the data. For eg, the continuity of sales, drop of sales, if there is a certain period in which the drop is observed, discontinuity of items, etc.
twenty_examples = df.sample(20, random_state=2).set_index(\'id\')[dcols].T.merge(calendar.set_index(\'d\')\'date\'],left_index=True,right_index=True, validate=\'1:1\').set_index(\'date\')fig, axs = plt.subplots(10, 2, figsize = (20, 20))axs = axs.flatten()ax_idx = 0for item in twenty_examples.columns: twenty_examples[item].plot( title = item, color = next(colour_cycle), ax = axs[ax_idx]) ax_idx += 1plt.tight_layout()plt.show()
We merge the two datasets i.e., calendar and sales_train_validation to form another \’past_sales\’ dataset from which we can view the sales of the items over the given time period.
past_sales = df.set_index(\'id\')[dcols].T.merge(calendar.set_index(\'d\')[\'date\'], left_index=True,right_index=True, validate=\'1:1\').set_index(\'date\')
If we look at the sales of different categories, any of us would assume the sales to be higher for food as compared to any other product category. This is what we assume, let\’s visualize the sales of the different categories to see if we can prove things to be different.
The sales of food is the highest compared to other categories. Also, there is a dip in sales which can be observed around the end of the year which can be because of Christmas.
We explore the sales at different stores in the three states
Texas
Wisconsin
Observations made from the above graphs are as follows:
- CA_3 observed the highest sales in California compared to the other stores and CA_4 was the lowest
- TX_2 observed the highest sales in Texas but according to the data, it observed a drastic drop in sales at the end of 2013 and then the sales of products from that store was similar to TX_3
- WI_3 observed the highest sales till end of 2012 after which it continuously dropped over the period of time
- WI_2 observed a drastic increase in sales of goods during mid-2012 and continued to maintain those number till the start of 2016 after which it observed another drastic increase.
Modelling
On reading the datasets, we observed the sales up to d_1941. So, we add zero sales for the remaining number of days. When use the .info() function on the datasets, we find that the datasets are very much bigger in size which can cause a problem or in simpler words a memory error as we move forward. To prevent this, we change the datatypes of the different columns in the datasets. Pandas on reading a dataframe assign int64 and float64 on its own even if the number can be written in int8 format. This makes the dataframe to consume higher memory due to its larger size. So, in order to make sure that there is minimum usage of memory, we downcast the datasets.
Downcasting is the process to reduce the memory usage by a dataframe and speed up the operations that are to be performed on them
def downcast(df): cols = df.dtypes.index.tolist() types = df.dtypes.values.tolist() for i,t in enumerate(types): if \'int\' in str(t): if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max: df[cols[i]] = df[cols[i]].astype(np.int8) elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max: df[cols[i]] = df[cols[i]].astype(np.int16) elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max: df[cols[i]] = df[cols[i]].astype(np.int32) else: df[cols[i]] = df[cols[i]].astype(np.int64) elif \'float\' in str(t): if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max: df[cols[i]] = df[cols[i]].astype(np.float16) elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max: df[cols[i]] = df[cols[i]].astype(np.float32) else: df[cols[i]] = df[cols[i]].astype(np.float64) elif t == np.object: if cols[i] == \'date\': df[cols[i]] = pd.to_datetime(df[cols[i]], format=\'%Y-%m-%d\') else: df[cols[i]] = df[cols[i]].astype(\'category\') return df
As you can see from the above code, we are reducing the datatypes of the columns based on the values present in those columns. If we look at the datasets after performing downcasting, we observe the new size to be almost 1/4th of what it originally was. The sales data was of 454.5MB and then it became of 97.1MB only. This helps in faster operations and saving memory.
Currently, the data is in three dataframes: sales, prices & calendar. The sales dataframe contains daily sales data with days(d_1 – d_1969) as columns. The prices dataframe contains items\’ price details and calendar contains data about the days d.
So, basically we will be converting the wide dataframe that we have to a long dataframe. Wide format —> long format. We have kept the id variables as id, item_id, cat_id, dept_id, store_id, state_id. They have 30490 unique values which will make the new dataframe to have 30490 x 1969(number of days) = 60034810 rows
We merge the datasets now
df = pd.melt(sales, id_vars=[\'id\', \'item_id\', \'dept_id\', \'cat_id\', \'store_id\', \'state_id\'], var_name=\'d\', value_name=\'sold\').dropna()df = pd.merge(df, calendar, on=\'d\', how=\'left\')df = pd.merge(df, prices, on=[\'store_id\',\'item_id\',\'wm_yr_wk\'], how=\'left\')
Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. We introduce lag to the target variable \’sold\’
lags = [1,2,3,6,12,24,36]for lag in lags: df[\'sold_lag_\'+str(lag)] = df.groupby([\'id\', \'item_id\',\'dept_id\',\'cat_id\', \'store_id\',\'state_id\'],as_index=False) [\'sold\'].shift(lag).astype(np.float16)
Mean encoding represents the probability of the target variable conditional on each of the features. We have calculated mean encodings on the basis of the following features:
- item
- state
- store
- category
- department
- category and department
- store and item
- category & item
- department & item
- store & store
- state, store & category
- store, category & department
We introduce a rolling window to check the weekly sales. It will have the same number of days but the window will keep shifting
df[\'rolling_sold_mean\'] = df.groupby([\'id\', \'item_id\', \'dept_id\', \'cat_id\', \'store_id\',\'state_id\'])[\'sold\'].transform(lambdax:x.rolling(window=7).mean()).astype(np.float16)
We also introduced an expanding window column to check the sales as the window size increases. This window doesn\’t shift but the window size keeps on increasing.
df[\'expanding_sold_mean\'] = df.groupby([\'id\', \'item_id\', \'dept_id\', \'cat_id\', \'store_id\',\'state_id\'])[\'sold\'].transform(lambdax:x.expanding(2).mean()).astype(np.float16)
We also keep the various trends in our mind considering the holiday season in winter, in which we would expect the sales of all the items to be higher than what it is on other days. For this, we check the selling trend and the column will have a positive value if the day wise sale is higher than the entire duration average sales.
Now, we use an LGBMRegressor to check the sales of items based on the store_id. LightGBM uses the leaf-wise tree growth algorithm, while many other popular tools use depth-wise tree growth. Compared with depth-wise growth, the leaf-wise algorithm can converge much faster. However, the leaf-wise growth may be over-fitting if not used with the appropriate parameters.
model = LGBMRegressor(n_estimators=1000,learning_rate=0.3,subsample=0.8, colsample_bytree=0.8,max_depth=8,num_leaves=50,min_child_weight=300)
We look at max_depth and num_leaves together, since for higher accuracy and less over-fitting num_leaves <=2^[max_depth] so, for depth of 8 we use 50 num_leaves. A leaf-wise tree is much deeper than a depth-wise tree hence, we ensure lesser leaves to prevent over-fitting and being a depth-wise tree it will have higher accuracy.
During the training of the model, we use EarlyStopping and \’rmse\’ as the metrics.
After training, we look at the feature importances and which feature actually affects the result. Here is the graph for the same.