Instacart Market basket analysis case study.
This is my first Kaggle competition.
In this article, I will discuss how I achieved a private score of 0.35466. Instacart market basket analysis is a hiring competition organized by Instacart in Kaggle. Our goal is to get an idea about, whether a product will be purchase by the customer or not. This is a binary classification problem where we have to predict if the product will be reordered or not. Here is Github link for the case study.
Problem statement
The objective is to predict the product that the user is most likely to order in future orders, based on their prior buying pattern. The dataset that Instacart has provided for this competition contains orders of 200000 users, with each user having between 1 and 100 orders. Each datapoint in order is categorized into prior, test, and train sets. Where prior means an order that was made previously, train and test are future data that we can use to train and test models. Here our target variable is reordered. It tells us whether the user has reordered the product or not. The binary value of 1 in reordered indicates that the user reordered the product whereas 0 indicates that the user did not reorder the product. We will be training a binary classification model to predict if the user reordered the product or not.
Mapping the real-world problem to a Machine Learning Problem.
Type of problem:
- This is a Classification problem where we have to predict if the product would be reordered or not.
- The key performance criterion is F1 score.
- We want the false positive rate to be as low as possible and Tpr to be high.
- We won’t be relying too much on AUC and accuracy because of class imbalance.
- We will be more interested in optimizing precision and recall.
Data overview.
- Aisles.csv:{aisles_id, aisle}
This table contains more information about aisle:
Aisles_id: unique id
Aisle: name of the aisle for the product.
2. Department.csv :{department_id , department}
This table contains more information about the Department table:
Department_id : unique department id
Department: name of department
3. Order.csv:{order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order}
This table contains more information about the order table:
Order_id: unique order id.
User_id: unique user-id.
Eval_set: this tells which set an order belongs to. (test, train, prior)
Order_number: unique order number.
Order_dow: which day of the week order was made.
Order_hour_of_day: which hour of the day order was made.
Days_since_prior_order: after how many days the customer ordered.
4. order_products_train.csv/order_products_prior.csv :
{order_id,product_id,add_to_cart_order,reordered}
The order table is divided into two tables order_proir,order_train. These tables contain product and order details.
Order_id: unique order id.
Product_id: unique product id.
Add_to_cart_order: sequence in which product was added to cart.
Reordered: if the product is reordered value =1 else 0, this is our target variable.
Products.csv: {product_id,product_name,aisle_id,department_id}
This table contains more information about products:
Product_id: unique product id.
Product_name: name of the product.
Aisle_id: unique Aisle id.
Department_id : Unique department id
Preprocessing:
- No preprocessing is required as data is clean
Exploratory data analysis:
Checking to see how many unique orders are made.
len(order[‘order_id’].unique())
- There are 3421083 unique order id
Distribution of eval_set:
- There are 3.2 million prior orders.
- 1.3 million train order.
- 0.75 million test order.
Distribution of order day of week:
- People seem to order more on Saturday (0) and Sunday ( 1).
- Wednesday (4 ) has the least number of orders.
Distribution of order hour of day:
- There are very few orders in the early part of days 1–5, people are probably sleeping.
- From 8–11 orders spikes people are probably ordering for morning needs.
- There is a second spike during the later part of the day people are probably ordering for evening or night needs.
- Order count keeps reduces in later part of the evening.
Distribution of order number:
- Order numbers range from 1–100, customers have made anywhere from 1 to 100 orders.
- 40% of the user has less than 10 orders.
- 50 % of the user has less than 15 order.
- 90% of people have made less than 50 orders.
- People who have ordered more than 50 orders are rare.
Distribution of days_since_prior_order
- On day 7 there is a spike in order, people may be restocking their weekly needs.
- The second and third spike can be seen on 14, 21 day
- After every 7th day, there spike in order
- On Day 30 we see a huge spike that may be because the day since prior order is ranged till 30 so all the orders made after 30 days are put into one bin.
Percentage of reorder in train set:
- About 60% of the order in the training dataset is reordered.
- About 40% of the order in the training dataset is not reordered.
Let’s try to find how many products people buy in an order:
- Most people buy 5 products per order on average.
- The majority of people buy anywhere between 1–10 products per transaction.
- Very few people order more than 30 products in one transaction.
Department wise Reorder ratio:
- Produce has the highest number of reordering followed by eggs, people seem to reorder daily consumable and fresh food most.
- People like to reorder snacks, beverages, frozen a lot.
Product reorders ratio for top 20 aisle:
- The milk aisle has the highest reordered ratio followed by water and fresh fruit followed by water eggs and veggies.
Top 20 Reordered products:
- Banana has the highest number of reordering followed by bags of organic banana.
- Most of the top twenty products contain fresh fruits and veggies.
Let’s try to find add to cart effect reorder ratio:
- 99.9 % of orders have less than 49 products ordered.
- Item added first in the cart has a higher chance of been reordered.
- Reordered ratio gradually decreases as the add_to_cart_order count increases.
Let try to find how day of week effect reorder ratio:
- Reordering has no significant major effect on days of weeks.
- Sunday and Thursday reorder seem to be higher.
Let try to find how hour of day effect reorder ratio
- In the early part of the day reorder seems to higher people seem to order for their daily needs.
- From 6 to 10 there is a significantly high reordering.
- During the night there seems to be a high reorder compared to daytime, people seem to order for their morning needs.
Let try to find how day of week and hour of week effect reordering ratio:
- We can see that during the early hour of the day between 5 -10 most reordering are made.
- Sunday early hour between 5 -10 has the highest reordering.
- Saturday night people order considerably more than any other week day.
Correlation plot between numerical feature and target:
- Order_number is slightly positively correlated with reorder, people who tend to order more also tend to reorder more.
- add_to_cart_order slightly negatively correlates with reorder, as add_to cart_order increases probability of reorder decreases, product added initially has higher chances of being reordered.
- days_since_prior_order is slightly negatively correlated with reorder, days_since_prior_order tell from last how many days customers haven’t reordered, if days_since_prior_order increases probability of reorder decreases.
- day of week and hour are very slightly negatively correlated with reordered they do not influence reordering much.
Metric:
- F1 score.
- Confusion matrix.
What is the F1 score?
Before I go to the F1 score we need to know how classification metrics are measured. Say you have trained a classifier that tells if the given image is a dog or cat.
Confusion matrix:
The confusion matrix is a table used to describe the performance of a model.
Where:
tp = True positive, model predicted class as positive and it's positive.
tn = True negative, the model predicted class as negative and it's actually negative.
fp = False positive, model falsely predicts class as positive actually it's negative.
fn = False negative, model falsely predicts class as negative actually it's positive.
Example:
- When we pass a dog image to a model and it predicts the image as a dog we have correct prediction this is also referred to as true positive.
- When we pass cat image to model and it predicts image as cat we have correct prediction this is also referred as true negative.
- When you pass a dog image to a model and it predicts the image as a cat, we have falsely predicted the dog as a cat or positive class as negative. This is also known as a false negative.
- When you pass a cat image to a model and it predicts the image as a dog, we have falsely predicted the cat as a dog or negative class as positive. This is also known as false positive.
Now that we have an idea of the confusion matrix let’s talk about the f1 score.
What is F1 score?
F1 score is a harmonic mean of precision and recall.
Precision: Of all points, the model predicted to be positive what percentage of them are actually positive.
Recall: Of all the points that belong to the positive class how many detected as a positive class.
Why F1 score?
The data we are training on is imbalanced. By imbalance, I mean that the ratio between positive and negative class is not 1. For every positive sample we have 9 negative samples so classification metrics like Accuracy or ROC-AUC won’t give the complete picture of prediction. Both Accuracy and ROC-AUC could be high for imbalanced data.
Why is accuracy not a good idea when you have imbalanced data?
Let’s say we have 100 images, 90 belong to cats (majority class) and 10 belong to dogs(minority class). Our business constraint is such that we don’t want our classifier to make mistakes when predicting if a given image is a dog.
We have an accuracy of 90%, but we have made 100% misclassification for the dog class, our classifier is performing very poorly for the minority class and we have failed to meet our business objective so we can’t just rely on accuracy in case of Imbalanced class distribution. In case of Imbalanced data distribution precision and recall is a better option.
Why is AUC not a good metric for an imbalanced dataset?
Auc is a plot between TPR and FPR.
TPR =TP/TP+FN
FPR =FP/FP+TN
We can see that TPR does not take into consideration the effect of negative class.
Considering +ve as a minority class and -ve as the majority class.Let’s say I have 1000 points (10+ve, 990-ve).
TP =5 ,FN =5 ,FP =10 ,TN =980
TPR = 5/(5+5) =0.5
FPR = 10/(10+980) = 0.0101010101
Because there are more points in the negative class.FPR will have a small value as we are dividing with a large denominator.
Precision as Recall is a better metric for an imbalanced dataset as it is directly affected by skew In data increasing or decreasing the threshold of probability directly affects FP and FN.
precision = 0.3333333333
recall = 0.5
Now that we have an idea why Accuracy and Auc score is not a good idea where data is imbalanced let's see why the F1 score makes sense, Consider a case where we have two classifiers.
Classifier 1 has a precision of 70% and a recall of 60%
Classifier 2 has a precision of 80% and a recall of 50%
Which classifier is performing better, we can get this answer by computing the F1 score for both classifiers.
Classifier 1:
F1 =2 *(0.7*0.6)/(0.7+0.6) =0.64
Classifier 2:
F1 =2 *(0.8*0.5)/(0.8+0.5) =0.61
In this case, we can say classifier 1 is performing better. Higher F1 score is better.
Train test split:
Split data based on the eval column in the order table, data points are distributed based on ‘train’, test’ ‘prior’ flag, we will merge train and prior set so that we have more samples to train on.
Feature engineering
These are the 12 features that I created to train models.
- ttl_cnt_product_user : total order made by user.
- Avg_no_prod_perOrder: Average number of products bought by users.
- days_since_prior_order: Average no of days since user made prior order.
- Max_dow: Day of the week on which user makes most order.
- Max_hour_of_day: which hour of day user makes most order.
- usr_ro_ratio : User reorder ratio,How often does user reorder.
- Product_name_length: length of the product name.
- Isglutenfree: Is product gluten-free.
- is_Organic: Is the product organic.
- ur_pr_count: sum of product reordered by the user.
- ur_pr_reordered: count of the product ordered by the user.
- Product_is_rare: Using IDF to find rare products, if IDF score is greater. than 90 percentile mark product as rare.
After Feature engineering I checked if created features have introduced any correlation.
- Removing ur_pr_count as it is highly correlated with ur_pr_count — correlated features do not add much information.
Data Preprocessing
- Standard scaling on the numerical feature.
- Target Encoding of the categorical feature.
Train and test and validation data:
- The train set has the shape of 8474661 * 13
- The test set has the shape of 4833292 * 12 — the target variable is not available for testing.
- Validation set has the shape of 1694933 *13 — this is created from train dataset where validation split size is 20% of the train.
Data columns on which model is trained on is:
[‘ur_pr_reordered’, ‘reordered_last’, ‘order_number’,
‘ttl_cnt_product_user’, ‘Avg_no_prod_perOrder’,
‘days_since_prior_order’, ‘max_hour_of_day’, ‘usr_ro_ratio’, ‘max_dow’,
‘product_name_length’, ‘isglutenfree’, ‘is_organic’, ‘product_is_rare’]
Machine Learning models
Algorithms I tried:
- Knn.
- SVC.
- Logistic regression.
- Naive Bayes.
- Decision Tree.
- Random forest.
- Xgboost.
- Stacking.
I started with KNN and quickly found out that the model took a lot of time to train so I tried different algorithms for Knn with has lower time complexity like Kd tree and ball tree still the training time was very long more than 30 hours to train a single model So I used a subset of data for training due to computing constraints.
KNN model performance :
We can clearly see that the Knn is underfitting. Tnr is very high 99% and FPR is about 89% most positive cases are predicted as negative also TPR is very low 10%. One key hyperparameter that you can tune to reduce underfitting is n_neighbour, as we go on increasing the n_neighbour model tends to underfit.
Hyperparameter Tuning
In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
I chose hypeopts for hyperparameter tuning to find the best parameter that maximizes for F1 score.
I created a custom function that returns the threshold at which f1 is maximized.
By default threshold of 0.5 is taken.
- Prediction < 0.5 = Class 0
- Prediction >= 0.5 = Class 1
So now prediction that target = 1 (reordered) is dependent on the threshold which maximizes test F1 score.
How to find the best threshold that maximizes the F1 score?
I tuned Knn to find the best hyperparameters then I trained KNeighborsClassifier with the best hyperparameters. Test data predicted probabilities and test y original to calculate precision_recall_curve. This function will return precision, recall, and threshold.
Now using precision and recall I can calculate f1 score:
f1 = (2 * precision * recall) / (precision + recall)
Then I find the index which gives a max f1 score.
index = np.argmax(f1)
Using index we can find a threshold that maximizes the F1 score.
optimal_threshold=threshold[index]
This approach is applied to each machine learning model that I train. For model evaluation, the model which gives the best test F1 score and the best private score are selected.
KNN performance after hyperparameter tuning and thresholding.
Model performance has improved not by a huge margin, We can see in test performance TPR has increased from 10% to 27% we want TPR to be as high as possible, As the threshold is increased from 0.5 to 0.66 precision increases while recall decreases.
How does threshold affect Precision and recall?
There is a tug war between precision and recall if precision increases recall decreases and vice versa.
As you increase the threshold for prediction the model predicts the sample as positive only if it’s very confident.
- False-positive decreases therefore precision increases.
- False-positive decreases, negative cases that are predicted as positive cases reduce.
- Recall decreases as false-negative increases.
- False-negative increases-more positive cases are predicted as negative.
As you decrease the threshold for prediction the model predicts the sample as positive even if it’s not very confident.
- So false positive increases therefore precision decreases.
- False-positive increases — more negative cases are predicted as positive.
- Recall increases as false-negative reduces.
- False-negative reduces — less number of positive cases is predicted as negative.
Model Selection:
After training all the machine learning models and doing hyperparameter tuning. Both random forest and Xgboost performance are similar but Xgboost Gives the best leaderboard score so this is taken as the final model. To push my leaderboard score further high I tried to implement a stacking classifier but I did not gain much so I dropped Using a stacking classifier.
- Auc: Xgboost has the highest AUC.
- TPR/Recall:Xgboost has the highest TPR/Recall.
- FPR: KNN has the lowest FPR.
- FNR: Xgboost has the lowest FNR.
- TNR: KNN has the highest TNR.
- Precision: Bernoulli NB has the highest Precision.
- F1: Random Forest has the highest F1.
- Private lb: Xgboost has the highest lb.
- Public lb: Xgboost has the highest lb.
- Over all Xgboost is performing best.
xgboost hyperparameters
best_params ={‘colsample_bytree’: 0.7736610394413841, ‘eta’: 0.1, ‘gamma’: 1, ‘max_depth’: 5, ‘scale_pos_weight’:9,’n_estimators’: 1800, ‘subsample’: 0.598690048807158, ‘tree_method’: ‘gpu_hist’}
How increasing or decreasing parameter values lead to overfitting/underfitting in the case of boosting algorithm.
Colsample_bytree: As you increase Model tends to overfit.
gamma: A higher value of gamma tends to make the model underfit.
max_depth: The increasing value of max depth tends to make xgboost overfit.
n_estimators: Increasing the number of estimators tends to make model overfit.
subsample: As you increase Model tends to overfit.
Both train and test data have a similar performance model that is balanced neither overfit or underfit. still, the concern is that FNR is high at 51%, many positive cases are taken as negative.
Future improvements
- Try to engineer a few more features so that the model has more information to learn from. The main difference between my solution and the top leaderboard solution is that they have crafted many more features.
References
- https://en.wikipedia.org/wiki/Hyperparameter_optimization#:~:text=In%20machine%20learning%2C%20hyperparameter%20optimization,typically%20node%20weights)%20are%20learned.
- https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/#:~:text=The%20decision%20for%20converting%20a,in%20the%20range%20between%200
- http://appliedaicourse.com/