2. Motivation
Create a classification model that can predict
which previously purchased products will be in a
user’s next order, to better understand user’s
preferences and optimize the recommendation
system.
5. Features
User features:
User total orders
User avg. cart size
User total products
User avg. days since prior order
User avg. reorder per cart
User avg. days between orders
Product features:
Product total orders
Product avg. add to cart order
Product total reorder
Product reorder probability
Department total order
Aisle total order
Avg. order hour of day
Avg. order day of week
User-Product features:
User-product total orders
User-product latest in cart
User-product avg. add to cart
User-product order frequency
Orders since previous product order
User-product avg. days between orders
User-product order days max
User-product days since last product order
User-product avg. order hour of day
User-product avg. order day of week
User-product days since last order max
6. Results: XGBoost & F1 Optimization
Scores XGBoost:
Training Baseline
XGBoost:
Test+Optimization
F1 0.206 0.396
Accuracy 0.906 0.862
Precision 0.62 0.456
Recall 0.123 0.350
7. Threshold Adjustment - F1 Optimization
Decision Threshold: 0.199negative~90%, positive~10%
9. Conclusions
● Additional model features could
further improve model scores
● Additional metrics, such as dates,
user location, user personal
information, could further improve
the model
● Pickle your models and results!
Editor's Notes
For this project I worked on Kaggle past competition on Instacart Market Basket Analysis
The main goal of this problem was to Create a classification model that can predict which previously purchased products will be in a user’s next order, to better understand user’s preferences and optimize the recommendation system.
Basically, we want to use user’s order history in order to make relevant recommendations for each user to maximize the probability of ordering a recommended product.
I used Instacart dataset from kaggle, which includes more than 30M rows of data on more than 200k Instacart users.
The dataset includes 5 relational tables (orders, products, product orders, aisles and departments).
Data processing was done using PostgreSQL
In order to test and create models for this data I used AWS (64GiB)
The initial evaluation, training and testing of my models was done on 10% of the data to save time and then I moved on to the whole dataset
For my final model I engineered 25 product, user, time and product-user features.
During model training the data was split twice to create a holdout set: first 80% for training set and 20% for testing, and then the training set was split again 75% for training and 25% percent for validation.
The best classification model for this data was XGBoost
The target of my model was to predict reorder of a product by user, 1 if a specific product will be reordered by a specific user in the next cart and 0 if the product won’t be reordered.
During my model choosing and training process I created multiple features using the metrics that were given in the dataset.
Those are the final 25 features I used in my classification model.
User features:
1. user_total_orders: how many orders a user has
2. user_avg_cartsize: average number of products they buy in an order
3. user_total_products: how many different products they've bought over time
4. user_avg_days_since_prior_order: how long they typically wait between orders
5. user_avg_reorder_per_cart: average number of reorders per cart
6. avg_days_between_orders
7. user_avg_order_hod: average order hour of day
8. user_avg_order_dow: average order day of the week
Product features:
1. product_total_orders: product popularity across all users
2. product_avg_add_to_cart_order: typical priority level in an order by averaging its add_to_cart_order
3. product_total_reorder: times product was reordered (
4. product_reorder_probability= product_total_reorder/product_total_orders
5. department_total_order: categorical
6. aisle: categorical
User-Product features:
1. user_product_total_order: how many times a user ordered a product
2. latest_in_cart: user’s latest cart products
3. user_product_avg_add_to_cart_order: get a sense of how much priority each user places on each product by looking at the typical add_to_cart_order for that user-product combination
4. user_product_order_freq=user_product_total_orders/user_total_orders: % of times a product occurs across all of a user's orders
5. orders_since_previous_product_order: how many orders made by user since the last time a product was ordered by a user
6. user_product_order_days_max: max days of product order per user
7. user_product _days_since_ last_product_order: when was the last time a product was ordered, how many days past since last order
8. user_product _days_last_order_max= days_since_ previous_product_order – user_product_order_days_max
After evaluating and comparing various scores, the best model for this dataset classification model was XgBoost.
In addition, due to target Class imbalance, in this data the reordered products were only 10% of all targets, I adjusted the decision threshold to optimize F1 score after my model training.
We can see improvement in F1 and recall scores
Accuracy=(TP+TN)/(TP+TN+FP+FN) : All Correctly classified ratio
Precision=TP/(TN+FP) : Correctly classified True Positive out of all negatives
Recall(sensitivity)=TP/(TP+FP): Correctly classified True Positive out of all positive classifications
Threshold Adjustment for F1 optimization after model training was the highest result in this case.
Based on the adjustment, the decision threshold is 0.199, so I used this threshold for my model final testing set.
I found that the most important features in this model were User-product order frequency, product reorder probability, average reorder per user and user total orders.
user_product_order_freq=user_product_total_orders/user_total_orders: % of times a product occurs across all of a user's orders
product_reorder_probability= product_total_reorder/product_total_orders
user_avg_reorder_per_cart: average number of reorders per cart
user_total_orders: how many orders a user has
To conclude:
Additional model features could further improve model scores
Additional metrics, such as dates, user location, user personal information, could further improve the model
Pickle your models and results!