Metis Project 3: Predicting Instacart Product Reorder - Kaggle Challenge

2. Motivation Create a classification model that can predict which previously purchased products will be in a user’s next order, to better understand user’s preferences and optimize the recommendation system.

3. Motivation

4. Procedure

5. Features User features: User total orders User avg. cart size User total products User avg. days since prior order User avg. reorder per cart User avg. days between orders Product features: Product total orders Product avg. add to cart order Product total reorder Product reorder probability Department total order Aisle total order Avg. order hour of day Avg. order day of week User-Product features: User-product total orders User-product latest in cart User-product avg. add to cart User-product order frequency Orders since previous product order User-product avg. days between orders User-product order days max User-product days since last product order User-product avg. order hour of day User-product avg. order day of week User-product days since last order max

6. Results: XGBoost & F1 Optimization Scores XGBoost: Training Baseline XGBoost: Test+Optimization F1 0.206 0.396 Accuracy 0.906 0.862 Precision 0.62 0.456 Recall 0.123 0.350

7. Threshold Adjustment - F1 Optimization Decision Threshold: 0.199negative~90%, positive~10%

8. Feature Importance

9. Conclusions ● Additional model features could further improve model scores ● Additional metrics, such as dates, user location, user personal information, could further improve the model ● Pickle your models and results!

Editor's Notes

For this project I worked on Kaggle past competition on Instacart Market Basket Analysis
The main goal of this problem was to Create a classification model that can predict which previously purchased products will be in a user’s next order, to better understand user’s preferences and optimize the recommendation system.
Basically, we want to use user’s order history in order to make relevant recommendations for each user to maximize the probability of ordering a recommended product.
I used Instacart dataset from kaggle, which includes more than 30M rows of data on more than 200k Instacart users. The dataset includes 5 relational tables (orders, products, product orders, aisles and departments). Data processing was done using PostgreSQL In order to test and create models for this data I used AWS (64GiB) The initial evaluation, training and testing of my models was done on 10% of the data to save time and then I moved on to the whole dataset For my final model I engineered 25 product, user, time and product-user features. During model training the data was split twice to create a holdout set: first 80% for training set and 20% for testing, and then the training set was split again 75% for training and 25% percent for validation. The best classification model for this data was XGBoost
The target of my model was to predict reorder of a product by user, 1 if a specific product will be reordered by a specific user in the next cart and 0 if the product won’t be reordered. During my model choosing and training process I created multiple features using the metrics that were given in the dataset. Those are the final 25 features I used in my classification model. User features: 1. user_total_orders: how many orders a user has 2. user_avg_cartsize: average number of products they buy in an order 3. user_total_products: how many different products they've bought over time 4. user_avg_days_since_prior_order: how long they typically wait between orders 5. user_avg_reorder_per_cart: average number of reorders per cart 6. avg_days_between_orders 7. user_avg_order_hod: average order hour of day 8. user_avg_order_dow: average order day of the week Product features: 1. product_total_orders: product popularity across all users 2. product_avg_add_to_cart_order: typical priority level in an order by averaging its add_to_cart_order 3. product_total_reorder: times product was reordered ( 4. product_reorder_probability= product_total_reorder/product_total_orders 5. department_total_order: categorical 6. aisle: categorical User-Product features: 1. user_product_total_order: how many times a user ordered a product 2. latest_in_cart: user’s latest cart products 3. user_product_avg_add_to_cart_order: get a sense of how much priority each user places on each product by looking at the typical add_to_cart_order for that user-product combination 4. user_product_order_freq=user_product_total_orders/user_total_orders: % of times a product occurs across all of a user's orders 5. orders_since_previous_product_order: how many orders made by user since the last time a product was ordered by a user 6. user_product_order_days_max: max days of product order per user 7. user_product _days_since_ last_product_order: when was the last time a product was ordered, how many days past since last order 8. user_product _days_last_order_max= days_since_ previous_product_order – user_product_order_days_max
After evaluating and comparing various scores, the best model for this dataset classification model was XgBoost. In addition, due to target Class imbalance, in this data the reordered products were only 10% of all targets, I adjusted the decision threshold to optimize F1 score after my model training. We can see improvement in F1 and recall scores Accuracy=(TP+TN)/(TP+TN+FP+FN) : All Correctly classified ratio Precision=TP/(TN+FP) : Correctly classified True Positive out of all negatives Recall(sensitivity)=TP/(TP+FP): Correctly classified True Positive out of all positive classifications
Threshold Adjustment for F1 optimization after model training was the highest result in this case. Based on the adjustment, the decision threshold is 0.199, so I used this threshold for my model final testing set.
I found that the most important features in this model were User-product order frequency, product reorder probability, average reorder per user and user total orders. user_product_order_freq=user_product_total_orders/user_total_orders: % of times a product occurs across all of a user's orders product_reorder_probability= product_total_reorder/product_total_orders user_avg_reorder_per_cart: average number of reorders per cart user_total_orders: how many orders a user has
To conclude: Additional model features could further improve model scores Additional metrics, such as dates, user location, user personal information, could further improve the model Pickle your models and results!

Metis Project 3: Predicting Instacart Product Reorder - Kaggle Challenge

Recommended

Recommended

More Related Content

Similar to Metis Project 3: Predicting Instacart Product Reorder - Kaggle Challenge

Similar to Metis Project 3: Predicting Instacart Product Reorder - Kaggle Challenge (20)

Recently uploaded

Recently uploaded (20)

Metis Project 3: Predicting Instacart Product Reorder - Kaggle Challenge

Editor's Notes