SlideShare a Scribd company logo
1 of 47
2nd Place Solution
Instacart Market Basket Analysis
Agenda
• My Background
• Problem Overview
• Main Approach
• Feature Engineering
• Feature Importance
• Important Findings
• F1 maximization
My Background
• Bachelor of Economics
• Programmer of Financial Industry
• Consultant of Financial Industry
• 2nd Place at KDDCUP2015
• Data Scientist at Yahoo! JAPAN
Problem Overview
• In this competition, we have to predict reorder.
• So, it is little different from general recommendation.
• I mean,
Problem Overview
• How hot(user)?
*prior is regarded as train
Problem Overview
• How hot(item)?
*Clipped by 500
Problem Overview
• Evaluation metric is mean F1 score
• Precision and Recall
Problem Overview
• Links between the files
Main Approach
• We are given orders.csv
Main Approach
• We are given orders.csv
Main Approach
• We are given order_products.csv
Main Approach
• Reorder Prediction
user_id product_id label
Main Approach
• None Prediction
user_id label
Main Approach
Main Approach
Feature Engineering
• I made 4 types of features
1. User
• What this user like
2. Item
• What this item like
3. User x Item
• How do the user feel about the item
4. Datetime
• What this day and hour like
*For None model, I can’t use above features except user and datetime. So I convert those to
stats(min, mean, max, sum, std…).
Feature Importance for reorder
Feature Importance for None
Important Findings for reorder - 1
• user_id: 54035
Important Findings for reorder - 2
• days_last_order-max is difference between days_since_last_order_this_item and
useritem_order_days_max
• days_since_last_order_this_item is a feature belong to user and item. This means how
many days passed since last order
• Also, useritem_order_days_max is a feature belong to user and item. This means max
span(day) of order
• For more detail, see the next page
Important Findings for reorder - 2
• See the index 0, this means
the user bought this item 14 days
ago, and max span is 30 days
• So I think this feature says if the user
is bored or not by that item
Important Findings for reorder - 3
• We already know fruits are reordered more frequently than vegetables(3
Million Instacart Orders, Open Sourced)
• I wanted to know how often
• So I made a item_10to1_ratio feature
that’s defined as the reorder ratio after
an item is ordered vs. not ordered.
• Next page, for more details
Important Findings for reorder - 3
• Let’s say userA bought itemA at order_number 1 and 4
• And userB bought itemA at order_number 1 and 3
• item_10to1_ratio is 0.5
Important Findings for None - 1
• Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart
that Item B falls into
• Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all
items
• So this feature essentially captures
the average position of an item in a user’s
cart, and we can see that users who
don’t buy many items all at once are
more likely to be None
Important Findings for None - 2
• total_buy is number of total order
• If userA bought itemA 3 times
in the past, this would be 3
• So total_buy-max is max of above
feature by user
• We can see that it predicts
whether or not a user will make a reorder
Important Findings for None - 3
• t-1_is_None(User A) is a binary feature that says whether or not the
user’s previous order was None.
• If the previous order is None,
then the next order will also be
None with 30% probability.
F1 maximization
• In this competition, the evaluation metric was an F1 score, which is a way of
capturing both precision and recall in a single metric.
• Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No)
numbers.
• However, in order to perform this conversion, we need to know a threshold. At
first, I used grid search to find a universal threshold of 0.2. But I saw
comments on the Kaggle discussion boards that said different orders should
have different thresholds.
• To understand why, let’s look at an example.
F1 maximization
F1 maximization
• In the first example, threshold is between 0.9 and 0.3
• In the second example, threshold is lower than 0.2
• As I showed, each order should have each threshold
• But using above calculation, we have to prepare all patterns of
probability at first
• Thus I needed to come up with another calculation
• See the next page
F1 maximization
• Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then
simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities.
• For example, the simulated labels might look like this.
• I then calculate the expected F1 score for each set of labels,
starting from the highest probability items, and then adding items
(e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score
peaks and then decreases.
• We don’t need to calculate all of patterns
like A, B, AB…
• Because if we should select itemB, we should
select itemA as well
F1 maximization
• F1score_mean( , [A]) -> 0.809747641431
• F1score_mean( , [A,B]) -> 0.709004233757
F1 maximization - Predicting None
• One way to think about None is as the probability (1 - Item A)
* (1 - Item B) * …
• But another method is to try to predict None as a special
case.
• By using our None model and treating None as just another
item, we can boost the F1 score from 0.400 to 0.407.
Appendix
Appendix
Appendix
1 month to go…
7 days to go…
2 days to go…
(´-`).。oO(
1 hours to go…
30 minutes to go…
やったか?!
やったか?!
(やってない)
20 minutes to go…
EOP

More Related Content

What's hot

数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013Shuyo Nakatani
 
時系列分析による異常検知入門
時系列分析による異常検知入門時系列分析による異常検知入門
時系列分析による異常検知入門Yohei Sato
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)RyuichiKanoh
 
協調フィルタリング入門
協調フィルタリング入門協調フィルタリング入門
協調フィルタリング入門hoxo_m
 
Imputation of Missing Values using Random Forest
Imputation of Missing Values using  Random ForestImputation of Missing Values using  Random Forest
Imputation of Missing Values using Random ForestSatoshi Kato
 
2 2.尤度と最尤法
2 2.尤度と最尤法2 2.尤度と最尤法
2 2.尤度と最尤法logics-of-blue
 
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリングmlm_kansai
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展Deep Learning JP
 
【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​ARISE analytics
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数Deep Learning JP
 
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介Takeshi Mikami
 
Stanの便利な事後処理関数
Stanの便利な事後処理関数Stanの便利な事後処理関数
Stanの便利な事後処理関数daiki hojo
 
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ー
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ーDiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ー
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ーTakashi Yamane
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
KaggleのテクニックYasunori Ozaki
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summerHarada Kei
 

What's hot (20)

数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013
 
時系列分析による異常検知入門
時系列分析による異常検知入門時系列分析による異常検知入門
時系列分析による異常検知入門
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
 
協調フィルタリング入門
協調フィルタリング入門協調フィルタリング入門
協調フィルタリング入門
 
Imputation of Missing Values using Random Forest
Imputation of Missing Values using  Random ForestImputation of Missing Values using  Random Forest
Imputation of Missing Values using Random Forest
 
2 2.尤度と最尤法
2 2.尤度と最尤法2 2.尤度と最尤法
2 2.尤度と最尤法
 
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
自然言語処理
自然言語処理自然言語処理
自然言語処理
 
【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​【論文レベルで理解しよう!】​ 欠測値処理編​
【論文レベルで理解しよう!】​ 欠測値処理編​
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数
 
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介
K meansによるクラスタリングの解説と具体的なクラスタリングの活用方法の紹介
 
Stanの便利な事後処理関数
Stanの便利な事後処理関数Stanの便利な事後処理関数
Stanの便利な事後処理関数
 
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ー
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ーDiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ー
DiagrammeRと仲良くなった話ーグラフィカルモデルのためのDiagrammeR速習ー
 
時系列分析入門
時系列分析入門時系列分析入門
時系列分析入門
 
Rで学ぶロバスト推定
Rで学ぶロバスト推定Rで学ぶロバスト推定
Rで学ぶロバスト推定
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summer
 

Viewers also liked

Quoraコンペ参加記録
Quoraコンペ参加記録Quoraコンペ参加記録
Quoraコンペ参加記録Takami Sato
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Keisuke Hosaka
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKeisuke Hosaka
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺についてKeisuke Hosaka
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門hoxo_m
 
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural NetworksDeep Learning JP
 
Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話y-uti
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Keiku322
 
Webスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよWebスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよTakaichi Ito
 
サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法Takuro Sasaki
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02goony0101
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01goony0101
 
岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00goony0101
 
Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Hiroki Itô
 
Python twitter data_150709
Python twitter data_150709Python twitter data_150709
Python twitter data_150709BrainPad Inc.
 
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料The Japan DataScientist Society
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについてryosuke-kojima
 
Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Isao Takaesu
 
Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Shinichi Nakagawa
 

Viewers also liked (20)

Quoraコンペ参加記録
Quoraコンペ参加記録Quoraコンペ参加記録
Quoraコンペ参加記録
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
 
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
 
Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
 
Webスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよWebスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよ
 
サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01
 
岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00
 
Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法
 
Python twitter data_150709
Python twitter data_150709Python twitter data_150709
Python twitter data_150709
 
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて
 
Semantic segmentation2
Semantic segmentation2Semantic segmentation2
Semantic segmentation2
 
Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化
 
Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31
 

Similar to Kaggle meetup #3 instacart 2nd place solution

Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Minha Hwang
 
Goal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxGoal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxmilanrameswarpanigra
 
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxIrfanRashid36
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxAsadkhan47384
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptSysteDesig
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptSysteDesig
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxmaxinesmith73660
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15AnwarrChaudary
 
Introduction to Management Science and Linear Programming
 Introduction to Management Science and Linear Programming  Introduction to Management Science and Linear Programming
Introduction to Management Science and Linear Programming Kishore Morya PhD.
 
Chatter Actions - Short Version
Chatter Actions - Short VersionChatter Actions - Short Version
Chatter Actions - Short VersionCloudTech 
 
Lecture 3F.ppt
Lecture 3F.pptLecture 3F.ppt
Lecture 3F.pptkhang28765
 
Dwh lecture 13-process dm
Dwh  lecture 13-process dmDwh  lecture 13-process dm
Dwh lecture 13-process dmSulman Ahmed
 
Value analysis and value engineering
Value  analysis and value engineeringValue  analysis and value engineering
Value analysis and value engineeringudayravi2
 
Production Planning and Process Planning
Production Planning and Process PlanningProduction Planning and Process Planning
Production Planning and Process PlanningPraveenManickam2
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmtalila4
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfMidhunM83
 

Similar to Kaggle meetup #3 instacart 2nd place solution (20)

C++ super market
C++ super marketC++ super market
C++ super market
 
Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3
 
Goal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxGoal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptx
 
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptx
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docx
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
 
Introduction to Management Science and Linear Programming
 Introduction to Management Science and Linear Programming  Introduction to Management Science and Linear Programming
Introduction to Management Science and Linear Programming
 
Chatter Actions - Short Version
Chatter Actions - Short VersionChatter Actions - Short Version
Chatter Actions - Short Version
 
Lecture 3F.ppt
Lecture 3F.pptLecture 3F.ppt
Lecture 3F.ppt
 
Dwh lecture 13-process dm
Dwh  lecture 13-process dmDwh  lecture 13-process dm
Dwh lecture 13-process dm
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
One day Course On Agile
One day Course On AgileOne day Course On Agile
One day Course On Agile
 
Value analysis and value engineering
Value  analysis and value engineeringValue  analysis and value engineering
Value analysis and value engineering
 
Production Planning and Process Planning
Production Planning and Process PlanningProduction Planning and Process Planning
Production Planning and Process Planning
 
GRO n GO
GRO n GO GRO n GO
GRO n GO
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdf
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Kaggle meetup #3 instacart 2nd place solution

  • 1. 2nd Place Solution Instacart Market Basket Analysis
  • 2. Agenda • My Background • Problem Overview • Main Approach • Feature Engineering • Feature Importance • Important Findings • F1 maximization
  • 3. My Background • Bachelor of Economics • Programmer of Financial Industry • Consultant of Financial Industry • 2nd Place at KDDCUP2015 • Data Scientist at Yahoo! JAPAN
  • 4. Problem Overview • In this competition, we have to predict reorder. • So, it is little different from general recommendation. • I mean,
  • 5. Problem Overview • How hot(user)? *prior is regarded as train
  • 6. Problem Overview • How hot(item)? *Clipped by 500
  • 7. Problem Overview • Evaluation metric is mean F1 score • Precision and Recall
  • 8. Problem Overview • Links between the files
  • 9. Main Approach • We are given orders.csv
  • 10. Main Approach • We are given orders.csv
  • 11. Main Approach • We are given order_products.csv
  • 12. Main Approach • Reorder Prediction user_id product_id label
  • 13. Main Approach • None Prediction user_id label
  • 16. Feature Engineering • I made 4 types of features 1. User • What this user like 2. Item • What this item like 3. User x Item • How do the user feel about the item 4. Datetime • What this day and hour like *For None model, I can’t use above features except user and datetime. So I convert those to stats(min, mean, max, sum, std…).
  • 19. Important Findings for reorder - 1 • user_id: 54035
  • 20. Important Findings for reorder - 2 • days_last_order-max is difference between days_since_last_order_this_item and useritem_order_days_max • days_since_last_order_this_item is a feature belong to user and item. This means how many days passed since last order • Also, useritem_order_days_max is a feature belong to user and item. This means max span(day) of order • For more detail, see the next page
  • 21. Important Findings for reorder - 2 • See the index 0, this means the user bought this item 14 days ago, and max span is 30 days • So I think this feature says if the user is bored or not by that item
  • 22. Important Findings for reorder - 3 • We already know fruits are reordered more frequently than vegetables(3 Million Instacart Orders, Open Sourced) • I wanted to know how often • So I made a item_10to1_ratio feature that’s defined as the reorder ratio after an item is ordered vs. not ordered. • Next page, for more details
  • 23. Important Findings for reorder - 3 • Let’s say userA bought itemA at order_number 1 and 4 • And userB bought itemA at order_number 1 and 3 • item_10to1_ratio is 0.5
  • 24. Important Findings for None - 1 • Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart that Item B falls into • Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all items • So this feature essentially captures the average position of an item in a user’s cart, and we can see that users who don’t buy many items all at once are more likely to be None
  • 25. Important Findings for None - 2 • total_buy is number of total order • If userA bought itemA 3 times in the past, this would be 3 • So total_buy-max is max of above feature by user • We can see that it predicts whether or not a user will make a reorder
  • 26. Important Findings for None - 3 • t-1_is_None(User A) is a binary feature that says whether or not the user’s previous order was None. • If the previous order is None, then the next order will also be None with 30% probability.
  • 27. F1 maximization • In this competition, the evaluation metric was an F1 score, which is a way of capturing both precision and recall in a single metric. • Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No) numbers. • However, in order to perform this conversion, we need to know a threshold. At first, I used grid search to find a universal threshold of 0.2. But I saw comments on the Kaggle discussion boards that said different orders should have different thresholds. • To understand why, let’s look at an example.
  • 29. F1 maximization • In the first example, threshold is between 0.9 and 0.3 • In the second example, threshold is lower than 0.2 • As I showed, each order should have each threshold • But using above calculation, we have to prepare all patterns of probability at first • Thus I needed to come up with another calculation • See the next page
  • 30. F1 maximization • Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities. • For example, the simulated labels might look like this. • I then calculate the expected F1 score for each set of labels, starting from the highest probability items, and then adding items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases. • We don’t need to calculate all of patterns like A, B, AB… • Because if we should select itemB, we should select itemA as well
  • 31. F1 maximization • F1score_mean( , [A]) -> 0.809747641431 • F1score_mean( , [A,B]) -> 0.709004233757
  • 32. F1 maximization - Predicting None • One way to think about None is as the probability (1 - Item A) * (1 - Item B) * … • But another method is to try to predict None as a special case. • By using our None model and treating None as just another item, we can boost the F1 score from 0.400 to 0.407.
  • 36. 1 month to go…
  • 37.
  • 38. 7 days to go…
  • 39. 2 days to go…
  • 41. 1 hours to go…
  • 42.
  • 43. 30 minutes to go…
  • 46. 20 minutes to go…
  • 47. EOP