SlideShare a Scribd company logo
1 of 47
2nd Place Solution
Instacart Market Basket Analysis
Agenda
• My Background
• Problem Overview
• Main Approach
• Feature Engineering
• Feature Importance
• Important Findings
• F1 maximization
My Background
• Bachelor of Economics
• Programmer of Financial Industry
• Consultant of Financial Industry
• 2nd Place at KDDCUP2015
• Data Scientist at Yahoo! JAPAN
Problem Overview
• In this competition, we have to predict reorder.
• So, it is little different from general recommendation.
• I mean,
Problem Overview
• How hot(user)?
*prior is regarded as train
Problem Overview
• How hot(item)?
*Clipped by 500
Problem Overview
• Evaluation metric is mean F1 score
• Precision and Recall
Problem Overview
• Links between the files
Main Approach
• We are given orders.csv
Main Approach
• We are given orders.csv
Main Approach
• We are given order_products.csv
Main Approach
• Reorder Prediction
user_id product_id label
Main Approach
• None Prediction
user_id label
Main Approach
Main Approach
Feature Engineering
• I made 4 types of features
1. User
• What this user like
2. Item
• What this item like
3. User x Item
• How do the user feel about the item
4. Datetime
• What this day and hour like
*For None model, I can’t use above features except user and datetime. So I convert those to
stats(min, mean, max, sum, std…).
Feature Importance for reorder
Feature Importance for None
Important Findings for reorder - 1
• user_id: 54035
Important Findings for reorder - 2
• days_last_order-max is difference between days_since_last_order_this_item and
useritem_order_days_max
• days_since_last_order_this_item is a feature belong to user and item. This means how
many days passed since last order
• Also, useritem_order_days_max is a feature belong to user and item. This means max
span(day) of order
• For more detail, see the next page
Important Findings for reorder - 2
• See the index 0, this means
the user bought this item 14 days
ago, and max span is 30 days
• So I think this feature says if the user
is bored or not by that item
Important Findings for reorder - 3
• We already know fruits are reordered more frequently than vegetables(3
Million Instacart Orders, Open Sourced)
• I wanted to know how often
• So I made a item_10to1_ratio feature
that’s defined as the reorder ratio after
an item is ordered vs. not ordered.
• Next page, for more details
Important Findings for reorder - 3
• Let’s say userA bought itemA at order_number 1 and 4
• And userB bought itemA at order_number 1 and 3
• item_10to1_ratio is 0.5
Important Findings for None - 1
• Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart
that Item B falls into
• Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all
items
• So this feature essentially captures
the average position of an item in a user’s
cart, and we can see that users who
don’t buy many items all at once are
more likely to be None
Important Findings for None - 2
• total_buy is number of total order
• If userA bought itemA 3 times
in the past, this would be 3
• So total_buy-max is max of above
feature by user
• We can see that it predicts
whether or not a user will make a reorder
Important Findings for None - 3
• t-1_is_None(User A) is a binary feature that says whether or not the
user’s previous order was None.
• If the previous order is None,
then the next order will also be
None with 30% probability.
F1 maximization
• In this competition, the evaluation metric was an F1 score, which is a way of
capturing both precision and recall in a single metric.
• Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No)
numbers.
• However, in order to perform this conversion, we need to know a threshold. At
first, I used grid search to find a universal threshold of 0.2. But I saw
comments on the Kaggle discussion boards that said different orders should
have different thresholds.
• To understand why, let’s look at an example.
F1 maximization
F1 maximization
• In the first example, threshold is between 0.9 and 0.3
• In the second example, threshold is lower than 0.2
• As I showed, each order should have each threshold
• But using above calculation, we have to prepare all patterns of
probability at first
• Thus I needed to come up with another calculation
• See the next page
F1 maximization
• Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then
simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities.
• For example, the simulated labels might look like this.
• I then calculate the expected F1 score for each set of labels,
starting from the highest probability items, and then adding items
(e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score
peaks and then decreases.
• We don’t need to calculate all of patterns
like A, B, AB…
• Because if we should select itemB, we should
select itemA as well
F1 maximization
• F1score_mean( , [A]) -> 0.809747641431
• F1score_mean( , [A,B]) -> 0.709004233757
F1 maximization - Predicting None
• One way to think about None is as the probability (1 - Item A)
* (1 - Item B) * …
• But another method is to try to predict None as a special
case.
• By using our None model and treating None as just another
item, we can boost the F1 score from 0.400 to 0.407.
Appendix
Appendix
Appendix
1 month to go…
7 days to go…
2 days to go…
(´-`).。oO(
1 hours to go…
30 minutes to go…
やったか?!
やったか?!
(やってない)
20 minutes to go…
EOP

More Related Content

What's hot

LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
RyuichiKanoh
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
Shintaro Fukushima
 

What's hot (20)

Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
 
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
 
CatBoost on GPU のひみつ
CatBoost on GPU のひみつCatBoost on GPU のひみつ
CatBoost on GPU のひみつ
 
深層強化学習と実装例
深層強化学習と実装例深層強化学習と実装例
深層強化学習と実装例
 
テーブル・テキスト・画像の反実仮想説明
テーブル・テキスト・画像の反実仮想説明テーブル・テキスト・画像の反実仮想説明
テーブル・テキスト・画像の反実仮想説明
 
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
 
ChatGPT 人間のフィードバックから強化学習した対話AI
ChatGPT 人間のフィードバックから強化学習した対話AIChatGPT 人間のフィードバックから強化学習した対話AI
ChatGPT 人間のフィードバックから強化学習した対話AI
 
時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?
 
CV分野におけるサーベイ方法
CV分野におけるサーベイ方法CV分野におけるサーベイ方法
CV分野におけるサーベイ方法
 
SSII2020SS: グラフデータでも深層学習 〜 Graph Neural Networks 入門 〜
SSII2020SS: グラフデータでも深層学習 〜 Graph Neural Networks 入門 〜SSII2020SS: グラフデータでも深層学習 〜 Graph Neural Networks 入門 〜
SSII2020SS: グラフデータでも深層学習 〜 Graph Neural Networks 入門 〜
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ
 
感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...
感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...
感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
【DL輪読会】Toolformer: Language Models Can Teach Themselves to Use Tools
【DL輪読会】Toolformer: Language Models Can Teach Themselves to Use Tools【DL輪読会】Toolformer: Language Models Can Teach Themselves to Use Tools
【DL輪読会】Toolformer: Language Models Can Teach Themselves to Use Tools
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
学習時に使ってはいないデータの混入「リーケージを避ける」
学習時に使ってはいないデータの混入「リーケージを避ける」学習時に使ってはいないデータの混入「リーケージを避ける」
学習時に使ってはいないデータの混入「リーケージを避ける」
 
効果のあるクリエイティブ広告の見つけ方(Contextual Bandit + TS or UCB)
効果のあるクリエイティブ広告の見つけ方(Contextual Bandit + TS or UCB)効果のあるクリエイティブ広告の見つけ方(Contextual Bandit + TS or UCB)
効果のあるクリエイティブ広告の見つけ方(Contextual Bandit + TS or UCB)
 

Viewers also liked

Viewers also liked (20)

Quoraコンペ参加記録
Quoraコンペ参加記録Quoraコンペ参加記録
Quoraコンペ参加記録
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
 
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
 
Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
 
Webスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよWebスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよ
 
サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01
 
岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00
 
Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法
 
Python twitter data_150709
Python twitter data_150709Python twitter data_150709
Python twitter data_150709
 
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて
 
Semantic segmentation2
Semantic segmentation2Semantic segmentation2
Semantic segmentation2
 
Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化
 
Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31
 

Similar to Kaggle meetup #3 instacart 2nd place solution

goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
IrfanRashid36
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptx
Asadkhan47384
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docx
maxinesmith73660
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
talila4
 

Similar to Kaggle meetup #3 instacart 2nd place solution (20)

C++ super market
C++ super marketC++ super market
C++ super market
 
Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3
 
Goal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxGoal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptx
 
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptx
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docx
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
 
Introduction to Management Science and Linear Programming
 Introduction to Management Science and Linear Programming  Introduction to Management Science and Linear Programming
Introduction to Management Science and Linear Programming
 
Chatter Actions - Short Version
Chatter Actions - Short VersionChatter Actions - Short Version
Chatter Actions - Short Version
 
Lecture 3F.ppt
Lecture 3F.pptLecture 3F.ppt
Lecture 3F.ppt
 
Dwh lecture 13-process dm
Dwh  lecture 13-process dmDwh  lecture 13-process dm
Dwh lecture 13-process dm
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
One day Course On Agile
One day Course On AgileOne day Course On Agile
One day Course On Agile
 
Value analysis and value engineering
Value  analysis and value engineeringValue  analysis and value engineering
Value analysis and value engineering
 
Production Planning and Process Planning
Production Planning and Process PlanningProduction Planning and Process Planning
Production Planning and Process Planning
 
GRO n GO
GRO n GO GRO n GO
GRO n GO
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdf
 

Recently uploaded

原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 

Recently uploaded (20)

Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptxChapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 

Kaggle meetup #3 instacart 2nd place solution

  • 1. 2nd Place Solution Instacart Market Basket Analysis
  • 2. Agenda • My Background • Problem Overview • Main Approach • Feature Engineering • Feature Importance • Important Findings • F1 maximization
  • 3. My Background • Bachelor of Economics • Programmer of Financial Industry • Consultant of Financial Industry • 2nd Place at KDDCUP2015 • Data Scientist at Yahoo! JAPAN
  • 4. Problem Overview • In this competition, we have to predict reorder. • So, it is little different from general recommendation. • I mean,
  • 5. Problem Overview • How hot(user)? *prior is regarded as train
  • 6. Problem Overview • How hot(item)? *Clipped by 500
  • 7. Problem Overview • Evaluation metric is mean F1 score • Precision and Recall
  • 8. Problem Overview • Links between the files
  • 9. Main Approach • We are given orders.csv
  • 10. Main Approach • We are given orders.csv
  • 11. Main Approach • We are given order_products.csv
  • 12. Main Approach • Reorder Prediction user_id product_id label
  • 13. Main Approach • None Prediction user_id label
  • 16. Feature Engineering • I made 4 types of features 1. User • What this user like 2. Item • What this item like 3. User x Item • How do the user feel about the item 4. Datetime • What this day and hour like *For None model, I can’t use above features except user and datetime. So I convert those to stats(min, mean, max, sum, std…).
  • 19. Important Findings for reorder - 1 • user_id: 54035
  • 20. Important Findings for reorder - 2 • days_last_order-max is difference between days_since_last_order_this_item and useritem_order_days_max • days_since_last_order_this_item is a feature belong to user and item. This means how many days passed since last order • Also, useritem_order_days_max is a feature belong to user and item. This means max span(day) of order • For more detail, see the next page
  • 21. Important Findings for reorder - 2 • See the index 0, this means the user bought this item 14 days ago, and max span is 30 days • So I think this feature says if the user is bored or not by that item
  • 22. Important Findings for reorder - 3 • We already know fruits are reordered more frequently than vegetables(3 Million Instacart Orders, Open Sourced) • I wanted to know how often • So I made a item_10to1_ratio feature that’s defined as the reorder ratio after an item is ordered vs. not ordered. • Next page, for more details
  • 23. Important Findings for reorder - 3 • Let’s say userA bought itemA at order_number 1 and 4 • And userB bought itemA at order_number 1 and 3 • item_10to1_ratio is 0.5
  • 24. Important Findings for None - 1 • Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart that Item B falls into • Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all items • So this feature essentially captures the average position of an item in a user’s cart, and we can see that users who don’t buy many items all at once are more likely to be None
  • 25. Important Findings for None - 2 • total_buy is number of total order • If userA bought itemA 3 times in the past, this would be 3 • So total_buy-max is max of above feature by user • We can see that it predicts whether or not a user will make a reorder
  • 26. Important Findings for None - 3 • t-1_is_None(User A) is a binary feature that says whether or not the user’s previous order was None. • If the previous order is None, then the next order will also be None with 30% probability.
  • 27. F1 maximization • In this competition, the evaluation metric was an F1 score, which is a way of capturing both precision and recall in a single metric. • Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No) numbers. • However, in order to perform this conversion, we need to know a threshold. At first, I used grid search to find a universal threshold of 0.2. But I saw comments on the Kaggle discussion boards that said different orders should have different thresholds. • To understand why, let’s look at an example.
  • 29. F1 maximization • In the first example, threshold is between 0.9 and 0.3 • In the second example, threshold is lower than 0.2 • As I showed, each order should have each threshold • But using above calculation, we have to prepare all patterns of probability at first • Thus I needed to come up with another calculation • See the next page
  • 30. F1 maximization • Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities. • For example, the simulated labels might look like this. • I then calculate the expected F1 score for each set of labels, starting from the highest probability items, and then adding items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases. • We don’t need to calculate all of patterns like A, B, AB… • Because if we should select itemB, we should select itemA as well
  • 31. F1 maximization • F1score_mean( , [A]) -> 0.809747641431 • F1score_mean( , [A,B]) -> 0.709004233757
  • 32. F1 maximization - Predicting None • One way to think about None is as the probability (1 - Item A) * (1 - Item B) * … • But another method is to try to predict None as a special case. • By using our None model and treating None as just another item, we can boost the F1 score from 0.400 to 0.407.
  • 36. 1 month to go…
  • 37.
  • 38. 7 days to go…
  • 39. 2 days to go…
  • 41. 1 hours to go…
  • 42.
  • 43. 30 minutes to go…
  • 46. 20 minutes to go…
  • 47. EOP