SlideShare a Scribd company logo
My Journey To
GrandMaster:
Success and Failure
詹金 センキン jinZhan
Agenda
Part 1: Introduction Of My Kaggle Journey
● Before kaggle
● Kaggle Preference
● Competition history
Part 2: Some Success and Failure In Competitions
● Validation
● Pre-Processing
● Feature Engineering
● Feature Selection
● Modeling
● Stacking
● Post-Processing
Before Kaggle
Kaggle Preference
Competition Type: Buisness Tabular Data
,Science Tabular Data , Text Data
Language: Python
Library:
Pandas/Numpy/Sklearn/Matplotlib/Keras/Pyt
orch
Model: Lightgbm/NeuralNetwork/
Catboost/Xgboost/Ridge Regression/KNN…
Favorite Part: Finding Killer Feature
2nd Favorite Part: Stacking
Hardware: 32GMem & GTX1080Ti Desktop
,GoogleCloud
First Stage : From Beginer To Expert
Competition Public Private Shake Medal
Zillow’s Home Value Prediction
(2018-01-11 ended)
185/3775 203/3775 ⬇️28 Bronze
Corporación Favorita Grocery Sales Forecasting
(2018-01-15 ended)
42/1674 85/1674 ⬇️43 Bronze
Expert
Recruit Restaurant Visitor Forecasting
(2018-02-06 ended)
10/2157 760/2157 ⬇️750
Mercari Price Suggestion Challenge
(2018-02-21 ended)
32/2382 2318/2382 ⬇️2286
Toxic Comment Classification Challenge
(2018-03-20 ended)
78/4550 82/4550 ⬇️4 Silver
TalkingData AdTracking Fraud Detection
Challenge (2018-05-07 ended)
7/3946 19/3946 ⬇️12 Silver
Second Stage : From Master To Solo Gold
Competition Public Private Shake Medal
Avito Demand Prediction Challenge
(2018-06-27 ended)
8/1871 9/1871 ⬇️1 Gold
Master
Home Credit Default Risk
(2018-08-29 ended)
6/7190 8/7190 ⬇️2 Gold
Google Analytics Customer Revenue Prediction
(2019-02-15 ended)
Leak 85/3611 Silver
Elo Merchant Category Recommendation
(2019-02-26 ended)
3/4127 7/4127 ⬇️4 Solo Gold
Third Stage : Keep Going To GrandMaster
Competition Public Private Shake Medal
Santander Customer Transaction
Prediction (2019-04-10 ended)
31/8802 24/8802 ⬆︎7 Gold
Jigsaw Unintended Bias in Toxicity
Classification (2019-06-27 ended)
30+/3165 Kernel
Failed
Predicting Molecular Properties
(2019-08-28 ended)
15/2749 15/2749 - Gold
GM
Validation
Train and Test are splitted by
timestamp,Public Test and
Private Test are splitted by
timestamp too.
Failure Case
Success Case
Predicting the past
with the future
data is a form of
data leakage
Validation
Elo
train['outliers'] = 0
train.loc[train['target'] < -30, 'outliers'] = 1
StratifiedKFold().split(train['outliers'] )
KFold().split(train[’target'] )
Outliers in Target only 1%
Failure Case
Success Case
Make sure your each fold of validation data
have similar distribution,and similar to test
Pre-Processing
Elo
Anonymized Purchase Amount
df_new['purchase_amount_new'] = np.round(df_new['purchase_amount'] / 0.00150265118 + 497.06,2)
De-Anonymized Purchase Amount
Feature engineering make more sense and improved after de-anonymization
Feature Engineering
Card_id Feature_1 Feature_2 Feature_3 Target(loyalty)
C_ID_92a2005557 5 2 1 0.392890
Card_id Merchant_id …… Purchase_a
mount
Purchase_d
ate
C_ID_92a2005557 M_ID_b0c793002c 5.263790 2018-04-26
14:08:44
C_ID_92a2005557 M_ID_d15eae0468 -2.782712 2018-05-01
13:01:24
train.csv
transactions.csv
Elo
Merchant_id merchant_group … city_id state_id
M_ID_b0c793002c 8179 16 242
merchants.csv
Start from understanding problem and data
Feature Engineering
Elo
Some strong features I made:
- last_day_purchased (Recency)
- unique_month_purchased (Frequency)
- max_purchase_amount (Monetary)
Get domain knowledge from kaggle discussion(kernel) &google
RFM is a method used for analyzing customer value. It is
commonly used in database marketing and direct marketing
and has received particular attention in retail and professional
services industries.
RFM stands for the three dimensions:
• Recency – How recently did the customer purchase?
• Frequency – How often do they purchase?
• Monetary Value – How much do they spend?
Feature Engineering
Elo
Card_id Merchant_id
C1 M1
C1 M2
… …
C1 M99
C1 M100
Card_id Merchant
_Unique
Merchant_
count
C1 100 200
Card_id M1_C
ount
M2_C
ount
… M99_
Count
M100_
Count
C1 1 2 … 5 7
Raw Data
Coarse-grained
Fine-grained
Not only coarse-grained aggregation, more fine-grained information
unique count and total
count of one card’s
purchased merchant
count of one card’s all the
purchased merchants
Feature Engineering
Elo
Card_id M1 M2 … M100
C1 0.67 0.34 … 0.12
C2 0.23 0.45 … 0.66
… … … … …
C999 0.01 0.43 … 0.72
C1000 0.99 0.89 … 0.35
Text Like Data
TF-IDF
(ngram=1,max_features=None)
Not only tabular data feature engieering, transform to text like data can build more
features
Singular Value
Decomposition(SVD)
Card_id Purchase Merchant Sequence
C1 M1,M2, M3,M1,M3,……M100
C2 M2,M3,……M100
… …
C999 M45….M100
C1000 M99
Card_i
d
SVD1 … SVD5
C1 0.34 … 0.78
C2 0.33 … 0.56
… … … …
C999 0.31 … 0.70
C1000 0.95 … 0.25
Feature Engineering
Elo Word2Vec Of Merchant
M1
M2 M50
M51
M100
M99
Word2vec model can generate more sequence-related information
Sequence Data
Card_id Purchase Merchant Sequence
C1 M1,M2, M3,M1,M3,……M100
C2 M2,M3,……M100
… …
C999 M45….M100
C1000 M99
Card_id W2V_1_Mean … W2V_5_Max
C1 0.34 … 0.78
… … … …
C1000 0.95 … 0.25
aggregation of all the merchants
embedding of each card
Feature Engineering
C1
M1
C3
M2
C2
M3
Step1: Perform random walks on nodes
in a graph to generate node sequences
Step 2: Run skip-gram to learn the
embedding of each node based on the
node sequences generated in step 1
Node: card_id ,merchant_id
Edge: purchased count
DeepWalkElo
Deepwalk model can generate more graph-related information
Graph Data
Card_id DW_Card_1 … DW_Mercha
nt_1_Max
C1 0.34 … 0.78
… … … …
C1000 0.95 … 0.25
Feature Engineering
Elo
Card_id … Target
C1 … 0.392890
C2 … 0.589014
Card_id … Target
C1 … 0.392890
C1 … 0.392890
C2 … 0.589014
C2 … 0.589014
train.csv
transactions.csv
Card_id Merchant_i
d
… Prediction
C1 M1 … 0.389345
C1 M2 … 0.373495
C2 M99 … 0.689014
C2 M100 … 0.489014
Card_id … Mean Of
Prediction
Max Of
Prediction
C1 … 0.378924 0.380056
C2 … 0.509341 0.580085
Give card_id’s target to every transaction,build a transaction
based model to generate meta feature improved very much
Feature Selection
Target Permutation
(Null Importance)
Feature1 Feature2 Feature3 Target
0.34 0.56 0.78 0.1
3.44 1.09 1.23 1.2
5.66 7.88 0.99 2.1
Feature1 Feature2 Feature3 Target
0.34 0.56 0.78 0.1
3.44 1.09 1.23 1.2
5.66 7.88 0.99 2.1
Null
Importance
Actual
Importance
HomeCredit
Elo
Santander
Top N
Run 50~100 times
gain_score = np.log(1e-10 + act_imps_gain / (1 + np.percentile(null_imps_gain, 75)))
Shuffle the target then train many times to
get gain importance
Modeling
Competition Best Single Model Ensemble Models
Avito
(tabular,text,image)
LGB > NN
(top teams NN>LGB)
Stage1: 70+ nn lgb xgb catboost ridge rf rgf
Stage2: xgb for stacking
Stage3: quiz blending
Home Credit
(financial tabular)
LGB >> NN Stage1: 10+ lgb nn
Stage2: lgb(linear),random forest for stacking
Stage3: weight average blending
Elo
(financial tabular)
LGB >> NN Stage1: 12 lgb and 40 dnn
Stage2: lgb,extratree,dnn,linear for stacking
Stage3: weight average blending
Santander
(anonymous tabular)
LGB > NN
(top teams NN>LGB)
Stage1: Blending of one lgb and one nn
Molecular
(chemistry tabular)
GNN >> LGB,DNN Stage1: 40+ gnn dnn lgb
Stage2: bayesian ridge for stacking
Stacking
EloHomeCredit
Single Model ,Final 5th
Simple Stacking,Final 3th
Single Model ,Final 5th
Failure Case
Local cv and LB matched
unwell,the weight of stacking
model is unstable
There are many strong lgb in first
stage,the second stage’s tree
model(lgb,extra tree) overfitted
much,if only use nn and linear on
second stage,it will improve
Stacking
Moleculer
Success Case
Feature-rich(tabular,text,image)
Train/Public/Private splitted well
Local cv and LB matched very well
Moleculer Atom world are clean?
Train/Public/Private splitted well
Local cv and LB matched very well
Postprocessing
Talkingdata
Without
postprocessing
Failure Case Lost a solo gold due to
postprocessing(shared by discussion),no
check for local cv,and both for 2 submissions
Post-Processing
Elo
Failure Case
①
②
Prediction Target
-28.4579 -33.2192
-27.1178 -33.2192
-26.6666 -33.2192
Calibrate continuous predictions to
discrete can improve CV and LB both
but PLB broken
Overide Top-N lowest predictions to
outliers value can improve CV and LB
both but PLB broken
Post-Processing
user target
user1 1
user2 0
user target
user1 0.75 -> 1
user2 0.12 -> 0
Train
Test
HomeCredit
IEEE-CIS Fraud
Detection
Success Case
identify same users in train and test,then
override test predictions with train’s
target can give big improvement
Summary
● Finding a more stable Validation guide you in the right path
● Trying different non-linear transformation in Pre-Processing always help
● The more knowledge(domain,tech,trick…) you learned, the better Feature
Engineering you can do
● Feature Selection can improve accuracy and prevent overfitting
● Tree Model always perform good ,but don’t ignore neural network,linear,
unsupervised…sometimes they can change the game
● Stacking is crucial when local cv match public leaderboard very well
● Be careful using Post-Processing,even if can improve local cv and public
leaderboard ,only use in one submission
Thank You !

More Related Content

What's hot

グラフデータ分析 入門編
グラフデータ分析 入門編グラフデータ分析 入門編
グラフデータ分析 入門編順也 山口
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)RyuichiKanoh
 
時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?Fumihiko Takahashi
 
Kaggle M5 Forecasting (日本語)
Kaggle M5 Forecasting (日本語)Kaggle M5 Forecasting (日本語)
Kaggle M5 Forecasting (日本語)Masakazu Mori
 
【メタサーベイ】基盤モデル / Foundation Models
【メタサーベイ】基盤モデル / Foundation Models【メタサーベイ】基盤モデル / Foundation Models
【メタサーベイ】基盤モデル / Foundation Modelscvpaper. challenge
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17Takuya Akiba
 
[DL輪読会]Graph R-CNN for Scene Graph Generation
[DL輪読会]Graph R-CNN for Scene Graph Generation[DL輪読会]Graph R-CNN for Scene Graph Generation
[DL輪読会]Graph R-CNN for Scene Graph GenerationDeep Learning JP
 
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeNIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeTakami Sato
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Yusuke Fujimoto
 
SSII2018TS: 3D物体検出とロボットビジョンへの応用
SSII2018TS: 3D物体検出とロボットビジョンへの応用SSII2018TS: 3D物体検出とロボットビジョンへの応用
SSII2018TS: 3D物体検出とロボットビジョンへの応用SSII
 
情報検索の基礎
情報検索の基礎情報検索の基礎
情報検索の基礎Retrieva inc.
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNNTakashi Abe
 
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」  佐野正太郎明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」  佐野正太郎
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎Preferred Networks
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
 

What's hot (20)

グラフデータ分析 入門編
グラフデータ分析 入門編グラフデータ分析 入門編
グラフデータ分析 入門編
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
 
時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?時系列予測にTransformerを使うのは有効か?
時系列予測にTransformerを使うのは有効か?
 
Kaggle M5 Forecasting (日本語)
Kaggle M5 Forecasting (日本語)Kaggle M5 Forecasting (日本語)
Kaggle M5 Forecasting (日本語)
 
【メタサーベイ】基盤モデル / Foundation Models
【メタサーベイ】基盤モデル / Foundation Models【メタサーベイ】基盤モデル / Foundation Models
【メタサーベイ】基盤モデル / Foundation Models
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17
 
[DL輪読会]Graph R-CNN for Scene Graph Generation
[DL輪読会]Graph R-CNN for Scene Graph Generation[DL輪読会]Graph R-CNN for Scene Graph Generation
[DL輪読会]Graph R-CNN for Scene Graph Generation
 
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision TreeNIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
 
Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化
 
SSII2018TS: 3D物体検出とロボットビジョンへの応用
SSII2018TS: 3D物体検出とロボットビジョンへの応用SSII2018TS: 3D物体検出とロボットビジョンへの応用
SSII2018TS: 3D物体検出とロボットビジョンへの応用
 
情報検索の基礎
情報検索の基礎情報検索の基礎
情報検索の基礎
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN
 
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」  佐野正太郎明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」  佐野正太郎
明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」 佐野正太郎
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 

Similar to Kaggle days tokyo jin zhan

new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86Droidcon Berlin
 
Discussion RubricPage 1 of 8 1. I
Discussion RubricPage 1 of 8  1. IDiscussion RubricPage 1 of 8  1. I
Discussion RubricPage 1 of 8 1. ILyndonPelletier761
 
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISSafe Software
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Just Mouse Jack Init
Just Mouse Jack InitJust Mouse Jack Init
Just Mouse Jack Initantitree
 
Making archive IL2C #6-55 dotnet600 2018
Making archive IL2C #6-55 dotnet600 2018Making archive IL2C #6-55 dotnet600 2018
Making archive IL2C #6-55 dotnet600 2018Kouji Matsui
 
TC39: How we work, what we are working on, and how you can get involved (dotJ...
TC39: How we work, what we are working on, and how you can get involved (dotJ...TC39: How we work, what we are working on, and how you can get involved (dotJ...
TC39: How we work, what we are working on, and how you can get involved (dotJ...Igalia
 
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...ScyllaDB
 
Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentRaimonds Simanovskis
 
JS Conf 2018 AU Node.js applications diagnostics under the hood
JS Conf 2018 AU Node.js applications diagnostics under the hoodJS Conf 2018 AU Node.js applications diagnostics under the hood
JS Conf 2018 AU Node.js applications diagnostics under the hoodNikolay Matvienko
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentationarhismece
 
5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR WorkflowsSafe Software
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...sipij
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)Андрей Новиков
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 

Similar to Kaggle days tokyo jin zhan (20)

new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
 
Discussion RubricPage 1 of 8 1. I
Discussion RubricPage 1 of 8  1. IDiscussion RubricPage 1 of 8  1. I
Discussion RubricPage 1 of 8 1. I
 
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GIS
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Just Mouse Jack Init
Just Mouse Jack InitJust Mouse Jack Init
Just Mouse Jack Init
 
Making archive IL2C #6-55 dotnet600 2018
Making archive IL2C #6-55 dotnet600 2018Making archive IL2C #6-55 dotnet600 2018
Making archive IL2C #6-55 dotnet600 2018
 
TC39: How we work, what we are working on, and how you can get involved (dotJ...
TC39: How we work, what we are working on, and how you can get involved (dotJ...TC39: How we work, what we are working on, and how you can get involved (dotJ...
TC39: How we work, what we are working on, and how you can get involved (dotJ...
 
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...
Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction and Aut...
 
Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production Environment
 
JS Conf 2018 AU Node.js applications diagnostics under the hood
JS Conf 2018 AU Node.js applications diagnostics under the hoodJS Conf 2018 AU Node.js applications diagnostics under the hood
JS Conf 2018 AU Node.js applications diagnostics under the hood
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentation
 
5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...
AUTOMATIC IMAGE PROCESSING ENGINE ORIENTED ON QUALITY CONTROL OF ELECTRONIC B...
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
 
Centernet
CenternetCenternet
Centernet
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
Deep MIML Network
Deep MIML NetworkDeep MIML Network
Deep MIML Network
 

Recently uploaded

Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 

Recently uploaded (20)

Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

Kaggle days tokyo jin zhan

  • 1. My Journey To GrandMaster: Success and Failure 詹金 センキン jinZhan
  • 2. Agenda Part 1: Introduction Of My Kaggle Journey ● Before kaggle ● Kaggle Preference ● Competition history Part 2: Some Success and Failure In Competitions ● Validation ● Pre-Processing ● Feature Engineering ● Feature Selection ● Modeling ● Stacking ● Post-Processing
  • 4. Kaggle Preference Competition Type: Buisness Tabular Data ,Science Tabular Data , Text Data Language: Python Library: Pandas/Numpy/Sklearn/Matplotlib/Keras/Pyt orch Model: Lightgbm/NeuralNetwork/ Catboost/Xgboost/Ridge Regression/KNN… Favorite Part: Finding Killer Feature 2nd Favorite Part: Stacking Hardware: 32GMem & GTX1080Ti Desktop ,GoogleCloud
  • 5. First Stage : From Beginer To Expert Competition Public Private Shake Medal Zillow’s Home Value Prediction (2018-01-11 ended) 185/3775 203/3775 ⬇️28 Bronze Corporación Favorita Grocery Sales Forecasting (2018-01-15 ended) 42/1674 85/1674 ⬇️43 Bronze Expert Recruit Restaurant Visitor Forecasting (2018-02-06 ended) 10/2157 760/2157 ⬇️750 Mercari Price Suggestion Challenge (2018-02-21 ended) 32/2382 2318/2382 ⬇️2286 Toxic Comment Classification Challenge (2018-03-20 ended) 78/4550 82/4550 ⬇️4 Silver TalkingData AdTracking Fraud Detection Challenge (2018-05-07 ended) 7/3946 19/3946 ⬇️12 Silver
  • 6. Second Stage : From Master To Solo Gold Competition Public Private Shake Medal Avito Demand Prediction Challenge (2018-06-27 ended) 8/1871 9/1871 ⬇️1 Gold Master Home Credit Default Risk (2018-08-29 ended) 6/7190 8/7190 ⬇️2 Gold Google Analytics Customer Revenue Prediction (2019-02-15 ended) Leak 85/3611 Silver Elo Merchant Category Recommendation (2019-02-26 ended) 3/4127 7/4127 ⬇️4 Solo Gold
  • 7. Third Stage : Keep Going To GrandMaster Competition Public Private Shake Medal Santander Customer Transaction Prediction (2019-04-10 ended) 31/8802 24/8802 ⬆︎7 Gold Jigsaw Unintended Bias in Toxicity Classification (2019-06-27 ended) 30+/3165 Kernel Failed Predicting Molecular Properties (2019-08-28 ended) 15/2749 15/2749 - Gold GM
  • 8. Validation Train and Test are splitted by timestamp,Public Test and Private Test are splitted by timestamp too. Failure Case Success Case Predicting the past with the future data is a form of data leakage
  • 9. Validation Elo train['outliers'] = 0 train.loc[train['target'] < -30, 'outliers'] = 1 StratifiedKFold().split(train['outliers'] ) KFold().split(train[’target'] ) Outliers in Target only 1% Failure Case Success Case Make sure your each fold of validation data have similar distribution,and similar to test
  • 10. Pre-Processing Elo Anonymized Purchase Amount df_new['purchase_amount_new'] = np.round(df_new['purchase_amount'] / 0.00150265118 + 497.06,2) De-Anonymized Purchase Amount Feature engineering make more sense and improved after de-anonymization
  • 11. Feature Engineering Card_id Feature_1 Feature_2 Feature_3 Target(loyalty) C_ID_92a2005557 5 2 1 0.392890 Card_id Merchant_id …… Purchase_a mount Purchase_d ate C_ID_92a2005557 M_ID_b0c793002c 5.263790 2018-04-26 14:08:44 C_ID_92a2005557 M_ID_d15eae0468 -2.782712 2018-05-01 13:01:24 train.csv transactions.csv Elo Merchant_id merchant_group … city_id state_id M_ID_b0c793002c 8179 16 242 merchants.csv Start from understanding problem and data
  • 12. Feature Engineering Elo Some strong features I made: - last_day_purchased (Recency) - unique_month_purchased (Frequency) - max_purchase_amount (Monetary) Get domain knowledge from kaggle discussion(kernel) &google RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries. RFM stands for the three dimensions: • Recency – How recently did the customer purchase? • Frequency – How often do they purchase? • Monetary Value – How much do they spend?
  • 13. Feature Engineering Elo Card_id Merchant_id C1 M1 C1 M2 … … C1 M99 C1 M100 Card_id Merchant _Unique Merchant_ count C1 100 200 Card_id M1_C ount M2_C ount … M99_ Count M100_ Count C1 1 2 … 5 7 Raw Data Coarse-grained Fine-grained Not only coarse-grained aggregation, more fine-grained information unique count and total count of one card’s purchased merchant count of one card’s all the purchased merchants
  • 14. Feature Engineering Elo Card_id M1 M2 … M100 C1 0.67 0.34 … 0.12 C2 0.23 0.45 … 0.66 … … … … … C999 0.01 0.43 … 0.72 C1000 0.99 0.89 … 0.35 Text Like Data TF-IDF (ngram=1,max_features=None) Not only tabular data feature engieering, transform to text like data can build more features Singular Value Decomposition(SVD) Card_id Purchase Merchant Sequence C1 M1,M2, M3,M1,M3,……M100 C2 M2,M3,……M100 … … C999 M45….M100 C1000 M99 Card_i d SVD1 … SVD5 C1 0.34 … 0.78 C2 0.33 … 0.56 … … … … C999 0.31 … 0.70 C1000 0.95 … 0.25
  • 15. Feature Engineering Elo Word2Vec Of Merchant M1 M2 M50 M51 M100 M99 Word2vec model can generate more sequence-related information Sequence Data Card_id Purchase Merchant Sequence C1 M1,M2, M3,M1,M3,……M100 C2 M2,M3,……M100 … … C999 M45….M100 C1000 M99 Card_id W2V_1_Mean … W2V_5_Max C1 0.34 … 0.78 … … … … C1000 0.95 … 0.25 aggregation of all the merchants embedding of each card
  • 16. Feature Engineering C1 M1 C3 M2 C2 M3 Step1: Perform random walks on nodes in a graph to generate node sequences Step 2: Run skip-gram to learn the embedding of each node based on the node sequences generated in step 1 Node: card_id ,merchant_id Edge: purchased count DeepWalkElo Deepwalk model can generate more graph-related information Graph Data Card_id DW_Card_1 … DW_Mercha nt_1_Max C1 0.34 … 0.78 … … … … C1000 0.95 … 0.25
  • 17. Feature Engineering Elo Card_id … Target C1 … 0.392890 C2 … 0.589014 Card_id … Target C1 … 0.392890 C1 … 0.392890 C2 … 0.589014 C2 … 0.589014 train.csv transactions.csv Card_id Merchant_i d … Prediction C1 M1 … 0.389345 C1 M2 … 0.373495 C2 M99 … 0.689014 C2 M100 … 0.489014 Card_id … Mean Of Prediction Max Of Prediction C1 … 0.378924 0.380056 C2 … 0.509341 0.580085 Give card_id’s target to every transaction,build a transaction based model to generate meta feature improved very much
  • 18. Feature Selection Target Permutation (Null Importance) Feature1 Feature2 Feature3 Target 0.34 0.56 0.78 0.1 3.44 1.09 1.23 1.2 5.66 7.88 0.99 2.1 Feature1 Feature2 Feature3 Target 0.34 0.56 0.78 0.1 3.44 1.09 1.23 1.2 5.66 7.88 0.99 2.1 Null Importance Actual Importance HomeCredit Elo Santander Top N Run 50~100 times gain_score = np.log(1e-10 + act_imps_gain / (1 + np.percentile(null_imps_gain, 75))) Shuffle the target then train many times to get gain importance
  • 19. Modeling Competition Best Single Model Ensemble Models Avito (tabular,text,image) LGB > NN (top teams NN>LGB) Stage1: 70+ nn lgb xgb catboost ridge rf rgf Stage2: xgb for stacking Stage3: quiz blending Home Credit (financial tabular) LGB >> NN Stage1: 10+ lgb nn Stage2: lgb(linear),random forest for stacking Stage3: weight average blending Elo (financial tabular) LGB >> NN Stage1: 12 lgb and 40 dnn Stage2: lgb,extratree,dnn,linear for stacking Stage3: weight average blending Santander (anonymous tabular) LGB > NN (top teams NN>LGB) Stage1: Blending of one lgb and one nn Molecular (chemistry tabular) GNN >> LGB,DNN Stage1: 40+ gnn dnn lgb Stage2: bayesian ridge for stacking
  • 20. Stacking EloHomeCredit Single Model ,Final 5th Simple Stacking,Final 3th Single Model ,Final 5th Failure Case Local cv and LB matched unwell,the weight of stacking model is unstable There are many strong lgb in first stage,the second stage’s tree model(lgb,extra tree) overfitted much,if only use nn and linear on second stage,it will improve
  • 21. Stacking Moleculer Success Case Feature-rich(tabular,text,image) Train/Public/Private splitted well Local cv and LB matched very well Moleculer Atom world are clean? Train/Public/Private splitted well Local cv and LB matched very well
  • 22. Postprocessing Talkingdata Without postprocessing Failure Case Lost a solo gold due to postprocessing(shared by discussion),no check for local cv,and both for 2 submissions
  • 23. Post-Processing Elo Failure Case ① ② Prediction Target -28.4579 -33.2192 -27.1178 -33.2192 -26.6666 -33.2192 Calibrate continuous predictions to discrete can improve CV and LB both but PLB broken Overide Top-N lowest predictions to outliers value can improve CV and LB both but PLB broken
  • 24. Post-Processing user target user1 1 user2 0 user target user1 0.75 -> 1 user2 0.12 -> 0 Train Test HomeCredit IEEE-CIS Fraud Detection Success Case identify same users in train and test,then override test predictions with train’s target can give big improvement
  • 25. Summary ● Finding a more stable Validation guide you in the right path ● Trying different non-linear transformation in Pre-Processing always help ● The more knowledge(domain,tech,trick…) you learned, the better Feature Engineering you can do ● Feature Selection can improve accuracy and prevent overfitting ● Tree Model always perform good ,but don’t ignore neural network,linear, unsupervised…sometimes they can change the game ● Stacking is crucial when local cv match public leaderboard very well ● Be careful using Post-Processing,even if can improve local cv and public leaderboard ,only use in one submission