2. Agenda
Part 1: Introduction Of My Kaggle Journey
● Before kaggle
● Kaggle Preference
● Competition history
Part 2: Some Success and Failure In Competitions
● Validation
● Pre-Processing
● Feature Engineering
● Feature Selection
● Modeling
● Stacking
● Post-Processing
6. Second Stage : From Master To Solo Gold
Competition Public Private Shake Medal
Avito Demand Prediction Challenge
(2018-06-27 ended)
8/1871 9/1871 ⬇️1 Gold
Master
Home Credit Default Risk
(2018-08-29 ended)
6/7190 8/7190 ⬇️2 Gold
Google Analytics Customer Revenue Prediction
(2019-02-15 ended)
Leak 85/3611 Silver
Elo Merchant Category Recommendation
(2019-02-26 ended)
3/4127 7/4127 ⬇️4 Solo Gold
7. Third Stage : Keep Going To GrandMaster
Competition Public Private Shake Medal
Santander Customer Transaction
Prediction (2019-04-10 ended)
31/8802 24/8802 ⬆︎7 Gold
Jigsaw Unintended Bias in Toxicity
Classification (2019-06-27 ended)
30+/3165 Kernel
Failed
Predicting Molecular Properties
(2019-08-28 ended)
15/2749 15/2749 - Gold
GM
8. Validation
Train and Test are splitted by
timestamp,Public Test and
Private Test are splitted by
timestamp too.
Failure Case
Success Case
Predicting the past
with the future
data is a form of
data leakage
9. Validation
Elo
train['outliers'] = 0
train.loc[train['target'] < -30, 'outliers'] = 1
StratifiedKFold().split(train['outliers'] )
KFold().split(train[’target'] )
Outliers in Target only 1%
Failure Case
Success Case
Make sure your each fold of validation data
have similar distribution,and similar to test
11. Feature Engineering
Card_id Feature_1 Feature_2 Feature_3 Target(loyalty)
C_ID_92a2005557 5 2 1 0.392890
Card_id Merchant_id …… Purchase_a
mount
Purchase_d
ate
C_ID_92a2005557 M_ID_b0c793002c 5.263790 2018-04-26
14:08:44
C_ID_92a2005557 M_ID_d15eae0468 -2.782712 2018-05-01
13:01:24
train.csv
transactions.csv
Elo
Merchant_id merchant_group … city_id state_id
M_ID_b0c793002c 8179 16 242
merchants.csv
Start from understanding problem and data
12. Feature Engineering
Elo
Some strong features I made:
- last_day_purchased (Recency)
- unique_month_purchased (Frequency)
- max_purchase_amount (Monetary)
Get domain knowledge from kaggle discussion(kernel) &google
RFM is a method used for analyzing customer value. It is
commonly used in database marketing and direct marketing
and has received particular attention in retail and professional
services industries.
RFM stands for the three dimensions:
• Recency – How recently did the customer purchase?
• Frequency – How often do they purchase?
• Monetary Value – How much do they spend?
13. Feature Engineering
Elo
Card_id Merchant_id
C1 M1
C1 M2
… …
C1 M99
C1 M100
Card_id Merchant
_Unique
Merchant_
count
C1 100 200
Card_id M1_C
ount
M2_C
ount
… M99_
Count
M100_
Count
C1 1 2 … 5 7
Raw Data
Coarse-grained
Fine-grained
Not only coarse-grained aggregation, more fine-grained information
unique count and total
count of one card’s
purchased merchant
count of one card’s all the
purchased merchants
14. Feature Engineering
Elo
Card_id M1 M2 … M100
C1 0.67 0.34 … 0.12
C2 0.23 0.45 … 0.66
… … … … …
C999 0.01 0.43 … 0.72
C1000 0.99 0.89 … 0.35
Text Like Data
TF-IDF
(ngram=1,max_features=None)
Not only tabular data feature engieering, transform to text like data can build more
features
Singular Value
Decomposition(SVD)
Card_id Purchase Merchant Sequence
C1 M1,M2, M3,M1,M3,……M100
C2 M2,M3,……M100
… …
C999 M45….M100
C1000 M99
Card_i
d
SVD1 … SVD5
C1 0.34 … 0.78
C2 0.33 … 0.56
… … … …
C999 0.31 … 0.70
C1000 0.95 … 0.25
15. Feature Engineering
Elo Word2Vec Of Merchant
M1
M2 M50
M51
M100
M99
Word2vec model can generate more sequence-related information
Sequence Data
Card_id Purchase Merchant Sequence
C1 M1,M2, M3,M1,M3,……M100
C2 M2,M3,……M100
… …
C999 M45….M100
C1000 M99
Card_id W2V_1_Mean … W2V_5_Max
C1 0.34 … 0.78
… … … …
C1000 0.95 … 0.25
aggregation of all the merchants
embedding of each card
16. Feature Engineering
C1
M1
C3
M2
C2
M3
Step1: Perform random walks on nodes
in a graph to generate node sequences
Step 2: Run skip-gram to learn the
embedding of each node based on the
node sequences generated in step 1
Node: card_id ,merchant_id
Edge: purchased count
DeepWalkElo
Deepwalk model can generate more graph-related information
Graph Data
Card_id DW_Card_1 … DW_Mercha
nt_1_Max
C1 0.34 … 0.78
… … … …
C1000 0.95 … 0.25
17. Feature Engineering
Elo
Card_id … Target
C1 … 0.392890
C2 … 0.589014
Card_id … Target
C1 … 0.392890
C1 … 0.392890
C2 … 0.589014
C2 … 0.589014
train.csv
transactions.csv
Card_id Merchant_i
d
… Prediction
C1 M1 … 0.389345
C1 M2 … 0.373495
C2 M99 … 0.689014
C2 M100 … 0.489014
Card_id … Mean Of
Prediction
Max Of
Prediction
C1 … 0.378924 0.380056
C2 … 0.509341 0.580085
Give card_id’s target to every transaction,build a transaction
based model to generate meta feature improved very much
18. Feature Selection
Target Permutation
(Null Importance)
Feature1 Feature2 Feature3 Target
0.34 0.56 0.78 0.1
3.44 1.09 1.23 1.2
5.66 7.88 0.99 2.1
Feature1 Feature2 Feature3 Target
0.34 0.56 0.78 0.1
3.44 1.09 1.23 1.2
5.66 7.88 0.99 2.1
Null
Importance
Actual
Importance
HomeCredit
Elo
Santander
Top N
Run 50~100 times
gain_score = np.log(1e-10 + act_imps_gain / (1 + np.percentile(null_imps_gain, 75)))
Shuffle the target then train many times to
get gain importance
19. Modeling
Competition Best Single Model Ensemble Models
Avito
(tabular,text,image)
LGB > NN
(top teams NN>LGB)
Stage1: 70+ nn lgb xgb catboost ridge rf rgf
Stage2: xgb for stacking
Stage3: quiz blending
Home Credit
(financial tabular)
LGB >> NN Stage1: 10+ lgb nn
Stage2: lgb(linear),random forest for stacking
Stage3: weight average blending
Elo
(financial tabular)
LGB >> NN Stage1: 12 lgb and 40 dnn
Stage2: lgb,extratree,dnn,linear for stacking
Stage3: weight average blending
Santander
(anonymous tabular)
LGB > NN
(top teams NN>LGB)
Stage1: Blending of one lgb and one nn
Molecular
(chemistry tabular)
GNN >> LGB,DNN Stage1: 40+ gnn dnn lgb
Stage2: bayesian ridge for stacking
20. Stacking
EloHomeCredit
Single Model ,Final 5th
Simple Stacking,Final 3th
Single Model ,Final 5th
Failure Case
Local cv and LB matched
unwell,the weight of stacking
model is unstable
There are many strong lgb in first
stage,the second stage’s tree
model(lgb,extra tree) overfitted
much,if only use nn and linear on
second stage,it will improve
23. Post-Processing
Elo
Failure Case
①
②
Prediction Target
-28.4579 -33.2192
-27.1178 -33.2192
-26.6666 -33.2192
Calibrate continuous predictions to
discrete can improve CV and LB both
but PLB broken
Overide Top-N lowest predictions to
outliers value can improve CV and LB
both but PLB broken
24. Post-Processing
user target
user1 1
user2 0
user target
user1 0.75 -> 1
user2 0.12 -> 0
Train
Test
HomeCredit
IEEE-CIS Fraud
Detection
Success Case
identify same users in train and test,then
override test predictions with train’s
target can give big improvement
25. Summary
● Finding a more stable Validation guide you in the right path
● Trying different non-linear transformation in Pre-Processing always help
● The more knowledge(domain,tech,trick…) you learned, the better Feature
Engineering you can do
● Feature Selection can improve accuracy and prevent overfitting
● Tree Model always perform good ,but don’t ignore neural network,linear,
unsupervised…sometimes they can change the game
● Stacking is crucial when local cv match public leaderboard very well
● Be careful using Post-Processing,even if can improve local cv and public
leaderboard ,only use in one submission