Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kaggle days tokyo jin zhan

6,119 views

Published on

Kaggle days tokyo by jin zhan

Published in: Data & Analytics

Kaggle days tokyo jin zhan

  1. 1. My Journey To GrandMaster: Success and Failure 詹金 センキン jinZhan
  2. 2. Agenda Part 1: Introduction Of My Kaggle Journey ● Before kaggle ● Kaggle Preference ● Competition history Part 2: Some Success and Failure In Competitions ● Validation ● Pre-Processing ● Feature Engineering ● Feature Selection ● Modeling ● Stacking ● Post-Processing
  3. 3. Before Kaggle
  4. 4. Kaggle Preference Competition Type: Buisness Tabular Data ,Science Tabular Data , Text Data Language: Python Library: Pandas/Numpy/Sklearn/Matplotlib/Keras/Pyt orch Model: Lightgbm/NeuralNetwork/ Catboost/Xgboost/Ridge Regression/KNN… Favorite Part: Finding Killer Feature 2nd Favorite Part: Stacking Hardware: 32GMem & GTX1080Ti Desktop ,GoogleCloud
  5. 5. First Stage : From Beginer To Expert Competition Public Private Shake Medal Zillow’s Home Value Prediction (2018-01-11 ended) 185/3775 203/3775 ⬇️28 Bronze Corporación Favorita Grocery Sales Forecasting (2018-01-15 ended) 42/1674 85/1674 ⬇️43 Bronze Expert Recruit Restaurant Visitor Forecasting (2018-02-06 ended) 10/2157 760/2157 ⬇️750 Mercari Price Suggestion Challenge (2018-02-21 ended) 32/2382 2318/2382 ⬇️2286 Toxic Comment Classification Challenge (2018-03-20 ended) 78/4550 82/4550 ⬇️4 Silver TalkingData AdTracking Fraud Detection Challenge (2018-05-07 ended) 7/3946 19/3946 ⬇️12 Silver
  6. 6. Second Stage : From Master To Solo Gold Competition Public Private Shake Medal Avito Demand Prediction Challenge (2018-06-27 ended) 8/1871 9/1871 ⬇️1 Gold Master Home Credit Default Risk (2018-08-29 ended) 6/7190 8/7190 ⬇️2 Gold Google Analytics Customer Revenue Prediction (2019-02-15 ended) Leak 85/3611 Silver Elo Merchant Category Recommendation (2019-02-26 ended) 3/4127 7/4127 ⬇️4 Solo Gold
  7. 7. Third Stage : Keep Going To GrandMaster Competition Public Private Shake Medal Santander Customer Transaction Prediction (2019-04-10 ended) 31/8802 24/8802 ⬆︎7 Gold Jigsaw Unintended Bias in Toxicity Classification (2019-06-27 ended) 30+/3165 Kernel Failed Predicting Molecular Properties (2019-08-28 ended) 15/2749 15/2749 - Gold GM
  8. 8. Validation Train and Test are splitted by timestamp,Public Test and Private Test are splitted by timestamp too. Failure Case Success Case Predicting the past with the future data is a form of data leakage
  9. 9. Validation Elo train['outliers'] = 0 train.loc[train['target'] < -30, 'outliers'] = 1 StratifiedKFold().split(train['outliers'] ) KFold().split(train[’target'] ) Outliers in Target only 1% Failure Case Success Case Make sure your each fold of validation data have similar distribution,and similar to test
  10. 10. Pre-Processing Elo Anonymized Purchase Amount df_new['purchase_amount_new'] = np.round(df_new['purchase_amount'] / 0.00150265118 + 497.06,2) De-Anonymized Purchase Amount Feature engineering make more sense and improved after de-anonymization
  11. 11. Feature Engineering Card_id Feature_1 Feature_2 Feature_3 Target(loyalty) C_ID_92a2005557 5 2 1 0.392890 Card_id Merchant_id …… Purchase_a mount Purchase_d ate C_ID_92a2005557 M_ID_b0c793002c 5.263790 2018-04-26 14:08:44 C_ID_92a2005557 M_ID_d15eae0468 -2.782712 2018-05-01 13:01:24 train.csv transactions.csv Elo Merchant_id merchant_group … city_id state_id M_ID_b0c793002c 8179 16 242 merchants.csv Start from understanding problem and data
  12. 12. Feature Engineering Elo Some strong features I made: - last_day_purchased (Recency) - unique_month_purchased (Frequency) - max_purchase_amount (Monetary) Get domain knowledge from kaggle discussion(kernel) &google RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries. RFM stands for the three dimensions: • Recency – How recently did the customer purchase? • Frequency – How often do they purchase? • Monetary Value – How much do they spend?
  13. 13. Feature Engineering Elo Card_id Merchant_id C1 M1 C1 M2 … … C1 M99 C1 M100 Card_id Merchant _Unique Merchant_ count C1 100 200 Card_id M1_C ount M2_C ount … M99_ Count M100_ Count C1 1 2 … 5 7 Raw Data Coarse-grained Fine-grained Not only coarse-grained aggregation, more fine-grained information unique count and total count of one card’s purchased merchant count of one card’s all the purchased merchants
  14. 14. Feature Engineering Elo Card_id M1 M2 … M100 C1 0.67 0.34 … 0.12 C2 0.23 0.45 … 0.66 … … … … … C999 0.01 0.43 … 0.72 C1000 0.99 0.89 … 0.35 Text Like Data TF-IDF (ngram=1,max_features=None) Not only tabular data feature engieering, transform to text like data can build more features Singular Value Decomposition(SVD) Card_id Purchase Merchant Sequence C1 M1,M2, M3,M1,M3,……M100 C2 M2,M3,……M100 … … C999 M45….M100 C1000 M99 Card_i d SVD1 … SVD5 C1 0.34 … 0.78 C2 0.33 … 0.56 … … … … C999 0.31 … 0.70 C1000 0.95 … 0.25
  15. 15. Feature Engineering Elo Word2Vec Of Merchant M1 M2 M50 M51 M100 M99 Word2vec model can generate more sequence-related information Sequence Data Card_id Purchase Merchant Sequence C1 M1,M2, M3,M1,M3,……M100 C2 M2,M3,……M100 … … C999 M45….M100 C1000 M99 Card_id W2V_1_Mean … W2V_5_Max C1 0.34 … 0.78 … … … … C1000 0.95 … 0.25 aggregation of all the merchants embedding of each card
  16. 16. Feature Engineering C1 M1 C3 M2 C2 M3 Step1: Perform random walks on nodes in a graph to generate node sequences Step 2: Run skip-gram to learn the embedding of each node based on the node sequences generated in step 1 Node: card_id ,merchant_id Edge: purchased count DeepWalkElo Deepwalk model can generate more graph-related information Graph Data Card_id DW_Card_1 … DW_Mercha nt_1_Max C1 0.34 … 0.78 … … … … C1000 0.95 … 0.25
  17. 17. Feature Engineering Elo Card_id … Target C1 … 0.392890 C2 … 0.589014 Card_id … Target C1 … 0.392890 C1 … 0.392890 C2 … 0.589014 C2 … 0.589014 train.csv transactions.csv Card_id Merchant_i d … Prediction C1 M1 … 0.389345 C1 M2 … 0.373495 C2 M99 … 0.689014 C2 M100 … 0.489014 Card_id … Mean Of Prediction Max Of Prediction C1 … 0.378924 0.380056 C2 … 0.509341 0.580085 Give card_id’s target to every transaction,build a transaction based model to generate meta feature improved very much
  18. 18. Feature Selection Target Permutation (Null Importance) Feature1 Feature2 Feature3 Target 0.34 0.56 0.78 0.1 3.44 1.09 1.23 1.2 5.66 7.88 0.99 2.1 Feature1 Feature2 Feature3 Target 0.34 0.56 0.78 0.1 3.44 1.09 1.23 1.2 5.66 7.88 0.99 2.1 Null Importance Actual Importance HomeCredit Elo Santander Top N Run 50~100 times gain_score = np.log(1e-10 + act_imps_gain / (1 + np.percentile(null_imps_gain, 75))) Shuffle the target then train many times to get gain importance
  19. 19. Modeling Competition Best Single Model Ensemble Models Avito (tabular,text,image) LGB > NN (top teams NN>LGB) Stage1: 70+ nn lgb xgb catboost ridge rf rgf Stage2: xgb for stacking Stage3: quiz blending Home Credit (financial tabular) LGB >> NN Stage1: 10+ lgb nn Stage2: lgb(linear),random forest for stacking Stage3: weight average blending Elo (financial tabular) LGB >> NN Stage1: 12 lgb and 40 dnn Stage2: lgb,extratree,dnn,linear for stacking Stage3: weight average blending Santander (anonymous tabular) LGB > NN (top teams NN>LGB) Stage1: Blending of one lgb and one nn Molecular (chemistry tabular) GNN >> LGB,DNN Stage1: 40+ gnn dnn lgb Stage2: bayesian ridge for stacking
  20. 20. Stacking EloHomeCredit Single Model ,Final 5th Simple Stacking,Final 3th Single Model ,Final 5th Failure Case Local cv and LB matched unwell,the weight of stacking model is unstable There are many strong lgb in first stage,the second stage’s tree model(lgb,extra tree) overfitted much,if only use nn and linear on second stage,it will improve
  21. 21. Stacking Moleculer Success Case Feature-rich(tabular,text,image) Train/Public/Private splitted well Local cv and LB matched very well Moleculer Atom world are clean? Train/Public/Private splitted well Local cv and LB matched very well
  22. 22. Postprocessing Talkingdata Without postprocessing Failure Case Lost a solo gold due to postprocessing(shared by discussion),no check for local cv,and both for 2 submissions
  23. 23. Post-Processing Elo Failure Case ① ② Prediction Target -28.4579 -33.2192 -27.1178 -33.2192 -26.6666 -33.2192 Calibrate continuous predictions to discrete can improve CV and LB both but PLB broken Overide Top-N lowest predictions to outliers value can improve CV and LB both but PLB broken
  24. 24. Post-Processing user target user1 1 user2 0 user target user1 0.75 -> 1 user2 0.12 -> 0 Train Test HomeCredit IEEE-CIS Fraud Detection Success Case identify same users in train and test,then override test predictions with train’s target can give big improvement
  25. 25. Summary ● Finding a more stable Validation guide you in the right path ● Trying different non-linear transformation in Pre-Processing always help ● The more knowledge(domain,tech,trick…) you learned, the better Feature Engineering you can do ● Feature Selection can improve accuracy and prevent overfitting ● Tree Model always perform good ,but don’t ignore neural network,linear, unsupervised…sometimes they can change the game ● Stacking is crucial when local cv match public leaderboard very well ● Be careful using Post-Processing,even if can improve local cv and public leaderboard ,only use in one submission
  26. 26. Thank You !

×