Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- CSPO、CSM研修に参加して by Arata Fujimura 7511 views
- ユーザーストーリー駆動開発で行こう。 by toshihiro ichitani 35445 views
- 開発モデルの作り方(守破離の破) by Arata Fujimura 8387 views
- Prophet入門【Python編】Facebookの時系列予測ツール by hoxo_m 35550 views
- ユーザーストーリーとは？ by Ryutaro YOSHIBA 282955 views
- Matrix Factorisation (and Dimension... by HJvanVeen 3114 views

7,460 views

Published on

Some tips and tricks for winning Kaggle competitive data science competitions

Published in:
Technology

No Downloads

Total views

7,460

On SlideShare

0

From Embeds

0

Number of Embeds

792

Shares

0

Downloads

365

Comments

0

Likes

34

No embeds

No notes for slide

- 1. Winning Kaggle Competitions Hendrik Jacob van Veen - Nubank Brasil
- 2. About Kaggle Biggest platform for competitive data science in the world Currently 500k + competitors Great platform to learn about the latest techniques and avoiding overﬁt Great platform to share and meet up with other data freaks
- 3. Approach Get a good score as fast as possible Using versatile libraries Model ensembling
- 4. Get a good score as fast as possible Get the raw data into a universal format like SVMlight or Numpy arrays. Failing fast and failing often / Agile sprint / Iteration Sub-linear debugging: “output enough intermediate information as a calculation is progressing to determine before it ﬁnishes whether you've injected a major defect or a signiﬁcant improvement.” Paul Mineiro
- 5. Using versatile libraries Scikit-learn Vowpal Wabbit XGBoost Keras Other tools get Scikit-learn API wrappers
- 6. Model Ensembling Voting Averaging Bagging Boosting Binning Blending Stacking
- 7. General Strategy Try to create “machine learning”-learning algorithms with optimized pipelines that are: Data agnostic (Sparse, dense, missing values, larger than memory) Problem agnostic (Classiﬁcation, regression, clustering) Solution agnostic (Production-ready, PoC, latency) Automated (Turn on and go to bed) Memory-friendly (Don’t want to pay for AWS) Robust (Good generalization, concept drift, consistent)
- 8. First Overview I Classiﬁcation? Regression? Evaluation Metric Description Benchmark code “Predict human activities based on their smartphone usage. Predict if a user is sitting, walking etc.” - Smartphone User Activity Prediction Given the HTML of ~337k websites served to users of StumbleUpon, identify the paid content disguised as real content. - Dato Truly Native?
- 9. First Overview II Counts Images Text Categorical Floats Dates 0.28309984, -0.025501173, … , -0.11118051, 0.37447712 <!Doctype html><html><head><meta charset=utf-8> … </html>
- 10. First Overview III Data size? Dimensionality? Number of train samples & test samples? Online or ofﬂine learning? Linear problem or non-linear problem? Previous competitions that were similar?
- 11. Branch If: Issues with the data -> Tedious clean-up Join JSON tables, Impute missing values, Curse Kaggle and join another competition Else: Get data into Numpy arrays, we want: X_train, y, X_test
- 12. Local Evaluation Set up local evaluation according to competition metric Create a simple benchmark (useful for exploration and discarding models) 5-fold stratiﬁed cross-validation usually does the trick Very important step for fast iteration and saving submissions, yet easy to be lazy and use leaderboard. Area Under the Curve, Multi-Class Classification Accuracy
- 13. Data Exploration Min, Max, Mean, Percentiles, Std, Plotting Can detect: leakage, golden features, feature engineering tricks, data health issues. Caveat: At least one top 50 Kaggler used to not look at the data at all: “It’s called machine learning for a reason.”
- 14. Feature Engineering I Log-transform count features, tf-idf transform text features Unsupervised transforms / dimensionality reduction Manual inspection of data Dates -> day of month, is_holiday, season, etc. Create histograms and cluster similar features Using VW-varinfo or XGBﬁ to check 2-3-way interactions Row stats: mean, max, min, number of NA’s.
- 15. Feature Engineering II Bin numerical features to categorical features Bayesian encoding of categorical features to likelihood Genetic programming Random-swap feature elimination Time binning (customer bought in last week, last month, last year …) Expand data (Coates & Ng, Random Bit Regression) Automate all of this
- 16. Feature Engineering III Categorical features need some special treatment Onehot-encode for linear models (sparsity) Colhot-encode for tree-based models (density) Counthot-encode for large cardinality features Likelihood-encode for experts…
- 17. Algorithms I A bias-variance trade-off between simple and complex models
- 18. Algorithms II There is No Free Lunch in statistical inference We show that all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions. – Wolpert & Macready, No free lunch theorems for search Practical Solution for low-bias low-variance models: Use prior knowledge / experience to limit search (Let algo’s play to their known strengths for particular problems) Remove or avoid their weaknesses Combine/Bag their predictions
- 19. Random Forests I A Random Forest is an ensemble of decision trees. "Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. […] More robust to noise - “Random Forest" Breiman
- 20. Random Forests II Strengths Fast Easy to tune Easy to inspect Easy to explore data with Good Benchmark Very wide applicability Can introduce randomness / Diversity Weaknesses Memory Hungry Popular Slower for test time
- 21. GBM I A GBM trains weak models on samples that previous models got wrong "A method is described for converting a weak learning algorithm [the learner can produce an hypothesis that performs only slightly better than random guessing] into one that achieves arbitrarily high accuracy." - “The Strength of Weak Learnability" Schapire
- 22. GBM II Strengths Can achieve very good results Can model complex problems Works on wide variety of problems Use custom loss functions No need to scale data Weaknesses Slower to train Easier to overﬁt than RF Weak learner assumption is broken along the way Tricky to tune Popular
- 23. SVM I Classiﬁcation and Regression using Support Vectors "Nothing is more practical than a good theory." ‘The Nature of Statistical Learning Theory’, Vapnik
- 24. SVM II Strengths Strong theoretical guarantees Tuning regularization parameter helps prevent overﬁt Kernel Trick: Use custom kernels, turn linear kernel into non-linear kernel Achieve state-of-the-art on select problems Weaknesses Slower to train Memory heavy Requires a tedious grid-search for best performance Will probably time-out on large datasets
- 25. Nearest Neighbours I Look at the distance to other samples "The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points." ‘Nearest neighbor pattern classiﬁcation’, Cover et. al.
- 26. Nearest Neighbours II Strengths Simple Impopular Non-linear Easy to tune Detect near-duplicates Weaknesses Simple Does not work well on average Depending on data size: Slow
- 27. Perceptron I Update weights when wrong prediction, else do nothing The embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. ‘New York Times’, Rosenblatt
- 28. Perceptron II Strengths Cool / Street Cred Extremely Simple Fast / Sparse updates Online Learning Works well with text Weaknesses Other linear algo’s usually beat it Does not work well on average No regularization
- 29. Neural Networks I Inspired by biological systems (Connected neurons ﬁring when threshold is reached) Because of the "all-or-none" character of nervous activity, neural events and the relations among them can be treated by means of propositional logic. […] for any logical expression satisfying certain conditions, one can find a net behaving in the fashion it describes. ‘A Logical Calculus of the Ideas Immanent in Nervous Activity’, McCulloch & Pitts
- 30. Neural Networks II Strengths The best for images Can model any function End-to-end Training Amortizes feature representation Weaknesses Can be difﬁcult to set up Not very interpretable Requires specialized hardware Underﬁt / Overﬁt
- 31. Vowpal Wabbit I Online learning while optimizing a loss function We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features, billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. ‘A Reliable Effective Terascale Linear Learning System’, Agarwal et al.
- 32. Vowpal Wabbit II Strengths Fixed memory constraint Extremely fast Feature expansion Difﬁcult to overﬁt Versatile Weaknesses Different API Manual feature engineering Loses against boosting Requires practice Hashing can obscure
- 33. Others Factorization Machines PCA t-SNE SVD / LSA Ridge Regression GLMNet Genetic Algorithms Bayesian Logistic Regression Quantile Regression AdaBoosting SGD
- 34. Ensembles I Combine models in a way that outperforms individual models. “That’s how almost all ML competitions are won” - ‘Dark Knowledge’ Hinton et al. Ensembles reduce the chance of overﬁt. Bagging / Averaging -> Lower variance, slightly lower bias Blending / Stacking -> Remove biases of base models
- 35. Ensembles II Practical tips: Use diverse models Use diverse feature sets Use many models Do not leak any information
- 36. Stacked Generalization I Train one model on the predictions of another model A scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. - ‘Stacked Generalization’, Wolpert
- 37. Stacked Generalization II Train one model on the predictions of another model
- 38. Stacked Generalization III Using weak base models vs. using strong base models Using average of out-of-fold predictors vs. One model for testing One can also stack features when these are not available in test set. Can share train set predictions based on different folds
- 39. StackNet We need to go deeper: Splitting node: x1 > 5? 1 else 0 Decision tree: x1 > 5 AND x2 < 12? Random forest: avg ( x1 > 5 AND x2 < 12?, x3 > 2? ) Stacking-1: avg ( RF1_pred > 0.9?, RF2_pred > 0.92? ) Stacking-2: avg ( S1_pred > 0.93?, S2_pred < 0.77? ) Stacking-3: avg ( SS1_pred > 0.98?, SS2_pred > 0.97? )
- 40. Bagging Predictors I Averaging submissions to reduce variance "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor." - "Bagging Predictors". Breiman
- 41. Bagging Predictors II Train models with: Different data sets Different algorithms Different features subsets Different sample subsets Then average / vote aggregate these
- 42. Bagging Predictors III One can average with: Plain average Geometric mean Rank mean Harmonic mean KazAnova’s brute-force weighted averaging Caruana’s forward greedy model selection
- 43. Brute-Force Weighted Average Create out-of-fold predictions for train set for n models Pick a stepsize s, and set n weights Try every possible weight with stepsize s Look which set of n weights improves the train set score the most Can do in cross-validation-style manner for extra robustness.
- 44. Greedy forward model selection (Caruana) Create out-of-fold predictions for the train set Start with a base ensemble of 3 best models Loop: Add every model from library to ensemble and pick 4 models that give best train score performance Using place-back of models, models can be picked multiple times (weighing them) Using random subset selection from library in loop avoids overﬁtting to single best model.
- 45. Automated Stack ’n Bag I Automatically train 1000s of models and 100s of stackers, then average everything. “Hodor!” - Hodor
- 46. Automated Stack ’n Bag II Generalization Train random models, random parameters, random data set transforms, random feature sets, random sample sets. Stacking Train random models, random parameters, random base models, with and without original features, random feature sets, random sample sets. Bagging Average random selection of Stackers and Generalizers. Either pick best model, or create more random bags and keep averaging, ‘till no increase.
- 47. Automated Stack ’n Bag III Strengths Wins Kaggle competitions Best generalization No tuning No selection No human bias Weaknesses Extremely slow Redundant Inelegant Very complex Bad for environment
- 48. Leakage I “The introduction of information about the data mining target, which should not be legitimately available to mine from.” - ‘Leakage in Data Mining: Formulation, Detection, and Avoidance’, Kaufman et. al. “one of the top ten data mining mistakes” - ‘Handbook of Statistical Analysis and Data Mining Applications.’, Nisbet et. al.
- 49. Leakage II Exploiting Leakage: In predictive modeling competitions: Allowed and beneﬁcial for results In Science and Business: A very big NO NO! In both: Accidental (Complex algo’s ﬁnd leakage automatically, or KNN ﬁnds duplicates)
- 50. Leakage III Dato Truly Native? This task suffered from data collection leakage: Dates and certain keywords (Trump) were indicative, and generalized to private LB (but not generalize to future data). Smartphone activity prediction This task had not enough randomization (order of samples in train and test set was indicative) Could manually change predictions, because classes were clustered.
- 51. Winning Dato Truly Native? I Invented StackNet “Data science is a team sport”: it helps to join up with #1 Kaggler :) We used basic NLP: Cleaning, lowercasing, stemming, ngrams, chargrams, tf- idf, SVD. Trained a lot of different models on different datasets. Started ensembling in the last 2 weeks. Doing research and fun stuff, while waiting for models to complete. XGBoost the big winner (somewhat rare to use boosting for sparse text)
- 52. Winning Dato Truly Native? II
- 53. Winning Smartphone Activity Prediction I Prototyped Automated Stack ’n Bag (Kaggle Killer). Let computer run for two days Automatically inferred feature types Did not look at the data Beat very stiff competition
- 54. Winning Smartphone Activity Prediction I
- 55. General strategy Being #1 during competition sucks. Team up Go crazy with ensembling Do not worry so much about replication that it freezes progress Check previous competitions Be patient and persistent (dont run out of steam) Automate a lot Stay up-to-date with State-of-the-art algorithms and tools
- 56. Complexity vs. Practicality I Most Kaggle winner models are useless for production. It’s about hyper-optimization. Top 10% probably good enough for business. But what if we could use some Top 1% principles from Kaggle models for business? 1-5% increase in accuracy can matter a lot! Batch jobs allow us to overcome latency constraints Ensembles are technically brittle, but give good generalization. Leave no model behind!
- 57. Complexity vs. Practicality II
- 58. Future Use re-usable holdout set Use contextual bandits for training the ensemble Find more models to add to library Ensemble pruning / compression Interpretable black box models

No public clipboards found for this slide

Be the first to comment