Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

62,510 views

Published on

State-of-the-art techniques anyone can use to improve machine learning model performance. Includes several steps on model strategy, feature creation, Kaggle success secrets, and many other tips.

Published in: Data & Analytics

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

  1. 1. Scott.Clendaniel@MktgSciences.com Machine Learning: Boosting Analytics Model Performance
  2. 2. Scott.Clendaniel@MktgSciences.com THE JOB OF DATA SCIENTISTS Does this sound familiar to anyone?
  3. 3. Scott.Clendaniel@MktgSciences.com How to design a strategy for boosting performance. 2- Strategy How to use Feature Engineering to boost model performance. 3. Features Explaining why boosting performance is relevant. 1- Background Time for questions from the audience. 5. Questions A collection of free resources for boosting model performance. 4. Bonus Round AGENDA
  4. 4. Scott.Clendaniel@MktgSciences.com BOOSTING MODEL PERFOMANCE Section 1: Background
  5. 5. Scott.Clendaniel@MktgSciences.com Explaining why boosting performance is relevant. 1- Background SECTION 1: Background
  6. 6. Scott.Clendaniel@MktgSciences.com TIPS SOURCES Where do the recommendations originate? 197 Kaggle Winner Interviews How did they win? 50 In-depth Case Studies Which factors mattered 25,000 Head-to-Head Tests What made the difference?
  7. 7. Scott.Clendaniel@MktgSciences.com WHERE HAVE THESE TIPS WORKED? IMPORTANT: All views expressed are solely my own, and should not be taken as being those of current or past employers, clients or others.
  8. 8. Scott.Clendaniel@MktgSciences.com TWO CATEGORIES OF TIPS Presentation Focus The plan, method, series of tactics or stratagems for building your model. Model Strategy Part 1 The process for identifying, building, developing, standardizing, normalizing and engineering the correct inputs for one or more analytics processes. Data Preparation Part 2
  9. 9. Scott.Clendaniel@MktgSciences.com BOOSTING MODEL PERFOMANCE Section 2: Model Strategy
  10. 10. Scott.Clendaniel@MktgSciences.com How to design a strategy for boosting performance. 2- Strategy Explaining why boosting performance is relevant. 1- Background SECTION 2 Strategy
  11. 11. Scott.Clendaniel@MktgSciences.com Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic, https://www.slideshare.net/jeongyoonlee/ data-science-competition-72596610 TIP 1: Leverage Extreme Ensembles The performance boost from models with non-correlated errors is consistently higher than single models or smaller ensembles. Source: Owen Zhang, Chief Product Officer at DataRobot, https://www.slideshare.net/OwenZhang2 /tips-for-data-science-competitions • 6-layer process • 5 distinct data prep steps • 31 combined feature sets • 2 layers of 3 models each 2015 Liberty Mutual Contest Owen Zhang • 7 feature sets • 64 component models • 15 models in Level 1 Ensemble • 2 models in Level 2 Ensemble 2015 KDD CUP Jeong-Yoon Lee
  12. 12. Scott.Clendaniel@MktgSciences.com • Seed lists • Old, unusable lead sources • Discontinued markets MARKETING Eliminate irrelevant populations • Low dollar thresholds • “Best” customers • Higher authentication transactions • “Standing” transactions • Canceled transfers FRAUD Eliminate “safer” populations • What do you already know? • What is beyond your influence? • Which problems can be handled separately? GENERAL Other instances TIP 2: Reduce Decision Space Reduce the Decision Space
  13. 13. Scott.Clendaniel@MktgSciences.com TIP 3: Use Targeted AUC Instead of Total AUC Match model objective to organizational objective. Example courtesy of ORACLE. • Less common approach • Perfect for projects with target thresholds such as limited marketing budgets or maximum fraud referral/ turndown rates • Sacrifices overall accuracy for accuracy at lower threshold targets TARGETED AUC Optimizes targeted model performance • Traditional approach • Perfect for may Kaggle competitions • Sacrifices accuracy at lower threshold targets for overall accuracy TOTAL AUC Optimizes overall model performance
  14. 14. Scott.Clendaniel@MktgSciences.com TIP 4: Cross-Validate Everywhere Reducing overfitting while extracting maximum learning from your data OUT-OF-SAMPLE VALIDATION Traditional methodology CROSS-VALIDATION Used to reduce both overfitting and outlier influence
  15. 15. Scott.Clendaniel@MktgSciences.com TIP 5: Algorithm Arsenal Leverage diverse modeling arsenal Bayesian Network Gradient Boosting Machines Random Forests Logistic Regression Factorization Machines Neural Network Genetic Algorithms Support Vector Machines
  16. 16. Scott.Clendaniel@MktgSciences.com BOOSTING MODEL PERFOMANCE Section 3: Features
  17. 17. Scott.Clendaniel@MktgSciences.com How to design a strategy for boosting performance. 2- Strategy How to use Feature Engineering to boost model performance. 3. Features Explaining why boosting performance is relevant. 1- Background SECTION 3 Features
  18. 18. Scott.Clendaniel@MktgSciences.com TIP 7: Test Variable Transformation Functions Features
  19. 19. Scott.Clendaniel@MktgSciences.com “Stumps” represent the first split in decision trees, and make powerful “weak learners.” Create a derived feature for each input. 1. Derive “Stumps” Using trees creates bin “boundaries” directly associated with the dependent variable, rather than a more arbitrary approach. Assign bins for each continuous inputs. 2. Bin Continuous Inputs Missing values assigned to a separate, unique category preserves information content and eliminates arbitrary replacement approaches. 3. Handle Missing Values Each input, regardless of data type, can have consistent, normalized scaling by using something like NORM Sigmoid or Yule’s Q for each terminal node from each univariate tree. 5. Normalize scaling Calling out tree nodes with uniquely powerful splitting capabilities as derived features leverages the most benefit from single inputs. 4. Derive High-Impact Flags Re-coding the original input into the values from the terminal nodes makes interpretation much easier. 6. Overall Transformation TIPS 8-13: Univariate Tree Feature Engineering Features
  20. 20. Scott.Clendaniel@MktgSciences.com Moving Away From… Moving Toward… TIP 14: Think “Crafts-person-ship” Less “Assembly Line,” More “Fine Craftsmanship”
  21. 21. Scott.Clendaniel@MktgSciences.com BOOSTING MODEL PERFOMANCE Section 4: Bonus Round
  22. 22. Scott.Clendaniel@MktgSciences.com How to design a strategy for boosting performance. 2- Strategy How to use Feature Engineering to boost model performance. 3. Features Explaining why boosting performance is relevant. 1- Background A collection of free resources for boosting model performance. 4. Bonus Round SECTION 4 Bonus Round 
  23. 23. Scott.Clendaniel@MktgSciences.com 2. Create Common Table of Values for Each Node 3. Calculate Z-Score Across Entire Table 5. Calculate Avg., High and Low 6. Gradient Boosting4. Assign New Value to New Derived Feature 1. Univariate Tree Models Bonus Round: Patent-Application IMPACT Features Patent application approach for transforming and combining model inputs
  24. 24. Scott.Clendaniel@MktgSciences.com How to design a strategy for boosting performance. 2- Strategy How to use Feature Engineering to boost model performance. 3. Features Explaining why boosting performance is relevant. 1- Background Time for questions from the audience. 5. Questions A collection of free resources for boosting model performance. 4. Bonus Round AGENDA
  25. 25. Scott.Clendaniel@MktgSciences.com USA 1-443-810-8066 Scott.Clendaniel@MktgSciences.com MktgSciences 3719 Yolando Road Baltimore, MD 21218 Get in TouchSee you soon....
  26. 26. Scott.Clendaniel@MktgSciences.com Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic, https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610 MODEL STRATEGY TIP 1 Cross-validate everywhere.
  27. 27. Scott.Clendaniel@MktgSciences.com Source: Owen Zhang, Chief Product Officer at DataRobot, https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions MODEL STRATEGY TIP 1 Cross-validate everywhere.
  28. 28. Scott.Clendaniel@MktgSciences.com THANK YOU...
  29. 29. Scott.Clendaniel@MktgSciences.com BOOSTING MODEL PERFOMANCE Appendix
  30. 30. Scott.Clendaniel@MktgSciences.com DEFINITIONS performance (noun): “the manner in which or the efficiency with which something reacts or fulfills its intended purpose.”
  31. 31. Scott.Clendaniel@MktgSciences.com Moving Away From… Moving Toward… PERFORMANCE IS BEING MORE CLOSELY MEASURED
  32. 32. Scott.Clendaniel@MktgSciences.com PEFORMANCE WILL DETERMINE COMPENSATION Like it or not, Data Science compensation will become more closely tied to model performance.

×