Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Before Kaggle : from a business goal to a Machine Learning problem

1,723 views

Published on

Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.

This is a presentation by Pierre Gutierrez (Dataiku’s data scientist).

Published in: Science

Before Kaggle : from a business goal to a Machine Learning problem

  1. 1. Before Kaggle From a business goal to a ML problem Pierre  Gu(errez  @prrgu(errez  
  2. 2. •  Data Science competitions platform (There are others : DataScience.net in France) •  332,000 Data Scientists •  today : 192 competitions, 18 active + 516 In class, 12 active •  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex… What is ?
  3. 3. •  Price pool? •  325,000 $ to make on August 31st •  Good luck with that ! •  Not a good hourly wage •  today : 192 competitions, 18 active Understand : •  Lot’s of datasets about approximately every DS topic •  Lot’s of winner solutions, tip and tricks, etc… •  Lot’s of “beat the benchmark” for beginners I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ... Why should I join ?
  4. 4. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition ?
  5. 5. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition? Why  AUC?  F1  score?  Log  loss?     Could  that  depend  on  my  train/test  split?     Where  do  they  come  from  ?  Do  you  always   have  some?       Why  is  the  split  this  way?  Random?  Time?    
  6. 6. What you don’t learn on Kaggle (or in class?): •  How to model a business question into a ML problem. •  How to manage/create labels. (proxy / missing…) •  How to evaluate a model: •  How to choose your metric •  How to design your train/test split •  How to account for this in feature engineering Understanding this actually helps you in Kaggle competition : •  How to design your cross validation scheme (and not overfit) •  How to create relevant features •  Hacks and tricks (leak exploitation J) What is a Data Science Competition?
  7. 7. Scikit learn cheat sheet
  8. 8. Christophe Bourguignat DS cheat sheet @chris_bour     Today  
  9. 9. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction
  10. 10. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction The  newcomer  disillusion   The  produc(on  bad  surprise   The  business  obfusca(on  
  11. 11. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  12. 12. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  13. 13. •  Everywhere is fraud E-business, Telco, Medicare,… •  Easily defined as a classification problem •  Target well defined ? •  E-business : yes with lag •  Elsewhere : need checks, labels are expensive Fraud Detection
  14. 14. •  Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving Churn
  15. 15. •  Subscription models: •  Telco •  E-gamming (Wow) •  Ex : Coyote -> 1 year subscription -> you know when someone leave •  Non subscription models: •  E-Business (Amazon, Price Minister, Vente Privée) •  E-gamming (Candy Crush, free MMORPG) -> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months Two types of Churn
  16. 16. •  Predict if a vehicle / machine / part is going to fail •  Classification Problem: •  Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target •  Vary a lot the target -> spurious correlation •  Just choose it as the result of the exact business need Predictive Maintenance
  17. 17. •  Target is “will like” or “will buy” •  Target is often proxy of real interest (implicit feedback) Recommender System
  18. 18. •  Can you model the problem as a ML problem? •  Ex : predictive maintenance •  Ask the right question from a business point of view. Not what you know how to do. •  Is your target a proxy? •  Recommendation system •  May need bandit algorithm •  Is it easy to get labels? •  Ex : Fraud detection •  Can be expensive •  Mechanical Turk can be the answer Summary on Labels
  19. 19. •  Random Split •  Just like in school Train / test split   •  When  and  why  ?     -­‐>    When  each  line  is  independent  from  the   rest  (not  that  common  !)       image,  document  classifica(on,  sen(ment   analysis  (“but  aha  is  the  new  lol”  )     -­‐>    When  you  want  to  quickly  iterate  /   benchmark:  “is  it  even  possible?”     -­‐>    When  you  want  to  sell  something  to   your  boss  
  20. 20. •  Column / group based Ex : Caterpillar challenge •  Predict a price •  for each tube id •  Tube id in train and test are different Objective : being able to generalize to other tubes! Train / test split
  21. 21. •  Time based •  Simply separate train and test on a time variable •  When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples : Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,… Train / test split
  22. 22. •  No subscription example •  Target : 4 month without buying •  Features ? Train / test split : Churn example
  23. 23. Ex : Train and predict scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for  feature   genera(on.   Use  model  to  predict   future  churn   Train  model  using  features  and  target  
  24. 24. Ex : Train Evaluation and Predict Scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for   feature  genera(on   Valida&on  set   Use  model  to   predict  future   churn   Training   Evaluate  on  the  target   of  the  valida(on  set   T  –  8  month   Data  is  used  for  features   genera(on.   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months  
  25. 25. •  More complex design •  Graph sampling (fraud rings ? ) •  Random sampling in client / machine life •  Mix of column based and time based … •  The rule : 1)  What is the problem ? 2)  To what would I like to generalize my model ? Future ? Other individuals ? … 3)  => Train / Test split Train / test split
  26. 26. •  Predictive Maintenance problem •  Objective : predict failure in next 3 days. •  Metric is proportional to accuracy (and 0.57 is the best score !) •  Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge EX PHM Society (Fail example)
  27. 27. •  Failures EX PHM Society
  28. 28. •  Usage EX PHM Society
  29. 29. •  Part Replacements EX PHM Society
  30. 30. •  How to design the evaluation scheme? •  What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? •  Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening •  Choose several dates for each asset? -> beware of asset over-fitting •  In the challenge : random selection of (asset, date) in the future + over sampling of failures. EX PHM Society
  31. 31. •  Basic Feature engineering EX PHM Society
  32. 32. •  Random Sampling EX PHM Society This  is  decent!     «  With  some  more  work  I  could  have  a  model   that  beat  randomness  enough  to  be  useful  »  
  33. 33. •  Time based split EX PHM Society Wait  what?    
  34. 34. •  TIME LEAK EX PHM Society
  35. 35. •  TIME LEAK EX PHM Society Tree  cuts  
  36. 36. •  Beware of the distribution of you features! •  Is there a time dependency? •  Ex : count, sum, … that will only increase with time •  -> Calculate count and sum rescaled by time / in moving windows instead. •  Can be found in Churn, Fraud detection, Ad click prediction,… •  A categorical variable dependency? •  Ex : email flag in fraud detection •  Is there a Network dependency? •  Ex : Fraud / Bot detection (network features can be useful but leaky) Feature Engineering
  37. 37. •  Final trick : -  Stack train and test and add is_test boolean -  Try to predict is_test -  Check if the model is able to predict -  If so : -  check the feature importance -  Remove / modify feature and iterate Feature Engineering
  38. 38. •  Final trick: •  Back to Phm example: Feature Engineering Huge  (me  leak  !    
  39. 39. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… Evaluation metric : Classification
  40. 40. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… •  Customs Evaluation metric : Classification Not  good  if  unbalanced  target   When  you  have  an  order  problem     When  you  are  going  stochas(c   When  you  need  to  s(ck  to  business   Accuracy  alterna(ve  
  41. 41. •  Custom metrics •  Cost based •  Ex Fraud: •  Mean loss of 50 $ / fraud (FN) •  Mean loss of 20 $ / wrongly cancelled transaction (FP) •  F1 score often used in papers •  in practice, you often have a business cost Evaluation metric : Classification TP   FN   TN  FP  
  42. 42. •  Custom metrics •  Fraud Example 1: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this by manually handling transactions with score higher than threshold •  I have 1 person that does this fulltime and able to deal with 100 transactions / day •  The rest is automatically accepted -> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication. Evaluation metric : Classification
  43. 43. •  Custom metrics •  Fraud Example 2: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this automatically by blocking all transactions with score higher than threshold -> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better Evaluation metric : Classification
  44. 44. •  My cheat sheet Evaluation metric : Classification Metric   Op&mized  By  ML  model  ?     Treshold  Dependant   Applica&on  example   Accuracy   YES   YES   image  classifica(on,  nlp  …     F1-­‐score   NO   YES   ?  Papers  ?     AUC   NO   NO   fraud  detec(on,  churn,  healthcare  …     Log-­‐Loss   YES   NO   add  click  predic(on   Custom  metric   NO   ?     all  ?    
  45. 45. •  Business Question dictates Evaluation Scheme! •  test set design •  evaluation metric •  Indirectly impact feature engineering •  Indirectly impact label quality •  Think (not too much) before coding •  Don’t try to optimize the wrong problem! Conclusion
  46. 46. Thank you for your attention!

×