Your SlideShare is downloading. ×
Making Machine Learning Work in Practice - StampedeCon 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Making Machine Learning Work in Practice - StampedeCon 2014

657

Published on

At StampedeCon 2014, Kilian Q. Weinberger (Washington University) presented "Making Machine Learning work in Practice." …

At StampedeCon 2014, Kilian Q. Weinberger (Washington University) presented "Making Machine Learning work in Practice."

Here, Kilian will go over common pitfalls and tricks on how to make machine learning work.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
657
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Machine Learning in practice common pitfalls, and debugging tricks ! Kilian Weinberger, Associate Professor (thanks to Rob Shapire, Andrew Ng)
  • 2. What is Machine Learning
  • 3. Traditional Computer Science Data Program Output Computer Traditional CS:
  • 4. Machine Learning Data Program Output Computer Traditional CS: Machine Learning: Data Output Program Computer
  • 5. Machine Learning Data Program Output Computer Data Output Program Computer Machine Learning: Traditional CS:
  • 6. Machine Learning Data Program Output Computer Train Data Labels Computer Training: Testing:
  • 7. Example: Spam Filter
  • 8. Date Soon:Autonomous Cars
  • 9. Machine Learning Setup
  • 10. Goal Data Miracle Learning Algorithm Amazing results!!! Fame, Glory, Rock’n Roll! Idea
  • 11. 1. Learning Problem What is my relevant data? What am I trying to learn? Can I obtain trustworthy supervision? QUIZ: What would be some answers for email spam filtering?
  • 12. Example: What is my data? What am I trying to learn? Can I obtain trustworthy supervision? Email content / Meta Data User’s spam/ham labels Employees?
  • 13. 2. Train / Test split How much data do I need? (More is more.) How do you split into train / test? (Always by time! o/w: random) Training data should be just like test data!! (i.i.d.) Train Data Test Data time Real World Data ??
  • 14. Train Data Test Data Data set overfitting ! By evaluating on the same data set over and over, you will overfit Overfitting bounded by: Kishore’s rule of thumb: subtract 1% accuracy for every time you have tested on a data set Ideally: Create a second train / test split! Train Data Test Data time Real World Data ?? many runs one run! O s log (#trials) #examples !
  • 15. 3. Data Representation: feature vector: 0 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... “viagra” “hello” “cheap” “$” “Microsoft” ... Sender in address book? IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... data (email)
  • 16. Data Representation: 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... feature vector: “viagra” “hello” “cheap” “$” “Microsoft” ... IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... bag of word features (sparse) meta features (sparse / dense) aggregate statistics (dense real) Pitfall #1: Aggregate statistics should not be over test data! Sender in address book?
  • 17. Pitfall #2: Feature scaling 1. With linear classifiers / kernels features should have similar scale (e.g. range [0,1]) 2. Must use the same scaling constants for test data!!! (most likely test data will not be in a clean [0,1] interval) 3. Dense features should be down-weighted when combined with sparse features (Scale does not matter for decision trees.) fi ! (fi + ai) ⇤ bi
  • 18. Over-condensing of features Features do not need to be semantically meaningful Just add them: Redundancy is (generally) not a problem Let the learning algorithm decide what’s useful! 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... 1.2 -23.2 2.3 5.3 12.1 condensed feature vector raw data: Pitfall #3:
  • 19. Example: Thought reading fMRI scan Nobody knows what the features are But it works!!! [Mitchell et al 2008]
  • 20. 4. Training Signal • How reliable is my labeling source? (E.g. in web search editors agree 33% of the time.) • Does the signal have high coverage? • Is the signal derived independently of the features?! • Could the signal shift after deployment?
  • 21. Quiz: Spam filtering The spammer with IP e.v.i.l has sent 10M spam emails over the last 10 days - use all emails with this IP as spam examples ! Use user’s spam / not-spam votes as signal ! Use WUSTL students’ spam/not-spam votes not diverse potentially label in data too noisy low coverage
  • 22. Example: Spam filtering spam filter user feedback: SPAM / NOT-SPAM incoming email Inbox Junk
  • 23. Example: Spam filtering old spam filter user incoming email Inbox Junk new ML spam filter feedback: SPAM / NOT-SPAM annotates email QUIZ: What is wrong with this setup?
  • 24. Example: Spam filtering old spam filter incoming email Inbox new ML spam filter annotates email feedback: SPAM / NOT-SPAM Problem: Users only vote when classifier is wrong New filter learns to exactly invert the old classifier Possible solution: Occasionally let emails through filter to avoid bias
  • 25. Example: Trusted votes Goal: Classify email votes as trusted / untrusted Signal conjecture: time votes voted “bad” voted “good” evil spammer community
  • 26. Searching for signal time voted “bad” voted “good” evil spammer community The good news: We found that exact pattern A LOT!! votes
  • 27. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” votes
  • 28. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” voted “good” voted “bad” voted “good” votes Moral: Given enough data you’ll find anything! You need to be very very careful that you learn the right thing!
  • 29. 5. Learning Method • Classification / Regression / Ranking? • Do you want probabilities? • How sensitive is a model to label noise? • Do you have skewed classes / weighted examples? • Best off-the-shelf: Random Forests, Boosted Trees, SVM • Generally: Try out several algorithms
  • 30. Method Complexity (KISS) Common pitfall: Use a too complicated learning algorithm ALWAYS try simplest algorithm first!!! Move to more complex systems after the simple one works Rule of diminishing returns!! (Scientific papers exaggerate benefit of complex theory.) QUIZ: What would you use for spam?
  • 31. Ready-Made Packages Weka 3 http://www.cs.waikato.ac.nz/~ml/index.html Vowpal Wabbit (very large scale) http://hunch.net/~vw/ Machine Learning Open Software Project http://mloss.org/software MALLET: Machine Learning for Language Toolking http://mallet.cs.umass.edu/index.php/Main_Page scikit learn (Python) http://scikit-learn.org/stable/ Large-scale SVM: http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm SVM Lin (very fast linear SVM) http://people.cs.uchicago.edu/~vikass/svmlin.html LIB SVM (Powerful SVM implementation) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light http://svmlight.joachims.org/svm_struct.html
  • 32. Model Selection (parameter setting with cross validation) Do not trust default hyper-parameters Grid Search / Bayesian Optimization Most importantly: Learning rate!! Pick best parameters for Val B.O. usually better than grid search Train Train’ Val
  • 33. 6. Experimental Setup 1. Automate everything (one button setup) • pre-processing / training / testing / evaluation • Let’s you reproduce results easily • Fewer errors!! 2. Parallelize your experiments
  • 34. Quiz T/F: Condensing features with domain expertise improves learning? FALSE T/F: Feature scaling is irrelevant for boosted decision trees. TRUE To avoid dataset overfitting benchmark on a second train/test data set. T/F: Ideally, derive your signal directly from the features. FALSE You cannot create train/test split when your data changes over time. FALSE T/F: Always compute aggregate statistics over the entire corpus. FALSE
  • 35. Debugging ML algorithms
  • 36. Debugging: Spam filtering You implemented logistic regression with regularization. Problem: Your test error is too high (12%)! QUIZ: What can you do to fix it?
  • 37. Fixing attempts: 1. Get more training data 2. Get more features 3. Select fewer features 4. Feature engineering (e.g. meta features, header information) 5. Run gradient descent longer 6. Use Newton’s Method for optimization 7. Change regularization 8. Use SVMs instead of logistic regression But: which one should we try out?
  • 38. Possible problems Diagnostics: 1.Underfitting: Training error almost as high as test error 2.Overfitting: Training error much lower than test error 3.Wrong Algorithm: Other methods do better 4.Optimizer: Loss function is not minimized
  • 39. Underfitting / Overfitting
  • 40. Diagnostics training set size training error testing error desired error error over fitting • test error still decreasing with more data • large gap between train and test error Remedies: - Get more data - Do bagging - Feature selection
  • 41. Diagnostics training set size training error testing error desired error error under fitting • even training error is too high • small gap between train and test error Remedies: - Add features - Improve features - Use more powerful ML algorithm - (Boosting)
  • 42. Problem: You are “too good” on your setup ... iterations training error testing error desired error error online error
  • 43. Possible Problems Is the label included in data set? Does the training set contain test data? Famous example in 2007: Caltech 101 0.0 22.5 45.0 67.5 90.0 Caltech 101 Test Accuracy 20062005 2007
  • 44. Caltech 101 2007 2009
  • 45. Problem: Online error > Test Error training set size training error testing error desired error error online error
  • 46. Analytics: Suspicion: Online data differently distributed Construct new binary classification problem: Online vs. train+test If you can learn this (error < 50%), you have a distribution problem!! 1.You do not need any labels for this!! online train/test
  • 47. Suspicion: Temporal distribution drift Train Test ! Train Test shuffle time 12% Error 1% Error If E(shuffle)<E(train/test) then you have temporal distribution drift Cures: Retrain frequently / online learning
  • 48. Final Quiz Increasing your training set size increases the training error. Temporal drift can be detected through shuffling the training/test sets. Increasing your feature set size decreases the training error. T/F: More features always decreases the test error? False T/F: Very low validation error always indicates you are doing well. False When an algorithm overfits there is a big gap between train and test error. T/F: Underfitting can be cured with more powerful learners. True T/F: The test error is (almost) never below the training error. True
  • 49. Summary “Machine learning is only sexy when it works.” ML algorithms deserve a careful setup Debugging is just like any other code 1. Carefully rule out possible causes 2. Apply appropriate fixes
  • 50. Resources Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998. Pattern Recognition and Machine Learning by Christopher M. Bishop Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E

×