SlideShare a Scribd company logo
1 of 51
Machine Learning
in practice
common pitfalls, and debugging tricks
!
Kilian Weinberger,
Associate Professor
(thanks to Rob Shapire, Andrew Ng)
What is Machine Learning
Traditional Computer Science
Data
Program
Output
Computer
Traditional CS:
Machine Learning
Data
Program
Output
Computer
Traditional CS:
Machine Learning:
Data
Output
Program
Computer
Machine Learning
Data
Program
Output
Computer
Data
Output
Program
Computer
Machine Learning: Traditional CS:
Machine Learning
Data
Program
Output
Computer
Train Data
Labels
Computer
Training: Testing:
Example: Spam Filter
Date
Soon:Autonomous Cars
Machine Learning Setup
Goal
Data
Miracle Learning
Algorithm
Amazing results!!!
Fame, Glory, Rock’n Roll!
Idea
1. Learning Problem
What is my relevant data?
What am I trying to learn?
Can I obtain trustworthy supervision?
QUIZ: What would be some answers for
email spam filtering?
Example:
What is my data?
What am I trying to learn?
Can I obtain trustworthy supervision?
Email content / Meta Data
User’s spam/ham labels
Employees?
2. Train / Test split
How much data do I need? (More is more.)
How do you split into train / test? (Always by time! o/w: random)
Training data should be just like test data!! (i.i.d.)
Train Data Test Data
time
Real World
Data
??
Train Data Test Data
Data set overfitting
!
By evaluating on the same data set over and over, you will overfit
Overfitting bounded by:
Kishore’s rule of thumb: subtract 1% accuracy for every time you have
tested on a data set
Ideally: Create a second train / test split!
Train Data Test Data
time
Real World
Data
??
many runs one run!
O
s
log (#trials)
#examples
!
3. Data Representation:
feature vector:
0
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
Sender in address book?
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
data (email)
Data Representation:
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
feature vector:
“viagra”
“hello”
“cheap”
“$”
“Microsoft”
...
IP known?
Sent time in s since 1/1/1970
Email size
Attachment size
...
Percentile in email length
Percentile in token likelihood
...
bag of word features
(sparse)
meta features
(sparse / dense)
aggregate statistics
(dense real)
Pitfall #1: Aggregate statistics should not be over test data!
Sender in address book?
Pitfall #2:
Feature scaling
1. With linear classifiers / kernels features should have similar scale (e.g.
range [0,1])
2. Must use the same scaling constants for test data!!! (most likely test
data will not be in a clean [0,1] interval)
3. Dense features should be down-weighted when combined with sparse
features
(Scale does not matter for decision trees.)
fi ! (fi + ai) ⇤ bi
Over-condensing of features
Features do not need to be
semantically meaningful
Just add them: Redundancy is
(generally) not a problem
Let the learning algorithm decide
what’s useful!
3
1
0
1
1
...
0
1
2342304222342
12323
0
...
0.232
0.1
...
1.2
-23.2
2.3
5.3
12.1
condensed
feature vector
raw data:
Pitfall #3:
Example: Thought reading
fMRI scan
Nobody knows what the features are
But it works!!!
[Mitchell et al 2008]
4. Training Signal
• How reliable is my labeling source? (E.g. in web search editors agree
33% of the time.)
• Does the signal have high coverage?
• Is the signal derived independently of the features?!
• Could the signal shift after deployment?
Quiz: Spam filtering
The spammer with IP e.v.i.l has sent 10M spam emails
over the last 10 days - use all emails with this IP as
spam examples
!
Use user’s spam / not-spam votes as signal
!
Use WUSTL students’ spam/not-spam votes
not diverse
potentially label
in data
too noisy
low
coverage
Example: Spam filtering
spam
filter
user
feedback:
SPAM / NOT-SPAM
incoming
email
Inbox
Junk
Example: Spam filtering
old
spam
filter
user
incoming
email Inbox
Junk
new
ML spam
filter
feedback:
SPAM / NOT-SPAM
annotates
email
QUIZ: What is wrong with this setup?
Example: Spam filtering
old
spam
filter
incoming
email Inbox
new
ML spam
filter
annotates
email
feedback:
SPAM / NOT-SPAM
Problem: Users only vote when classifier is wrong
New filter learns to exactly invert the old classifier
Possible solution: Occasionally let emails through filter to avoid bias
Example: Trusted votes
Goal: Classify email votes as trusted / untrusted
Signal conjecture:
time
votes
voted “bad”
voted
“good”
evil spammer community
Searching for signal
time
voted “bad”
voted
“good”
evil spammer community
The good news: We found that exact pattern A LOT!!
votes
Searching for signal
The good news: We found that exact pattern A LOT!!
The bad news: We found other patterns just as often
time
voted “bad”
voted
“good”
votes
Searching for signal
The good news: We found that exact pattern A LOT!!
The bad news: We found other patterns just as often
time
voted
“bad”
voted
“good”
voted
“good”
voted
“bad”
voted
“good”
votes
Moral: Given enough data you’ll find anything!
You need to be very very careful that you learn the right thing!
5. Learning Method
• Classification / Regression / Ranking?
• Do you want probabilities?
• How sensitive is a model to label noise?
• Do you have skewed classes / weighted examples?
• Best off-the-shelf: Random Forests, Boosted Trees, SVM
• Generally: Try out several algorithms
Method Complexity (KISS)
Common pitfall: Use a too complicated
learning algorithm
ALWAYS try simplest algorithm first!!!
Move to more complex systems after the
simple one works
Rule of diminishing returns!!
(Scientific papers exaggerate benefit of
complex theory.)
QUIZ: What would you use for spam?
Ready-Made Packages
Weka 3
http://www.cs.waikato.ac.nz/~ml/index.html
Vowpal Wabbit (very large scale)
http://hunch.net/~vw/
Machine Learning Open Software Project
http://mloss.org/software
MALLET: Machine Learning for Language Toolking
http://mallet.cs.umass.edu/index.php/Main_Page
scikit learn (Python)
http://scikit-learn.org/stable/
Large-scale SVM:
http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm
SVM Lin (very fast linear SVM)
http://people.cs.uchicago.edu/~vikass/svmlin.html
LIB SVM (Powerful SVM implementation)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVM Light
http://svmlight.joachims.org/svm_struct.html
Model Selection
(parameter setting with cross validation)
Do not trust default hyper-parameters
Grid Search / Bayesian Optimization
Most importantly: Learning rate!!
Pick best parameters for Val
B.O. usually better than grid search
Train
Train’ Val
6. Experimental Setup
1. Automate everything (one button setup)
• pre-processing / training / testing / evaluation
• Let’s you reproduce results easily
• Fewer errors!!
2. Parallelize your experiments
Quiz
T/F: Condensing features with domain expertise improves learning? FALSE
T/F: Feature scaling is irrelevant for boosted decision trees. TRUE
To avoid dataset overfitting benchmark on a second train/test data set.
T/F: Ideally, derive your signal directly from the features. FALSE
You cannot create train/test split when your data changes over time. FALSE
T/F: Always compute aggregate statistics over the entire corpus. FALSE
Debugging ML algorithms
Debugging: Spam filtering
You implemented
logistic regression with
regularization.
Problem: Your test
error is too high
(12%)!
QUIZ: What can you do to fix it?
Fixing attempts:
1. Get more training data
2. Get more features
3. Select fewer features
4. Feature engineering (e.g. meta features, header information)
5. Run gradient descent longer
6. Use Newton’s Method for optimization
7. Change regularization
8. Use SVMs instead of logistic regression
But: which one should we try out?
Possible problems
Diagnostics:
1.Underfitting: Training error almost as high as test error
2.Overfitting: Training error much lower than test error
3.Wrong Algorithm: Other methods do better
4.Optimizer: Loss function is not minimized
Underfitting / Overfitting
Diagnostics
training set size
training error
testing error
desired error
error
over fitting • test error still decreasing with more data
• large gap between train and test error
Remedies:
- Get more data
- Do bagging
- Feature selection
Diagnostics
training set size
training error
testing error
desired error
error
under fitting • even training error is too high
• small gap between train and test error
Remedies:
- Add features
- Improve features
- Use more powerful ML algorithm
- (Boosting)
Problem: You are “too good” on
your setup ...
iterations
training error testing error
desired error
error
online error
Possible Problems
Is the label included in data set?
Does the training set contain test data?
Famous example in 2007: Caltech 101
0.0
22.5
45.0
67.5
90.0
Caltech 101 Test Accuracy
20062005 2007
Caltech 101
2007 2009
Problem: Online error > Test Error
training set size
training error
testing error
desired error
error online error
Analytics:
Suspicion: Online data differently distributed
Construct new binary classification problem: Online vs. train+test
If you can learn this (error < 50%), you have a distribution problem!!
1.You do not need any labels for this!!
online
train/test
Suspicion: Temporal distribution drift
Train Test
!
Train Test
shuffle
time
12% Error
1% Error
If E(shuffle)<E(train/test) then you have temporal distribution drift
Cures: Retrain frequently / online learning
Final Quiz
Increasing your training set size increases the training error.
Temporal drift can be detected through shuffling the training/test sets.
Increasing your feature set size decreases the training error.
T/F: More features always decreases the test error? False
T/F: Very low validation error always indicates you are doing well. False
When an algorithm overfits there is a big gap between train and test error.
T/F: Underfitting can be cured with more powerful learners. True
T/F: The test error is (almost) never below the training error. True
Summary
“Machine learning is only sexy when it works.”
ML algorithms deserve a careful setup
Debugging is just like any other code
1. Carefully rule out possible causes
2. Apply appropriate fixes
Resources
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller
K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.
Pattern Recognition and Machine Learning by Christopher M. Bishop
Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E

More Related Content

What's hot

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 

What's hot (20)

Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning ppt
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
Le Machine Learning de A à Z
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à Z
 
Machine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine LearningMachine Learning 101 | Essential Tools for Machine Learning
Machine Learning 101 | Essential Tools for Machine Learning
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
 

Viewers also liked

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
Reza Rahimi
 
Best Deep Learning Post from LinkedIn Group
Best Deep Learning Post from LinkedIn Group Best Deep Learning Post from LinkedIn Group
Best Deep Learning Post from LinkedIn Group
Farshid Pirahansiah
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 

Viewers also liked (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozCon
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
KleinTech using machine vision and learning 4 streets
KleinTech using machine vision and learning 4 streetsKleinTech using machine vision and learning 4 streets
KleinTech using machine vision and learning 4 streets
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business Leaders
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Best Deep Learning Post from LinkedIn Group
Best Deep Learning Post from LinkedIn Group Best Deep Learning Post from LinkedIn Group
Best Deep Learning Post from LinkedIn Group
 
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
Machine Learning in the Cloud: Building a Better Forecast with H20 & SalesforceMachine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
 
Machine Learning with R and Tableau
Machine Learning with R and TableauMachine Learning with R and Tableau
Machine Learning with R and Tableau
 
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
 
Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...
Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...
Machine Learning Application to Manufacturing using Tableau, Tableau and Goog...
 
Building a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS LambdaBuilding a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS Lambda
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorial
 
IOT & Machine Learning
IOT & Machine LearningIOT & Machine Learning
IOT & Machine Learning
 

Similar to Making Machine Learning Work in Practice - StampedeCon 2014

ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
Nikita Knysh
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocols
butest
 
Cracking the coding interview u penn - sept 30 2010
Cracking the coding interview   u penn - sept 30 2010Cracking the coding interview   u penn - sept 30 2010
Cracking the coding interview u penn - sept 30 2010
careercup
 

Similar to Making Machine Learning Work in Practice - StampedeCon 2014 (20)

Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocols
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Data science toolkit for product managers
Data science toolkit for product managers Data science toolkit for product managers
Data science toolkit for product managers
 
Data Science Toolkit for Product Managers
Data Science Toolkit for Product ManagersData Science Toolkit for Product Managers
Data Science Toolkit for Product Managers
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in Oracle
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
Cracking the coding interview u penn - sept 30 2010
Cracking the coding interview   u penn - sept 30 2010Cracking the coding interview   u penn - sept 30 2010
Cracking the coding interview u penn - sept 30 2010
 

More from StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Making Machine Learning Work in Practice - StampedeCon 2014

  • 1. Machine Learning in practice common pitfalls, and debugging tricks ! Kilian Weinberger, Associate Professor (thanks to Rob Shapire, Andrew Ng)
  • 2. What is Machine Learning
  • 8.
  • 12. 1. Learning Problem What is my relevant data? What am I trying to learn? Can I obtain trustworthy supervision? QUIZ: What would be some answers for email spam filtering?
  • 13. Example: What is my data? What am I trying to learn? Can I obtain trustworthy supervision? Email content / Meta Data User’s spam/ham labels Employees?
  • 14. 2. Train / Test split How much data do I need? (More is more.) How do you split into train / test? (Always by time! o/w: random) Training data should be just like test data!! (i.i.d.) Train Data Test Data time Real World Data ??
  • 15. Train Data Test Data Data set overfitting ! By evaluating on the same data set over and over, you will overfit Overfitting bounded by: Kishore’s rule of thumb: subtract 1% accuracy for every time you have tested on a data set Ideally: Create a second train / test split! Train Data Test Data time Real World Data ?? many runs one run! O s log (#trials) #examples !
  • 16. 3. Data Representation: feature vector: 0 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... “viagra” “hello” “cheap” “$” “Microsoft” ... Sender in address book? IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... data (email)
  • 17. Data Representation: 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... feature vector: “viagra” “hello” “cheap” “$” “Microsoft” ... IP known? Sent time in s since 1/1/1970 Email size Attachment size ... Percentile in email length Percentile in token likelihood ... bag of word features (sparse) meta features (sparse / dense) aggregate statistics (dense real) Pitfall #1: Aggregate statistics should not be over test data! Sender in address book?
  • 18. Pitfall #2: Feature scaling 1. With linear classifiers / kernels features should have similar scale (e.g. range [0,1]) 2. Must use the same scaling constants for test data!!! (most likely test data will not be in a clean [0,1] interval) 3. Dense features should be down-weighted when combined with sparse features (Scale does not matter for decision trees.) fi ! (fi + ai) ⇤ bi
  • 19. Over-condensing of features Features do not need to be semantically meaningful Just add them: Redundancy is (generally) not a problem Let the learning algorithm decide what’s useful! 3 1 0 1 1 ... 0 1 2342304222342 12323 0 ... 0.232 0.1 ... 1.2 -23.2 2.3 5.3 12.1 condensed feature vector raw data: Pitfall #3:
  • 20. Example: Thought reading fMRI scan Nobody knows what the features are But it works!!! [Mitchell et al 2008]
  • 21. 4. Training Signal • How reliable is my labeling source? (E.g. in web search editors agree 33% of the time.) • Does the signal have high coverage? • Is the signal derived independently of the features?! • Could the signal shift after deployment?
  • 22. Quiz: Spam filtering The spammer with IP e.v.i.l has sent 10M spam emails over the last 10 days - use all emails with this IP as spam examples ! Use user’s spam / not-spam votes as signal ! Use WUSTL students’ spam/not-spam votes not diverse potentially label in data too noisy low coverage
  • 23. Example: Spam filtering spam filter user feedback: SPAM / NOT-SPAM incoming email Inbox Junk
  • 24. Example: Spam filtering old spam filter user incoming email Inbox Junk new ML spam filter feedback: SPAM / NOT-SPAM annotates email QUIZ: What is wrong with this setup?
  • 25. Example: Spam filtering old spam filter incoming email Inbox new ML spam filter annotates email feedback: SPAM / NOT-SPAM Problem: Users only vote when classifier is wrong New filter learns to exactly invert the old classifier Possible solution: Occasionally let emails through filter to avoid bias
  • 26. Example: Trusted votes Goal: Classify email votes as trusted / untrusted Signal conjecture: time votes voted “bad” voted “good” evil spammer community
  • 27. Searching for signal time voted “bad” voted “good” evil spammer community The good news: We found that exact pattern A LOT!! votes
  • 28. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” votes
  • 29. Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often time voted “bad” voted “good” voted “good” voted “bad” voted “good” votes Moral: Given enough data you’ll find anything! You need to be very very careful that you learn the right thing!
  • 30. 5. Learning Method • Classification / Regression / Ranking? • Do you want probabilities? • How sensitive is a model to label noise? • Do you have skewed classes / weighted examples? • Best off-the-shelf: Random Forests, Boosted Trees, SVM • Generally: Try out several algorithms
  • 31. Method Complexity (KISS) Common pitfall: Use a too complicated learning algorithm ALWAYS try simplest algorithm first!!! Move to more complex systems after the simple one works Rule of diminishing returns!! (Scientific papers exaggerate benefit of complex theory.) QUIZ: What would you use for spam?
  • 32. Ready-Made Packages Weka 3 http://www.cs.waikato.ac.nz/~ml/index.html Vowpal Wabbit (very large scale) http://hunch.net/~vw/ Machine Learning Open Software Project http://mloss.org/software MALLET: Machine Learning for Language Toolking http://mallet.cs.umass.edu/index.php/Main_Page scikit learn (Python) http://scikit-learn.org/stable/ Large-scale SVM: http://machinelearning.wustl.edu/pmwiki.php/Main/Wusvm SVM Lin (very fast linear SVM) http://people.cs.uchicago.edu/~vikass/svmlin.html LIB SVM (Powerful SVM implementation) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light http://svmlight.joachims.org/svm_struct.html
  • 33. Model Selection (parameter setting with cross validation) Do not trust default hyper-parameters Grid Search / Bayesian Optimization Most importantly: Learning rate!! Pick best parameters for Val B.O. usually better than grid search Train Train’ Val
  • 34. 6. Experimental Setup 1. Automate everything (one button setup) • pre-processing / training / testing / evaluation • Let’s you reproduce results easily • Fewer errors!! 2. Parallelize your experiments
  • 35. Quiz T/F: Condensing features with domain expertise improves learning? FALSE T/F: Feature scaling is irrelevant for boosted decision trees. TRUE To avoid dataset overfitting benchmark on a second train/test data set. T/F: Ideally, derive your signal directly from the features. FALSE You cannot create train/test split when your data changes over time. FALSE T/F: Always compute aggregate statistics over the entire corpus. FALSE
  • 37. Debugging: Spam filtering You implemented logistic regression with regularization. Problem: Your test error is too high (12%)! QUIZ: What can you do to fix it?
  • 38. Fixing attempts: 1. Get more training data 2. Get more features 3. Select fewer features 4. Feature engineering (e.g. meta features, header information) 5. Run gradient descent longer 6. Use Newton’s Method for optimization 7. Change regularization 8. Use SVMs instead of logistic regression But: which one should we try out?
  • 39. Possible problems Diagnostics: 1.Underfitting: Training error almost as high as test error 2.Overfitting: Training error much lower than test error 3.Wrong Algorithm: Other methods do better 4.Optimizer: Loss function is not minimized
  • 41. Diagnostics training set size training error testing error desired error error over fitting • test error still decreasing with more data • large gap between train and test error Remedies: - Get more data - Do bagging - Feature selection
  • 42. Diagnostics training set size training error testing error desired error error under fitting • even training error is too high • small gap between train and test error Remedies: - Add features - Improve features - Use more powerful ML algorithm - (Boosting)
  • 43. Problem: You are “too good” on your setup ... iterations training error testing error desired error error online error
  • 44. Possible Problems Is the label included in data set? Does the training set contain test data? Famous example in 2007: Caltech 101 0.0 22.5 45.0 67.5 90.0 Caltech 101 Test Accuracy 20062005 2007
  • 46. Problem: Online error > Test Error training set size training error testing error desired error error online error
  • 47. Analytics: Suspicion: Online data differently distributed Construct new binary classification problem: Online vs. train+test If you can learn this (error < 50%), you have a distribution problem!! 1.You do not need any labels for this!! online train/test
  • 48. Suspicion: Temporal distribution drift Train Test ! Train Test shuffle time 12% Error 1% Error If E(shuffle)<E(train/test) then you have temporal distribution drift Cures: Retrain frequently / online learning
  • 49. Final Quiz Increasing your training set size increases the training error. Temporal drift can be detected through shuffling the training/test sets. Increasing your feature set size decreases the training error. T/F: More features always decreases the test error? False T/F: Very low validation error always indicates you are doing well. False When an algorithm overfits there is a big gap between train and test error. T/F: Underfitting can be cured with more powerful learners. True T/F: The test error is (almost) never below the training error. True
  • 50. Summary “Machine learning is only sexy when it works.” ML algorithms deserve a careful setup Debugging is just like any other code 1. Carefully rule out possible causes 2. Apply appropriate fixes
  • 51. Resources Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998. Pattern Recognition and Machine Learning by Christopher M. Bishop Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E