SlideShare a Scribd company logo
Debugging Machine Learning.
Mostly for profit but with a bit of fun too!
Michał Łopuszyński
PyData Warsaw, 19.10.2017
About me
In my previous professional life, I was a theoretical physicist.
I got a PhD in solid state physics
•
For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
Agenda
4 failure modes of ML systems•
9 simple hints what to do•
Hey, my
ML system
does not
work at all...
Hint #1
Check your code
AKA it is engineering, stupid!
Write tests
Do not strive for 100% coverage, partial coverage is infinitely better
then none!
•
In Python, doctests are your friends•
"Hidden" benefits of tests•
Better code structure•
Executable documentation•
Test the fragile parts first•
Test your code with Monte Carlo / synthetic data
Try to generate a trivial case for your
ML system
•
This tests the whole pipeline (transforming/training) and allows for exploration of
your system performance under unusual conditions
•
If you have generative model,
prepare Monte Carlo data from
assumed distributions
•
Perturb the original data, by
generating the output with known
and learnable prescription
•
Code style
Single responsibility principle•
Do not repeat yourself (DRY!)•
Have and apply coding standard (PEP8)•
Consider using a linter•
Short functions and methods•
https://xkcd.com/844/
pycodestyle checker (formerly pep8)•
yapf formater•
pylint, pyflakes, pychecker
Naming
Minimal requirement - be consistent!•
Interesting software development books, offering chapters on naming•
Freely available chapter on names
Hint #2
Check your data
Data quality audits are difficult
Happy families are all alike; every unhappy family
is unhappy in its own way.
Leo Tolstoy
Like families, tidy datasets are all alike but every
messy dataset is messy in its own way.
Hadley Wickham
H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10
Images credit - Wikipedia
Data quality
Beware, your data providers usually
overestimate the data quality
•
Think of outliers, missing values (and how the are represented)•
Understand your data•
Do exploratory data analysis•
Visualize, visualize, visualize•
Talk to the domain expert•
Is your data correct, complete, coherent, stationary (seasonality!), deduplicated,
up-to-date, representative (unbiased)
•
OK, my ML system
works, but I think it
should perform
better...
Hint #3
Examine your features
Features
Features make a difference!•
Be creative with your features•
Try meaningful transformations,
combinations (products/ratios), decorrelation...
•
Think of good representations for non-tabular data
(text, time-series)
•
Make conscious decision about missing values•
Understand what features are important for your model•
Use ML models offering feature ranking•
Use feature selection methods•
ID F1 F2 F3
Hint #4
Examine your data points
Data points
Find difficult data points! (DDP)•
DDP = notoriously misclassified (or high error) cases
in your cross-validation loop for large variety of models
•
ID
P1
P2
P3
P4
P5
Examine DDPs, understand them!•
In the easiest case, remove DDPs from the dataset
(think outliers, mislabeled examples)
•
Influence functions
Influence functions
Best Paper Award
ICML 2017
Data points
Get more data!• ID
P1
P2
P3
P4
P5
Trick 1. Extend your set with artificial data
E.g., data augmentation in image processing,
SMOTE algorithm for imbalanced datasets
•
Good performance booster, rarely applicable•
Trick 2. Generate automatically noisy labeled
data set by heuristics, e.g. distant supervision in NLP
(requires unlabeled data!)
•
Trick 3. Semisupervised learning methods
self-training and co-training (requires unlabeled data!)
•
Hint #5
Examine your model
Why my model predicts what it predicts? (philosophical slide)
How do you answer why questions?•
Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
Model introspection
You can answer thy why question, only for very simple models
(e.g., linear model, basic decision trees)
•
Sometimes, it is instructive to run such a simple model on your dataset, even
though it does not provide top-level performance
•
You can boost your simple model by feeding it with more advanced
(non-linearly transformed) features
•
Complex model introspection
LIME algorithm = Local Interpretable Model-agnostic Explanations
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
Complex model introspection – practical issues
Lime authors provided open source Python implementation as lime package•
http://eli5.readthedocs.io/en/latest/tutorials•
Another option eli5 package, aimed more generally at model introspection
and explanation (includes lime implementation)
•
Visualizing models
Display model in
a data space
Look at collection of
models at once
Explore the process of
model fitting
The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
So my ML system
works on test data,
but you tell me it
fails in production?
Hint #6
Watch out for overfitting
Overfitting
If you torture the data long enough,
it will confess.
Roald Coase
Hint #7
Watch out for data leakage
Data leakage
Some time ago, I used to thing data leaks are trivial to avoid•
They are not! (Look at number of Kaggle competitions flawed by Data Leakage)•
You may introduce them yourself
E.g. meaningful identifiers, past & future separation in time series
•
You may receive them in the data from your provider•
Good paper•
Hint #8
Watch out for covariate shift
What is covariate shift?
Training data
y
X
What is covariate shift?
Model
Training data
X
y
What is covariate shift?
Model
Noiseless reality
Training data
y
X
What is covariate shift?
Model
Noiseless reality
Training data
Production data
(Test data)
y
X
Covariate shift
Unlike overfitting and data leakage, it is easier to detect•
Method: Try to build classifier differentiating train from production (test).
If you succeed, you very likely have a problem
•
Basic remedy – reweighting data points. Give production-like data higher impact
on your model
•
The quality of my
super ML system
deteriorates with
time, really?
Really really?
ML system in production
2009: Hooray we can predict flu
epidemics from Google query data!
2014: Hmm... Can we?
ML system in production
NIPS 2015 paper
Hint #9
Remember monitoring & maintenance
AKA it is engineering again, stupid!
5
Thank you!
Questions?
@lopusz

More Related Content

What's hot

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
ehtshamelahi
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
joshwills
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
Hakka Labs
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly Detection
Databricks
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Sri Ambati
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
Justin Basilico
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Sri Ambati
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDays Riga
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering
Ivano Malavolta
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
DevFest DC
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Aseda Owusua Addai-Deseh
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
Turi, Inc.
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
Xavier Amatriain
 

What's hot (20)

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly Detection
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 

Similar to Debugging machine-learning

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
fredverheul
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
CCG
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
Mihai Criveti
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
supportlogic
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
GibDevs
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
Johann Schleier-Smith
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
Juggernaut Liu
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
JosephArevaloLoli
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
snigdhaagrawal11
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
AbderrahmanABID2
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systems
shreenathji26
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
harikaramisetty3
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...
WiMLDSMontreal
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
panagenda
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 

Similar to Debugging machine-learning (20)

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systems
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 

Recently uploaded

REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 

Recently uploaded (20)

REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 

Debugging machine-learning

  • 1. Debugging Machine Learning. Mostly for profit but with a bit of fun too! Michał Łopuszyński PyData Warsaw, 19.10.2017
  • 2. About me In my previous professional life, I was a theoretical physicist. I got a PhD in solid state physics • For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
  • 3. Agenda 4 failure modes of ML systems• 9 simple hints what to do•
  • 4. Hey, my ML system does not work at all...
  • 5. Hint #1 Check your code AKA it is engineering, stupid!
  • 6. Write tests Do not strive for 100% coverage, partial coverage is infinitely better then none! • In Python, doctests are your friends• "Hidden" benefits of tests• Better code structure• Executable documentation• Test the fragile parts first•
  • 7. Test your code with Monte Carlo / synthetic data Try to generate a trivial case for your ML system • This tests the whole pipeline (transforming/training) and allows for exploration of your system performance under unusual conditions • If you have generative model, prepare Monte Carlo data from assumed distributions • Perturb the original data, by generating the output with known and learnable prescription •
  • 8. Code style Single responsibility principle• Do not repeat yourself (DRY!)• Have and apply coding standard (PEP8)• Consider using a linter• Short functions and methods• https://xkcd.com/844/ pycodestyle checker (formerly pep8)• yapf formater• pylint, pyflakes, pychecker
  • 9. Naming Minimal requirement - be consistent!• Interesting software development books, offering chapters on naming• Freely available chapter on names
  • 11. Data quality audits are difficult Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Hadley Wickham H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10 Images credit - Wikipedia
  • 12. Data quality Beware, your data providers usually overestimate the data quality • Think of outliers, missing values (and how the are represented)• Understand your data• Do exploratory data analysis• Visualize, visualize, visualize• Talk to the domain expert• Is your data correct, complete, coherent, stationary (seasonality!), deduplicated, up-to-date, representative (unbiased) •
  • 13. OK, my ML system works, but I think it should perform better...
  • 15. Features Features make a difference!• Be creative with your features• Try meaningful transformations, combinations (products/ratios), decorrelation... • Think of good representations for non-tabular data (text, time-series) • Make conscious decision about missing values• Understand what features are important for your model• Use ML models offering feature ranking• Use feature selection methods• ID F1 F2 F3
  • 16. Hint #4 Examine your data points
  • 17. Data points Find difficult data points! (DDP)• DDP = notoriously misclassified (or high error) cases in your cross-validation loop for large variety of models • ID P1 P2 P3 P4 P5 Examine DDPs, understand them!• In the easiest case, remove DDPs from the dataset (think outliers, mislabeled examples) •
  • 20. Data points Get more data!• ID P1 P2 P3 P4 P5 Trick 1. Extend your set with artificial data E.g., data augmentation in image processing, SMOTE algorithm for imbalanced datasets • Good performance booster, rarely applicable• Trick 2. Generate automatically noisy labeled data set by heuristics, e.g. distant supervision in NLP (requires unlabeled data!) • Trick 3. Semisupervised learning methods self-training and co-training (requires unlabeled data!) •
  • 22. Why my model predicts what it predicts? (philosophical slide) How do you answer why questions?• Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
  • 23. Model introspection You can answer thy why question, only for very simple models (e.g., linear model, basic decision trees) • Sometimes, it is instructive to run such a simple model on your dataset, even though it does not provide top-level performance • You can boost your simple model by feeding it with more advanced (non-linearly transformed) features •
  • 24. Complex model introspection LIME algorithm = Local Interpretable Model-agnostic Explanations ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
  • 25. Complex model introspection – practical issues Lime authors provided open source Python implementation as lime package• http://eli5.readthedocs.io/en/latest/tutorials• Another option eli5 package, aimed more generally at model introspection and explanation (includes lime implementation) •
  • 26. Visualizing models Display model in a data space Look at collection of models at once Explore the process of model fitting The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
  • 27. So my ML system works on test data, but you tell me it fails in production?
  • 28. Hint #6 Watch out for overfitting
  • 29. Overfitting If you torture the data long enough, it will confess. Roald Coase
  • 30. Hint #7 Watch out for data leakage
  • 31. Data leakage Some time ago, I used to thing data leaks are trivial to avoid• They are not! (Look at number of Kaggle competitions flawed by Data Leakage)• You may introduce them yourself E.g. meaningful identifiers, past & future separation in time series • You may receive them in the data from your provider• Good paper•
  • 32. Hint #8 Watch out for covariate shift
  • 33. What is covariate shift? Training data y X
  • 34. What is covariate shift? Model Training data X y
  • 35. What is covariate shift? Model Noiseless reality Training data y X
  • 36. What is covariate shift? Model Noiseless reality Training data Production data (Test data) y X
  • 37. Covariate shift Unlike overfitting and data leakage, it is easier to detect• Method: Try to build classifier differentiating train from production (test). If you succeed, you very likely have a problem • Basic remedy – reweighting data points. Give production-like data higher impact on your model •
  • 38. The quality of my super ML system deteriorates with time, really? Really really?
  • 39. ML system in production 2009: Hooray we can predict flu epidemics from Google query data! 2014: Hmm... Can we?
  • 40. ML system in production NIPS 2015 paper
  • 41. Hint #9 Remember monitoring & maintenance AKA it is engineering again, stupid! 5