SlideShare a Scribd company logo
H2O.ai
Machine Intelligence
Top 10 Data Science
Practitioner Pitfalls
Mark Landry
H2O World 2015
H2O.ai
Machine Intelligence
Train vs Test
1 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
1. Train vs Test
Training Set vs.
Test Set
• Partition the original data (randomly or stratified) into a
training set and a test set. (e.g. 70/30)
• It can be useful to evaluate the training error, but you
should not look at training error alone.
• Training error is not an estimate of generalization error
(on a test set or cross-validated), which is what you
should care more about.
• Training error vs test error over time is an useful thing to
calculate. It can tell you when you start to overfit your
model, so it is a useful metric in supervised machine
learning.
Training Error vs.
Test Error
H2O.ai
Machine Intelligence
1. Train vs Test Error
Source: Elements of Statistical Learning
H2O.ai
Machine Intelligence
Validation Set
2 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
2. Train vs Test vs Valid
Training Set vs.
Validation Set vs.
Test Set
• If you have “enough” data and plan to do some model
tuning, you should really partition your data into three
parts — Training, Validation and Test sets.
• There is no general rule for how you should partition the
data and it will depend on how strong the signal in your
data is, but an example could be: 50% Train, 25%
Validation and 25% Test
• The validation set is used strictly for model tuning (via
validation of models with different parameters) and the
test set is used to make a final estimate of the
generalization error.
Validation is for
Model Tuning
H2O.ai
Machine Intelligence
Model Performance
3 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
3. Model Performance
Test Error • Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the training set and evaluate
performance (a single time) on the test set.
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure, Log-loss
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
H2O.ai
Machine Intelligence
Class Imbalance
4 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
4. Class Imbalance
Imbalanced
Response Variable
• A dataset is said to be imbalanced when the binomial or
multinomial response variable has one or more classes
that are underrepresented in the training data, with
respect to the other classes.
• This is incredibly common in real-word datasets.
• In practice, balanced datasets are the rarity, unless they
have been artificially created.
• There is no precise definition of what defines an
imbalanced vs balanced dataset — the term is vague.
• My rule of thumb for binary response: If the minority
class makes <10% of the data, this can cause issues.
• Advertising — Probability that someone clicks on ad is
very low… very very low.
• Healthcare & Medicine — Certain diseases or adverse medical
conditions are rare.
• Fraud Detection — Insurance or credit fraud is rare.
Very common
Industries
H2O.ai
Machine Intelligence
4. Remedies
Artificial Balance • You can balance the training set using sampling.
• Notice that we don’t say to balance the test set. The test
set represents the true data distribution. The only way
to get “honest” model performance on your test set is to
use the original, unbalanced, test set.
• The same goes for the hold-out sets in cross-validation.
For this, you may end up having to write custom code,
depending on what software you use.
• H2O has a “balance_classes” argument that can be used to do
this properly & automatically.
• You can manually upsample (or downsample) your minority
(or majority) class(es) set either by duplicating (or sub-
sampling) rows, or by using row weights.
• The SMOTE (Synthetic Minority Oversampling Technique)
algorithm generates simulated training examples from the
minority class instead of upsampling.
Potential Pitfalls
Solutions
H2O.ai
Machine Intelligence
Categorical Data
5 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
5. Categorical Data
Real Data • Most real world datasets contain categorical data.
• Problems can arise if you have too many categories.
• A lot of ML software will place limits on the number of
categories allowed in a single column (e.g. 1024) so you
may be forced to deal with this whether you like it or
not.
• When there are high-cardinality categorical columns,
often there will be many categories that only occur a
small number of times (not very useful).
• If you have some hierarchical knowledge about the data, then
you may be able to reduce the number of categories by using
some sensible higher-level mapping of the categories.
• Example: ICD-9 codes — thousands of unique diagnostic and
procedure codes. You can map each category to a higher
level super-category to reduce the cardinality.
Too Many
Categories
Solutions
H2O.ai
Machine Intelligence
Missing Data
6 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
6. Missing Data
Types of
Missing Data
• Unavailable: Valid for the observation, but not available
in the data set.
• Removed: Observation quality threshold may have not
been reached, and data removed
• Not applicable: measurement does not apply to the
particular observation (e.g. number of tires on a boat
observation)
• It depends! Some options:
• Ignore entire observation.
• Create an binary variable for each predictor to indicate
whether the data was missing or not
• Segment model based on data availability.
• Use alternative algorithm: decision trees accept missing
values; linear models typically do not.
What to Do
H2O.ai
Machine Intelligence
Outliers
7 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
7. Outliers/Extreme Values
Types of Outliers
• Outliers can exist in response or predictors
• Valid outliers: rare, extreme events
• Invalid outliers: erroneous measurements
• Remove observations.
• Apply a transformation to reduce impact: e.g. log or
bins.
• Choose a loss function that is more robust: e.g. MAE vs
MSE.
• Impose a constraint on data range (cap values).
• Ask questions: Understand whether the values are valid
or invalid, to make the most appropriate choice.
What to Do
What Can
Happen
• Outlier values can have a disproportionate weight on the
model.
• MSE will focus on handling outlier observations more to
reduce squared error.
• Boosting will spend considerable modeling effort fitting
these observations.
H2O.ai
Machine Intelligence
Data Leakage
8 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
8. Data Leakage
What Is It
• Leakage is allowing your model to use information that
will not be available in a production setting.
• Obvious example: using the Dow Jones daily gain/loss as
part of a model to predict individual stock performance
• Model is overfit.
• Will make predictions inconsistent with those you scored
when fitting the model (even with a validation set).
• Insights derived from the model will be incorrect.
• Understand the nature of your problem and data.
• Scrutinize model feedback, such as relative influence or
linear coefficient.
What Happens
What to Do
H2O.ai
Machine Intelligence
Useless Models
9 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
9. Useless Models
What is a
“Useless” Model?
• Solving the Wrong Problem.
• Not collecting appropriate data.
• Not structuring data correctly to solve the problem.
• Choosing a target/loss measure that does not optimize
the end use case: using accuracy to prioritize resources.
• Having a model that is not actionable.
• Using a complicated model that is less accurate than a
simple model.
• Understand the problem statement.
• Solving the wrong problem is an issue in all problem-solving
domains, but arguably easier with black box techniques
common to ML
• Utilize post-processing measures
• Create simple baseline models to understand lift of more
complex models
• Plan on an iterative approach: start quickly, even if on
imperfect data
• Question your models and attempt to understand them
What To Do
H2O.ai
Machine Intelligence
No Free Lunch
10 of 10
Top 10 Data Science Practitioner Pitfalls
H2O.ai
Machine Intelligence
10. No Free Lunch
No Such Thing as
a Free Lunch
• No general purpose algorithm to solve all problems.
• No right answer on optimal data preparation.
• General heuristics are not always true:
• Tree models solve problems equivalently with any
order-preserving transformation.
• Decision trees and neural networks will
automatically find interactions.
• High number of predictors may be handled, but
lead to a less optimal result than fewer key
predictors.
• Models can not find relative information that span
multiple observations.
• Model feedback can be misleading: relative
influence, linear coefficients
• Understand how the underlying algorithms operate
• Try several algorithms and observe relative performance and
the characteristics of your data
• Feature engineering & feature selection
• Interpret and react to model feedback
What To Do
H2O.ai
Machine Intelligence
Where to learn more about H2O?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com

More Related Content

What's hot

Diaries for data collection
Diaries for data collectionDiaries for data collection
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
Sefa Soner Bayraktar
 
Quantitative reseach method
Quantitative reseach methodQuantitative reseach method
Quantitative reseach methodmetalkid132
 
WHO labour guide.pdf
WHO labour guide.pdfWHO labour guide.pdf
WHO labour guide.pdf
drmonicaagrawal2
 
Newer Methods of Assessment in Medical Education
Newer Methods of Assessment in  Medical EducationNewer Methods of Assessment in  Medical Education
Newer Methods of Assessment in Medical Education
Swati Deshpande
 
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCYA RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
Anu Manivannan
 
Induction of labour
Induction of labourInduction of labour
Induction of labour
Abino David
 
Blood transfusion in obstetrics
Blood transfusion in obstetricsBlood transfusion in obstetrics
Blood transfusion in obstetrics
Aboubakr Elnashar
 
Critiquing research
Critiquing researchCritiquing research
Critiquing researchNursing Path
 
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
keerthi samuel
 
Abdominal pain in pregnancy
Abdominal pain in pregnancyAbdominal pain in pregnancy
Abdominal pain in pregnancy
Hanifullah Khan
 
Assessment Module 1
Assessment Module 1Assessment Module 1
Assessment Module 1
Anh Le
 
Qualitative Research Method
 Qualitative Research  Method  Qualitative Research  Method
Qualitative Research Method
Kunal Modak
 
Validity &amp; reliability
Validity &amp; reliabilityValidity &amp; reliability
Validity &amp; reliability
Praisy AB Vineesh
 
Writing assessment
Writing assessmentWriting assessment
Writing assessment
Anabel Aqramonte
 
Gynaecological history taking
Gynaecological history takingGynaecological history taking
Gynaecological history taking
Kavya Liyanage
 
Ebonics Emerges (2)
Ebonics Emerges (2)Ebonics Emerges (2)
Ebonics Emerges (2)Aiden Yeh
 
Qual&quantitative research
Qual&quantitative researchQual&quantitative research
Qual&quantitative researchPatel Mahendra
 

What's hot (20)

Diaries for data collection
Diaries for data collectionDiaries for data collection
Diaries for data collection
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
 
Quantitative reseach method
Quantitative reseach methodQuantitative reseach method
Quantitative reseach method
 
WHO labour guide.pdf
WHO labour guide.pdfWHO labour guide.pdf
WHO labour guide.pdf
 
Newer Methods of Assessment in Medical Education
Newer Methods of Assessment in  Medical EducationNewer Methods of Assessment in  Medical Education
Newer Methods of Assessment in Medical Education
 
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCYA RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
A RARE CASE PRESENTATION OF OVARIAN ECTOPIC PREGNANCY
 
Induction of labour
Induction of labourInduction of labour
Induction of labour
 
Blood transfusion in obstetrics
Blood transfusion in obstetricsBlood transfusion in obstetrics
Blood transfusion in obstetrics
 
Critiquing research
Critiquing researchCritiquing research
Critiquing research
 
literature_review
literature_reviewliterature_review
literature_review
 
Molar Pregnancy
Molar PregnancyMolar Pregnancy
Molar Pregnancy
 
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
Inbornerrorsofmetabolism 120429124218-phpapp01 (4) (1)
 
Abdominal pain in pregnancy
Abdominal pain in pregnancyAbdominal pain in pregnancy
Abdominal pain in pregnancy
 
Assessment Module 1
Assessment Module 1Assessment Module 1
Assessment Module 1
 
Qualitative Research Method
 Qualitative Research  Method  Qualitative Research  Method
Qualitative Research Method
 
Validity &amp; reliability
Validity &amp; reliabilityValidity &amp; reliability
Validity &amp; reliability
 
Writing assessment
Writing assessmentWriting assessment
Writing assessment
 
Gynaecological history taking
Gynaecological history takingGynaecological history taking
Gynaecological history taking
 
Ebonics Emerges (2)
Ebonics Emerges (2)Ebonics Emerges (2)
Ebonics Emerges (2)
 
Qual&quantitative research
Qual&quantitative researchQual&quantitative research
Qual&quantitative research
 

Viewers also liked

H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
Sri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
Sri Ambati
 
YSI 5500D 5400 and 5200A aquaculture monitors and controllers
YSI 5500D 5400 and 5200A aquaculture monitors and controllersYSI 5500D 5400 and 5200A aquaculture monitors and controllers
YSI 5500D 5400 and 5200A aquaculture monitors and controllers
Xylem Inc.
 
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
Monica Rogati
 
Sparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public SafetySparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public Safety
Sri Ambati
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015
Sri Ambati
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
Hisham Arafat
 
H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno Candel
Sri Ambati
 
Machine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of ThingsMachine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of Things
Sri Ambati
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
Sri Ambati
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
Sri Ambati
 
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsExploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Ajin Abraham
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
Brad Klingenberg
 
BSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident TrackingBSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident Tracking
Judy Nowak, OSCP, GCIH, CISSP
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
Hacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - WhitepaperHacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - Whitepaper
Ajin Abraham
 

Viewers also liked (20)

H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
YSI 5500D 5400 and 5200A aquaculture monitors and controllers
YSI 5500D 5400 and 5200A aquaculture monitors and controllersYSI 5500D 5400 and 5200A aquaculture monitors and controllers
YSI 5500D 5400 and 5200A aquaculture monitors and controllers
 
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...
 
Sparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public SafetySparkling Water Meetup: Deep Learning for Public Safety
Sparkling Water Meetup: Deep Learning for Public Safety
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno Candel
 
Machine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of ThingsMachine Learning for the Sensored Internet of Things
Machine Learning for the Sensored Internet of Things
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFH2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
 
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsExploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
Exploit Research and Development Megaprimer: DEP Bypassing with ROP Chains
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
BSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident TrackingBSidesTO 2016 - Incident Tracking
BSidesTO 2016 - Incident Tracking
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Hacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - WhitepaperHacking Tizen: The OS of everything - Whitepaper
Hacking Tizen: The OS of everything - Whitepaper
 

Similar to H2O World - Top 10 Data Science Pitfalls - Mark Landry

Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
Chitrachitrap
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
DurgaDevi310087
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Intel® Software
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
Kunal Jain
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
Datacademy.ai
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
SATHVIK MANIKANTAN N U
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 

Similar to H2O World - Top 10 Data Science Pitfalls - Mark Landry (20)

Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
 
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 

More from Sri Ambati

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
Sri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
Sri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
Sri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
Sri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
Sri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Sri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
Sri Ambati
 

More from Sri Ambati (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 

Recently uploaded

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

H2O World - Top 10 Data Science Pitfalls - Mark Landry

  • 1. H2O.ai Machine Intelligence Top 10 Data Science Practitioner Pitfalls Mark Landry H2O World 2015
  • 2. H2O.ai Machine Intelligence Train vs Test 1 of 10 Top 10 Data Science Practitioner Pitfalls
  • 3. H2O.ai Machine Intelligence 1. Train vs Test Training Set vs. Test Set • Partition the original data (randomly or stratified) into a training set and a test set. (e.g. 70/30) • It can be useful to evaluate the training error, but you should not look at training error alone. • Training error is not an estimate of generalization error (on a test set or cross-validated), which is what you should care more about. • Training error vs test error over time is an useful thing to calculate. It can tell you when you start to overfit your model, so it is a useful metric in supervised machine learning. Training Error vs. Test Error
  • 4. H2O.ai Machine Intelligence 1. Train vs Test Error Source: Elements of Statistical Learning
  • 5. H2O.ai Machine Intelligence Validation Set 2 of 10 Top 10 Data Science Practitioner Pitfalls
  • 6. H2O.ai Machine Intelligence 2. Train vs Test vs Valid Training Set vs. Validation Set vs. Test Set • If you have “enough” data and plan to do some model tuning, you should really partition your data into three parts — Training, Validation and Test sets. • There is no general rule for how you should partition the data and it will depend on how strong the signal in your data is, but an example could be: 50% Train, 25% Validation and 25% Test • The validation set is used strictly for model tuning (via validation of models with different parameters) and the test set is used to make a final estimate of the generalization error. Validation is for Model Tuning
  • 7. H2O.ai Machine Intelligence Model Performance 3 of 10 Top 10 Data Science Practitioner Pitfalls
  • 8. H2O.ai Machine Intelligence 3. Model Performance Test Error • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the training set and evaluate performance (a single time) on the test set. • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure, Log-loss • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  • 9. H2O.ai Machine Intelligence Class Imbalance 4 of 10 Top 10 Data Science Practitioner Pitfalls
  • 10. H2O.ai Machine Intelligence 4. Class Imbalance Imbalanced Response Variable • A dataset is said to be imbalanced when the binomial or multinomial response variable has one or more classes that are underrepresented in the training data, with respect to the other classes. • This is incredibly common in real-word datasets. • In practice, balanced datasets are the rarity, unless they have been artificially created. • There is no precise definition of what defines an imbalanced vs balanced dataset — the term is vague. • My rule of thumb for binary response: If the minority class makes <10% of the data, this can cause issues. • Advertising — Probability that someone clicks on ad is very low… very very low. • Healthcare & Medicine — Certain diseases or adverse medical conditions are rare. • Fraud Detection — Insurance or credit fraud is rare. Very common Industries
  • 11. H2O.ai Machine Intelligence 4. Remedies Artificial Balance • You can balance the training set using sampling. • Notice that we don’t say to balance the test set. The test set represents the true data distribution. The only way to get “honest” model performance on your test set is to use the original, unbalanced, test set. • The same goes for the hold-out sets in cross-validation. For this, you may end up having to write custom code, depending on what software you use. • H2O has a “balance_classes” argument that can be used to do this properly & automatically. • You can manually upsample (or downsample) your minority (or majority) class(es) set either by duplicating (or sub- sampling) rows, or by using row weights. • The SMOTE (Synthetic Minority Oversampling Technique) algorithm generates simulated training examples from the minority class instead of upsampling. Potential Pitfalls Solutions
  • 12. H2O.ai Machine Intelligence Categorical Data 5 of 10 Top 10 Data Science Practitioner Pitfalls
  • 13. H2O.ai Machine Intelligence 5. Categorical Data Real Data • Most real world datasets contain categorical data. • Problems can arise if you have too many categories. • A lot of ML software will place limits on the number of categories allowed in a single column (e.g. 1024) so you may be forced to deal with this whether you like it or not. • When there are high-cardinality categorical columns, often there will be many categories that only occur a small number of times (not very useful). • If you have some hierarchical knowledge about the data, then you may be able to reduce the number of categories by using some sensible higher-level mapping of the categories. • Example: ICD-9 codes — thousands of unique diagnostic and procedure codes. You can map each category to a higher level super-category to reduce the cardinality. Too Many Categories Solutions
  • 14. H2O.ai Machine Intelligence Missing Data 6 of 10 Top 10 Data Science Practitioner Pitfalls
  • 15. H2O.ai Machine Intelligence 6. Missing Data Types of Missing Data • Unavailable: Valid for the observation, but not available in the data set. • Removed: Observation quality threshold may have not been reached, and data removed • Not applicable: measurement does not apply to the particular observation (e.g. number of tires on a boat observation) • It depends! Some options: • Ignore entire observation. • Create an binary variable for each predictor to indicate whether the data was missing or not • Segment model based on data availability. • Use alternative algorithm: decision trees accept missing values; linear models typically do not. What to Do
  • 16. H2O.ai Machine Intelligence Outliers 7 of 10 Top 10 Data Science Practitioner Pitfalls
  • 17. H2O.ai Machine Intelligence 7. Outliers/Extreme Values Types of Outliers • Outliers can exist in response or predictors • Valid outliers: rare, extreme events • Invalid outliers: erroneous measurements • Remove observations. • Apply a transformation to reduce impact: e.g. log or bins. • Choose a loss function that is more robust: e.g. MAE vs MSE. • Impose a constraint on data range (cap values). • Ask questions: Understand whether the values are valid or invalid, to make the most appropriate choice. What to Do What Can Happen • Outlier values can have a disproportionate weight on the model. • MSE will focus on handling outlier observations more to reduce squared error. • Boosting will spend considerable modeling effort fitting these observations.
  • 18. H2O.ai Machine Intelligence Data Leakage 8 of 10 Top 10 Data Science Practitioner Pitfalls
  • 19. H2O.ai Machine Intelligence 8. Data Leakage What Is It • Leakage is allowing your model to use information that will not be available in a production setting. • Obvious example: using the Dow Jones daily gain/loss as part of a model to predict individual stock performance • Model is overfit. • Will make predictions inconsistent with those you scored when fitting the model (even with a validation set). • Insights derived from the model will be incorrect. • Understand the nature of your problem and data. • Scrutinize model feedback, such as relative influence or linear coefficient. What Happens What to Do
  • 20. H2O.ai Machine Intelligence Useless Models 9 of 10 Top 10 Data Science Practitioner Pitfalls
  • 21. H2O.ai Machine Intelligence 9. Useless Models What is a “Useless” Model? • Solving the Wrong Problem. • Not collecting appropriate data. • Not structuring data correctly to solve the problem. • Choosing a target/loss measure that does not optimize the end use case: using accuracy to prioritize resources. • Having a model that is not actionable. • Using a complicated model that is less accurate than a simple model. • Understand the problem statement. • Solving the wrong problem is an issue in all problem-solving domains, but arguably easier with black box techniques common to ML • Utilize post-processing measures • Create simple baseline models to understand lift of more complex models • Plan on an iterative approach: start quickly, even if on imperfect data • Question your models and attempt to understand them What To Do
  • 22. H2O.ai Machine Intelligence No Free Lunch 10 of 10 Top 10 Data Science Practitioner Pitfalls
  • 23. H2O.ai Machine Intelligence 10. No Free Lunch No Such Thing as a Free Lunch • No general purpose algorithm to solve all problems. • No right answer on optimal data preparation. • General heuristics are not always true: • Tree models solve problems equivalently with any order-preserving transformation. • Decision trees and neural networks will automatically find interactions. • High number of predictors may be handled, but lead to a less optimal result than fewer key predictors. • Models can not find relative information that span multiple observations. • Model feedback can be misleading: relative influence, linear coefficients • Understand how the underlying algorithms operate • Try several algorithms and observe relative performance and the characteristics of your data • Feature engineering & feature selection • Interpret and react to model feedback What To Do
  • 24. H2O.ai Machine Intelligence Where to learn more about H2O? • H2O Online Training (free): http://learn.h2o.ai • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events • Machine Learning & Data Science courses: http://coursebuffet.com