Top 10 Data Science Practitioner Pitfalls

H2O.ai 
Machine Intelligence
Top 10 Data Science
Practitioner Pitfalls
Erin LeDell and Mark Landry
Silicon Valley Big Data Science
September 2015

H2O.ai 
H2O.ai
H2O Company
H2O Software
• Team: ~35. Founded in 2012, Mountain View, CA
• Stanford Math & Systems Engineers
• Open Source Software (Apache 2.0 License) 
• Ease of Use via Web Interface
• R, Python, Scala, Spark & Hadoop Interfaces
• Distributed Algorithms Scale to Big Data

H2O.ai 
Scientific Advisory Council
Dr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers

H2O.ai 
What is Data Science?
Problem
Formulation
• Identify an outcome of interest and the type of task:
classification / regression / clustering
• Identify the potential predictor variables
• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)
• Collect examples / randomly sample the population
• Transform, clean, impute, filter, aggregate data
• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)
• Model evaluation and comparison
• Sensitivity & Cost Analysis
• Translate results into action items
• Feed results into research pipeline
Collect &
Process Data
Machine Learning
Insights & Action

H2O.ai 
Classification
Clustering
Machine Learning Task Overview
• Predict a real-valued response (viral load, weight)
• Gaussian, Gamma, Poisson and Tweedie
• MSE and R^2
• Multi-class or Binary classification
• Ranking
• Accuracy and AUC
• Unsupervised learning (no training labels)
• Partition the data / identify clusters
• AIC and BIC
Regression

H2O.ai 
Machine Learning Workflow
Source: NLTK
Example of a supervised machine learning workflow.

H2O.ai 
Train vs Test
1 of 10
Top 10 Data Science Practitioner Pitfalls

H2O.ai 
1. Train vs Test
Training Set vs.
Test Set
• Partition the original data (randomly or stratified) into
a training set and a test set. (e.g. 70/30)
• It can be useful to evaluate the training error, but you
should not look at training error alone.
• Training error is not an estimate of generalization
error (on a test set or cross-validated), which is what
you should care more about.
• Training error vs test error over time is an useful thing
to calculate. It can tell you when you start to overfit
your model, so it is a useful metric in supervised
machine learning.
• Be careful of data leakage (from the training set
into the test set).
• If you are using pooled repeated measures data
(vs iid data), you must ensure that all rows
associated with a cluster/individual are either in train
or test, but not in both.
Training Error vs.
Test Error
Data Leakage

H2O.ai 
1. Train vs Test Error
Source: Elements of Statistical Learning

H2O.ai 
Validation Set
2 of 10

H2O.ai 
2. Train vs Test vs Valid
Training Set vs.
Validation Set vs.
Test Set
• If you have “enough” data and plan to do some model
tuning, you should really partition your data into three
parts — Training, Validation and Test sets.
• There is no general rule for how you should partition
the data and it will depend on how strong the signal in
your data is, but an example could be: 50% Train,
25% Validation and 25% Test
• The validation set is used strictly for model tuning
(via validation of models with different parameters)
and the test set is used to make a final estimate of the
generalization error.
Validation is for
Model Tuning

H2O.ai 
Model Performance
3 of 10

H2O.ai 
3. Model Performance
Test Error
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the training set and evaluate
performance (a single time) on the test set.
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure, Log-loss
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics

H2O.ai 
Class Imbalance
4 of 10

H2O.ai 
4. Class Imbalance
Imbalanced
Response Variable
• A dataset is said to be imbalanced when the binomial
or multinomial response variable has one or more
classes that are underrepresented in the training data,
with respect to the other classes.
• This is incredibly common in real-word datasets.
• In practice, balanced datasets are the rarity, unless
they have been artificially created.
• There is no precise definition of what defines an
imbalanced vs balanced dataset — the term is vague.
• My rule of thumb for binary response: If the minority
class makes <10% of the data, this can cause issues.
• Advertising — Probability that someone clicks on ad is  
very low… very very low.
• Healthcare & Medicine — Certain diseases or adverse
medical conditions are rare.
• Fraud Detection — Insurance or credit fraud is rare.
Very common
Industries

H2O.ai 
4. Simple Remedies
Artificial Balance • You can balance the training set using sampling.
• Notice that we don’t say to balance the test set. The
test set represents the true data distribution. The only
way to get “honest” model performance on your test
set is to use the original, unbalanced, test set.
• The same goes for the hold-out sets in cross-
validation. For this, you may end up having to write
custom code, depending on what software you use.
• H2O has a “balance_classes” argument that can be used
to do this properly & automatically.
• You can manually upsample (or downsample) your
minority (or majority) class(es) set either by duplicating (or
sub-sampling) rows, or by using row weights.
• The SMOTE (Synthetic Minority Oversampling Technique)
algorithm generates simulated training examples from the
minority class instead of upsampling.
Potential Pitfalls
Solutions

H2O.ai 
4. Advanced Remedies
AUC-Maximizing
Algorithms
• There are ways to tackle this issue more directly.
• By using algorithms that optimize a metric that is
insensitive to prior class probabilities — for
example, Area Under the ROC Curve (AUC).
• Many algorithms work by optimizing a metric
equivalent or similar to accuracy. If your data is
imbalanced, this will not produce a good model since
you can have excellent accuracy and poor AUC.
Cost-Sensitive
Training
• Use a cost function to penalize the types of errors
you care about most more harshly.
• Cost Matrix:

H2O.ai 
Categorical Data
5 of 10

H2O.ai 
5. Categorical Data
Real Data • Most real world datasets contain categorical data.
• Problems can arise if you have too many categories.
• A lot of ML software will place limits on the number of
categories allowed in a single column (e.g. 1024) so
you may be forced to deal with this whether you like it
or not.
• When there are high-cardinality categorical columns,
often there will be many categories that only occur a
small number of times (not very useful).
• If you have some hierarchical knowledge about the data,
then you may be able to reduce the number of categories
by using some sensible higher-level mapping of the
categories.
• Example: ICD-9 codes — thousands of unique diagnostic
and procedure codes. You can map each category to a
higher level super-category to reduce the cardinality.
Too Many
Categories
Solutions

H2O.ai 
5. Missing Categories
Missing Data • There are many approaches to imputing categorical
data. The simplest approach is to impute all missing
values with the mode (the category that occurs most).
• When your data is split into training and test sets,
there may be categories that are represented in the
training set but not in the test set and vice versa.
• If you have expanded your categorical variable into a group
of binary indicator columns equal to the number of
categories, then new categories in the test set should not
cause any problems. Example: If you expand a
categorical (cat, dog) into “cat” and “dog” indicator columns
and your test set has a “rat” in it, then the value in each of
those columns will be 0 — Neither cat nor dog.
• If the algorithm you are using (e.g. Random Forest)
implicitly uses the categories then you may want to add an
“Other” column that all new categories will be grouped into.
Training vs.
Test Categories
New Categories in
Test Set

H2O.ai 
Missing Data
6 of 10

H2O.ai 
6. Missing Data
Types of
Missing Data
• Unavailable: Valid for the observation, but not
available in the data set.
• Removed: Observation quality threshold may have
not been reached, and data removed
• Not applicable: measurement does not apply to the
particular observation (e.g. number of tires on a boat
observation)
• It depends! Some options:
• Ignore entire observation.
• Create an binary variable for each predictor to
indicate whether the data was missing or not
• Segment model based on data availability.
• Use alternative algorithm: decision trees accept
missing values; linear models typically do not.
What to Do

H2O.ai 
Outliers
7 of 10

H2O.ai 
7. Outliers
Types of Outliers
• Outliers can exist in response or predictors
• Valid outliers: rare, extreme events
• Invalid outliers: erroneous measurements
• Remove observations.
• Apply a transformation to reduce impact: e.g. log or
bins.
• Choose a loss function that is more robust: e.g. MAE
vs MSE.
• Impose a constraint on data range (cap values).
• Ask questions: Understand whether the values are
valid or invalid, to make the most appropriate choice.
What to Do
What Can
Happen
• Outlier values can have a disproportionate weight on
the model.
• MSE will focus on handling outlier observations more
to reduce squared error.
• Boosting will spend considerable modeling effort
fitting these observations.

H2O.ai 
Data Leakage
8 of 10

H2O.ai 
8. Data Leakage
What Is It
• Leakage is allowing your model to use information
that will not be available in a production setting.
• Obvious example: using the Dow Jones daily gain/
loss as part of a model to predict individual stock
performance (even if that symbol is not part of the
Dow)
• Model is overfit.
• Will make predictions inconsistent with those you
scored when fitting the model (even with a validation
set).
• Insights derived from the model will be incorrect.
• Understand the nature of your problem and data.
• Scrutinize model feedback, such as relative influence
or linear coefficient.
What Happens
What to Do

H2O.ai 
Useless Models
9 of 10

H2O.ai 
9. Useless Models
What is a
“Useless” Model?
• Solving the Wrong Problem.
• Not collecting appropriate data.
• Not structuring data correctly to solve the problem.
• Choosing a target/loss measure that does not
optimize the end use case: using accuracy to prioritize
resources.
• Having a model that is not actionable.
• Using a complicated model that is less accurate than
a simple model.
• Understand the problem statement.
• Solving the wrong problem is an issue in all problem-
solving domains, but arguably easier with black box
techniques common to ML
• Utilize post-processing measures
• Create simple baseline models to understand lift of more
complex models
• Plan on an iterative approach: start quickly, even if on
imperfect data
• Question your models and attempt to understand them
What To Do

H2O.ai 
No Free Lunch
10 of 10

H2O.ai 
10. No Free Lunch
No Such Thing as
a Free Lunch
• No general purpose algorithm to solve all problems.
• No right answer on optimal data preparation.
• General heuristics are not always true:
• Tree models solve problems equivalently with
any order-preserving transformation.
• Decision trees and neural networks will
automatically find interactions.
• High number of predictors may be handled, but
lead to a less optimal result than fewer key
predictors.
• Models can not find relative information that
span multiple observations.
• Model feedback can be misleading: relative
influence, linear coefficients
• Understand how the underlying algorithms operate
• Try several algorithms and observe relative performance
and the characteristics of your data
• Feature engineering & feature selection
• Interpret and react to model feedback
What To Do

H2O.ai 
Where to learn practical tips?
• WinVector Blog (Nina Zumel & John Mount):  
http://win-vector.com/blog
• Practical Data Science With R (book by Nina Zumel & John Mount):  
https://www.manning.com/books/practical-data-science-with-r
• Elements of Statistical Learning (book by Trevor Hastie, Robert
Tibshirani & Jerome Friedman):  
http://statweb.stanford.edu/~tibs/ElemStatLearn
• Machine Learning Mastery Blog (Jason Brownlee):  
http://machinelearningmastery.com

H2O.ai 
Where to learn more about H2O?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com

Customers ! Community ! Evangelists
November 9, 10, 11
Computer History Museum

H 2 O W O R L D . H 2 O . A I

!
20% off registration
using code:

h2ocommunity
!

H2O.ai 
@ledell on Twitter, GitHub
erin@h2o.ai
http://www.stat.berkeley.edu/~ledell
@Mark_a_Landry on Twitter
mark@h2o.ai

Top 10 Data Science Practitioner Pitfalls

More Related Content

What's hot

Viewers also liked

Similar to Top 10 Data Science Practitioner Pitfalls

More from Sri Ambati

Recently uploaded

Top 10 Data Science Practitioner Pitfalls