SlideShare a Scribd company logo
Statistics in the age of data science,
issues you can and can not ignore
John Mount
(data scientist, not a statistician)
Win-Vector LLC
http://www.win-vector.com/
Session: http://conf.dato.com/speakers/dr-john-mount/
These slides, all data and code: http://winvector.github.io/DS/
1
This talk
•Our most important data science tools are our theories and methods. Let
us look a bit at their fundamentals.
•Large data gives the apparent luxury of wishing away some common model
evaluation biases (instead of needing to apply the traditional statistical
corrections).
•Conversely, to work agilely data scientists must act as if a few naive
axioms of statistical inference were true (though they are not).
•I will point out some common statistical issues that do and do not remain
critical problems when you are data rich.
•I will concentrate on the simple case of supervised learning.
2
Outline
•(quick) What is data science?
•How can that work?
•An example critical task that gets easier when you have more data.
•What are some of the folk axioms of data science?
•How to design bad data.
3
What is Data Science: my position
•Data science is the continuation of data engineering and predictive analytics.
•More data allows domain naive models to perform well.
•See: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig,
and Fernando Pereira, IEEE Intelligent Systems, 2009, pp. 1541-1672.
•Emphasis on prediction over harder statistical problems such as coefficient
inference.
•Strong preference for easy procedures that become more statistically reliable as
you accumulate more data.
•Reliance on strong black-box tools.
4
Why does data science
work at all?
5
Complicated domain theory doesn’t always
preclude easily observable statistical signals
6
“Maybe the molecule didn’t go to
graduate school.”
(Will Welch defending the success of his approximate
molecular screening algorithm, given he is a computer scientist
and not a chemist.)
Example approximate docking (in this case using SLIDE
approximation, not Welch et al.’s Hammerhead).
“Database Screening for HIV Protease Ligands: The
Influence of Binding-Site Conformation and
Representation on Ligand Selectivity", Volker Schnecke,
Leslie A. Kuhn, Proceedings of the Seventh International
Conference on Intelligent Systems for Molecular
Biology, Pages 242-251, AAAI Press, 1999.
A lot of deep statistics is about how to work
correctly with small data sets
•From Statistics 4th
edition, David
Freedman, Robert
Pisani, Roger
Purves, Norton,
2007.
7
What is a good example of a
critical task that becomes easier
when you have more data?
8
Model assessment
•Estimating the performance of your predictive model on future instances
•A critical task
•Gets easier when you have more data:
•Don’t need to rely on statistical adjustments
•Can reserve a single sample of data as a held-out test set (see “The Elements of
Statistical Learning” 2nd edition section 7.2)
•Computationally cheaper than:
•leave k-out cross validation
•k-fold cross validation
9
Let’s review these terms
10
Statistical Adjustments
•Attempt to estimate the value of a statistic or the performance of a model
on new data using only training data.
•Examples:
•Sample size adjustment for variance: writing instead
of
•“adjusted R-squared”, in-sample p-values, “AIC”, “BIC”, …
•Pointless to adjust in training sample quantities when you have enough
data to try and estimate out of sample quantities directly (cross validation
methods, train/test methods, or bootstrap methods).
11
Leave k-out, k-way, and k-fold scoring
12
Train/Test split
Training set
Test set
13
Train/Test split continued
•Statistically inefficient
•Blocking issue for small data sets
•Largely irrelevant for large data sets
•Considered “cross validation done wrong” by some statisticians
•Cross validation techniques in fact estimate the quality of the fitting
procedure, not the quality of the final returned model.
•Test or held-out procedures directly estimate the performance of the
actual model built.
•Preferred by some statisticians and most data scientists.
14
Back to data science
15
Data scientists rush in where
statisticians fear to tread
•Large data sets
•Wide data sets
•Heterogeneous variables
•Colinear variables
•Noisy dependent variables
•Noisy independent variables
16
We have to admit:
data scientists are a flourishing species
•Must be something to be
learned from that.
•What axioms (true or
false) would be needed to
explain their success?
17
Data science axiom wish list
•Just a few:
•Wish more data was always better.
•Wish more variables were always better.
•Wish you can retain some fraction of training
performance on future model application.
18
Is more data always better?
•In theory: yes (you could always simulate having less
data by throwing some away).
•In practice: almost always yes.
•Absolutely for every algorithm every time? no.
19
More data can be bad for a fixed procedure
(artificial example)
•Statistics / machine learning algorithms that depend
on re-sampling to supply diversity can degrade in the
presence of extra data.
•Case in point: random forest over shallow trees can
lose tree diversity (especially when there are
duplicate or near-duplicate variables).
20
Random forest example
•A data set where a random forest
over shallow trees shows lower
median accuracy on test data as we
increase training data set size.
•(synthetic data set designed to hurt
random forest, logistic model
passes 0.5 accuracy)
•All code/data:
http://winvector.github.io/DS/
21
Are more variables always better?
•In theory: yes.
•Consequence of the non-negativity of mutual information.
•Only true for training set performance, not performance on future instances.
•In practice: often.
•In fact: ridiculously easy to break:
•Noise variables
•Collinear variables
•Near constant variables
•Overfit
22
To benefit from more variables
•Need at least a few of the following:
•Enough additional data to falsify additional columns.
•Regularization terms / useful inductive biases.
•Variance reduction / bagging schemes.
•Dependent variable aware pre-processing (variable selection, partial
least squares, word2vec, and not principal components projection).
23
Can’t we keep at least some
of our training performance?
•Common situation:
•Near perfect fit on training data.
•Model performs like random guessing on new
instances.
•Extreme over fit.
•One often hopes some regularized, ensemble,
or transformed version of such a model would
have at least some use on new instances.
24
25
Not the case
•For at least the following
common popular machine
learning algorithms we can
design a simple data set
where we get arbitrarily high
accuracy on training even
when the dependent variable
is generated completely
independently of all of the
independent variables.
•Decision Trees
•Logistic Regression
•Elastic Net Logistic Regression
•Gradient Boosting
•Naive Bayes
•Random Forest
•Support Vector Machine
(All code/data:
http://winvector.github.io/DS/ )
Lesson
•Can’t just tack on cross validation late in a machine
learning algorithm’s design (or just use it to pick a few
calibration parameters).
•Have to express model quality in terms of out of sample
data throughout.
•And must understand some fraction of the above
measurements will still be chimeric and falsely optimistic
(statistical significance re-enters the picture).
26
27
How did we design the counter
examples?
•A lot of common machine learning algorithms fail in the
presence of:
•Noise variables
•Duplicate examples
•Serial correlation
•Incompatible scales
•Punchline: all these things are common in typical under-
curated real world data!
The analyst themself can be a source of
additional exotic “can never happen” biases
•Neighboring duplicate and near-duplicate rows (bad
join/sessionizing logic).
•Features with activation patterns depending on the size of the
training set (opportunistic feature generation/selection).
•Leakage of facts about evaluation set through repeated scoring
(see “wacky boosting” by Moritz Hardt, which gives a reliable
procedure to place high on Kaggle leaderboards without even
looking at the data).
28
Conclusions
•Data scientists, statisticians, and domain experts all see things differently.
•Data science emphasizes procedures that are conceptually easy and become more
correct when scaled to large data. Procedures can seem overly ambitious and as
pandering to domain/business needs.
•Statistics emphasizes procedures that are correct at all data scales, including difficult
small data problems. Procedures can seem overly doctrinal and as insensitive to
original domain/business needs.
•Domain experts/scientists value correctness and foundation, over implementability.
•An effective data science team must work agilely, understand statistics, and develop
domain empathy.
•We need a deeper science of structured back-testing.
29
Thank you
30
For more, please check out my book,
or contact me at win-vector.com
Also follow our group on our blog
http://www.win-vector.com/blog/ or on
Twitter @WinVectorLLC

More Related Content

What's hot

CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
Displayr
 
Operations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a procesOperations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a proces
POLY33
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
plka13
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_scienceMartina Pugliese
 
DIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDIY Max-Diff webinar slides
DIY Max-Diff webinar slides
Displayr
 
5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis
Bhagya Silva
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
wolf vanpaemel
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs Unsupervised
BigML, Inc
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
tqmdoctor
 
#8 Root Cause Analysis
#8 Root Cause Analysis#8 Root Cause Analysis
#8 Root Cause Analysis
suman_purumandla
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
Paul Agapow
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
Stats Statswork
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
Joseph Ours, MBA, PMP
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
gatelyw396
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...
BenVanCalster
 
Make clinical prediction models great again
Make clinical prediction models great againMake clinical prediction models great again
Make clinical prediction models great again
BenVanCalster
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
ScientificRevenue
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
Laure Wynants
 

What's hot (20)

CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
 
Operations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a procesOperations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a proces
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
 
DIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDIY Max-Diff webinar slides
DIY Max-Diff webinar slides
 
5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs Unsupervised
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
#8 Root Cause Analysis
#8 Root Cause Analysis#8 Root Cause Analysis
#8 Root Cause Analysis
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...
 
Make clinical prediction models great again
Make clinical prediction models great againMake clinical prediction models great again
Make clinical prediction models great again
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 

Similar to Statistics in the age of data science, issues you can not ignore

H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati
 
Data science 101
Data science 101Data science 101
Data science 101
University of West Florida
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
CongChen35
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Knowledge And Skill Forum
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in it
Khosrow Hassibi
 
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Dave Litwiller
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
Optimus Information Inc.
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big data
OSTHUS
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
Library and Information Science Research Coalition
 

Similar to Statistics in the age of data science, issues you can not ignore (20)

H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Data science 101
Data science 101Data science 101
Data science 101
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in it
 
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big data
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
Turi, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
Turi, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
Turi, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
Turi, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
Turi, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Turi, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
Turi, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
Turi, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
Turi, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
SFrame
SFrameSFrame
SFrame
Turi, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
Turi, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Statistics in the age of data science, issues you can not ignore

  • 1. Statistics in the age of data science, issues you can and can not ignore John Mount (data scientist, not a statistician) Win-Vector LLC http://www.win-vector.com/ Session: http://conf.dato.com/speakers/dr-john-mount/ These slides, all data and code: http://winvector.github.io/DS/ 1
  • 2. This talk •Our most important data science tools are our theories and methods. Let us look a bit at their fundamentals. •Large data gives the apparent luxury of wishing away some common model evaluation biases (instead of needing to apply the traditional statistical corrections). •Conversely, to work agilely data scientists must act as if a few naive axioms of statistical inference were true (though they are not). •I will point out some common statistical issues that do and do not remain critical problems when you are data rich. •I will concentrate on the simple case of supervised learning. 2
  • 3. Outline •(quick) What is data science? •How can that work? •An example critical task that gets easier when you have more data. •What are some of the folk axioms of data science? •How to design bad data. 3
  • 4. What is Data Science: my position •Data science is the continuation of data engineering and predictive analytics. •More data allows domain naive models to perform well. •See: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira, IEEE Intelligent Systems, 2009, pp. 1541-1672. •Emphasis on prediction over harder statistical problems such as coefficient inference. •Strong preference for easy procedures that become more statistically reliable as you accumulate more data. •Reliance on strong black-box tools. 4
  • 5. Why does data science work at all? 5
  • 6. Complicated domain theory doesn’t always preclude easily observable statistical signals 6 “Maybe the molecule didn’t go to graduate school.” (Will Welch defending the success of his approximate molecular screening algorithm, given he is a computer scientist and not a chemist.) Example approximate docking (in this case using SLIDE approximation, not Welch et al.’s Hammerhead). “Database Screening for HIV Protease Ligands: The Influence of Binding-Site Conformation and Representation on Ligand Selectivity", Volker Schnecke, Leslie A. Kuhn, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, Pages 242-251, AAAI Press, 1999.
  • 7. A lot of deep statistics is about how to work correctly with small data sets •From Statistics 4th edition, David Freedman, Robert Pisani, Roger Purves, Norton, 2007. 7
  • 8. What is a good example of a critical task that becomes easier when you have more data? 8
  • 9. Model assessment •Estimating the performance of your predictive model on future instances •A critical task •Gets easier when you have more data: •Don’t need to rely on statistical adjustments •Can reserve a single sample of data as a held-out test set (see “The Elements of Statistical Learning” 2nd edition section 7.2) •Computationally cheaper than: •leave k-out cross validation •k-fold cross validation 9
  • 11. Statistical Adjustments •Attempt to estimate the value of a statistic or the performance of a model on new data using only training data. •Examples: •Sample size adjustment for variance: writing instead of •“adjusted R-squared”, in-sample p-values, “AIC”, “BIC”, … •Pointless to adjust in training sample quantities when you have enough data to try and estimate out of sample quantities directly (cross validation methods, train/test methods, or bootstrap methods). 11
  • 12. Leave k-out, k-way, and k-fold scoring 12
  • 14. Train/Test split continued •Statistically inefficient •Blocking issue for small data sets •Largely irrelevant for large data sets •Considered “cross validation done wrong” by some statisticians •Cross validation techniques in fact estimate the quality of the fitting procedure, not the quality of the final returned model. •Test or held-out procedures directly estimate the performance of the actual model built. •Preferred by some statisticians and most data scientists. 14
  • 15. Back to data science 15
  • 16. Data scientists rush in where statisticians fear to tread •Large data sets •Wide data sets •Heterogeneous variables •Colinear variables •Noisy dependent variables •Noisy independent variables 16
  • 17. We have to admit: data scientists are a flourishing species •Must be something to be learned from that. •What axioms (true or false) would be needed to explain their success? 17
  • 18. Data science axiom wish list •Just a few: •Wish more data was always better. •Wish more variables were always better. •Wish you can retain some fraction of training performance on future model application. 18
  • 19. Is more data always better? •In theory: yes (you could always simulate having less data by throwing some away). •In practice: almost always yes. •Absolutely for every algorithm every time? no. 19
  • 20. More data can be bad for a fixed procedure (artificial example) •Statistics / machine learning algorithms that depend on re-sampling to supply diversity can degrade in the presence of extra data. •Case in point: random forest over shallow trees can lose tree diversity (especially when there are duplicate or near-duplicate variables). 20
  • 21. Random forest example •A data set where a random forest over shallow trees shows lower median accuracy on test data as we increase training data set size. •(synthetic data set designed to hurt random forest, logistic model passes 0.5 accuracy) •All code/data: http://winvector.github.io/DS/ 21
  • 22. Are more variables always better? •In theory: yes. •Consequence of the non-negativity of mutual information. •Only true for training set performance, not performance on future instances. •In practice: often. •In fact: ridiculously easy to break: •Noise variables •Collinear variables •Near constant variables •Overfit 22
  • 23. To benefit from more variables •Need at least a few of the following: •Enough additional data to falsify additional columns. •Regularization terms / useful inductive biases. •Variance reduction / bagging schemes. •Dependent variable aware pre-processing (variable selection, partial least squares, word2vec, and not principal components projection). 23
  • 24. Can’t we keep at least some of our training performance? •Common situation: •Near perfect fit on training data. •Model performs like random guessing on new instances. •Extreme over fit. •One often hopes some regularized, ensemble, or transformed version of such a model would have at least some use on new instances. 24
  • 25. 25 Not the case •For at least the following common popular machine learning algorithms we can design a simple data set where we get arbitrarily high accuracy on training even when the dependent variable is generated completely independently of all of the independent variables. •Decision Trees •Logistic Regression •Elastic Net Logistic Regression •Gradient Boosting •Naive Bayes •Random Forest •Support Vector Machine (All code/data: http://winvector.github.io/DS/ )
  • 26. Lesson •Can’t just tack on cross validation late in a machine learning algorithm’s design (or just use it to pick a few calibration parameters). •Have to express model quality in terms of out of sample data throughout. •And must understand some fraction of the above measurements will still be chimeric and falsely optimistic (statistical significance re-enters the picture). 26
  • 27. 27 How did we design the counter examples? •A lot of common machine learning algorithms fail in the presence of: •Noise variables •Duplicate examples •Serial correlation •Incompatible scales •Punchline: all these things are common in typical under- curated real world data!
  • 28. The analyst themself can be a source of additional exotic “can never happen” biases •Neighboring duplicate and near-duplicate rows (bad join/sessionizing logic). •Features with activation patterns depending on the size of the training set (opportunistic feature generation/selection). •Leakage of facts about evaluation set through repeated scoring (see “wacky boosting” by Moritz Hardt, which gives a reliable procedure to place high on Kaggle leaderboards without even looking at the data). 28
  • 29. Conclusions •Data scientists, statisticians, and domain experts all see things differently. •Data science emphasizes procedures that are conceptually easy and become more correct when scaled to large data. Procedures can seem overly ambitious and as pandering to domain/business needs. •Statistics emphasizes procedures that are correct at all data scales, including difficult small data problems. Procedures can seem overly doctrinal and as insensitive to original domain/business needs. •Domain experts/scientists value correctness and foundation, over implementability. •An effective data science team must work agilely, understand statistics, and develop domain empathy. •We need a deeper science of structured back-testing. 29
  • 30. Thank you 30 For more, please check out my book, or contact me at win-vector.com Also follow our group on our blog http://www.win-vector.com/blog/ or on Twitter @WinVectorLLC

Editor's Notes

  1. http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/35179.pdf
  2. Clearly we are ignoring some important domain science issues and statistical science issues, so how does data science work?
  3. http://www.aaai.org/Papers/ISMB/1999/ISMB99-028.pdf You may not get the whole story, but you may not miss the whole story.
  4. Ch. 26 page 493. Statistical efficiency is a huge worry when you don’t have a lot of data.
  5. “The Elements of Statistical Learning” 2nd edition section 7.2 page 222. https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  6. Does not matter when n is large. Can actually be quite complicated and require a lot of background to apply correctly. Prefer tools like the PRESS statistics to adjusted R-squared. Can use training mean against out of sample instances and so on.
  7. k-X cross validation methods are a procedural alternative. Shown: 3-fold cross validation. We try to simulate the performance of a model on new data by never applying a model to any data used to construct it. Which cross validation scheme you are using determines pattern of arrows. Common to all schemes: there are many throw away models. The larger the models the more like training on all of the available data they behave.
  8. Test/Train split is an easier alternative that is less statistically efficient and depends on having good tools (that their selves cross-validate) during the training phase. Test set is held secret during model construction, tuning, and even early evaluation. Scoring in Train subset may in fact itself use both cross-validation and train/test subsplit methods. Actual model produced is scored on test set (though some data scientists re-train on the entire data set as a final “model polish” step).
  9. Splitting your available data into train and test is a way to try and simulate the arrival of future data. Like any simulation- it may fail. Controlled experiments are prospective designs that are somewhat more expensive and somewhat more powerful than this.
  10. Data science is a bit looser than traditional statistical practice and moves a bit faster; what does that look like?
  11. Axioms that are true are true in the extreme.
  12. Random forest is a dominant machine learning algorithm in practice. This is a problem where logistic regression gets 85% accuracy as n increases, and the concept is reachable by the random forest model.
  13. Note: collinear variables while damaging to prediction are nowhere near as large a hazard to prediction as they are to coefficient inference. And classic “x alone” methods of dealing with them become problematic in so called “wide data” situations.
  14. Principal components is a “independent variable only” or “x alone” transform, a good idea over curated homogeneous variable- not good over wild wide datasets. word2vec ( https://code.google.com/p/word2vec/ ) can be considered not “x alone” as it presumably retains concept clusters from the grouping of data its training source (typically GoogleNews or Freebase naming); to it has an “y” (just not your “y”).
  15. I.e. we see an arbitrarily good model on training, even when to model is possible. Also have sometimes seen a reversal: the model is significantly worse than random on the test set. Being worse than random is likely a minor distribution change from training to test. The observed statistical significance is likely due to some process causing dependence between rows in a limited window (like serial correlation or bad sessionizing) and not evidence of anything usable). These methods should be familiar: they are most of the supervised learning algorithms from everybody’s “top 10 data mining” or “top 10 Kaggle algorithms” lists.
  16. Gradient boosting typically uses cross val to number of trees, zero trees being no model.
  17. Some of these problems even break test/train exchangeability, one of the major justifications of machine learning.
  18. http://blog.mrtz.org/2015/03/09/competition.html
  19. It is equally arrogant to completely ignore domain science as it is to believe you can always quickly become a domain expert. http://www.win-vector.com/blog/2014/05/a-bit-of-the-agenda-of-practical-data-science-with-r/