SlideShare a Scribd company logo
1 of 30
Statistics in the age of data science,
issues you can and can not ignore
John Mount
(data scientist, not a statistician)
Win-Vector LLC
http://www.win-vector.com/
Session: http://conf.dato.com/speakers/dr-john-mount/
These slides, all data and code: http://winvector.github.io/DS/
1
This talk
•Our most important data science tools are our theories and methods. Let
us look a bit at their fundamentals.
•Large data gives the apparent luxury of wishing away some common model
evaluation biases (instead of needing to apply the traditional statistical
corrections).
•Conversely, to work agilely data scientists must act as if a few naive
axioms of statistical inference were true (though they are not).
•I will point out some common statistical issues that do and do not remain
critical problems when you are data rich.
•I will concentrate on the simple case of supervised learning.
2
Outline
•(quick) What is data science?
•How can that work?
•An example critical task that gets easier when you have more data.
•What are some of the folk axioms of data science?
•How to design bad data.
3
What is Data Science: my position
•Data science is the continuation of data engineering and predictive analytics.
•More data allows domain naive models to perform well.
•See: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig,
and Fernando Pereira, IEEE Intelligent Systems, 2009, pp. 1541-1672.
•Emphasis on prediction over harder statistical problems such as coefficient
inference.
•Strong preference for easy procedures that become more statistically reliable as
you accumulate more data.
•Reliance on strong black-box tools.
4
Why does data science
work at all?
5
Complicated domain theory doesn’t always
preclude easily observable statistical signals
6
“Maybe the molecule didn’t go to
graduate school.”
(Will Welch defending the success of his approximate
molecular screening algorithm, given he is a computer scientist
and not a chemist.)
Example approximate docking (in this case using SLIDE
approximation, not Welch et al.’s Hammerhead).
“Database Screening for HIV Protease Ligands: The
Influence of Binding-Site Conformation and
Representation on Ligand Selectivity", Volker Schnecke,
Leslie A. Kuhn, Proceedings of the Seventh International
Conference on Intelligent Systems for Molecular
Biology, Pages 242-251, AAAI Press, 1999.
A lot of deep statistics is about how to work
correctly with small data sets
•From Statistics 4th
edition, David
Freedman, Robert
Pisani, Roger
Purves, Norton,
2007.
7
What is a good example of a
critical task that becomes easier
when you have more data?
8
Model assessment
•Estimating the performance of your predictive model on future instances
•A critical task
•Gets easier when you have more data:
•Don’t need to rely on statistical adjustments
•Can reserve a single sample of data as a held-out test set (see “The Elements of
Statistical Learning” 2nd edition section 7.2)
•Computationally cheaper than:
•leave k-out cross validation
•k-fold cross validation
9
Let’s review these terms
10
Statistical Adjustments
•Attempt to estimate the value of a statistic or the performance of a model
on new data using only training data.
•Examples:
•Sample size adjustment for variance: writing instead
of
•“adjusted R-squared”, in-sample p-values, “AIC”, “BIC”, …
•Pointless to adjust in training sample quantities when you have enough
data to try and estimate out of sample quantities directly (cross validation
methods, train/test methods, or bootstrap methods).
11
Leave k-out, k-way, and k-fold scoring
12
Train/Test split
Training set
Test set
13
Train/Test split continued
•Statistically inefficient
•Blocking issue for small data sets
•Largely irrelevant for large data sets
•Considered “cross validation done wrong” by some statisticians
•Cross validation techniques in fact estimate the quality of the fitting
procedure, not the quality of the final returned model.
•Test or held-out procedures directly estimate the performance of the
actual model built.
•Preferred by some statisticians and most data scientists.
14
Back to data science
15
Data scientists rush in where
statisticians fear to tread
•Large data sets
•Wide data sets
•Heterogeneous variables
•Colinear variables
•Noisy dependent variables
•Noisy independent variables
16
We have to admit:
data scientists are a flourishing species
•Must be something to be
learned from that.
•What axioms (true or
false) would be needed to
explain their success?
17
Data science axiom wish list
•Just a few:
•Wish more data was always better.
•Wish more variables were always better.
•Wish you can retain some fraction of training
performance on future model application.
18
Is more data always better?
•In theory: yes (you could always simulate having less
data by throwing some away).
•In practice: almost always yes.
•Absolutely for every algorithm every time? no.
19
More data can be bad for a fixed procedure
(artificial example)
•Statistics / machine learning algorithms that depend
on re-sampling to supply diversity can degrade in the
presence of extra data.
•Case in point: random forest over shallow trees can
lose tree diversity (especially when there are
duplicate or near-duplicate variables).
20
Random forest example
•A data set where a random forest
over shallow trees shows lower
median accuracy on test data as we
increase training data set size.
•(synthetic data set designed to hurt
random forest, logistic model
passes 0.5 accuracy)
•All code/data:
http://winvector.github.io/DS/
21
Are more variables always better?
•In theory: yes.
•Consequence of the non-negativity of mutual information.
•Only true for training set performance, not performance on future instances.
•In practice: often.
•In fact: ridiculously easy to break:
•Noise variables
•Collinear variables
•Near constant variables
•Overfit
22
To benefit from more variables
•Need at least a few of the following:
•Enough additional data to falsify additional columns.
•Regularization terms / useful inductive biases.
•Variance reduction / bagging schemes.
•Dependent variable aware pre-processing (variable selection, partial
least squares, word2vec, and not principal components projection).
23
Can’t we keep at least some
of our training performance?
•Common situation:
•Near perfect fit on training data.
•Model performs like random guessing on new
instances.
•Extreme over fit.
•One often hopes some regularized, ensemble,
or transformed version of such a model would
have at least some use on new instances.
24
25
Not the case
•For at least the following
common popular machine
learning algorithms we can
design a simple data set
where we get arbitrarily high
accuracy on training even
when the dependent variable
is generated completely
independently of all of the
independent variables.
•Decision Trees
•Logistic Regression
•Elastic Net Logistic Regression
•Gradient Boosting
•Naive Bayes
•Random Forest
•Support Vector Machine
(All code/data:
http://winvector.github.io/DS/ )
Lesson
•Can’t just tack on cross validation late in a machine
learning algorithm’s design (or just use it to pick a few
calibration parameters).
•Have to express model quality in terms of out of sample
data throughout.
•And must understand some fraction of the above
measurements will still be chimeric and falsely optimistic
(statistical significance re-enters the picture).
26
27
How did we design the counter
examples?
•A lot of common machine learning algorithms fail in the
presence of:
•Noise variables
•Duplicate examples
•Serial correlation
•Incompatible scales
•Punchline: all these things are common in typical under-
curated real world data!
The analyst themself can be a source of
additional exotic “can never happen” biases
•Neighboring duplicate and near-duplicate rows (bad
join/sessionizing logic).
•Features with activation patterns depending on the size of the
training set (opportunistic feature generation/selection).
•Leakage of facts about evaluation set through repeated scoring
(see “wacky boosting” by Moritz Hardt, which gives a reliable
procedure to place high on Kaggle leaderboards without even
looking at the data).
28
Conclusions
•Data scientists, statisticians, and domain experts all see things differently.
•Data science emphasizes procedures that are conceptually easy and become more
correct when scaled to large data. Procedures can seem overly ambitious and as
pandering to domain/business needs.
•Statistics emphasizes procedures that are correct at all data scales, including difficult
small data problems. Procedures can seem overly doctrinal and as insensitive to
original domain/business needs.
•Domain experts/scientists value correctness and foundation, over implementability.
•An effective data science team must work agilely, understand statistics, and develop
domain empathy.
•We need a deeper science of structured back-testing.
29
Thank you
30
For more, please check out my book,
or contact me at win-vector.com
Also follow our group on our blog
http://www.win-vector.com/blog/ or on
Twitter @WinVectorLLC

More Related Content

What's hot

CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125Displayr
 
Operations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a procesOperations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a procesPOLY33
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicplka13
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_scienceMartina Pugliese
 
DIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDisplayr
 
5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysisBhagya Silva
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysistqmdoctor
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical ResearchPaul Agapow
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysisgatelyw396
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...BenVanCalster
 
Make clinical prediction models great again
Make clinical prediction models great againMake clinical prediction models great again
Make clinical prediction models great againBenVanCalster
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
 

What's hot (20)

CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
 
Operations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a procesOperations management practical problem ii flow charting a proces
Operations management practical problem ii flow charting a proces
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
 
DIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDIY Max-Diff webinar slides
DIY Max-Diff webinar slides
 
5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis5 why’s technique and cause and effect analysis
5 why’s technique and cause and effect analysis
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs Unsupervised
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
#8 Root Cause Analysis
#8 Root Cause Analysis#8 Root Cause Analysis
#8 Root Cause Analysis
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...
 
Make clinical prediction models great again
Make clinical prediction models great againMake clinical prediction models great again
Make clinical prediction models great again
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 

Similar to Statistics in the age of data science, issues you can not ignore

H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetCongChen35
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itKhosrow Hassibi
 
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicDave Litwiller
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big dataOSTHUS
 

Similar to Statistics in the age of data science, issues you can not ignore (20)

H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Data science 101
Data science 101Data science 101
Data science 101
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in it
 
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Reasoning over big data
Reasoning over big dataReasoning over big data
Reasoning over big data
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Statistics in the age of data science, issues you can not ignore

  • 1. Statistics in the age of data science, issues you can and can not ignore John Mount (data scientist, not a statistician) Win-Vector LLC http://www.win-vector.com/ Session: http://conf.dato.com/speakers/dr-john-mount/ These slides, all data and code: http://winvector.github.io/DS/ 1
  • 2. This talk •Our most important data science tools are our theories and methods. Let us look a bit at their fundamentals. •Large data gives the apparent luxury of wishing away some common model evaluation biases (instead of needing to apply the traditional statistical corrections). •Conversely, to work agilely data scientists must act as if a few naive axioms of statistical inference were true (though they are not). •I will point out some common statistical issues that do and do not remain critical problems when you are data rich. •I will concentrate on the simple case of supervised learning. 2
  • 3. Outline •(quick) What is data science? •How can that work? •An example critical task that gets easier when you have more data. •What are some of the folk axioms of data science? •How to design bad data. 3
  • 4. What is Data Science: my position •Data science is the continuation of data engineering and predictive analytics. •More data allows domain naive models to perform well. •See: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira, IEEE Intelligent Systems, 2009, pp. 1541-1672. •Emphasis on prediction over harder statistical problems such as coefficient inference. •Strong preference for easy procedures that become more statistically reliable as you accumulate more data. •Reliance on strong black-box tools. 4
  • 5. Why does data science work at all? 5
  • 6. Complicated domain theory doesn’t always preclude easily observable statistical signals 6 “Maybe the molecule didn’t go to graduate school.” (Will Welch defending the success of his approximate molecular screening algorithm, given he is a computer scientist and not a chemist.) Example approximate docking (in this case using SLIDE approximation, not Welch et al.’s Hammerhead). “Database Screening for HIV Protease Ligands: The Influence of Binding-Site Conformation and Representation on Ligand Selectivity", Volker Schnecke, Leslie A. Kuhn, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, Pages 242-251, AAAI Press, 1999.
  • 7. A lot of deep statistics is about how to work correctly with small data sets •From Statistics 4th edition, David Freedman, Robert Pisani, Roger Purves, Norton, 2007. 7
  • 8. What is a good example of a critical task that becomes easier when you have more data? 8
  • 9. Model assessment •Estimating the performance of your predictive model on future instances •A critical task •Gets easier when you have more data: •Don’t need to rely on statistical adjustments •Can reserve a single sample of data as a held-out test set (see “The Elements of Statistical Learning” 2nd edition section 7.2) •Computationally cheaper than: •leave k-out cross validation •k-fold cross validation 9
  • 11. Statistical Adjustments •Attempt to estimate the value of a statistic or the performance of a model on new data using only training data. •Examples: •Sample size adjustment for variance: writing instead of •“adjusted R-squared”, in-sample p-values, “AIC”, “BIC”, … •Pointless to adjust in training sample quantities when you have enough data to try and estimate out of sample quantities directly (cross validation methods, train/test methods, or bootstrap methods). 11
  • 12. Leave k-out, k-way, and k-fold scoring 12
  • 14. Train/Test split continued •Statistically inefficient •Blocking issue for small data sets •Largely irrelevant for large data sets •Considered “cross validation done wrong” by some statisticians •Cross validation techniques in fact estimate the quality of the fitting procedure, not the quality of the final returned model. •Test or held-out procedures directly estimate the performance of the actual model built. •Preferred by some statisticians and most data scientists. 14
  • 15. Back to data science 15
  • 16. Data scientists rush in where statisticians fear to tread •Large data sets •Wide data sets •Heterogeneous variables •Colinear variables •Noisy dependent variables •Noisy independent variables 16
  • 17. We have to admit: data scientists are a flourishing species •Must be something to be learned from that. •What axioms (true or false) would be needed to explain their success? 17
  • 18. Data science axiom wish list •Just a few: •Wish more data was always better. •Wish more variables were always better. •Wish you can retain some fraction of training performance on future model application. 18
  • 19. Is more data always better? •In theory: yes (you could always simulate having less data by throwing some away). •In practice: almost always yes. •Absolutely for every algorithm every time? no. 19
  • 20. More data can be bad for a fixed procedure (artificial example) •Statistics / machine learning algorithms that depend on re-sampling to supply diversity can degrade in the presence of extra data. •Case in point: random forest over shallow trees can lose tree diversity (especially when there are duplicate or near-duplicate variables). 20
  • 21. Random forest example •A data set where a random forest over shallow trees shows lower median accuracy on test data as we increase training data set size. •(synthetic data set designed to hurt random forest, logistic model passes 0.5 accuracy) •All code/data: http://winvector.github.io/DS/ 21
  • 22. Are more variables always better? •In theory: yes. •Consequence of the non-negativity of mutual information. •Only true for training set performance, not performance on future instances. •In practice: often. •In fact: ridiculously easy to break: •Noise variables •Collinear variables •Near constant variables •Overfit 22
  • 23. To benefit from more variables •Need at least a few of the following: •Enough additional data to falsify additional columns. •Regularization terms / useful inductive biases. •Variance reduction / bagging schemes. •Dependent variable aware pre-processing (variable selection, partial least squares, word2vec, and not principal components projection). 23
  • 24. Can’t we keep at least some of our training performance? •Common situation: •Near perfect fit on training data. •Model performs like random guessing on new instances. •Extreme over fit. •One often hopes some regularized, ensemble, or transformed version of such a model would have at least some use on new instances. 24
  • 25. 25 Not the case •For at least the following common popular machine learning algorithms we can design a simple data set where we get arbitrarily high accuracy on training even when the dependent variable is generated completely independently of all of the independent variables. •Decision Trees •Logistic Regression •Elastic Net Logistic Regression •Gradient Boosting •Naive Bayes •Random Forest •Support Vector Machine (All code/data: http://winvector.github.io/DS/ )
  • 26. Lesson •Can’t just tack on cross validation late in a machine learning algorithm’s design (or just use it to pick a few calibration parameters). •Have to express model quality in terms of out of sample data throughout. •And must understand some fraction of the above measurements will still be chimeric and falsely optimistic (statistical significance re-enters the picture). 26
  • 27. 27 How did we design the counter examples? •A lot of common machine learning algorithms fail in the presence of: •Noise variables •Duplicate examples •Serial correlation •Incompatible scales •Punchline: all these things are common in typical under- curated real world data!
  • 28. The analyst themself can be a source of additional exotic “can never happen” biases •Neighboring duplicate and near-duplicate rows (bad join/sessionizing logic). •Features with activation patterns depending on the size of the training set (opportunistic feature generation/selection). •Leakage of facts about evaluation set through repeated scoring (see “wacky boosting” by Moritz Hardt, which gives a reliable procedure to place high on Kaggle leaderboards without even looking at the data). 28
  • 29. Conclusions •Data scientists, statisticians, and domain experts all see things differently. •Data science emphasizes procedures that are conceptually easy and become more correct when scaled to large data. Procedures can seem overly ambitious and as pandering to domain/business needs. •Statistics emphasizes procedures that are correct at all data scales, including difficult small data problems. Procedures can seem overly doctrinal and as insensitive to original domain/business needs. •Domain experts/scientists value correctness and foundation, over implementability. •An effective data science team must work agilely, understand statistics, and develop domain empathy. •We need a deeper science of structured back-testing. 29
  • 30. Thank you 30 For more, please check out my book, or contact me at win-vector.com Also follow our group on our blog http://www.win-vector.com/blog/ or on Twitter @WinVectorLLC

Editor's Notes

  1. http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/35179.pdf
  2. Clearly we are ignoring some important domain science issues and statistical science issues, so how does data science work?
  3. http://www.aaai.org/Papers/ISMB/1999/ISMB99-028.pdf You may not get the whole story, but you may not miss the whole story.
  4. Ch. 26 page 493. Statistical efficiency is a huge worry when you don’t have a lot of data.
  5. “The Elements of Statistical Learning” 2nd edition section 7.2 page 222. https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  6. Does not matter when n is large. Can actually be quite complicated and require a lot of background to apply correctly. Prefer tools like the PRESS statistics to adjusted R-squared. Can use training mean against out of sample instances and so on.
  7. k-X cross validation methods are a procedural alternative. Shown: 3-fold cross validation. We try to simulate the performance of a model on new data by never applying a model to any data used to construct it. Which cross validation scheme you are using determines pattern of arrows. Common to all schemes: there are many throw away models. The larger the models the more like training on all of the available data they behave.
  8. Test/Train split is an easier alternative that is less statistically efficient and depends on having good tools (that their selves cross-validate) during the training phase. Test set is held secret during model construction, tuning, and even early evaluation. Scoring in Train subset may in fact itself use both cross-validation and train/test subsplit methods. Actual model produced is scored on test set (though some data scientists re-train on the entire data set as a final “model polish” step).
  9. Splitting your available data into train and test is a way to try and simulate the arrival of future data. Like any simulation- it may fail. Controlled experiments are prospective designs that are somewhat more expensive and somewhat more powerful than this.
  10. Data science is a bit looser than traditional statistical practice and moves a bit faster; what does that look like?
  11. Axioms that are true are true in the extreme.
  12. Random forest is a dominant machine learning algorithm in practice. This is a problem where logistic regression gets 85% accuracy as n increases, and the concept is reachable by the random forest model.
  13. Note: collinear variables while damaging to prediction are nowhere near as large a hazard to prediction as they are to coefficient inference. And classic “x alone” methods of dealing with them become problematic in so called “wide data” situations.
  14. Principal components is a “independent variable only” or “x alone” transform, a good idea over curated homogeneous variable- not good over wild wide datasets. word2vec ( https://code.google.com/p/word2vec/ ) can be considered not “x alone” as it presumably retains concept clusters from the grouping of data its training source (typically GoogleNews or Freebase naming); to it has an “y” (just not your “y”).
  15. I.e. we see an arbitrarily good model on training, even when to model is possible. Also have sometimes seen a reversal: the model is significantly worse than random on the test set. Being worse than random is likely a minor distribution change from training to test. The observed statistical significance is likely due to some process causing dependence between rows in a limited window (like serial correlation or bad sessionizing) and not evidence of anything usable). These methods should be familiar: they are most of the supervised learning algorithms from everybody’s “top 10 data mining” or “top 10 Kaggle algorithms” lists.
  16. Gradient boosting typically uses cross val to number of trees, zero trees being no model.
  17. Some of these problems even break test/train exchangeability, one of the major justifications of machine learning.
  18. http://blog.mrtz.org/2015/03/09/competition.html
  19. It is equally arrogant to completely ignore domain science as it is to believe you can always quickly become a domain expert. http://www.win-vector.com/blog/2014/05/a-bit-of-the-agenda-of-practical-data-science-with-r/