SlideShare a Scribd company logo
1 of 39
Download to read offline
STATISTICAL MODELING: THE
TWO CULTURES
Based on Leo Breiman's paper

Christoph Molnar
Department of Statistics,
LMU Munich
2014-01-20

.

.
Statistical Modeling: The Two Cultures

STATISTICAL MODELING: THE
TWO CULTURES
Based on Leo Breiman's paper

.

Christoph Molnar
Department of Statistics,
LMU Munich

.

Abstract:
This presentation compares two cultures of statistical
modeling: the data modeling culture, which assumes a
stochastic process that produced the data. This culture is
associated with traditional statistics. The other culture is called
algorithmic modeling culture, which can be reduced to
optimization of a loss function with an algorithm. This culture is
associated with Machine Learning. It is argued to use
algorithmic modeling more often in statistics.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

OUTLINE

1. Statistics: Data Modeling Culture
2. Machine Learning: Algorithmic Modeling Culture
3. Statistical Learning Principles
4. Personal Experience
5. Summary

Content heavily based on: ``Statistical Modeling: The two cultures''
from Leo Breiman [1]
1 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

OUTLINE

1. Statistics: Data Modeling Culture
2. Machine Learning: Algorithmic Modeling Culture
3. Statistical Learning Principles
4. Personal Experience

Outline

5. Summary

.

Content heavily based on: ``Statistical Modeling: The two cultures''
from Leo Breiman [1]

.

This presentation is based on ``Statistical Modeling: The two
cultures'' from Leo Breiman [1].
The first segment introduces the data modeling culture and
analogously the second segment explains the algorithmic
modeling culture together with the presentation of three
algorithms. The part about statistical learning principals
presents aspects which help to compare both of cultures.
Personal experiences in both cultures are addressed. The
conclusion summarizes the message of the paper [1].
DATA MODELING CULTURE
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

WORK OF A DATA ANALYST

→ Predict
→ Reveal associations
→ Munge data, design experiments, visualize data, …

2 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

WORK OF A DATA ANALYST

Data Modeling

→ Predict
→ Reveal associations
→ Munge data, design experiments, visualize data, …

Work of a Data Analyst
.
.

The work of a data analyst is very diverse. This presentation
focuses on the modeling of data, which can be reduced to two
targets: Learn a model to predict the outcome for new
covariates and get a better understanding about the
relationship between covariates and outcome.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

SIMPLIFIED WORLDVIEW

y

.
nature

x

3 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

SIMPLIFIED WORLDVIEW

Data Modeling
y

.
nature

Simplified worldview
.
.

In a strongly simplified world an arbitrary outcome y is
produced by the nature given the covariates x. The knowledge
about the natures true mechanisms range between entirely
unknown and established (scientific) explanations of the
mechanism. One example: Outcome y is the rent for
apartments and covariates x are size, number of bathrooms
and location.

x
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

DATA MODELING CULTURE

y

Logistic
Regression,
.
Cox Model,
GEE,
…

x

Find a stochastic model of the data-generating process:
y = f(x, parameters, random error)

4 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

DATA MODELING CULTURE

Data Modeling

y

Data Modeling Culture

Logistic
Regression,
.
Cox Model,
GEE,
…

x

Find a stochastic model of the data-generating process:
y = f(x, parameters, random error)

.
.

The direct modeling of the mechanism in the ``box'' is labeled
»Data Modeling Culture« by Leo Breiman. In this culture a
stochastic model for the data- generating process is assumed.
A common formulation of these model is: y is a function of x
with corresponding weights and a random error. For example:
Given the covariates size, number of bathrooms and location,
the rent of apartments is normal distributed.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

TYPICAL ASSUMPTIONS AND RESTRICTIONS

→ Specific stochastic model that generated the data
→ Distribution of residuals
→ Linearity (e.g. linear predictor)
→ Manual specification of interactions

5 / 19
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

PROBLEMS

→ Conclusions about model, not about nature
→ Assumptions often violated
→ Often no model evaluation
→ ⇒ can lead to irrelevant theory and questionable
statistical conclusions
→ Focus not on prediction
→ Data models fail in areas like image and speech
recognition

6 / 19
ALGORITHMIC MODELING CULTURE
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

MACHINE LEARNING

y

.
unknown

x

algorithm
Find a function f (X) that minimizes the loss: L(Y, f (X))

7 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

MACHINE LEARNING

y

Algorithmic Modeling
Machine Learning

.
unknown

x

algorithm
Find a function f (X) that minimizes the loss: L(Y, f (X))

.
.

In the »Algorithmic Modeling Culture«, the true mechanism is
treated as unknown. It is not the target to find the true
data-generating mechanism but to use an algorithm that
imitates the mechanism as good as possible. Modeling is
reduced to a mere problem of function optimization: Given the
covariates x, outcome y and a loss function find a function f(x)
which minimizes the loss for the prediction of the outcome. This
culture is lived in the machine learning area.
Summary: The data modeling culture tries to find the true
data-generating mechanism, the algorithmic modeling culture
tries to imitate the true mechanism as good as possible.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

ALGORITHM IN MACHINE LEARNING

→ Boosting
→ Support Vector Machines
→ Artificial neural networks
→ Random Forests
→ Hidden Markov
→ Bayes-Nets
→ …1

1

Details and more algorithms in ``Elements of statistical learning''[2]
8 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

ALGORITHM IN MACHINE LEARNING

→ Boosting

Algorithmic Modeling

→ Support Vector Machines
→ Artificial neural networks
→ Random Forests
→ Hidden Markov
→ Bayes-Nets

Algorithm in Machine Learning

→ …1

.

1

Details and more algorithms in ``Elements of statistical learning''[2]

.

The algorithms used in machine learning are motivated
differently. Three algorithms are presented in short.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

ARTIFICIAL NEURAL NETWORKS

2

2

http://commons.wikimedia.org/wiki/File:
Mouse_cingulate_cortex_neurons.jpg
http://commons.wikimedia.org/wiki/File:Neural_network.svg
9 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

ARTIFICIAL NEURAL NETWORKS

Algorithmic Modeling
Artificial neural networks

2

.

2
http://commons.wikimedia.org/wiki/File:
Mouse_cingulate_cortex_neurons.jpg
http://commons.wikimedia.org/wiki/File:Neural_network.svg

.

Artificial neural networks are used in classification and
regression. They are inspired by the brain, which consists of a
network of brain cells (neurons). Mathematically artificial
neural networks are a concatenation of weighted functions. An
exemplary application is image processing.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

SUPPORT VECTOR MACHINES

3
3

http://commons.wikimedia.org/wiki/File:Svm_10_perceptron.JPG

10 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

SUPPORT VECTOR MACHINES

Algorithmic Modeling
Support Vector Machines
3

.

3

http://commons.wikimedia.org/wiki/File:Svm_10_perceptron.JPG

.

Support Vector Machines (SVM) were originally a classification
method (regression is also possible). SVMs try to draw a
border between two classes in the covariate space. The
distance between the border and the class points is
maximized. They use a mathematical trick to implicitly project
the covariates in a space with higher dimensions (yes, it sounds
a bit crazy) in order to achieve class separation. Text
classification is an exemplary usage.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

RANDOM FORESTS

4

11 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

RANDOM FORESTS

Algorithmic Modeling
Random Forests
.
.

4
4

http://openclipart.org/detail/175304/forest-by-z-175304

Random Forests™(invented by Leo Breiman) are used for
regression and classification. A Random Forest consists of
many decision trees. Two random mechanisms are used to
train different trees on the data. Results from all trees are
averaged for the prediction.
STATISTICAL LEARNING PRINCIPLES
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

RASHOMON EFFECT

(Often) Many different models describe a situation
equally accurate.

12 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

RASHOMON EFFECT

Principles
(Often) Many different models describe a situation
equally accurate.

Rashomon Effect
.
.

Rashomon is a Japanese movie in which four witnesses tell
different versions of a crime. All versions account for the facts
but they contradict each other. In terms of statistical learning
this means, that often different models (e.g. same y but
different covariates) can be equally accurate. Each model has
a different interpretation which makes it difficult to find the
true mechanism in the data modeling cultures. In the
algorithmic modeling culture the Rashomon effect is exploited
by aggregating over many models. Random forests use this
effect by aggregating over many trees. It is also common to
average the predictions of different algorithms.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

DIMENSIONALITY OF THE DATA

→ The higher the dimensionality of the data (#
covariates) the more difficult is the separation of
signal and noise
→ Common practice in data modeling: variable selection
(by expert selection or data driven) and reduction of
dimensionality (e.g. PCA)
→ Common practice in algorithmic modeling:
Engineering of new features (covariates) to increase
predictive accuracy; algorithms robust for many
covariates

13 / 19
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

PREDICTION VS. INTERPRETATION

Predictive accuracy
Interpretability

.

Tree
Logist. Regression
…

Random Forest
Neuronal networks
…

14 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

PREDICTION VS. INTERPRETATION

Principles

Predictive accuracy
Interpretability

.

Tree
Logist. Regression

Prediction vs. Interpretation

…

.
.

There is a trade-off between interpretability and predictive
accuracy: the models that are good in prediction are often
complex and models that are easy to interpret are often bad
predictors. See for example trees and Random Forests: A
single decision tree is very intuitive and easy to read for
non-professionals, but they are unstable and give weak
predictions. A complex aggregation of decision trees (Random
Forest) has an excellent prediction accuracy, but it is
impossible to interpret the model structure.

Random Forest
Neuronal networks
…
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

GOODNESS OF MODEL

→ Data modeling culture: Goodness of fit often based
on model assumptions (e.g. AIC) and calculated on
training data.
→ Algorithmic modeling culture: Evaluation of predictive
accuracy with an extra test set or cross validation.
How good is a statistical model if the predictive accuracy
is weak? Is it legit to interpret parameters and p-values?

15 / 19
PERSONAL EXPERIENCES
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

STATISTICAL CONSULTING

Stereotypical user ...
→ are e.g. veterinarians, linguists, biologists, …
→ crave p-values
→ want interpretability
→ ignore model diagnosis

16 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

STATISTICAL CONSULTING

Experiences

Stereotypical user ...
→ are e.g. veterinarians, linguists, biologists, …
→ crave p-values
→ want interpretability
→ ignore model diagnosis

Statistical Consulting
.
.

From my experience in the statistical consulting unit of our
university, most user want to test their scientific hypothesis
with models. They want models which are easy to interpret
regarding their questions. Thus it is more important to have a
model that gives parameters associated with covariates and
p-values than to have a model that predicts the data well.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

KAGGLE

Algorithms of winners on kaggle, a platform for
prediction challenges:
→ Job Salary Prediction: »I used deep neural networks«
→ Observing Dark Worlds: »Bayesian analysis provided
the winning recipe for solving this problem«
→ Give Me Some Credit: »In the end we only used five
supervised learning methods: a random forest of
classification trees, a random forest of regression
trees, a classification tree boosting algorithm, a
regression tree boosting algorithm, and a neural
network.«

17 / 19
2014-01-20

.

.
Statistical Modeling: The Two Cultures

KAGGLE

Algorithms of winners on kaggle, a platform for
prediction challenges:

Experiences

→ Job Salary Prediction: »I used deep neural networks«
→ Observing Dark Worlds: »Bayesian analysis provided
the winning recipe for solving this problem«
→ Give Me Some Credit: »In the end we only used five
supervised learning methods: a random forest of
classification trees, a random forest of regression
trees, a classification tree boosting algorithm, a
regression tree boosting algorithm, and a neural
network.«

kaggle
.
.

Kaggle (http://www.kaggle.com) is a platform for prediction
challenges. Data are provided and the participant with the best
prediction on a test set wins the challenge. To generate
insights about the mechanisms in the data is secondary
because the prediction is all that counts to win. That's why the
algorithmic modeling culture is superior in this field.
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

CONCLUSION

Data analysts should:
→ Use predictive accuracy to evaluate models
→ Seek the best model
→ Add Machine Learning to their toolbox

18 / 19
Data Modeling

Algorithmic Modeling

Principles

Experiences

Conclusions

FURTHER LITERATURE

L. Breiman
Statistical modeling: The two cultures (with comments
and a rejoinder by the author)
Institute of Mathematical Statistics, 2001.
T. Hastie, R. Tibshirani and J. Friedman
The elements of statistical learning
Springer New York, 2001

19 / 19
»All models are wrong, but some are useful« (G. Box)

More Related Content

What's hot

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesHortonworks
 
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...Neo4j
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data ManagementZahra Mansoori
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerAmazon Web Services
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and SelectionNikita Zhiltsov
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AIFlorian Wilhelm
 
POLE Investigations with Neo4j
POLE Investigations with Neo4jPOLE Investigations with Neo4j
POLE Investigations with Neo4jNeo4j
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 finalAmjid Ali
 
From Data Lakes to the Data Fabric: Our Vision for Digital Strategy
From Data Lakes to the Data Fabric: Our Vision for Digital StrategyFrom Data Lakes to the Data Fabric: Our Vision for Digital Strategy
From Data Lakes to the Data Fabric: Our Vision for Digital StrategyCambridge Semantics
 
Data Analytics and Business Intelligence
Data Analytics and Business IntelligenceData Analytics and Business Intelligence
Data Analytics and Business IntelligenceChris Ortega, MBA
 
Analytics with Descriptive, Predictive and Prescriptive Techniques
Analytics with Descriptive, Predictive and Prescriptive TechniquesAnalytics with Descriptive, Predictive and Prescriptive Techniques
Analytics with Descriptive, Predictive and Prescriptive Techniquesleadershipsoil
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsRenee Yao
 
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Simplilearn
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity AssessmentFiras Hamdan
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 

What's hot (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
From Target to Product - Accelerating the Drug Lifecycle with Knowledge Graph...
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data Management
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMaker
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and Selection
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
 
POLE Investigations with Neo4j
POLE Investigations with Neo4jPOLE Investigations with Neo4j
POLE Investigations with Neo4j
 
Graph Kernelpdf
Graph KernelpdfGraph Kernelpdf
Graph Kernelpdf
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
From Data Lakes to the Data Fabric: Our Vision for Digital Strategy
From Data Lakes to the Data Fabric: Our Vision for Digital StrategyFrom Data Lakes to the Data Fabric: Our Vision for Digital Strategy
From Data Lakes to the Data Fabric: Our Vision for Digital Strategy
 
Data Analytics and Business Intelligence
Data Analytics and Business IntelligenceData Analytics and Business Intelligence
Data Analytics and Business Intelligence
 
Analytics with Descriptive, Predictive and Prescriptive Techniques
Analytics with Descriptive, Predictive and Prescriptive TechniquesAnalytics with Descriptive, Predictive and Prescriptive Techniques
Analytics with Descriptive, Predictive and Prescriptive Techniques
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANs
 
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity Assessment
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 

Viewers also liked

Conditional inference trees (CTREEs) in dynamic microsimulation
Conditional inference trees (CTREEs) in dynamic microsimulationConditional inference trees (CTREEs) in dynamic microsimulation
Conditional inference trees (CTREEs) in dynamic microsimulationNiels Erik Kaaber Rasmussen
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Andrea Dal Pozzolo
 
Parallel and Perpendicular Slopes lines
Parallel and Perpendicular Slopes linesParallel and Perpendicular Slopes lines
Parallel and Perpendicular Slopes linesswartzje
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...WithTheBest
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep LearningMyungjin Lee
 
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...Plotly
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationPier Luca Lanzi
 
Lossless Compression
Lossless CompressionLossless Compression
Lossless CompressionPuchpa Oks
 
Data compression techniques
Data compression techniquesData compression techniques
Data compression techniquesDeep Bhatt
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticskemdoby
 

Viewers also liked (20)

Conditional trees
Conditional treesConditional trees
Conditional trees
 
Conditional inference trees (CTREEs) in dynamic microsimulation
Conditional inference trees (CTREEs) in dynamic microsimulationConditional inference trees (CTREEs) in dynamic microsimulation
Conditional inference trees (CTREEs) in dynamic microsimulation
 
Datajournalistik – en introduktion
Datajournalistik – en introduktionDatajournalistik – en introduktion
Datajournalistik – en introduktion
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
 
Parallel and Perpendicular Slopes lines
Parallel and Perpendicular Slopes linesParallel and Perpendicular Slopes lines
Parallel and Perpendicular Slopes lines
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
 
Data mining
Data miningData mining
Data mining
 
Data Mining (Predict The Future)
Data Mining (Predict The Future)Data Mining (Predict The Future)
Data Mining (Predict The Future)
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
 
Statistical Project
Statistical ProjectStatistical Project
Statistical Project
 
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
 
Parallel lines
Parallel linesParallel lines
Parallel lines
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and Preparation
 
Lossless Compression
Lossless CompressionLossless Compression
Lossless Compression
 
Data compression techniques
Data compression techniquesData compression techniques
Data compression techniques
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 

Similar to Statistical Modeling: The Two Cultures

G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Statistical learning vs. Machine Learning
Statistical learning vs. Machine LearningStatistical learning vs. Machine Learning
Statistical learning vs. Machine LearningAtanu Ray
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Augmented intelligence as a response to the crisis of artificial intelligence
Augmented intelligence as a response to the crisis of artificial intelligenceAugmented intelligence as a response to the crisis of artificial intelligence
Augmented intelligence as a response to the crisis of artificial intelligenceAlexander Ryzhov
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Google Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlowGoogle Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlowHarini Gunabalan
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Kim Flintoff
 
ML All Chapter PDF.pdf
ML All Chapter PDF.pdfML All Chapter PDF.pdf
ML All Chapter PDF.pdfexample43
 
Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning ParrotAI
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 
Experimental categorization and deep visualization
 Experimental categorization and deep visualization Experimental categorization and deep visualization
Experimental categorization and deep visualizationEverardo Reyes-García
 
mathematical modeling nikki.pptx
mathematical modeling nikki.pptxmathematical modeling nikki.pptx
mathematical modeling nikki.pptxArikB
 

Similar to Statistical Modeling: The Two Cultures (20)

G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Statistical learning vs. Machine Learning
Statistical learning vs. Machine LearningStatistical learning vs. Machine Learning
Statistical learning vs. Machine Learning
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdfTop Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
module 6 (1).ppt
module 6 (1).pptmodule 6 (1).ppt
module 6 (1).ppt
 
Augmented intelligence as a response to the crisis of artificial intelligence
Augmented intelligence as a response to the crisis of artificial intelligenceAugmented intelligence as a response to the crisis of artificial intelligence
Augmented intelligence as a response to the crisis of artificial intelligence
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Google Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlowGoogle Developer Groups Talk - TensorFlow
Google Developer Groups Talk - TensorFlow
 
20181212 ibm aot
20181212 ibm aot20181212 ibm aot
20181212 ibm aot
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...
 
ML All Chapter PDF.pdf
ML All Chapter PDF.pdfML All Chapter PDF.pdf
ML All Chapter PDF.pdf
 
Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Experimental categorization and deep visualization
 Experimental categorization and deep visualization Experimental categorization and deep visualization
Experimental categorization and deep visualization
 
mathematical modeling nikki.pptx
mathematical modeling nikki.pptxmathematical modeling nikki.pptx
mathematical modeling nikki.pptx
 

Recently uploaded

QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonhttgc7rh9c
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111GangaMaiya1
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsNbelano25
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxakanksha16arora
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesSHIVANANDaRV
 

Recently uploaded (20)

QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 

Statistical Modeling: The Two Cultures

  • 1. STATISTICAL MODELING: THE TWO CULTURES Based on Leo Breiman's paper Christoph Molnar Department of Statistics, LMU Munich
  • 2. 2014-01-20 . . Statistical Modeling: The Two Cultures STATISTICAL MODELING: THE TWO CULTURES Based on Leo Breiman's paper . Christoph Molnar Department of Statistics, LMU Munich . Abstract: This presentation compares two cultures of statistical modeling: the data modeling culture, which assumes a stochastic process that produced the data. This culture is associated with traditional statistics. The other culture is called algorithmic modeling culture, which can be reduced to optimization of a loss function with an algorithm. This culture is associated with Machine Learning. It is argued to use algorithmic modeling more often in statistics.
  • 3. Data Modeling Algorithmic Modeling Principles Experiences Conclusions OUTLINE 1. Statistics: Data Modeling Culture 2. Machine Learning: Algorithmic Modeling Culture 3. Statistical Learning Principles 4. Personal Experience 5. Summary Content heavily based on: ``Statistical Modeling: The two cultures'' from Leo Breiman [1] 1 / 19
  • 4. 2014-01-20 . . Statistical Modeling: The Two Cultures OUTLINE 1. Statistics: Data Modeling Culture 2. Machine Learning: Algorithmic Modeling Culture 3. Statistical Learning Principles 4. Personal Experience Outline 5. Summary . Content heavily based on: ``Statistical Modeling: The two cultures'' from Leo Breiman [1] . This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman [1]. The first segment introduces the data modeling culture and analogously the second segment explains the algorithmic modeling culture together with the presentation of three algorithms. The part about statistical learning principals presents aspects which help to compare both of cultures. Personal experiences in both cultures are addressed. The conclusion summarizes the message of the paper [1].
  • 6. Data Modeling Algorithmic Modeling Principles Experiences Conclusions WORK OF A DATA ANALYST → Predict → Reveal associations → Munge data, design experiments, visualize data, … 2 / 19
  • 7. 2014-01-20 . . Statistical Modeling: The Two Cultures WORK OF A DATA ANALYST Data Modeling → Predict → Reveal associations → Munge data, design experiments, visualize data, … Work of a Data Analyst . . The work of a data analyst is very diverse. This presentation focuses on the modeling of data, which can be reduced to two targets: Learn a model to predict the outcome for new covariates and get a better understanding about the relationship between covariates and outcome.
  • 9. 2014-01-20 . . Statistical Modeling: The Two Cultures SIMPLIFIED WORLDVIEW Data Modeling y . nature Simplified worldview . . In a strongly simplified world an arbitrary outcome y is produced by the nature given the covariates x. The knowledge about the natures true mechanisms range between entirely unknown and established (scientific) explanations of the mechanism. One example: Outcome y is the rent for apartments and covariates x are size, number of bathrooms and location. x
  • 10. Data Modeling Algorithmic Modeling Principles Experiences Conclusions DATA MODELING CULTURE y Logistic Regression, . Cox Model, GEE, … x Find a stochastic model of the data-generating process: y = f(x, parameters, random error) 4 / 19
  • 11. 2014-01-20 . . Statistical Modeling: The Two Cultures DATA MODELING CULTURE Data Modeling y Data Modeling Culture Logistic Regression, . Cox Model, GEE, … x Find a stochastic model of the data-generating process: y = f(x, parameters, random error) . . The direct modeling of the mechanism in the ``box'' is labeled »Data Modeling Culture« by Leo Breiman. In this culture a stochastic model for the data- generating process is assumed. A common formulation of these model is: y is a function of x with corresponding weights and a random error. For example: Given the covariates size, number of bathrooms and location, the rent of apartments is normal distributed.
  • 12. Data Modeling Algorithmic Modeling Principles Experiences Conclusions TYPICAL ASSUMPTIONS AND RESTRICTIONS → Specific stochastic model that generated the data → Distribution of residuals → Linearity (e.g. linear predictor) → Manual specification of interactions 5 / 19
  • 13. Data Modeling Algorithmic Modeling Principles Experiences Conclusions PROBLEMS → Conclusions about model, not about nature → Assumptions often violated → Often no model evaluation → ⇒ can lead to irrelevant theory and questionable statistical conclusions → Focus not on prediction → Data models fail in areas like image and speech recognition 6 / 19
  • 15. Data Modeling Algorithmic Modeling Principles Experiences Conclusions MACHINE LEARNING y . unknown x algorithm Find a function f (X) that minimizes the loss: L(Y, f (X)) 7 / 19
  • 16. 2014-01-20 . . Statistical Modeling: The Two Cultures MACHINE LEARNING y Algorithmic Modeling Machine Learning . unknown x algorithm Find a function f (X) that minimizes the loss: L(Y, f (X)) . . In the »Algorithmic Modeling Culture«, the true mechanism is treated as unknown. It is not the target to find the true data-generating mechanism but to use an algorithm that imitates the mechanism as good as possible. Modeling is reduced to a mere problem of function optimization: Given the covariates x, outcome y and a loss function find a function f(x) which minimizes the loss for the prediction of the outcome. This culture is lived in the machine learning area. Summary: The data modeling culture tries to find the true data-generating mechanism, the algorithmic modeling culture tries to imitate the true mechanism as good as possible.
  • 17. Data Modeling Algorithmic Modeling Principles Experiences Conclusions ALGORITHM IN MACHINE LEARNING → Boosting → Support Vector Machines → Artificial neural networks → Random Forests → Hidden Markov → Bayes-Nets → …1 1 Details and more algorithms in ``Elements of statistical learning''[2] 8 / 19
  • 18. 2014-01-20 . . Statistical Modeling: The Two Cultures ALGORITHM IN MACHINE LEARNING → Boosting Algorithmic Modeling → Support Vector Machines → Artificial neural networks → Random Forests → Hidden Markov → Bayes-Nets Algorithm in Machine Learning → …1 . 1 Details and more algorithms in ``Elements of statistical learning''[2] . The algorithms used in machine learning are motivated differently. Three algorithms are presented in short.
  • 19. Data Modeling Algorithmic Modeling Principles Experiences Conclusions ARTIFICIAL NEURAL NETWORKS 2 2 http://commons.wikimedia.org/wiki/File: Mouse_cingulate_cortex_neurons.jpg http://commons.wikimedia.org/wiki/File:Neural_network.svg 9 / 19
  • 20. 2014-01-20 . . Statistical Modeling: The Two Cultures ARTIFICIAL NEURAL NETWORKS Algorithmic Modeling Artificial neural networks 2 . 2 http://commons.wikimedia.org/wiki/File: Mouse_cingulate_cortex_neurons.jpg http://commons.wikimedia.org/wiki/File:Neural_network.svg . Artificial neural networks are used in classification and regression. They are inspired by the brain, which consists of a network of brain cells (neurons). Mathematically artificial neural networks are a concatenation of weighted functions. An exemplary application is image processing.
  • 21. Data Modeling Algorithmic Modeling Principles Experiences Conclusions SUPPORT VECTOR MACHINES 3 3 http://commons.wikimedia.org/wiki/File:Svm_10_perceptron.JPG 10 / 19
  • 22. 2014-01-20 . . Statistical Modeling: The Two Cultures SUPPORT VECTOR MACHINES Algorithmic Modeling Support Vector Machines 3 . 3 http://commons.wikimedia.org/wiki/File:Svm_10_perceptron.JPG . Support Vector Machines (SVM) were originally a classification method (regression is also possible). SVMs try to draw a border between two classes in the covariate space. The distance between the border and the class points is maximized. They use a mathematical trick to implicitly project the covariates in a space with higher dimensions (yes, it sounds a bit crazy) in order to achieve class separation. Text classification is an exemplary usage.
  • 24. 2014-01-20 . . Statistical Modeling: The Two Cultures RANDOM FORESTS Algorithmic Modeling Random Forests . . 4 4 http://openclipart.org/detail/175304/forest-by-z-175304 Random Forests™(invented by Leo Breiman) are used for regression and classification. A Random Forest consists of many decision trees. Two random mechanisms are used to train different trees on the data. Results from all trees are averaged for the prediction.
  • 26. Data Modeling Algorithmic Modeling Principles Experiences Conclusions RASHOMON EFFECT (Often) Many different models describe a situation equally accurate. 12 / 19
  • 27. 2014-01-20 . . Statistical Modeling: The Two Cultures RASHOMON EFFECT Principles (Often) Many different models describe a situation equally accurate. Rashomon Effect . . Rashomon is a Japanese movie in which four witnesses tell different versions of a crime. All versions account for the facts but they contradict each other. In terms of statistical learning this means, that often different models (e.g. same y but different covariates) can be equally accurate. Each model has a different interpretation which makes it difficult to find the true mechanism in the data modeling cultures. In the algorithmic modeling culture the Rashomon effect is exploited by aggregating over many models. Random forests use this effect by aggregating over many trees. It is also common to average the predictions of different algorithms.
  • 28. Data Modeling Algorithmic Modeling Principles Experiences Conclusions DIMENSIONALITY OF THE DATA → The higher the dimensionality of the data (# covariates) the more difficult is the separation of signal and noise → Common practice in data modeling: variable selection (by expert selection or data driven) and reduction of dimensionality (e.g. PCA) → Common practice in algorithmic modeling: Engineering of new features (covariates) to increase predictive accuracy; algorithms robust for many covariates 13 / 19
  • 29. Data Modeling Algorithmic Modeling Principles Experiences Conclusions PREDICTION VS. INTERPRETATION Predictive accuracy Interpretability . Tree Logist. Regression … Random Forest Neuronal networks … 14 / 19
  • 30. 2014-01-20 . . Statistical Modeling: The Two Cultures PREDICTION VS. INTERPRETATION Principles Predictive accuracy Interpretability . Tree Logist. Regression Prediction vs. Interpretation … . . There is a trade-off between interpretability and predictive accuracy: the models that are good in prediction are often complex and models that are easy to interpret are often bad predictors. See for example trees and Random Forests: A single decision tree is very intuitive and easy to read for non-professionals, but they are unstable and give weak predictions. A complex aggregation of decision trees (Random Forest) has an excellent prediction accuracy, but it is impossible to interpret the model structure. Random Forest Neuronal networks …
  • 31. Data Modeling Algorithmic Modeling Principles Experiences Conclusions GOODNESS OF MODEL → Data modeling culture: Goodness of fit often based on model assumptions (e.g. AIC) and calculated on training data. → Algorithmic modeling culture: Evaluation of predictive accuracy with an extra test set or cross validation. How good is a statistical model if the predictive accuracy is weak? Is it legit to interpret parameters and p-values? 15 / 19
  • 33. Data Modeling Algorithmic Modeling Principles Experiences Conclusions STATISTICAL CONSULTING Stereotypical user ... → are e.g. veterinarians, linguists, biologists, … → crave p-values → want interpretability → ignore model diagnosis 16 / 19
  • 34. 2014-01-20 . . Statistical Modeling: The Two Cultures STATISTICAL CONSULTING Experiences Stereotypical user ... → are e.g. veterinarians, linguists, biologists, … → crave p-values → want interpretability → ignore model diagnosis Statistical Consulting . . From my experience in the statistical consulting unit of our university, most user want to test their scientific hypothesis with models. They want models which are easy to interpret regarding their questions. Thus it is more important to have a model that gives parameters associated with covariates and p-values than to have a model that predicts the data well.
  • 35. Data Modeling Algorithmic Modeling Principles Experiences Conclusions KAGGLE Algorithms of winners on kaggle, a platform for prediction challenges: → Job Salary Prediction: »I used deep neural networks« → Observing Dark Worlds: »Bayesian analysis provided the winning recipe for solving this problem« → Give Me Some Credit: »In the end we only used five supervised learning methods: a random forest of classification trees, a random forest of regression trees, a classification tree boosting algorithm, a regression tree boosting algorithm, and a neural network.« 17 / 19
  • 36. 2014-01-20 . . Statistical Modeling: The Two Cultures KAGGLE Algorithms of winners on kaggle, a platform for prediction challenges: Experiences → Job Salary Prediction: »I used deep neural networks« → Observing Dark Worlds: »Bayesian analysis provided the winning recipe for solving this problem« → Give Me Some Credit: »In the end we only used five supervised learning methods: a random forest of classification trees, a random forest of regression trees, a classification tree boosting algorithm, a regression tree boosting algorithm, and a neural network.« kaggle . . Kaggle (http://www.kaggle.com) is a platform for prediction challenges. Data are provided and the participant with the best prediction on a test set wins the challenge. To generate insights about the mechanisms in the data is secondary because the prediction is all that counts to win. That's why the algorithmic modeling culture is superior in this field.
  • 37. Data Modeling Algorithmic Modeling Principles Experiences Conclusions CONCLUSION Data analysts should: → Use predictive accuracy to evaluate models → Seek the best model → Add Machine Learning to their toolbox 18 / 19
  • 38. Data Modeling Algorithmic Modeling Principles Experiences Conclusions FURTHER LITERATURE L. Breiman Statistical modeling: The two cultures (with comments and a rejoinder by the author) Institute of Mathematical Statistics, 2001. T. Hastie, R. Tibshirani and J. Friedman The elements of statistical learning Springer New York, 2001 19 / 19
  • 39. »All models are wrong, but some are useful« (G. Box)