SlideShare a Scribd company logo
The use of machine
learning in official
statistics
Giulio Barcaroli
Istituto Nazionale di Statistica
0
① Data science and Machine Learning (ML)
② Two cultures in statistical modeling: data vs algorithmic modeling, to explain or to predict?
③ Official statistics: what kind of modeling?
④ Why ML in official statistics?
⑤ ML in the traditional statistical information production process
⑥ ML in a multi-source production process
⑦ Some conclusions
1
Outline
2
2
Data science and
Machine Learning
“Data science is a concept to unify statistics,
data analysis, machine learning and their related
methods in order to understand and analyze actual
phenomena with data.”
“Machine learning is a subset of artificial
intelligence in the field of computer science that
often uses statistical techniques to give
computers the ability to learn (i.e., progressively
improve performance on a specific task) with data,
without being explicitly programmed.”
(Wikipedia)
«Statistical Modeling: The Two Cultures» (Breiman, 2001)
“There are two cultures in the use of statistical modeling to reach
conclusions from data.
o One assumes that the data are generated by a given stochastic data
model.
o The other uses algorithmic models and treats the data mechanism as
unknown.
The statistical community has been committed to the almost exclusive use of
data models.
Algorithmic modeling, both in theory and practice, has developed rapidly in
fields outside statistics. It can be used both on large complex data sets and
as a more accurate and informative alternative to data modeling on smaller
data sets. If our goal as a field is to use data to solve problems, then we
need to move away from exclusive dependence on data models and adopt a
more diverse set of tools.”
3
3
From data modeling to algorithmic modeling
“Assuming a stochastic data model for the inside of the black box.
A common data model is that data are generated by independent draws from
response variables = f(predictor variables, random noise, parameters)
Model validation: Yes–no using goodness-of-fit tests and residual examination”
4
4
From data modeling to algorithmic modeling
“The analysis in this culture considers the inside of the box complex and unknown. Their approach is to
find a function f(x) — an algorithm that operates on x to predict the responses y. Their black box looks
like this:
Model validation. Measured by predictive accuracy”
5
5
From data modeling to algorithmic modeling
6
6
Machine Learning
“A core objective of a learner is to generalize from its experience.
Generalization in this context is the ability of a learning machine
to perform accurately on new, unseen examples/tasks after
having experienced a learning data set.
Classification machine learning models can be validated by
accuracy estimation techniques like the holdout method,
which splits the data in a training and test set (conventionally 2/3
training set and 1/3 test set designation) and evaluates the
performance of the training model on the test set.
In comparison, the N-fold-cross-validation method randomly
splits the data in k subsets where the k-1 instances of the data
are used to train the model while the k-th instance is used to test
the predictive ability of the training model. In addition to the
holdout and cross-validation methods, bootstrap, which samples
n instances with replacement from the dataset, can be used to
assess model accuracy.”
(Wikipedia)
7
7
Primary and secondary data: towards a multisource
environment
“NSOs are currently facing unprecedented pressure to evaluate how they operate. Years of declining response rates
to primary data collection efforts and the proliferation of readily accessible data, which has made it easier for private
companies to produce statistics, is putting into question the role of NSOs. In response, many NSOs are looking to
tap into these alternative data sources to supplement, or even replace, data collected by traditional means.”
(UNECE Machine Learning Team)
So, the shift is from primary data (survey data) where the only source is represented by data are collected for
statistical purposes, to secondary data (administrative or Big Data sources).
The nature of secondary data, in particular the volume and variety of these data, makes the algorithmic
approach more convenient than the data modeling approach.
But even in the classical production process based on primary data (described by the Generic Statistic Business
Process Model) Machine Learning can be competitive in many phases.
8
8
Modeling in Official Statistics: primary data process
Modeling is widely used in Official Statistics.
In the standard statistical information production
process, model based or model assisted
techniques are adopted in
- Sampling design (stratification)
- Data integration (record linkage and statistical matching)
- Data editing and imputation
- Outlier detection and handling
- Total non response handling
- Estimation
9
9
Use of models in primary data production process
Examples of implicit definition of models:
• stratification in sample design
• donor search for imputation
• population totals for calibration
Examples of explicit definition of models:
• models for imputation
• models for outlier detection
• models for calibration
10
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
f ? 2 1 1
h 21 2 1 2
a 3 2 2 4
i ? 3 1 1
e 7 3 3 1
g 13 4 1 3
10
Example: imputation (donor search)
Given a dataset with a number of missing values on a
variable Y, impute them with the hot-deck donor method.
1. Order the dataset by other variables X1, X2, …,Xp
2. Scan the dataset starting from the first record.
Whenever there is a missing value in the Y variable,
impute the value of the previous record.
Implicit model:
Y = f(X1,X2, … Xp)
No parameters, no indications about the quality of the
imputation.
11
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
11
Example: model based imputation
(traditional approach)
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data
Incomplete
data
Fit models
using complete
data
Apply to
impute
Evaluate
fitting
12
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
12
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
Id Y X1 X2 X3
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data:
“ground
truth”
Training
set
Validate
set
Incomplete
data
Fit models
using training
set
Evaluate
performance
on validate set
Choose the
best and apply
to impute
Example: model based imputation
(Machine Learning approach)
13
13
Use of ML in a multi-source production process
e-commerce
e-recruitment
e-tendering
…
32,000 enterprises
Sample
selection
Data
collection
on 19,000
enterprises
Machine
learning
Population
frame
(ASIA)
Reference population:
184,000 enterprises
Predictors
Survey
data
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
11,700 websites
14,000 URLs
14
14
Use of ML in a multi-source production process
e-commerce
e-recruitment
e-tendering
…
Reference population:
184,000 enterprises
Predictors
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
85,000 websites
Population
frame
(ASIA)
Predictions
Survey
data
Estimation
Estimates
15
15
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
1. Logistic Regression
2. Classification Trees
3. Ensembles (Bagging, Boosting,
Random Forests)
4. Naïve Bayes
5. Neural Networks
6. Support Vector Machines
7. …
16
16
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
Learner Accuracy Recall Precision F1-measure
Naïve Bayes 0.84 0.56 0.56 0.56
Logistic 0.84 0.57 0.57 0.57
Decision
Tree
0.87 0.64 0.64 0.64
Neural Net 0.88 0.65 0.66 0.66
Bagging 0.88 0.66 0.67 0.67
SVM 0.90 0.62 0.76 0.68
Boosting 0.90 0.71 0.71 0.71
Random
Forest
0.90 0.73 0.73 0.73
17
17
Estimator Formula Weighting Description
Design based /
model assisted
𝑌 = ∑ 𝑟 𝑦 𝑘 𝑤 𝑘
𝑘=1
𝑟
𝑤 𝑘 = 𝑁 𝑈
𝑤 𝑘 weights are obtained by a
calibration procedure making
use of known totals in the
population
Model based
𝑌 = ∑ 𝑈2 𝑦 𝑘 𝑤 𝑘
′
𝑘=1
𝑈2
𝑤 𝑘
′
= 𝑁 𝑈1
Count of the predicted values
𝑦 𝑘 for all units for which it
was possible reach their
websites (population 𝑈2
),
calibrated in order to make
them representative of all
the population having
websites (𝑈1
).
Combined
𝑌 = ∑ 𝑈2 𝑦 𝑘 +
∑ 𝑟1( 𝑦 𝑘 − 𝑦 𝑘)𝑤 𝑘
′′
+
∑ 𝑟2 𝑦 𝑘 𝑤 𝑘
′′′
∑ 𝑘=1
𝑟1
𝑤 𝑘
′′
= 𝑁 𝑈2
and
∑ 𝑘=1
𝑟2
𝑤 𝑘
′′′
= 𝑁 𝑈1−𝑈2
Estimates produced by using
both survey data and
predicted values.
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
18
18
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
19
19
Use of ML in a multi-source production process
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
20
20
Some conclusions
1. The adoption of Machine Learning is a real paradigm shift for Official Statistics
2. Algorithmic modeling is particularly suitable for new data sources
3. The fundamental principle at the basis of ML is to privilege the prediction capability
of a model, regardless of its interpretability
4. Generalizability is the main requirement
5. The evaluation of the accuracy of predictions is the key to choose the model
6. ML approach can/should be adopted in the traditional production process based
on primary (survey) data
7. ML approach is often the only suitable in a multi-source production process, where
new sources (Big Data) require algorithmic solutions able to handle their volume and
variety
21
21
The use of Machine Learning and Official Statistics
Thank you!
barcarol@istat.it

More Related Content

What's hot

IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
IRJET Journal
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Intel® Software
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Real Time Survey_Long Method
Real Time Survey_Long MethodReal Time Survey_Long Method
Real Time Survey_Long Method
Innovation Network
 
Application of Exponential Gamma Distribution in Modeling Queuing Data
Application of Exponential Gamma Distribution in Modeling Queuing DataApplication of Exponential Gamma Distribution in Modeling Queuing Data
Application of Exponential Gamma Distribution in Modeling Queuing Data
ijtsrd
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
snoreen
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
Galit Shmueli
 
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
csandit
 
Machine Learning for Understanding Biomedical Publications
Machine Learning for Understanding Biomedical PublicationsMachine Learning for Understanding Biomedical Publications
Machine Learning for Understanding Biomedical Publications
Grigorios Tsoumakas
 
Matlab Data And Statistics
Matlab Data And StatisticsMatlab Data And Statistics
Matlab Data And Statistics
DataminingTools Inc
 
Single view vs. multiple views scatterplots
Single view vs. multiple views scatterplotsSingle view vs. multiple views scatterplots
Single view vs. multiple views scatterplots
IJECEIAES
 
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Olivier Jeunen
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
ranjit banshpal
 
A Review of Various Clustering Techniques
A Review of Various Clustering TechniquesA Review of Various Clustering Techniques
A Review of Various Clustering Techniques
IJEACS
 
Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)
Galit Shmueli
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
Bharathi Raja Asoka Chakravarthi
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
ertekg
 
Data Mining and Knowledge Management
Data Mining and Knowledge ManagementData Mining and Knowledge Management
Data Mining and Knowledge Management
IRJET Journal
 

What's hot (20)

IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Real Time Survey_Long Method
Real Time Survey_Long MethodReal Time Survey_Long Method
Real Time Survey_Long Method
 
Application of Exponential Gamma Distribution in Modeling Queuing Data
Application of Exponential Gamma Distribution in Modeling Queuing DataApplication of Exponential Gamma Distribution in Modeling Queuing Data
Application of Exponential Gamma Distribution in Modeling Queuing Data
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
 
Machine Learning for Understanding Biomedical Publications
Machine Learning for Understanding Biomedical PublicationsMachine Learning for Understanding Biomedical Publications
Machine Learning for Understanding Biomedical Publications
 
Matlab Data And Statistics
Matlab Data And StatisticsMatlab Data And Statistics
Matlab Data And Statistics
 
Single view vs. multiple views scatterplots
Single view vs. multiple views scatterplotsSingle view vs. multiple views scatterplots
Single view vs. multiple views scatterplots
 
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
A Review of Various Clustering Techniques
A Review of Various Clustering TechniquesA Review of Various Clustering Techniques
A Review of Various Clustering Techniques
 
Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)Predictive Model Selection in PLS-PM (SCECR 2015)
Predictive Model Selection in PLS-PM (SCECR 2015)
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
 
Data Mining and Knowledge Management
Data Mining and Knowledge ManagementData Mining and Knowledge Management
Data Mining and Knowledge Management
 

Similar to G. Barcaroli, The use of machine learning in official statistics

Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
antimo musone
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
VenkateswaraBabuRavi
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...
Kim Flintoff
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
pradeep kumar
 
Topic_6
Topic_6Topic_6
Topic_6butest
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
Frank Kienle
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
Dr. Abdul Ahad Abro
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
Dr.DHANALAKSHMI SENTHILKUMAR
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Akshay Kanchan
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
Real life application of statistics in engineering
Real life application of statistics in engineeringReal life application of statistics in engineering
Real life application of statistics in engineering
JannatulFerdous160
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptx
AnanthReddy38
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
Francesca Lazzeri, PhD
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
Akin Osman Kazakci
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
ijcseit
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
IJCSES Journal
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Sri Ambati
 
machine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdfmachine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdf
agfi
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 

Similar to G. Barcaroli, The use of machine learning in official statistics (20)

Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...Introduction to Data and Computation: Essential capabilities for everyone in ...
Introduction to Data and Computation: Essential capabilities for everyone in ...
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Topic_6
Topic_6Topic_6
Topic_6
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
 
Real life application of statistics in engineering
Real life application of statistics in engineeringReal life application of statistics in engineering
Real life application of statistics in engineering
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptx
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
machine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdfmachine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdf
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 

More from Istituto nazionale di statistica

Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
Istituto nazionale di statistica
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Istituto nazionale di statistica
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Istituto nazionale di statistica
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Istituto nazionale di statistica
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica1414a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica14
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
Istituto nazionale di statistica
 

More from Istituto nazionale di statistica (20)

Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica1414a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica14
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 

Recently uploaded

ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 

Recently uploaded (20)

ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 

G. Barcaroli, The use of machine learning in official statistics

  • 1. The use of machine learning in official statistics Giulio Barcaroli Istituto Nazionale di Statistica 0
  • 2. ① Data science and Machine Learning (ML) ② Two cultures in statistical modeling: data vs algorithmic modeling, to explain or to predict? ③ Official statistics: what kind of modeling? ④ Why ML in official statistics? ⑤ ML in the traditional statistical information production process ⑥ ML in a multi-source production process ⑦ Some conclusions 1 Outline
  • 3. 2 2 Data science and Machine Learning “Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data.” “Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to learn (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.” (Wikipedia)
  • 4. «Statistical Modeling: The Two Cultures» (Breiman, 2001) “There are two cultures in the use of statistical modeling to reach conclusions from data. o One assumes that the data are generated by a given stochastic data model. o The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.” 3 3 From data modeling to algorithmic modeling
  • 5. “Assuming a stochastic data model for the inside of the black box. A common data model is that data are generated by independent draws from response variables = f(predictor variables, random noise, parameters) Model validation: Yes–no using goodness-of-fit tests and residual examination” 4 4 From data modeling to algorithmic modeling
  • 6. “The analysis in this culture considers the inside of the box complex and unknown. Their approach is to find a function f(x) — an algorithm that operates on x to predict the responses y. Their black box looks like this: Model validation. Measured by predictive accuracy” 5 5 From data modeling to algorithmic modeling
  • 7. 6 6 Machine Learning “A core objective of a learner is to generalize from its experience. Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. Classification machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the N-fold-cross-validation method randomly splits the data in k subsets where the k-1 instances of the data are used to train the model while the k-th instance is used to test the predictive ability of the training model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.” (Wikipedia)
  • 8. 7 7 Primary and secondary data: towards a multisource environment “NSOs are currently facing unprecedented pressure to evaluate how they operate. Years of declining response rates to primary data collection efforts and the proliferation of readily accessible data, which has made it easier for private companies to produce statistics, is putting into question the role of NSOs. In response, many NSOs are looking to tap into these alternative data sources to supplement, or even replace, data collected by traditional means.” (UNECE Machine Learning Team) So, the shift is from primary data (survey data) where the only source is represented by data are collected for statistical purposes, to secondary data (administrative or Big Data sources). The nature of secondary data, in particular the volume and variety of these data, makes the algorithmic approach more convenient than the data modeling approach. But even in the classical production process based on primary data (described by the Generic Statistic Business Process Model) Machine Learning can be competitive in many phases.
  • 9. 8 8 Modeling in Official Statistics: primary data process Modeling is widely used in Official Statistics. In the standard statistical information production process, model based or model assisted techniques are adopted in - Sampling design (stratification) - Data integration (record linkage and statistical matching) - Data editing and imputation - Outlier detection and handling - Total non response handling - Estimation
  • 10. 9 9 Use of models in primary data production process Examples of implicit definition of models: • stratification in sample design • donor search for imputation • population totals for calibration Examples of explicit definition of models: • models for imputation • models for outlier detection • models for calibration
  • 11. 10 Id Y X1 X2 X3 c 12 1 1 1 b 6 1 1 2 d 8 1 2 3 f ? 2 1 1 h 21 2 1 2 a 3 2 2 4 i ? 3 1 1 e 7 3 3 1 g 13 4 1 3 10 Example: imputation (donor search) Given a dataset with a number of missing values on a variable Y, impute them with the hot-deck donor method. 1. Order the dataset by other variables X1, X2, …,Xp 2. Scan the dataset starting from the first record. Whenever there is a missing value in the Y variable, impute the value of the previous record. Implicit model: Y = f(X1,X2, … Xp) No parameters, no indications about the quality of the imputation.
  • 12. 11 Id Y X1 X2 X3 f ? 2 1 1 i ? 3 1 1 11 Example: model based imputation (traditional approach) Id Y X1 X2 X3 c 12 1 1 1 b 6 1 1 2 d 8 1 2 3 h 21 2 1 2 a 3 2 2 4 e 7 3 3 1 g 13 4 1 3 Complete data Incomplete data Fit models using complete data Apply to impute Evaluate fitting
  • 13. 12 Id Y X1 X2 X3 f ? 2 1 1 i ? 3 1 1 12 Id Y X1 X2 X3 c 12 1 1 1 b 6 1 1 2 d 8 1 2 3 h 21 2 1 2 a 3 2 2 4 e 7 3 3 1 g 13 4 1 3 Id Y X1 X2 X3 c 12 1 1 1 b 6 1 1 2 d 8 1 2 3 h 21 2 1 2 Id Y X1 X2 X3 a 3 2 2 4 e 7 3 3 1 g 13 4 1 3 Complete data: “ground truth” Training set Validate set Incomplete data Fit models using training set Evaluate performance on validate set Choose the best and apply to impute Example: model based imputation (Machine Learning approach)
  • 14. 13 13 Use of ML in a multi-source production process e-commerce e-recruitment e-tendering … 32,000 enterprises Sample selection Data collection on 19,000 enterprises Machine learning Population frame (ASIA) Reference population: 184,000 enterprises Predictors Survey data Websites and social networks Big Data: Internet as Data Source Web scraping + text processing Document Terms Matrix 11,700 websites 14,000 URLs
  • 15. 14 14 Use of ML in a multi-source production process e-commerce e-recruitment e-tendering … Reference population: 184,000 enterprises Predictors Websites and social networks Big Data: Internet as Data Source Web scraping + text processing Document Terms Matrix 85,000 websites Population frame (ASIA) Predictions Survey data Estimation Estimates
  • 16. 15 15 Use of ML in a multi-source production process 1. Web scraping a. URLs retrieval b. Websites scraping 2. Text processing 3. Machine learning a. Models fitting b. Models performance evaluation 4. Estimation a. Design based estimators b. Model based estimators c. Combined estimators 5. Quality compared evaluation a. Analytic and resampling methods b. Simulation studies 1. Logistic Regression 2. Classification Trees 3. Ensembles (Bagging, Boosting, Random Forests) 4. Naïve Bayes 5. Neural Networks 6. Support Vector Machines 7. …
  • 17. 16 16 Use of ML in a multi-source production process 1. Web scraping a. URLs retrieval b. Websites scraping 2. Text processing 3. Machine learning a. Models fitting b. Models performance evaluation 4. Estimation a. Design based estimators b. Model based estimators c. Combined estimators 5. Quality compared evaluation a. Analytic and resampling methods b. Simulation studies Learner Accuracy Recall Precision F1-measure Naïve Bayes 0.84 0.56 0.56 0.56 Logistic 0.84 0.57 0.57 0.57 Decision Tree 0.87 0.64 0.64 0.64 Neural Net 0.88 0.65 0.66 0.66 Bagging 0.88 0.66 0.67 0.67 SVM 0.90 0.62 0.76 0.68 Boosting 0.90 0.71 0.71 0.71 Random Forest 0.90 0.73 0.73 0.73
  • 18. 17 17 Estimator Formula Weighting Description Design based / model assisted 𝑌 = ∑ 𝑟 𝑦 𝑘 𝑤 𝑘 𝑘=1 𝑟 𝑤 𝑘 = 𝑁 𝑈 𝑤 𝑘 weights are obtained by a calibration procedure making use of known totals in the population Model based 𝑌 = ∑ 𝑈2 𝑦 𝑘 𝑤 𝑘 ′ 𝑘=1 𝑈2 𝑤 𝑘 ′ = 𝑁 𝑈1 Count of the predicted values 𝑦 𝑘 for all units for which it was possible reach their websites (population 𝑈2 ), calibrated in order to make them representative of all the population having websites (𝑈1 ). Combined 𝑌 = ∑ 𝑈2 𝑦 𝑘 + ∑ 𝑟1( 𝑦 𝑘 − 𝑦 𝑘)𝑤 𝑘 ′′ + ∑ 𝑟2 𝑦 𝑘 𝑤 𝑘 ′′′ ∑ 𝑘=1 𝑟1 𝑤 𝑘 ′′ = 𝑁 𝑈2 and ∑ 𝑘=1 𝑟2 𝑤 𝑘 ′′′ = 𝑁 𝑈1−𝑈2 Estimates produced by using both survey data and predicted values. Use of ML in a multi-source production process 1. Web scraping a. URLs retrieval b. Websites scraping 2. Text processing 3. Machine learning a. Models fitting b. Models performance evaluation 4. Estimation a. Design based estimators b. Model based estimators c. Combined estimators 5. Quality compared evaluation a. Analytic and resampling methods b. Simulation studies
  • 19. 18 18 Use of ML in a multi-source production process 1. Web scraping a. URLs retrieval b. Websites scraping 2. Text processing 3. Machine learning a. Models fitting b. Models performance evaluation 4. Estimation a. Design based estimators b. Model based estimators c. Combined estimators 5. Quality compared evaluation a. Analytic and resampling methods b. Simulation studies
  • 20. 19 19 Use of ML in a multi-source production process 1. Web scraping a. URLs retrieval b. Websites scraping 2. Text processing 3. Machine learning a. Models fitting b. Models performance evaluation 4. Estimation a. Design based estimators b. Model based estimators c. Combined estimators 5. Quality compared evaluation a. Analytic and resampling methods b. Simulation studies
  • 21. 20 20 Some conclusions 1. The adoption of Machine Learning is a real paradigm shift for Official Statistics 2. Algorithmic modeling is particularly suitable for new data sources 3. The fundamental principle at the basis of ML is to privilege the prediction capability of a model, regardless of its interpretability 4. Generalizability is the main requirement 5. The evaluation of the accuracy of predictions is the key to choose the model 6. ML approach can/should be adopted in the traditional production process based on primary (survey) data 7. ML approach is often the only suitable in a multi-source production process, where new sources (Big Data) require algorithmic solutions able to handle their volume and variety
  • 22. 21 21 The use of Machine Learning and Official Statistics Thank you! barcarol@istat.it