G. Barcaroli, The use of machine learning in official statistics

The use of machine
learning in official
statistics
Giulio Barcaroli
Istituto Nazionale di Statistica
0

① Data science and Machine Learning (ML)
② Two cultures in statistical modeling: data vs algorithmic modeling, to explain or to predict?
③ Official statistics: what kind of modeling?
④ Why ML in official statistics?
⑤ ML in the traditional statistical information production process
⑥ ML in a multi-source production process
⑦ Some conclusions
1
Outline

2
2
Data science and
Machine Learning
“Data science is a concept to unify statistics,
data analysis, machine learning and their related
methods in order to understand and analyze actual
phenomena with data.”
“Machine learning is a subset of artificial
intelligence in the field of computer science that
often uses statistical techniques to give
computers the ability to learn (i.e., progressively
improve performance on a specific task) with data,
without being explicitly programmed.”
(Wikipedia)

«Statistical Modeling: The Two Cultures» (Breiman, 2001)
“There are two cultures in the use of statistical modeling to reach
conclusions from data.
o One assumes that the data are generated by a given stochastic data
model.
o The other uses algorithmic models and treats the data mechanism as
unknown.
The statistical community has been committed to the almost exclusive use of
data models.
Algorithmic modeling, both in theory and practice, has developed rapidly in
fields outside statistics. It can be used both on large complex data sets and
as a more accurate and informative alternative to data modeling on smaller
data sets. If our goal as a field is to use data to solve problems, then we
need to move away from exclusive dependence on data models and adopt a
more diverse set of tools.”
3
3
From data modeling to algorithmic modeling

“Assuming a stochastic data model for the inside of the black box.
A common data model is that data are generated by independent draws from
response variables = f(predictor variables, random noise, parameters)
Model validation: Yes–no using goodness-of-fit tests and residual examination”
4
4

“The analysis in this culture considers the inside of the box complex and unknown. Their approach is to
find a function f(x) — an algorithm that operates on x to predict the responses y. Their black box looks
like this:
Model validation. Measured by predictive accuracy”
5
5

6
6
Machine Learning
“A core objective of a learner is to generalize from its experience.
Generalization in this context is the ability of a learning machine
to perform accurately on new, unseen examples/tasks after
having experienced a learning data set.
Classification machine learning models can be validated by
accuracy estimation techniques like the holdout method,
which splits the data in a training and test set (conventionally 2/3
training set and 1/3 test set designation) and evaluates the
performance of the training model on the test set.
In comparison, the N-fold-cross-validation method randomly
splits the data in k subsets where the k-1 instances of the data
are used to train the model while the k-th instance is used to test
the predictive ability of the training model. In addition to the
holdout and cross-validation methods, bootstrap, which samples
n instances with replacement from the dataset, can be used to
assess model accuracy.”
(Wikipedia)

7
7
Primary and secondary data: towards a multisource
environment
“NSOs are currently facing unprecedented pressure to evaluate how they operate. Years of declining response rates
to primary data collection efforts and the proliferation of readily accessible data, which has made it easier for private
companies to produce statistics, is putting into question the role of NSOs. In response, many NSOs are looking to
tap into these alternative data sources to supplement, or even replace, data collected by traditional means.”
(UNECE Machine Learning Team)
So, the shift is from primary data (survey data) where the only source is represented by data are collected for
statistical purposes, to secondary data (administrative or Big Data sources).
The nature of secondary data, in particular the volume and variety of these data, makes the algorithmic
approach more convenient than the data modeling approach.
But even in the classical production process based on primary data (described by the Generic Statistic Business
Process Model) Machine Learning can be competitive in many phases.

8
8
Modeling in Official Statistics: primary data process
Modeling is widely used in Official Statistics.
In the standard statistical information production
process, model based or model assisted
techniques are adopted in
- Sampling design (stratification)
- Data integration (record linkage and statistical matching)
- Data editing and imputation
- Outlier detection and handling
- Total non response handling
- Estimation

9
9
Use of models in primary data production process
Examples of implicit definition of models:
• stratification in sample design
• donor search for imputation
• population totals for calibration
Examples of explicit definition of models:
• models for imputation
• models for outlier detection
• models for calibration

10
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
f ? 2 1 1
h 21 2 1 2
a 3 2 2 4
i ? 3 1 1
e 7 3 3 1
g 13 4 1 3
10
Example: imputation (donor search)
Given a dataset with a number of missing values on a
variable Y, impute them with the hot-deck donor method.
1. Order the dataset by other variables X1, X2, …,Xp
2. Scan the dataset starting from the first record.
Whenever there is a missing value in the Y variable,
impute the value of the previous record.
Implicit model:
Y = f(X1,X2, … Xp)
No parameters, no indications about the quality of the
imputation.

11
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
11
Example: model based imputation
(traditional approach)
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data
Incomplete
data
Fit models
using complete
data
Apply to
impute
Evaluate
fitting

12
Id Y X1 X2 X3
f ? 2 1 1
i ? 3 1 1
12
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Id Y X1 X2 X3
c 12 1 1 1
b 6 1 1 2
d 8 1 2 3
h 21 2 1 2
Id Y X1 X2 X3
a 3 2 2 4
e 7 3 3 1
g 13 4 1 3
Complete
data:
“ground
truth”
Training
set
Validate
set
Incomplete
data
Fit models
using training
set
Evaluate
performance
on validate set
Choose the
best and apply
to impute
Example: model based imputation
(Machine Learning approach)

13
13
Use of ML in a multi-source production process
e-commerce
e-recruitment
e-tendering
…
32,000 enterprises
Sample
selection
Data
collection
on 19,000
enterprises
Machine
learning
Population
frame
(ASIA)
Reference population:
184,000 enterprises
Predictors
Survey
data
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
11,700 websites
14,000 URLs

14
14
e-commerce
e-recruitment
e-tendering
…
Reference population:
184,000 enterprises
Predictors
Websites
and social
networks
Big Data:
Internet as
Data Source
Web scraping +
text processing
Document
Terms
Matrix
85,000 websites
Population
frame
(ASIA)
Predictions
Survey
data
Estimation
Estimates

15
15
1. Web scraping
a. URLs retrieval
b. Websites scraping
2. Text processing
3. Machine learning
a. Models fitting
b. Models performance evaluation
4. Estimation
a. Design based estimators
b. Model based estimators
c. Combined estimators
5. Quality compared evaluation
a. Analytic and resampling methods
b. Simulation studies
1. Logistic Regression
2. Classification Trees
3. Ensembles (Bagging, Boosting,
Random Forests)
4. Naïve Bayes
5. Neural Networks
6. Support Vector Machines
7. …

16
16
1. Web scraping
a. URLs retrieval
2. Text processing
3. Machine learning
a. Models fitting
4. Estimation
Learner Accuracy Recall Precision F1-measure
Naïve Bayes 0.84 0.56 0.56 0.56
Logistic 0.84 0.57 0.57 0.57
Decision
Tree
0.87 0.64 0.64 0.64
Neural Net 0.88 0.65 0.66 0.66
Bagging 0.88 0.66 0.67 0.67
SVM 0.90 0.62 0.76 0.68
Boosting 0.90 0.71 0.71 0.71
Random
Forest
0.90 0.73 0.73 0.73

17
17
Estimator Formula Weighting Description
Design based /
model assisted
𝑌 = ∑ 𝑟 𝑦 𝑘 𝑤 𝑘
𝑘=1
𝑟
𝑤 𝑘 = 𝑁 𝑈
𝑤 𝑘 weights are obtained by a
calibration procedure making
use of known totals in the
population
Model based
𝑌 = ∑ 𝑈2 𝑦 𝑘 𝑤 𝑘
′
𝑘=1
𝑈2
𝑤 𝑘
′
= 𝑁 𝑈1
Count of the predicted values
𝑦 𝑘 for all units for which it
was possible reach their
websites (population 𝑈2
),
calibrated in order to make
them representative of all
the population having
websites (𝑈1
).
Combined
𝑌 = ∑ 𝑈2 𝑦 𝑘 +
∑ 𝑟1( 𝑦 𝑘 − 𝑦 𝑘)𝑤 𝑘
′′
+
∑ 𝑟2 𝑦 𝑘 𝑤 𝑘
′′′
∑ 𝑘=1
𝑟1
𝑤 𝑘
′′
= 𝑁 𝑈2
and
∑ 𝑘=1
𝑟2
𝑤 𝑘
′′′
= 𝑁 𝑈1−𝑈2
Estimates produced by using
both survey data and
predicted values.
1. Web scraping
a. URLs retrieval
2. Text processing
3. Machine learning
a. Models fitting
4. Estimation

18
18
1. Web scraping
a. URLs retrieval
2. Text processing
3. Machine learning
a. Models fitting
4. Estimation

19
19
1. Web scraping
a. URLs retrieval
2. Text processing
3. Machine learning
a. Models fitting
4. Estimation

20
20
Some conclusions
1. The adoption of Machine Learning is a real paradigm shift for Official Statistics
2. Algorithmic modeling is particularly suitable for new data sources
3. The fundamental principle at the basis of ML is to privilege the prediction capability
of a model, regardless of its interpretability
4. Generalizability is the main requirement
5. The evaluation of the accuracy of predictions is the key to choose the model
6. ML approach can/should be adopted in the traditional production process based
on primary (survey) data
7. ML approach is often the only suitable in a multi-source production process, where
new sources (Big Data) require algorithmic solutions able to handle their volume and
variety

21
21
The use of Machine Learning and Official Statistics
Thank you!
barcarol@istat.it

G. Barcaroli, The use of machine learning in official statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to G. Barcaroli, The use of machine learning in official statistics

Similar to G. Barcaroli, The use of machine learning in official statistics (20)

More from Istituto nazionale di statistica

More from Istituto nazionale di statistica (20)

Recently uploaded

Recently uploaded (20)

G. Barcaroli, The use of machine learning in official statistics