Big data as a source for official statistics

Big Data as a source for Official Statistics

Edwin de Jonge and Piet Daas
November 12, London

Overview

• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
• Visualization as a tool
•Why?
•Examples in our office

• Issues & challenges
• From an official statistical perspective
• Focus on methodological and legal ones
2

Why Visualization?

October 1st 2013, Statistics Netherlands

Effective Display!
(see Tor Norretranders, “Band width of our senses)

Anscombes quartet…

DS1 x

y

DS2 x

y

DS3

x

y

DS4

x

y

10

8.04

10 9.14

10 7.46

8

6.58

8

6.95

8 8.14

8 6.77

8

5.76

13

7.58

13 8.74

13 12.74

8

7.71

9

8.81

9 8.77

9 7.11

8

8.84

11

8.33

11 9.26

11 7.81

8

8.47

14

9.96

14 8.1

14 8.84

8

7.04

6

7.24

6 6.13

6 6.08

8

5.25

4

4.26

4 3.1

4 5.39

19

12.5

12

10.84

12 9.13

12 8.15

8

5.56

7

4.82

7 7.26

7 6.42

8

7.91

5

5.68

5 4.74

5 5.73

8

6.89

5

Anscombe’s quartet

Property

Value

Mean of x1, x2, x3, x4

All equal: 9

Variance of x1, x2, x3, x4

All equal: 11

Mean of y1, y2, y3, y4

All equal: 7.50

Variance of y1, y2, y3, y4

All equal: 4.1

Correlation for ds1, ds2, ds3, ds4

All equal 0.816

Linear regression for ds1, ds2, ds3,
ds4

All equal: y = 3.00 + 0.500x

Looks the same, right?

Why visualization?
Tool for data analysis
– Effective display of information
– Summary of data
– Show outliers / patterns
– Helps exploring data
– Helps checking assumptions

Often Maps
Many visualizations are maps
– Positive:
‐ Is familiar
‐ Attractive
But: only makes sense:
‐ When data geographically distributed
‐ When locality is meaningful
‐ When data is correctly normalized

Many maps just population maps!
A better map:
‐ Takes population size into account (e.g.
by making figures relative)

‐ May plot difference w.r.t. an expected
value.
13

Visualization is not easy
– Creating good visualizations is hard
– “Easy Reading” is not “Easy Writing”
Visualization must be:
– Faithful
– Objective
Thus not introduce perceptial bias

Visualization
– Use appropriate chart
– Use approprate scales
‐ x,y, color, time
– Use appropriate granularity
Research: What works for which data?

Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands
• Last traditional census was in 1971

‐ Now by (re-)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11

Making the Tableplot
1.
2.

Load file
Sort record according to
key variable
• Age in this example
3. Combine records
each)
• Numeric variables
•

•

100 groups (170,000 records

Calculate average (avg. age)

Categorical variables
•

4.

17 million records
17 million records

Ratio between categories present (male vs. female)

Plot figure
•

Colours used are important

of select number of variables
up to 12

12

October 1st 2013, Statistics Netherlands tableplot of the census test file

Tableplot: Monitor data quality
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)

21

Processing of data
Raw (unedited) data

Edited data

Final data

Example 2 : Social Security Register

15

– Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the Dutch Tax office
‐ A total of 20 million records each month

‐ How to obtain insight into so much data?
• With a visualisation method: a heat map

24

Income (euro)

Heat map: Age vs. ‘Income’

Age


16

mount

amount


17

Visualization helps with volume of data
–
–
–
–
–
–

Summarize by “binning”
Tableplot
Histogram
Heatmap (2D histogram)
Smoothing?
Detect unexpected patterns

We use it as a tool to check, explore and communicate
data
27

Big Data: Issues and challenges

Big Data: issues & challenges
During our exploratory studies we identified
a number of issues & challenges.
Focussing on the methodological and legal ones,
we found that there is a need to:
1) deal with noisy and dirty data
2) deal with selectivity
3) go beyond correlation
4) cope with privacy and security issues
We have only solved some of them (partially)
29

1) Deal with noisy and dirty data
– Big Data is often
‐ noisy
‐ dirty
‐ redundant
‐ unstructured
• e.g. texts, images
– How to extract information
from Big data?
‐ In the best/most efficient way
30

Noisy and dirty data

Social media sentiment

Traffic loop data

Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models
(capture structure), ‘Google approach’ (80/20 rule)
Preferably do NOT use samples !

31

Noise reduction
Social media: daily sentiment in Dutch messages

32

Noise reduction
Social media, daily sentiment in Dutch messages
Social media: daily & weekly sentiment in Dutch messages

33

Noise reduction
Social media: daily, weekly & monthly sentiment in Dutch messages

34

Noise reduction
Social media: monthly sentiment in Dutch messages

35

Social media sentiment & Consumer confidence
Social media: monthly sentiment in Dutch messages &
Consumer confidence

Corr: 0.88

36

Dirty data
Total number of vehicles detected by traffic loops during the day

37

Time (hour)

Loop active varies during the day

38

(first 10 min)

Correct for dirty data
Use data from same location from previous/next minute (5 min. window)
Before

Total = ~ 295 million vehicles

39

After

Total = ~ 330 million vehicles (+ 12%)

2) Deal with selectivity
–

Big data sources are selective (they do NOT cover
the entire population considered)
‐

–

AND: all Big Data sources studied so far contain events!
‐
‐

–

Some probably more then others

E.g. social media messages created, calls made and vehicles detected
Events are probably the reason why these sources are so Big

When there is a need to correct for selectivity
1)

Convert events to units
How to identify units?

2) Correct for selectivity of units included
How to cope with units that are truly absent and part of the
population under study?

40

Units / events
– Big Data contains events
‐ Social media messages are generated by usernames
‐ Traffic loops count vehicles (Dutch roads are units)
‐ Call detail records of mobile phone ID’s

‐ Convert events to units
• By profiling

41

Travel behaviour of active mobile phones

Mobility of very active mobile
phone users
- during a 14-day period

Based on:
- Call- and text-activity
multiples times a day

- Location based on phone masts

Clearly selective:
- North and South-west
of the country hardly included

43

__

3) Go beyond correlation
–

You will very likely use correlation to check Big Data
findings with those in other (survey) data

–

When correlation is high:
1) try falsifying it first (is it coincidental/spurious?)
correlation ≠ causation
2) If this fails, you may have found something
interesting!
3) Perform additional analysis (look for causality)
cointegration, structural time-series approach

44

Use common sense!

An illustrative example
Official unemployment percentage

Number of social media messages
including the word “unemployment”

X

Corr: 0.90 ?

45

4) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy
sensitive data for scientific and statistical research
– Still appropriate measures need to be taken
• Prior to new research studies, check privacy sensitivity of data
• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates
• Use a secure environment

– Legal issues that enable the use of Big Data for official statistics
production is currently being looked at
‐ No problems for Big Data that can be considered ‘Administrative data’: i.e.
Big Data that is managed by a (semi-)governmentally funded organisation
46

Conclusions
– Big data is a very interesting data source
‐ Also for official statistics
– Visualisation is a great way of getting/creating insight
‐ Not only for data exploration
– A number of fundamental issues need to be resolved
‐ Methodological
‐ Legal
‐ Technical (not discussed here)
– We expect great things in the near future!
47

Big data as a source for official statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data as a source for official statistics

Similar to Big data as a source for official statistics (20)

More from Edwin de Jonge

More from Edwin de Jonge (15)

Recently uploaded

Recently uploaded (20)

Big data as a source for official statistics