The elusive 'Data Scientist' is a word that pops up more and more. Is this a buzzword or is something really changing in the world? Piet Daas of the CBS will take us on a tour of the changes that he sees around him.
Data Science
‘and the future of statistics’
Piet Daas (and many colleagues)*
Statistics Netherlands / Centraal Bureau voor de Statistiek
*Martijn Tennekes, Edwin de Jonge, Alex Priem, Bart Buelens, Merijn van Pelt, Paul van den Hurk
Data Science NL, 8 Nov. Utrecht
Layout
• Introduction
• What is Data Science?
• You need data, to be one
• Data Scientist skills
• A sexy job with a paradigm shift
• Link with Statistics Netherlands work
• Examples of recent developments
Data Science NL, 8 November, Utrecht 1
Introduction
“Statistics Netherlands will produces
about 5000 official publications and
tables in 2012”
For this we need DATA
Data Science NL, 8 November, Utrecht 2
Two types of data
Primary data Secondary data
Data from ‘others’
Our own surveys - Administrative sources
- ‘New’ data sources
Data Science NL, 8 November, Utrecht 3
• Data, data everywhere!
X
Data Science NL, 8 November, Utrecht 4
Statistics & Data science
1) Is the study of ‘the use of secondary data
for statistics’ data science?
2) What is data science?
Data Science NL, 8 November, Utrecht 5
What is Data Science?
• First used in 1974 by Danish computer
scientist Peter Nauer in book “Concise
Survey of Computer Methods ”
• Defined as:
• “The science of dealing with data, once
they have been established”
Established data is data that has been created. If that
was done by someone else: Than its secondary data!
Data Science NL, 8 November, Utrecht 6
Data scientist /statistician is “the sexiest job of the
21st Century”
People able to derive knowledge from large amounts of data!
Data Science NL, 8 November, Utrecht 7
Data science skills ‘landscape’
k ills
s
i ng
m
m
gra
Pro
Sexy Skills of Data Geeks
1) Statistics - traditional analysis you're used to
thinking about
2) Data ‘munging’ - parsing, scraping, and
formatting data
3) Visualization - graphs, tools, etc.
Data Science NL, 8 November, Utrecht 8
Data science skills ‘landscape’
k ills
s
i ng
m
m
gra
Pro
Sexy Skills of Data Geeks
1) Statistics - traditional analysis you're used to
thinking about
2) Data ‘munging’ - parsing, scraping, and
formatting data
3) Visualization - graphs, tools, etc.
Data Science NL, 8 November, Utrecht 8
Statistics Netherlands law
• “Statistics Netherlands aims to reduce the
administrative burden for companies and the
public as much as possible”
• By (re-)using existing administrative registrations of both
government and government-funded organizations.
• And study potential new sources of information
Data Science NL, 8 November, Utrecht 11
Statistics Netherlands and Data
• Data is generated in increasing amounts and at increasing frequencies:
• From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative
& Big)
• Ever increasing amounts of data need to be checked, processed and
analyzed
• More sources of information become available
• Opportunities to produce statistics faster (‘real-time statistics’)
• Need for new methods and tools
1. Methods to quickly uncover information from massive amounts of data
available, such as visualisation methods and data-, text- and stream-
mining techniques (‘making Big Data small’), High Performance Comp.
2. Methods capable of integrating the information in the statistical process,
e.g. linking at massive scale, macro/meso-integration, estimation methods
suited for large datasets
Data Science NL, 8 November, Utrecht 12
Examples of new developments
1) New approaches to official statistical inference
a. Algorithmic inference
2) Visualisation methods to quickly obtain insight into
large datasets
b. Virtual Census (17 million records)
c. Social Security Register (20 million records)
3) Research findings on the use of ‘new’ data sources
d. Traffic loop data (80 million records)
e. Mobile phone data (~500 million records)
f. Social media (12 million - 1 billion records)
Data Science NL, 8 November, Utrecht 13
Example a. Statistical inference
• Inference is traditionally motivated from a
design-based sample perspective
• The model-based approach is being
gradually adopted in specific circumstances
(e.g. adminstrative data).
• Next step: algorithmic inference methods
• Machine learning, data mining approaches
Data Science NL, 8 November, Utrecht 14
Simulation results (1000x)
Design Model Neural. DisTree
Data Science NL, 8 November, Utrecht
Shifting paradigms 15
Example b. Virtual Census
• Every 10 years a Census needs to be conducted
• No longer with surveys in the Netherlands
• Last traditional census was in 1971
• Now by (re-)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
Data Science NL, 8 November, Utrecht 16
Making the Tableplot
1. Load file 17 million records
2. Sort record according to 17 million records
key variable
• Age in this example
3. Combine records 100 groups (170,000 records each)
• Numeric variables
• Calculate average (avg. age)
• Categorical variables
• Ratio between categories present (male vs. female)
4. Plot figure of select number of variables
• Colours used are important up to 12
Data Science NL, 8 November, Utrecht 17
Data Science NL, 8 November, Utrecht tableplot of the census test file
Processing of data
Raw (unedited) data
Edited data
Final data
Data Science NL, 8 November, Utrecht
Example c: Social Security Register
• Contains all financial data on jobs, benefits
and pensions in the Netherlands
• Collected by the Dutch Tax office
• A total of 20 million records each month
• How to obtain insight into so much data?
• With a visualisation method: a heat map
Data Science NL, 8 November, Utrecht 20
Income (euro)
Heat map: Age vs. ‘Income’
Age
Data Science NL, 8 November, Utrecht 21
A 3D heat map: Age vs. Income vs. Amount
After ‘
data r
educt
ion’
amount
amount
age
age
Data Science NL, 8 November, Utrecht 22
Example c: Traffic loop detection data
• Traffic ‘loops’
• Every minute (24/7) the number of passing
vehicles is counted by >10,000 road sensors
& camera’s in the Netherlands
• Total vehicles and in different length classes
• Interesting source to produce traffic and
transport statistics (and more)
• Huge amounts of data, about 80 million
records a day
Locations
Data Science NL, 8 November, Utrecht 23
Number of detected vehicles on a single day
Total = ~ 295 milion
Data Science NL, 8 November, Utrecht 24
Traffic loop detection activity (only first 10 min.)
Data Science NL, 8 November, Utrecht 25
Number of detected vehicles on a single day
12% added
Data Science NL, 8 November, Utrecht 26
Example d: Mobile phone data
• Nearly every person in the Netherlands has a mobile phone
• On them and almost always switched on!
• An increasing number of people has a smart phone
• Ideal source of information to:
• Use mobile phone data of mobile phone companies:
• Travel behaviour (‘Day time’-population)
• Tourism (new phones that register to network)
• Crowd info (for example during events)
• But also as a data collection instrument:
• Questionnaires (with app, text messaging or browser)
• Taking pictures of products, cash receipts and barcodes
• Determine exact GPS location
• Etc.
Data Science NL, 8 November, Utrecht 34
Travel behaviour of mobile phones
Mobility of very active
active mobile phone users
- during a 14-day period
- data of a single mob. company
Based on:
- Call- and text-activity
multiples times a day
- Location based on phone masts
Clearly selective:
- Includes major cities
- But the North and South-east
of the country much less
Data Science NL, 8 November, Utrecht 35
Example e: Social media
• Dutch are very active on social media platforms
• Bijna altijd bij zich en staat vrijwel altijd aan
• Steeds meer mensen hebben een smartphone!
• Mogelijke informatiebron voor:
• Welke onderwerpen zijn actueel:
• Aantal berichten en sentiment hierover
• Als meetinstrument te gebruiken voor:
• .
Map by Eric Fischer (via Fast Company)
Data Science NL, 8 November, Utrecht 36
Social media: Dutch messages
• Dutch are very active on social media platforms
• Potential information source for:
• Topics discussed and sentiment over these topics (quickly
available!) and probably more?
• Investigate it to obtain an answer on potential use
Collected Dutch Twitter messages for study: ‘selection’ of 12 million
Data Science NL, 8 November, Utrecht 37
Social media: Dutch Twitter topics
(3%)
(7%)
(3%)
(10%)
(7%)
(3%)
(5%)
(46%)
12 million messages
Data Science NL, 8 November, Utrecht 38
Final remarks: Future of statistics
• Preparing large data sources for statistics is a lot of work
• Exploration phase takes a lot of time
• Reduction of information is needed (‘making big data small’)
• Risk: ‘garbage in’ ‘garbage statistics out’
• Traditional approach does not suffice
• Large data sources are definitely not ‘large’ sample surveys
• Often a selective but large part of the population is included
• Sometimes its just to much detailed data
• With traditional statistical analysis everything will be significant!
• More need for:
• Visualisation methods (to rapidly gain insight)
• Methods specific for large dataset (speedy and ‘robust’) and non-
linear estimation methods (data mining like)
• ‘Computational statistics’ (& dedicated hardware)
• Privacy demands will increase!
Data Science NL, 8 November, Utrecht 42