SlideShare a Scribd company logo
1 of 30
Big Data as a Data Source for
Official Statistics
Dr. Piet J.H. Daas
Methodologist, Big Data research coördinator
Oct 31, Gatineau
Statistics
Netherlands
Experiences at Statistics Netherlands
Overview
2
• Big Data
• Research theme at Statistics Netherlands
• About the skills needed
• Examples of our experiences
• Lessons learned (so far)
– Data, data everywhere!
X
Two types of data
Primairy data Secondary data
Our ‘own’ surveys
Data from ‘others’
- Administrative sources
- Big Data
4
NSIs
Exploratory Big Data studies
Lot’s of
unknown
areas
Big Data research at Stat. Neth.
– Exploratory, ‘data driven’
‐ Cases: Road sensor data, Mobile phone data, Social media
‐ There is no Big Data methodology (yet)
– Combination of IT, methodology and content (Data Science)
– Important issues for official statistics
‐ Selectivity (representativity)
‐ Data editing at large scale
‐ Data reduction (dimension reduction)
6
Big Data skills
– The new skills needed are:
‐ At the border of IT and methodology
• Data Science (team)
• High Performance Analytics
‐ Obtained from a number of research areas ‘outside’
statistics
• Computational Sciences
• Machine learning (Statistical learning)
• Ability to extract information from a whole (new) range
of sources
• Such as texts and pictures
7
Hardware: High Performance Computing
8
‘Hadoop’
1 chip (CPU)
4-8 cores
1 GPU (graphics card)
500-1000 ‘cores’
Within 1 machine
Vertical scaling
Use multiple machines
Horizontal scaling
* - Program more efficiently
- Use ’lower’ programming languages
* - Distribute data & workload
Parallelization is trickier in case of internal dependencies
Results of case studies
Example 1: Roads sensors
Road sensor (traffic loop) data
‐ Each minute (24/7) the number of passing vehicles is
counted in around 20.000 ‘loops’ in the Netherlands
• Total and in different length classes
‐ Nice data source for transport and traffic statistics
(and more)
• A lot of data, around 230 million records a day
Locations
10
Data options
– Historical database
‐ Request data via web interface
‐ Minute data for all Highways
• 48 variables, around 2.5TB (Jan 2010‐April 2014)
• Data at a higher aggregation level is edited
– Data stream
‐ FTP‐site
‐ Every minute, all data for all active sensors
‐ Has to be continuously collected
11
Road sensor data
12
Large vehicles
All vehicles All vehicles
All vehicles
Minute data of 1 sensor for 196 days
13
‘Afsluitdijk’ (enclosing dike)
14
‘Afsluitdijk’ (enclosing dike) (2)
Cross correlation between sensor pairs
-Used to validate metadata
Trajectory speed vs. point speed
- Average speed is 98 Km/h
Process of road sensor based statistics
– Historic / streaming data
- Select sensors on Dutch highways
– Preprocessing
- Remove non-informative variables
- Remove bad records
- Calculate number of vehicles (per minute)
- Quality indicators for daily data per sensor
– Dimension reduction of daily data
- Exclude bad sensors
- Reduce dimension from 1440 to 3 (or 4?) on same road and region
- Obtain number of vehicles for each road and region
– Calculate traffic index
‐ Per region.
16
Example 2: Mobile phones
Mobile phone activity as a data source
– Nearly every person in the Netherlands has a mobile phone
‐ Usually on them and almost always switched on!
‐ Many people are very active during the day
– Can data of mobile phones be used for statistics?
‐ Travel behaviour (of active phones)
‐ ‘Day time population’ (of active phones)
‐ Tourism (new phones that register to network)
– Data of a single mobile company was used
‐ Hourly aggregates per area (only when > 15 events)
‐ Especially important for roaming data (foreign visitors)
17
‘Day time population’
– Hourly changes of mobile
phone activity
– 7 & 8 May 2013
– Per area distinguished
– Only data for areas with
> 15 events per hour
18
Tourism: Roaming during European league final
Hardly any
Low
Medium
High
Very high
19
Example 3: Social media
– Dutch are very active on social media!
‐ Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan
• Steeds meer mensen hebben een smartphone!
– Mogelijke informatiebron voor:
‐ Welke onderwerpen zijn actueel:
• Aantal berichten en sentiment hierover
‐ Als meetinstrument te gebruiken voor:
• .
Map by Eric Fischer (via Fast Company)
20
21
Sentiment in Dutch social media
– About the data
‐ Dutch firm that continuously collectsALL public social media messages
written in Dutch
‐ Dataset of more than 3.5 billion messages!
• Covering June 2010 till the present
• Between 3-4 million new messages are added per day
– About sentiment determination
‐ ‘Bag of words’ approach
• List of Dutch words with their associated sentiment
• Added social media specific words (‘FAIL’, ‘LOL’, ‘OMG’ etc.)
‐ Use overall score to determine sentiment
• Is either positive, negative or neutral
‐ Average sentiment per period (day / week / month)
• (#positive - #negative)/#total * 100%
Daily, weekly, monthly sentiment
22
Sentiment per platform
(~10%) (~80%)
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
24
Schematic overview
25
Vorige maand Maand
Consumer Confidence
Publication date (~20th)
Social media sentiment
Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28
Previous month Current month
Day 1-7 Day 8-14 Day 15-21 Day 22-28
Sentiment
Results of comparing various periods
26
Consumer Confidence Facebook Facebook Facebook
+ Twitter * Twitter
0.81* 0.84* 0.86*
0.85* 0.87* 0.89*
0.82 0.85 0.87
0.82* 0.85* 0.89*
0.79* 0.82* 0.84*
0.79 0.83 0.84
0.82* 0.86* 0.89*
0.79* 0.83* 0.87*
0.75* 0.80* 0.81*
LOOCV results*cointegration
Overall findings
27
– Correlation and cointegration
‐ 1st ‘week’ of Consumer confidence usually has 70% response
‐ Best correlation and cointegration with 2nd ‘week’ of the month
• Highest correlation 0.93* (all Facebook * specific word filteredTwitter)
– Granger causality
‐ Changes in Consumer confidence precede changes in Social media
sentiment
‐ For all combinations shown!
• Only tried linear models so far
– Prediction
‐ Slightly better than random chance
‐ Best for the 4th ‘week’ of month
Lessons learned (so far)
The most important ones:
1) There are many types of Big Data
2) Know how to access and analyse large amounts of data
3) Find ways to deal with noisy and unstructured data
4) Mind‐set for Big Data ≠ Mind‐set for survey data
5) Need to beyond correlation
6) Need people with the right skills, knowledge and mind‐set
7) Solve privacy and security issues
8) Data management & costs
28
We are getting more and more grip on these topics
The Future
29
The
future
of
statistics
looks
BIG
Thank you for your attention !@pietdaas

More Related Content

What's hot

Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
Human Centered ICT
 
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Paolo Nesi
 
Snaa2015 (1)
Snaa2015 (1)Snaa2015 (1)
Snaa2015 (1)
filippia
 
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
Christoph Trattner
 

What's hot (20)

Useful by Piet Daas
Useful by Piet DaasUseful by Piet Daas
Useful by Piet Daas
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Social Data Sentiment Analysis
Social Data Sentiment AnalysisSocial Data Sentiment Analysis
Social Data Sentiment Analysis
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
Public Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || ChoenniePublic Safety Mashups to Support Policy Makers || Choennie
Public Safety Mashups to Support Policy Makers || Choennie
 
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and Analysis
 
What is "data"?
What is "data"?What is "data"?
What is "data"?
 
What is big data?
What is big data?What is big data?
What is big data?
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
A framework for real time semantic social media analysis
A framework for real time semantic social media analysis A framework for real time semantic social media analysis
A framework for real time semantic social media analysis
 
Snaa2015 (1)
Snaa2015 (1)Snaa2015 (1)
Snaa2015 (1)
 
Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...
 
Assessment of the usability of Latvia’s open data portal or how close are we ...
Assessment of the usability of Latvia’s open data portal or how close are we ...Assessment of the usability of Latvia’s open data portal or how close are we ...
Assessment of the usability of Latvia’s open data portal or how close are we ...
 
Document(2)
Document(2)Document(2)
Document(2)
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
 
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 

Viewers also liked

Viewers also liked (9)

Bi dutch meeting data science
Bi dutch meeting data scienceBi dutch meeting data science
Bi dutch meeting data science
 
Statistiek en grote databestanden
Statistiek en grote databestandenStatistiek en grote databestanden
Statistiek en grote databestanden
 
Data science and the future of statistics
Data science and the future of statisticsData science and the future of statistics
Data science and the future of statistics
 
Big data for the rest of us
Big data for the rest of usBig data for the rest of us
Big data for the rest of us
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 

Similar to Big Data presentation for Statistics Canada

Open data for development
Open data for developmentOpen data for development
Open data for development
mlepage
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
Esri España
 

Similar to Big Data presentation for Statistics Canada (20)

Big Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in EindhovenBig Data @ CBS for Fontys students in Eindhoven
Big Data @ CBS for Fontys students in Eindhoven
 
Big data experiments
Big data experimentsBig data experiments
Big data experiments
 
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics NetherlandsBig Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
 
Big Data presentation Mannheim
Big Data presentation MannheimBig Data presentation Mannheim
Big Data presentation Mannheim
 
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics NetherlandsResponsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
 
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
 
OSFair2017 Workshop | OpenDataMonitor
OSFair2017 Workshop | OpenDataMonitorOSFair2017 Workshop | OpenDataMonitor
OSFair2017 Workshop | OpenDataMonitor
 
ONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational StatisticsONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational Statistics
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
 
Big Data and official statistics with examples of their use
Big Data and official statistics with examples of their useBig Data and official statistics with examples of their use
Big Data and official statistics with examples of their use
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European Statistics
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Open data for development
Open data for developmentOpen data for development
Open data for development
 
Social Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research CentreSocial Media Analytics Research at the QUT Digital Media Research Centre
Social Media Analytics Research at the QUT Digital Media Research Centre
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
 
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 WorkshopSC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
SC6 Workshop 1: From your data to data stories - BigDataEurope, SC6 Workshop
 
Odp rwanda-odra-rajiv
Odp rwanda-odra-rajivOdp rwanda-odra-rajiv
Odp rwanda-odra-rajiv
 
Extracting Value from Big Data - Stuart Higgins
Extracting Value from Big Data - Stuart HigginsExtracting Value from Big Data - Stuart Higgins
Extracting Value from Big Data - Stuart Higgins
 
Local Open Data: a perspective from local government in England 2014
Local Open Data: a perspective from local government in England 2014Local Open Data: a perspective from local government in England 2014
Local Open Data: a perspective from local government in England 2014
 
Local Open Data: A perspective from local government in England by Gesche Schmid
Local Open Data: A perspective from local government in England by Gesche SchmidLocal Open Data: A perspective from local government in England by Gesche Schmid
Local Open Data: A perspective from local government in England by Gesche Schmid
 

More from Piet J.H. Daas

More from Piet J.H. Daas (15)

IT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics NetherlandsIT infrastructure for Big Data and Data Science at Statistics Netherlands
IT infrastructure for Big Data and Data Science at Statistics Netherlands
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
Use of social media for official statistics
Use of social media for official statisticsUse of social media for official statistics
Use of social media for official statistics
 
CBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONSCBS lecture at the opening of Data Science Campus of ONS
CBS lecture at the opening of Data Science Campus of ONS
 
Ntts2017 presentation 45
Ntts2017 presentation 45Ntts2017 presentation 45
Ntts2017 presentation 45
 
Extracting information from ' messy' social media data
Extracting information from ' messy' social media dataExtracting information from ' messy' social media data
Extracting information from ' messy' social media data
 
Big data cbs_piet_daas
Big data cbs_piet_daasBig data cbs_piet_daas
Big data cbs_piet_daas
 
Gebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiekGebruik van sociale media voor de officiële statistiek
Gebruik van sociale media voor de officiële statistiek
 
Profiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivityProfiling Big Data sources to assess their selectivity
Profiling Big Data sources to assess their selectivity
 
Quality challenges in modernising business statistics
Quality challenges in modernising business statisticsQuality challenges in modernising business statistics
Quality challenges in modernising business statistics
 
Social media sentiment and consumer confidence
Social media sentiment and consumer confidenceSocial media sentiment and consumer confidence
Social media sentiment and consumer confidence
 
Big data @ CBS
Big data @ CBSBig data @ CBS
Big data @ CBS
 
Piet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningenPiet daas big_data_official_statistics_target_groningen
Piet daas big_data_official_statistics_target_groningen
 
Big data en officiële statistiek
Big data en officiële statistiekBig data en officiële statistiek
Big data en officiële statistiek
 
New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.New Data Sources for Statistics, Social media: Twitter.
New Data Sources for Statistics, Social media: Twitter.
 

Recently uploaded

Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
Artificial Intelligence in Philippine Local Governance: Challenges and Opport...Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
CedZabala
 
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
sonatiwari757
 
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Recently uploaded (20)

Get Premium Balaji Nagar Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
Get Premium Balaji Nagar Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...Get Premium Balaji Nagar Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
Get Premium Balaji Nagar Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
 
Expressive clarity oral presentation.pptx
Expressive clarity oral presentation.pptxExpressive clarity oral presentation.pptx
Expressive clarity oral presentation.pptx
 
Incident Command System xxxxxxxxxxxxxxxxxxxxxxxxx
Incident Command System xxxxxxxxxxxxxxxxxxxxxxxxxIncident Command System xxxxxxxxxxxxxxxxxxxxxxxxx
Incident Command System xxxxxxxxxxxxxxxxxxxxxxxxx
 
Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
Artificial Intelligence in Philippine Local Governance: Challenges and Opport...Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
Artificial Intelligence in Philippine Local Governance: Challenges and Opport...
 
Top Rated Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
Top Rated  Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...Top Rated  Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
Top Rated Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
 
Top Rated Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Bhosari ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl mohali 7001035870 Enjoy Call Girls With Our Escorts
 
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
 
Human-AI Collaboration for Virtual Capacity in Emergency Operation Centers (E...
Human-AI Collaborationfor Virtual Capacity in Emergency Operation Centers (E...Human-AI Collaborationfor Virtual Capacity in Emergency Operation Centers (E...
Human-AI Collaboration for Virtual Capacity in Emergency Operation Centers (E...
 
Climate change and occupational safety and health.
Climate change and occupational safety and health.Climate change and occupational safety and health.
Climate change and occupational safety and health.
 
Call On 6297143586 Viman Nagar Call Girls In All Pune 24/7 Provide Call With...
Call On 6297143586  Viman Nagar Call Girls In All Pune 24/7 Provide Call With...Call On 6297143586  Viman Nagar Call Girls In All Pune 24/7 Provide Call With...
Call On 6297143586 Viman Nagar Call Girls In All Pune 24/7 Provide Call With...
 
Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024
 
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
 
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
 
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...
 
Get Premium Budhwar Peth Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
Get Premium Budhwar Peth Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...Get Premium Budhwar Peth Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
Get Premium Budhwar Peth Call Girls (8005736733) 24x7 Rate 15999 with A/c Roo...
 
Top Rated Pune Call Girls Wadgaon Sheri ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated  Pune Call Girls Wadgaon Sheri ⟟ 6297143586 ⟟ Call Me For Genuine S...Top Rated  Pune Call Girls Wadgaon Sheri ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated Pune Call Girls Wadgaon Sheri ⟟ 6297143586 ⟟ Call Me For Genuine S...
 
The Most Attractive Pune Call Girls Handewadi Road 8250192130 Will You Miss T...
The Most Attractive Pune Call Girls Handewadi Road 8250192130 Will You Miss T...The Most Attractive Pune Call Girls Handewadi Road 8250192130 Will You Miss T...
The Most Attractive Pune Call Girls Handewadi Road 8250192130 Will You Miss T...
 
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Bhavnagar 7001035870 Whatsapp Number, 24/07 Booking
 
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
 

Big Data presentation for Statistics Canada

  • 1. Big Data as a Data Source for Official Statistics Dr. Piet J.H. Daas Methodologist, Big Data research coördinator Oct 31, Gatineau Statistics Netherlands Experiences at Statistics Netherlands
  • 2. Overview 2 • Big Data • Research theme at Statistics Netherlands • About the skills needed • Examples of our experiences • Lessons learned (so far)
  • 3. – Data, data everywhere! X
  • 4. Two types of data Primairy data Secondary data Our ‘own’ surveys Data from ‘others’ - Administrative sources - Big Data 4 NSIs
  • 5. Exploratory Big Data studies Lot’s of unknown areas
  • 6. Big Data research at Stat. Neth. – Exploratory, ‘data driven’ ‐ Cases: Road sensor data, Mobile phone data, Social media ‐ There is no Big Data methodology (yet) – Combination of IT, methodology and content (Data Science) – Important issues for official statistics ‐ Selectivity (representativity) ‐ Data editing at large scale ‐ Data reduction (dimension reduction) 6
  • 7. Big Data skills – The new skills needed are: ‐ At the border of IT and methodology • Data Science (team) • High Performance Analytics ‐ Obtained from a number of research areas ‘outside’ statistics • Computational Sciences • Machine learning (Statistical learning) • Ability to extract information from a whole (new) range of sources • Such as texts and pictures 7
  • 8. Hardware: High Performance Computing 8 ‘Hadoop’ 1 chip (CPU) 4-8 cores 1 GPU (graphics card) 500-1000 ‘cores’ Within 1 machine Vertical scaling Use multiple machines Horizontal scaling * - Program more efficiently - Use ’lower’ programming languages * - Distribute data & workload Parallelization is trickier in case of internal dependencies
  • 9. Results of case studies
  • 10. Example 1: Roads sensors Road sensor (traffic loop) data ‐ Each minute (24/7) the number of passing vehicles is counted in around 20.000 ‘loops’ in the Netherlands • Total and in different length classes ‐ Nice data source for transport and traffic statistics (and more) • A lot of data, around 230 million records a day Locations 10
  • 11. Data options – Historical database ‐ Request data via web interface ‐ Minute data for all Highways • 48 variables, around 2.5TB (Jan 2010‐April 2014) • Data at a higher aggregation level is edited – Data stream ‐ FTP‐site ‐ Every minute, all data for all active sensors ‐ Has to be continuously collected 11
  • 12. Road sensor data 12 Large vehicles All vehicles All vehicles All vehicles
  • 13. Minute data of 1 sensor for 196 days 13
  • 15. ‘Afsluitdijk’ (enclosing dike) (2) Cross correlation between sensor pairs -Used to validate metadata Trajectory speed vs. point speed - Average speed is 98 Km/h
  • 16. Process of road sensor based statistics – Historic / streaming data - Select sensors on Dutch highways – Preprocessing - Remove non-informative variables - Remove bad records - Calculate number of vehicles (per minute) - Quality indicators for daily data per sensor – Dimension reduction of daily data - Exclude bad sensors - Reduce dimension from 1440 to 3 (or 4?) on same road and region - Obtain number of vehicles for each road and region – Calculate traffic index ‐ Per region. 16
  • 17. Example 2: Mobile phones Mobile phone activity as a data source – Nearly every person in the Netherlands has a mobile phone ‐ Usually on them and almost always switched on! ‐ Many people are very active during the day – Can data of mobile phones be used for statistics? ‐ Travel behaviour (of active phones) ‐ ‘Day time population’ (of active phones) ‐ Tourism (new phones that register to network) – Data of a single mobile company was used ‐ Hourly aggregates per area (only when > 15 events) ‐ Especially important for roaming data (foreign visitors) 17
  • 18. ‘Day time population’ – Hourly changes of mobile phone activity – 7 & 8 May 2013 – Per area distinguished – Only data for areas with > 15 events per hour 18
  • 19. Tourism: Roaming during European league final Hardly any Low Medium High Very high 19
  • 20. Example 3: Social media – Dutch are very active on social media! ‐ Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone! – Mogelijke informatiebron voor: ‐ Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover ‐ Als meetinstrument te gebruiken voor: • . Map by Eric Fischer (via Fast Company) 20
  • 21. 21 Sentiment in Dutch social media – About the data ‐ Dutch firm that continuously collectsALL public social media messages written in Dutch ‐ Dataset of more than 3.5 billion messages! • Covering June 2010 till the present • Between 3-4 million new messages are added per day – About sentiment determination ‐ ‘Bag of words’ approach • List of Dutch words with their associated sentiment • Added social media specific words (‘FAIL’, ‘LOL’, ‘OMG’ etc.) ‐ Use overall score to determine sentiment • Is either positive, negative or neutral ‐ Average sentiment per period (day / week / month) • (#positive - #negative)/#total * 100%
  • 22. Daily, weekly, monthly sentiment 22
  • 24. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results 24
  • 25. Schematic overview 25 Vorige maand Maand Consumer Confidence Publication date (~20th) Social media sentiment Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28 Previous month Current month Day 1-7 Day 8-14 Day 15-21 Day 22-28 Sentiment
  • 26. Results of comparing various periods 26 Consumer Confidence Facebook Facebook Facebook + Twitter * Twitter 0.81* 0.84* 0.86* 0.85* 0.87* 0.89* 0.82 0.85 0.87 0.82* 0.85* 0.89* 0.79* 0.82* 0.84* 0.79 0.83 0.84 0.82* 0.86* 0.89* 0.79* 0.83* 0.87* 0.75* 0.80* 0.81* LOOCV results*cointegration
  • 27. Overall findings 27 – Correlation and cointegration ‐ 1st ‘week’ of Consumer confidence usually has 70% response ‐ Best correlation and cointegration with 2nd ‘week’ of the month • Highest correlation 0.93* (all Facebook * specific word filteredTwitter) – Granger causality ‐ Changes in Consumer confidence precede changes in Social media sentiment ‐ For all combinations shown! • Only tried linear models so far – Prediction ‐ Slightly better than random chance ‐ Best for the 4th ‘week’ of month
  • 28. Lessons learned (so far) The most important ones: 1) There are many types of Big Data 2) Know how to access and analyse large amounts of data 3) Find ways to deal with noisy and unstructured data 4) Mind‐set for Big Data ≠ Mind‐set for survey data 5) Need to beyond correlation 6) Need people with the right skills, knowledge and mind‐set 7) Solve privacy and security issues 8) Data management & costs 28 We are getting more and more grip on these topics
  • 30. Thank you for your attention !@pietdaas