• Save
Opportunities and methodological challenges of  Big Data for official statisticsg data piet_daas_roma2
Upcoming SlideShare
Loading in...5
×
 

Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2

on

  • 322 views

Presentation for the Eurostat organized Big data event in Rome (31 March- 1 April)

Presentation for the Eurostat organized Big data event in Rome (31 March- 1 April)

Statistics

Views

Total Views
322
Views on SlideShare
318
Embed Views
4

Actions

Likes
1
Downloads
0
Comments
1

2 Embeds 4

http://www.slideee.com 3
https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Opportunities and methodological challenges of  Big Data for official statisticsg data piet_daas_roma2 Opportunities and methodological challenges of Big Data for official statisticsg data piet_daas_roma2 Presentation Transcript

  • Opportunities and methodological challenges of Big Data for official statistics Dr. Piet J.H. Daas Methodologist, Big Data research coördinator March 31, Rome
  • Overview 2 • Big Data • Definition? • DGINS: Scheveningen Memorandum • Experiences at Statistics Netherlands • From ‘New data sources’ to ‘Big Data’ • Data driven approach (learning by doing) • Opportunities & challenges • Methodological & technical challenges • Skills, legal and other issues •With examples !
  • – Data, data everywhere! X
  • What is Big Data? Defining Big Data is not easy: An attempt: “Data that are difficult to collect, store or process within the conventional systems of statistical organizations. Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.” (Virtual sprint paper) More technical: “Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.” (Wikipedia) A user: “Data sources that are awkward to work with.” 4 TIP: Big Data sources are NOT surveys and NOT administrative data
  • DGINS: Scheveningen Memorandum 1. Big Data represent new opportunities and challenges for Official Statistics. 2. Develop an 'Official Statistics Big Data strategy' at national and EU-level. 3. Recognize the implications of Big Data for legislation especially with regard to data protection and personal rights 4. Several NSIs are currently initiating or considering different uses of Big Data. Momentum to share experiences and to collaborate. 5. Recognize the necessary capabilities and skills to effectively explore Big Data 6. Acknowledge that the multidisciplinary character requires synergies and partnerships. 7. The use of Big Data in the context of official statistics requires new developments in methodology, quality assessment and IT related issues. 8. Agree on adopting an ESS action plan and roadmap by mid-2014 5
  • Experiences at Statistics Netherlands – Started as ‘New data sources for statistics’ in 2009 – Several initiatives over the years: ‐ Internet as a data source • Collecting price data with web robots • Study the use of web job vacancies data • ‘Markplaats’ data (Dutch eBay clone) ‐ Alternative means of collecting primary data • Use of smartphones ‐ Big Data (really large amounts of data) • Traffic loop detection data (road sensors) • Mobile phone data (location data) • Social media data (content and sentiment) 6
  • Opportunities & challenges
  • What have we learned (so far) ? I’ll discuss the most important ones: 1) Types of ‘data’ in Big Data 2) How to access and analyse large amounts of data 3) How to deal with noisy and unstructured data 4) How to deal with selectivity (and our own bias) 5) How to go beyond correlation 6) The need for people with the right skills and mind‐set 7) Need to solve/deal with privacy and security issues 8) Data management & costs 8 We are slowly starting to get a grip on some of these topics
  • 1) Types of data 9 Secondary data Primary data
  • 1) Types of data & events There are many different Big data sources, An attempt to classify them (Virtual sprint paper): A) Human-sourced information (‘Social Networks’) Social media messages, blogs, web searches B) Process-mediated data (‘Traditional Business Systems andWebsites’) Credit card, bank or on-line transactions,CDR, product prices, page-views C) Machine-generated data (‘Automated Systems’) Road or climate sensors, satellite images, GPS,AIS. Essentially most of the data are event-based of which some can be directly related to a user (e.g. the target population) 10
  • 2) How to access and analyse large amounts of data 11 – If you want to analyse Big Data – You need a lot of computer power!! – Or you need a lot of time! High Performance Computing expertise is essential !
  • – We have: - Workstations with lot’s of memory (32-64GB), fast disk drives (SSD, 512 GB) and a large hard drive (>= 1TB) - A secure environment in which to access the data with those computers - A Big Data lab - The knowledge to load and analyse all the data into R or Python - Followed a High Performance Computing training course - Realized that learning by doing is key! (?databases?) AND a Big data source with no privacy and security issues so we can test all kinds of analysis, soft- and hardware (anyplace, anytime, anywhere) • Traffic loop data (road sensors) 12 Our current equipment and more
  • An example: – Processing of traffic loop data of 1 day - A total ~100 million records (25 GB) I/O limitation can by solved by: 1) Input part by using a cluster (distributed computing) 2) Output part by implementing a C++ write routine in R (20% faster) Processing in R Time needed Speed-up First R-script 6 hours - Improved code 30 min 12 Faster hardware 10 min 36 (Java code) Faster hardware 2 min 180 + preprocessed data Limited by I/O 13
  • All Dutch vehicles in September
  • 3) How to deal with noisy and unstructured data – Big Data is often ‐ noisy, dirty ‐ redundant ‐ unstructured • e.g. texts, images – How to extract information from Big data? ‐ In the best/most efficient way 15
  • Example of noisy data: Roads sensors Traffic loop data ‐ Each minute (24/7) the number of passing vehicles is counted in around 20.000 ‘loops’ in the Netherlands • Total and in different length classes ‐ Nice data source for transport and traffic statistics (and more) • A lot of data, around 100 million records a day Locations 16
  • Total number of vehicles during the day 17 Time (hour)
  • Correct for missing data: macro level Sliding window of 5 min. Impute missing data. Before After Total = ~ 295 million vehicles Total = ~ 330 million (+ 12%) vehicles 18
  • Correct for missing data: micro level 19 Time (min.) Numberofvehiclesdetected Recursive Bayesian estimator (<1 sec on GPGPU)
  • 4) How to deal with selectivity – Big Data sources may be selective when - Only part of the population contributes to the data set • For example: mobile phone owners - The measurement mechanism is selective (e.g. non-random times or places) • For example: placing of road sensors on Dutch highways is not random – Many Big Data sources contain events - Population units may generate widely varying numbers of events - Attempt to associate events with units – Correcting for selectivity - Background characteristics – or features – are needed (linking with registers; profiling) - Use predictive modelling / machine learning to produce population estimates20
  • Profiling: social media
  • Selectivity illustrated Selectivity of big data could potentially be less problematic than high non- response rates of surveys. -There is just more data for your model! The black line shows the relationship between the target and auxiliary variable in the target population.The red lines show the estimated relationship according to each of the three sources (with 95% confidence intervals). Here we assume units with auxiliary variables are available! 22
  • 5) How to go beyond correlation – You will very likely use correlation to check Big Data findings with those in other (survey) data – When correlation is high: 1) try falsifying it first (is it coincidental?) correlation ≠ causation 2) If this fails, you may have found something interesting! 3) Perform additional analysis (look for causality) cointegration, Granger causality, time‐series approach, etc. 23
  • Example: Sentiment in social media (day/week/month) 24
  • Platform specific sentiment 25
  • Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results Granger causality reveals that Consumer Confidence precedes Facebook sentiment ! (p-value < 0.001) 26
  • A schematic view Vorige maand Maand Consumer Confidence Publication date (~20th) Social media sentiment Dag 1-7 Dag 8-14 Dag 15-21 Dag 22-28 Previous month Current month Day 1-7 Day 8-14 Day 15-21 Day 22-28 27
  • Platform specific results (2) More detailed studies revealed a 1 week delay between both! Consumer confidence comes first, Social media sentiment follows 28 Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated
  • 6) People and skills needed For Big data studies you need: – People with an open mind‐set that do not see all problems a priori in terms of sampling theory – People with programming skills and IT‐affinity – People with a data‐driven, pragmatic attitude (data explorers, ’practitioners’) ‐ You need Data scientists ! 29
  • Data science skills ‘landscape’ Sexy Skills of Data Geeks 1) Statistics - traditional analysis you're used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3)Visualization - graphs, tools, etc. 4) High Performance Computing knowledge30
  • People that think outside the ‘box’ 31
  • 7) Privacy and security issues – The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research – Of course, appropriate measures always need to be taken • Prior to new research studies, check privacy sensitivity of data • In case of privacy sensitive data: • Try to anonymize micro data or use aggregates • Use secure environment: workstations in Big Data lab – Legal issues that enable the use of Big Data for official statistics production are currently being looked at - There is Big Data that can be considered ‘Administrative data’: i.e. Big Data that is managed by a (semi-)governmentally funded organisation 32
  • Example: Mobile phones Mobile phone activity as a data source – Nearly every person in the Netherlands has a mobile phone - Usually on them and almost always switched on! - Many people are very active during the day – Can data of mobile phones be used for statistics? - Travel behaviour (of active phones) - ‘Day time population’ (of active phones) - Tourism (new phones that register to network) – Data of a single mobile company was used - Hourly aggregates per area (only when > 15 events) - Especially important for roaming data (foreign visitors) 33
  • ‘Day time population’ – Hourly changes of mobile phone activity – 7 & 8 May 2013 – Per area distinguished – Only data for areas with > 15 events per hour 34
  • Tourism: Roaming during European league final Hardly any Low Medium High Very high 35
  • 8) Costs and data management – Costs ‐ In the Netherlands we don’t pay for administrative data. ‐ How about Big Data? • We currently pay for social media (access) and mobile phone data (extra processing efforts) – Data management ‐ Who owns the data? Stability of delivery/source ‐ Cope with the huge volume • Run queries in database of data source holder • Collect and process it as data stream • Bulk processing 36
  • The Future 37 The future of statistics looks BIG
  • Thank you for your attention !@pietdaas