Strata Big data presentation

877 views

Published on

Big data as a soure for official statistics. Presentation at Strata Big data in London.

Published in: Education, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
877
On SlideShare
0
From Embeds
0
Number of Embeds
98
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Strata Big data presentation

  1. 1. Big Data as a source for Official Statistics Edwin de Jonge and Piet Daas November 12, London
  2. 2. Overview • Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach • Visualization as a tool •Why? •Examples in our office • Issues & challenges • From an official statistical perspective • Focus on methodological and legal ones 2
  3. 3. Why Visualization? October 1st 2013, Statistics Netherlands
  4. 4. Effective Display! (see Tor Norretranders, “Band width of our senses)
  5. 5. Anscombes quartet… DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 5
  6. 6. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  7. 7. Lets plot!
  8. 8. Assumptions… 8
  9. 9. Why visualization? Tool for data analysis –Effective display of information –Summary of data –Show outliers / patterns –Helps exploring data –Helps checking assumptions
  10. 10. Often Maps Many visualizations are maps –Positive: ‐ Is familiar ‐ Attractive But: only makes sense: ‐ When data geographically distributed ‐ When locality is meaningful ‐ When data is correctly normalized
  11. 11. Huh, Normalized?, 11
  12. 12. Many maps just population maps! A better map: ‐ Takes population size into account (e.g. by making figures relative) ‐ May plot difference w.r.t. an expected value. 13
  13. 13. Visualization is not easy – Creating good visualizations is hard – “Easy Reading” is not “Easy Writing” Visualization must be: – Faithful – Objective Thus not introduce perceptial bias
  14. 14. Visualization – Use appropriate chart – Use approprate scales ‐ x,y, color, time – Use appropriate granularity Research: What works for which data?
  15. 15. Example: Census 16
  16. 16. Example Virtual Census ‐ Every 10 years a Census needs to be conducted ‐ No longer with surveys in the Netherlands • Last traditional census was in 1971 ‐ Now by (re‐)using existing information • Linking administrative sources and available sample survey data at a large scale • Check result • How? • With a visualisation method: the Tableplot 11
  17. 17. Making the Tableplot 1. 2. Load file Sort record according to key variable • Age in this example 3. Combine records each) • Numeric variables • • 100 groups (170,000 records Calculate average (avg. age) Categorical variables • 4. 17 million records 17 million records Ratio between categories present (male vs. female) Plot figure • Colours used are important of select number of variables up to 12 12
  18. 18. October 1st 2013, Statistics Netherlands tableplot of the census test file
  19. 19. Tableplot: Monitor data quality – All data in Office passes stages: ‐ Raw data (collected) ‐ Preproccesed (technically correct) ‐ Edited (completed data) ‐ Final (removal of outliers etc.) 21
  20. 20. Processing of data Raw (unedited) data Edited data Final data
  21. 21. Example 2 : Social Security Register 15
  22. 22. – Contains all financial data on jobs, benefits and pensions in the Netherlands ‐ Collected by the Dutch Tax office ‐ A total of 20 million records each month ‐ How to obtain insight into so much data? • With a visualisation method: a heat map 24
  23. 23. Income (euro) Heat map: Age vs. ‘Income’ Age October 1st 2013, Statistics Netherlands 16
  24. 24. After ‘ d ata re d uction ’ amount amount age October 1st 2013, Statistics Netherlands age 17
  25. 25. Visualization helps with volume of data – – – – – – Summarize by “binning” Tableplot Histogram Heatmap (2D histogram) Smoothing? Detect unexpected patterns We use it as a tool to check, explore and communicate data 27
  26. 26. Big Data: Issues and challenges
  27. 27. Big Data: issues & challenges During our exploratory studies we identified a number of issues & challenges. Focussing on the methodological and legal ones, we found that there is a need to: 1) deal with noisy and dirty data 2) deal with selectivity 3) go beyond correlation 4) cope with privacy and security issues We have only solved some of them (partially) 29
  28. 28. 1) Deal with noisy and dirty data – Big Data is often ‐ noisy ‐ dirty ‐ redundant ‐ unstructured • e.g. texts, images – How to extract information from Big data? ‐ In the best/most efficient way 30
  29. 29. Noisy and dirty data Social media sentiment Traffic loop data Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models (capture structure), ‘Google approach’ (80/20 rule) Preferably do NOT use samples ! 31
  30. 30. Noise reduction Social media: daily sentiment in Dutch messages 32
  31. 31. Noise reduction Social media, daily sentiment in Dutch messages Social media: daily & weekly sentiment in Dutch messages 33
  32. 32. Noise reduction Social media, daily sentiment in Dutch messages Social media: daily, weekly & monthly sentiment in Dutch messages 34
  33. 33. Noise reduction Social media, daily sentiment in Dutch messages Social media: monthly sentiment in Dutch messages 35
  34. 34. Social media sentiment & Consumer confidence Social media: monthly sentiment in Dutch messages & Social media, daily sentiment in Dutch messages Consumer confidence Corr: 0.88 36
  35. 35. Dirty data Total number of vehicles detected by traffic loops during the day 37 Time (hour)
  36. 36. Loop active varies during the day 38 (first 10 min)
  37. 37. Correct for dirty data Use data from same location from previous/next minute (5 min. window) Before Total = ~ 295 million vehicles 39 After Total = ~ 330 million vehicles (+ 12%)
  38. 38. 2) Deal with selectivity – Big data sources are selective (they do NOT cover the entire population considered) ‐ – AND: all Big Data sources studied so far contain events! ‐ ‐ – Some probably more then others E.g. social media messages created, calls made and vehicles detected Events are probably the reason why these sources are so Big When there is a need to correct for selectivity 1) Convert events to units How to identify units? 2) Correct for selectivity of units included How to cope with units that are truly absent and part of the population under study? 40
  39. 39. Units / events – Big Data contains events ‐ Social media messages are generated by usernames ‐ Traffic loops count vehicles (Dutch roads are units) ‐ Call detail records of mobile phone ID’s ‐ Convert events to units • By profiling 41
  40. 40. Profiling of Big data 42
  41. 41. Travel behaviour of active mobile phones Mobility of very active mobile phone users - during a 14-day period Based on: - Call- and text-activity multiples times a day - Location based on phone masts Clearly selective: - North and South-west of the country hardly included 43 __
  42. 42. 3) Go beyond correlation – You will very likely use correlation to check Big Data findings with those in other (survey) data – When correlation is high: 1) try falsifying it first (is it coincidental/spurious?) correlation ≠ causation 2) If this fails, you may have found something interesting! 3) Perform additional analysis (look for causality) cointegration, structural time‐series approach 44 Use common sense!
  43. 43. An illustrative example Official unemployment percentage Number of social media messages including the word “unemployment” X Corr: 0.90 ? 45
  44. 44. 4) Privacy and security issues – The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research – Still appropriate measures need to be taken • Prior to new research studies, check privacy sensitivity of data • In case of privacy sensitive data: • Try to anonymize micro data or use aggregates • Use a secure environment – Legal issues that enable the use of Big Data for official statistics production is currently being looked at ‐ No problems for Big Data that can be considered ‘Administrative data’: i.e. Big Data that is managed by a (semi‐)governmentally funded organisation 46
  45. 45. Conclusions – Big data is a very interesting data source ‐ Also for official statistics – Visualisation is a great way of getting/creating insight ‐ Not only for data exploration – A number of fundamental issues need to be resolved ‐ Methodological ‐ Legal ‐ Technical (not discussed here) – We expect great things in the near future! 47
  46. 46. The future of statistics?

×