Data science and the future of statistics

4,346 views

Published on

The elusive 'Data Scientist' is a word that pops up more and more. Is this a buzzword or is something really changing in the world? Piet Daas of the CBS will take us on a tour of the changes that he sees around him.

Published in: Education

Data science and the future of statistics

  1. 1. Data Science ‘and the future of statistics’ Piet Daas (and many colleagues)* Statistics Netherlands / Centraal Bureau voor de Statistiek*Martijn Tennekes, Edwin de Jonge, Alex Priem, Bart Buelens, Merijn van Pelt, Paul van den Hurk Data Science NL, 8 Nov. Utrecht
  2. 2. Layout• Introduction• What is Data Science? • You need data, to be one• Data Scientist skills • A sexy job with a paradigm shift• Link with Statistics Netherlands work • Examples of recent developments Data Science NL, 8 November, Utrecht 1
  3. 3. Introduction “Statistics Netherlands will produces about 5000 official publications and tables in 2012” For this we need DATAData Science NL, 8 November, Utrecht 2
  4. 4. Two types of data Primary data Secondary data Data from ‘others’ Our own surveys - Administrative sources - ‘New’ data sourcesData Science NL, 8 November, Utrecht 3
  5. 5. • Data, data everywhere! XData Science NL, 8 November, Utrecht 4
  6. 6. Statistics & Data science1) Is the study of ‘the use of secondary data for statistics’ data science?2) What is data science?Data Science NL, 8 November, Utrecht 5
  7. 7. What is Data Science? • First used in 1974 by Danish computer scientist Peter Nauer in book “Concise Survey of Computer Methods ” • Defined as: • “The science of dealing with data, once they have been established”Established data is data that has been created. If thatwas done by someone else: Than its secondary data! Data Science NL, 8 November, Utrecht 6
  8. 8. Data scientist /statistician is “the sexiest job of the21st Century” People able to derive knowledge from large amounts of data! Data Science NL, 8 November, Utrecht 7
  9. 9. Data science skills ‘landscape’ k ills s i ng m m graPro Sexy Skills of Data Geeks 1) Statistics - traditional analysis youre used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3) Visualization - graphs, tools, etc. Data Science NL, 8 November, Utrecht 8
  10. 10. Data science skills ‘landscape’ k ills s i ng m m graPro Sexy Skills of Data Geeks 1) Statistics - traditional analysis youre used to thinking about 2) Data ‘munging’ - parsing, scraping, and formatting data 3) Visualization - graphs, tools, etc. Data Science NL, 8 November, Utrecht 8
  11. 11. Data Science NL, 8 November, Utrecht 9
  12. 12. Are things changing at the office? Data Science NL, 8 November, Utrecht 10
  13. 13. Statistics Netherlands law• “Statistics Netherlands aims to reduce the administrative burden for companies and the public as much as possible” • By (re-)using existing administrative registrations of both government and government-funded organizations. • And study potential new sources of information Data Science NL, 8 November, Utrecht 11
  14. 14. Statistics Netherlands and Data• Data is generated in increasing amounts and at increasing frequencies: • From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative & Big) • Ever increasing amounts of data need to be checked, processed and analyzed • More sources of information become available • Opportunities to produce statistics faster (‘real-time statistics’) • Need for new methods and tools 1. Methods to quickly uncover information from massive amounts of data available, such as visualisation methods and data-, text- and stream- mining techniques (‘making Big Data small’), High Performance Comp. 2. Methods capable of integrating the information in the statistical process, e.g. linking at massive scale, macro/meso-integration, estimation methods suited for large datasets Data Science NL, 8 November, Utrecht 12
  15. 15. Examples of new developments1) New approaches to official statistical inference a. Algorithmic inference2) Visualisation methods to quickly obtain insight into large datasets b. Virtual Census (17 million records) c. Social Security Register (20 million records)3) Research findings on the use of ‘new’ data sources d. Traffic loop data (80 million records) e. Mobile phone data (~500 million records) f. Social media (12 million - 1 billion records)Data Science NL, 8 November, Utrecht 13
  16. 16. Example a. Statistical inference • Inference is traditionally motivated from a design-based sample perspective • The model-based approach is being gradually adopted in specific circumstances (e.g. adminstrative data). • Next step: algorithmic inference methods • Machine learning, data mining approaches Data Science NL, 8 November, Utrecht 14
  17. 17. Simulation results (1000x) Design Model Neural. DisTree Data Science NL, 8 November, Utrecht Shifting paradigms 15
  18. 18. Example b. Virtual Census • Every 10 years a Census needs to be conducted • No longer with surveys in the Netherlands • Last traditional census was in 1971 • Now by (re-)using existing information • Linking administrative sources and available sample survey data at a large scale • Check result • How? • With a visualisation method: the TableplotData Science NL, 8 November, Utrecht 16
  19. 19. Making the Tableplot1. Load file 17 million records2. Sort record according to 17 million records key variable • Age in this example3. Combine records 100 groups (170,000 records each) • Numeric variables • Calculate average (avg. age) • Categorical variables • Ratio between categories present (male vs. female)4. Plot figure of select number of variables • Colours used are important up to 12 Data Science NL, 8 November, Utrecht 17
  20. 20. Data Science NL, 8 November, Utrecht tableplot of the census test file
  21. 21. Processing of data Raw (unedited) data Edited data Final data Data Science NL, 8 November, Utrecht
  22. 22. Example c: Social Security Register• Contains all financial data on jobs, benefits and pensions in the Netherlands • Collected by the Dutch Tax office • A total of 20 million records each month • How to obtain insight into so much data? • With a visualisation method: a heat mapData Science NL, 8 November, Utrecht 20
  23. 23. Income (euro) Heat map: Age vs. ‘Income’ Age Data Science NL, 8 November, Utrecht 21
  24. 24. A 3D heat map: Age vs. Income vs. Amount After ‘ data r educt ion’amount amount age age Data Science NL, 8 November, Utrecht 22
  25. 25. Example c: Traffic loop detection data• Traffic ‘loops’ • Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera’s in the Netherlands • Total vehicles and in different length classes • Interesting source to produce traffic and transport statistics (and more) • Huge amounts of data, about 80 million records a day Locations Data Science NL, 8 November, Utrecht 23
  26. 26. Number of detected vehicles on a single day Total = ~ 295 milionData Science NL, 8 November, Utrecht 24
  27. 27. Traffic loop detection activity (only first 10 min.)Data Science NL, 8 November, Utrecht 25
  28. 28. Number of detected vehicles on a single day 12% addedData Science NL, 8 November, Utrecht 26
  29. 29. Total vehicles during the day (snapshots)Data Science NL, 8 November, Utrecht 28
  30. 30. Small, medium & large vehiclesData Science NL, 8 November, Utrecht 31
  31. 31. Volatile behaviour at the micro-levelData Science NL, 8 November, Utrecht 32
  32. 32. Docks in Rotterdam 51.941,4.02836Data Science NL, 8 November, Utrecht 33
  33. 33. Example d: Mobile phone data• Nearly every person in the Netherlands has a mobile phone • On them and almost always switched on! • An increasing number of people has a smart phone• Ideal source of information to: • Use mobile phone data of mobile phone companies: • Travel behaviour (‘Day time’-population) • Tourism (new phones that register to network) • Crowd info (for example during events) • But also as a data collection instrument: • Questionnaires (with app, text messaging or browser) • Taking pictures of products, cash receipts and barcodes • Determine exact GPS location • Etc.Data Science NL, 8 November, Utrecht 34
  34. 34. Travel behaviour of mobile phones Mobility of very active active mobile phone users - during a 14-day period - data of a single mob. company Based on: - Call- and text-activity multiples times a day - Location based on phone masts Clearly selective: - Includes major cities - But the North and South-east of the country much lessData Science NL, 8 November, Utrecht 35
  35. 35. Example e: Social media• Dutch are very active on social media platforms • Bijna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone!• Mogelijke informatiebron voor: • Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover • Als meetinstrument te gebruiken voor: • . Map by Eric Fischer (via Fast Company)Data Science NL, 8 November, Utrecht 36
  36. 36. Social media: Dutch messages• Dutch are very active on social media platforms • Potential information source for: • Topics discussed and sentiment over these topics (quickly available!) and probably more? • Investigate it to obtain an answer on potential use Collected Dutch Twitter messages for study: ‘selection’ of 12 millionData Science NL, 8 November, Utrecht 37
  37. 37. Social media: Dutch Twitter topics (3%) (7%) (3%) (10%) (7%) (3%) (5%) (46%) 12 million messagesData Science NL, 8 November, Utrecht 38
  38. 38. Final remarks: Future of statistics • Preparing large data sources for statistics is a lot of work • Exploration phase takes a lot of time • Reduction of information is needed (‘making big data small’) • Risk: ‘garbage in’ ‘garbage statistics out’ • Traditional approach does not suffice • Large data sources are definitely not ‘large’ sample surveys • Often a selective but large part of the population is included • Sometimes its just to much detailed data • With traditional statistical analysis everything will be significant! • More need for: • Visualisation methods (to rapidly gain insight) • Methods specific for large dataset (speedy and ‘robust’) and non- linear estimation methods (data mining like) • ‘Computational statistics’ (& dedicated hardware) • Privacy demands will increase! Data Science NL, 8 November, Utrecht 42
  39. 39. Data Science NL, 8 November, Utrecht The future of Stat Neth?

×