Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HLG Big Data project and Sandbox


Published on

Presentation at IAOS 2014 Conference - Da Nang (Vietnam)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

HLG Big Data project and Sandbox

  1. 1. HLG Big Data project and Sandbox Carlo Vaccari (Istat) – IAOS October 2014 1
  2. 2. This material is distributed under the Creative Commons "Attribution - NonCommercial - Share Alike - 3.0", available at Carlo Vaccari (Istat) – IAOS October 2014 2
  3. 3. Carlo Vaccari (Istat) – IAOS October 2014 3 I nt er nati onal High Level Group to coordinate groups working on Statistical Standards: UNECE, OECD, Eurostat, National Statistical Org.
  4. 4. May 2013: task team with the aim to define a project to be presented to international statistical community: Three main objectives: To identify the main possibilities and the main strategic and methodological issues that Big Data poses for the official statistics To analyze the feasibility of efficient production of official statistics using Big Data sources, and the possibility to replicate these approaches across different national contexts To facilitate the sharing across organizations of knowledge, expertise, tools and methods for the production of statistics using Big Data sources Carlo Vaccari (Istat) – IAOS October 2014 4 Bi g Dat a Pr oj ect
  5. 5. Project presented to HLG and CES Task teams composed by people from 13 organisations The project composed of four task teams: Partnership Task Team Privacy Task Team Quality Task Team Sandbox Task Team Carlo Vaccari (Istat) – IAOS October 2014 5 Bi g Dat a Pr oj ect
  6. 6. Carlo Vaccari (Istat) – IAOS October 2014 6 Part ner Providers s hi p Task and sources of data - challenges: access to data, managing privacy and confidentiality Government (Administrative records) Private (Commercial records) Social Media and other Internet sites Design - research design and development Academia Private and/or public research institutes NGOs International organizations
  7. 7. Carlo Vaccari (Istat) – IAOS October 2014 7 Part ner Technology s hi p Task - Tools, data and infrastructure for data processing, data mining, real-time analytics, storage, computing, and data visualization Private sector (technology providers, IT companies) Data providers themselves Analysis - NSOs can provide standards and methodology whereas others provide analytical capacity and modeling Academia Private and/or public research institutes NGOs International organizations
  8. 8. Overview of existing tools for risk management in view of privacy issues Carlo Vaccari (Istat) – IAOS October 2014 8 Pri v acy Task Tea Risks to privacy - Privacy software Data access strategies (onsite, remote access, microdata) Overview of database privacy technologies Evaluation of different privacy approaches Big Data characteristics and their implications for data privacy Data access strategies for Big Data Computer Science and Statistical Disclosure approaches Disclosure Risk assessment for Big Data
  9. 9. Information Integration and Governance (DB monitoring, security, transport security) Statistical Disclosure Limitations Carlo Vaccari (Istat) – IAOS October 2014 9 Pri v acy Task Tea Preserving confidentiality Balance between “Data utility” and “Disclosure Risk” SDL methods: Data masking Traditional approaches: aggregation, obfuscation, perturbations, data swapping Modern approaches: sampling and simulation Managing potential risk to reputation: ethical practices, controls, communication, dialog with public
  10. 10. Carlo Vaccari (Istat) – IAOS October 2014 10 Quali Input t y Task Tea quality framework with indicators: Source: data-source, reliability, privacy, availability, costs, procedures, ... Metadata: representativeness, usability, completeness, id, ... Data: collection, coverage, complexity, efficiency, integrability Output quality framework with indicators: Metadata: clarity, accessibility, completeness, comprehensiveness Data: relevance, accuracy, timeliness, accessibility, coherence, predictivity, selectivity Process quality with indicators : Cleaning: unambiguous, objectivity, granularity, reliability Transformations: compliance, categorization, precision Linking: completeness, selectivity, accuracy, id, time_related Aggregation: quantity, confidentiality, Integration, validity, accuracy
  11. 11. Carlo Vaccari (Istat) – IAOS October 2014 11 Sandbox Sandbox: web-accessible environment where researchers coming from different institutions explore tools and methods needed for statistical production and the feasibility of producing Big Data-derived statistics List of tools chosen: Hadoop, Hortonworks, Pentaho, RHadoop Open list ...
  12. 12. Carlo Vaccari (Istat) – IAOS October 2014 12 Sandbox Sandbox hosted at the Irish Center for High- End Computing (ICHEC) which will assist the task team for the testing and evaluation of Hadoop work-flows and associated data analysis application software The mission of ICHEC is to provide High- Performance Computing (HPC) resources, support, education and training for researchers
  13. 13. Carlo Vaccari (Istat) – IAOS October 2014 13 Sandbox c onfi gur The hardware on which the sandbox system is based is a High Performance Computing Linux cluster hosted in the National University of Ireland (Galway) composed of 30 nodes each of which has two quad-core processors, 48GB of RAM and a 1TB local disk Each node is connected to two networks – one for accessing the shared Lustre and one Gigabit Ethernet network for management 20TB shared filesystem is available to all nodes
  14. 14. Virtual Sprint (March 2014) → first document Workshop in Rome (April 2014) Training in Rome (May 2014) Sandbox installation and verification Workshop in Heerlen (September 2014) Testing scenarios for BD usage in Official Statistics: Carlo Vaccari (Istat) – IAOS October 2014 14 Sandbox i n 2014 use as auxiliary information to improve an existing survey replacing all or part of an existing survey with Big Data producing a predefined statistical output either with or without supplementation of survey data producing a statistical output guided by findings from the data
  15. 15. Carlo Vaccari (Istat) – IAOS October 2014 15 Sandbox partner Software: Hortonworks – Granted a free enterprise support subscription for the duration of the project Pentaho – Free trial of enterprise platform Data: Mobile data from Orange Smart meters data from Irish power agency Smart meters from Canadian power agency
  16. 16. Carlo Vaccari (Istat) – IAOS October 2014 16 Sandbox ex peri Organized in Task teams, one for each source: Consumer Price Index Mobile phone data Smart meters Traffic loops Social Data Web scraping Job vacancies
  17. 17. Carlo Vaccari (Istat) – IAOS October 2014 17 Ex peri ment Cons Sources: Web scraping from ONS (UK supermarkets) Synthetic scanner data from Istat Test performance of big data technologies applied to the computation of a simplified consumer price index, based on synthetic data sets modeling scanner data A first version of the price generator was tested successfully in generating a sample csv file with 11 billions rows, successfully uploaded in the sandbox Comparison between Hadoop ↔ NoSQL ↔ RDBMS Visual analysis of data through Pentaho suite
  18. 18. Carlo Vaccari (Istat) – IAOS October 2014 18 Ex peri ment Mobil Four dataset from Orange provider for Ivory Coast: calls and duration for pair of cells for each hour calls coming from 500k phones with time and cell calls coming from 500k randomly sampled individuals communication sub-graphs for 5k users Experiments: Classification of Caller: workers, students, business, not LF, ... Classification of zones (cells): industrial, residential, school/university, farmers, high/low traffic Temporal distribution of Calls (day/week/season)
  19. 19. Carlo Vaccari (Istat) – IAOS October 2014 19 Ex peri ment Mobil Parallel experiment on Slovenian and Orange data: → exchange of methods, tools, findings Searching for other datasets from other providers
  20. 20. Carlo Vaccari (Istat) – IAOS October 2014 20 Ex peri ment Datasets: S mart Smart meter data from Ireland (household level, linked with 2 surveys) Synthetic smart meter data from Canada (household level, covering several years, time stamped hourly electricity consumption linked with hourly weather data and hourly price data, matched with quarterly survey data) Experiment: Rhadoop code for visualizing synthetic Canadian smart meter data, providomg time elapsed for the following: Hourly Consumption (kWh) v Hourly Temperature (C) for all data Hourly Consumption (kWh) v Hourly Price (c) for all data
  21. 21. Carlo Vaccari (Istat) – IAOS October 2014 21 Ex peri ment Tr affi In the Netherlands, 20,000 traffic loops, counting the number of vehicles each minute, are located on approximately 3,000 km of speedway. All this data is collected by a central agency, the NDW (National data warehouse for traffic). Data loaded for one year for the area of South Limburg, consisting of about 800 of these traffic loop Experiment: Find out how to deal with multiple files in Hadoop See how the traffic develops during a year Deliverables: Code for aggregating the data in Hive and RHadoop A graphical representation about the development of the traffic on these roads and in this region
  22. 22. Carlo Vaccari (Istat) – IAOS October 2014 22 Ex peri ment Tr affi
  23. 23. Carlo Vaccari (Istat) – IAOS October 2014 23 Ex peri ment Soci Set of tweets generated in Mexico from January to July 2014: Sentimental analysis techniques in obtaining indicators of subjective wellbeing (compare with stats) Use geo-tagged tweets for analysing people movement State of origin of tourists visiting "Magic Towns" in Mexico
  24. 24. Carlo Vaccari (Istat) – IAOS October 2014 24 Ex peri ment Soci Next steps: Geo-located tweets experiments on: Working patterns / commuting from morning to night Weekends / Holydays / Seasonal movements South – North mobility / Commerce at the North border Work on emoticons and media acronyms analysis: Develop a small emoticons dictionary / review research papers Count of emoticons on the tweets that we have, and how many tweets have emoticons to have an idea of their representativity power Review of algorithms: work with some MapReduce adaptations, Spark, Scala
  25. 25. The Job-vacancies team works on (historical) job vacancies data, scraped from various sites on the web – goals: to identify possible both free and commercial data sources and its APIs and illustrate potential use cases to scrape job vacancies data from the biggest national websites (possibly international also) to test scraping tools (Irobotsoft and Kimonolabs) to test statistical process of data manipulation Carlo Vaccari (Istat) – IAOS October 2014 25 Ex peri ment J ob
  26. 26. Carlo Vaccari (Istat) – IAOS October 2014 26 Ex peri ment Web 8,600 Italian websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed The scraping and processing work took about 33 hours on a virtual server in Italy, the goal of this activity is to reproduce the used software configuration and rerun the process on a more powerful environment in order to measure the time consumption Experiment: Configure a Nutch job runnable in the Sandbox environment Execute the scraping job in order to produce the scraped data in HDFS Compare the performance of the sandbox with the performance of a single server
  27. 27. Carlo Vaccari (Istat) – IAOS October 2014 27 St at e of t he Pr All teams are running experiments and have defined objectives for final deliverables (preliminary results due for end of November, final end of year) Outline of final deliverables defined in September meetings Developed training material, available for all participants and public in future Effective cooperation and exchange of ideas: all participants requested more time for developing other experiments and look forward to extending the project
  28. 28. Carlo Vaccari (Istat) – IAOS October 2014 28 Less ons Lear ned International cooperation can multiply the ideas Data acquisition can be a long process. (eg: five months to get Orange mobile data) group suggested other possible approaches for the future need “political”/legal sponsorship Setup of the environment required time → difficult to achieve "stable" configuration Training should operate on different skills: IT, statistical and algorithms. Need of people open to learn new tools, techniques, methods...
  29. 29. Thank you for your attention!