Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Isi 2017 presentation on Big Data and bias


Published on

Big data examples with emphasis on dealing with bias. Presentation was part of the big data and ethics session at the ISI 2017

Published in: Government & Nonprofit
  • If u need a hand in making your writing assignments - visit ⇒ ⇐ for more detailed information.
    Are you sure you want to  Yes  No
    Your message goes here
  • Hey guys! Who wants to chat with me? More photos with me here 👉
    Are you sure you want to  Yes  No
    Your message goes here

Isi 2017 presentation on Big Data and bias

  1. 1. Big data, selection bias, and ways to correct for it Piet Daas, Bart Buelens Thanks to: Jan van den Brakel, Marco Puts, MartijnTennekes Chang Sun, Jade Cock and AgataTroost
  2. 2. Using of Big Data – Statistics Netherlands has been studying the potential application and use of Big Data since a number of years – How have we used Big Data so far? – Three types of Big Data use ‐ 1) Combined with survey (or admin) data ‐ 2) Single source, but complete (census like) ‐ 3) Single source, but incomplete (part of population) – Important considerations – Quality of the data (and metadata) – Coverage and ´selectivity´ of the population 2
  3. 3. 1. Type of Big Data use – 1) Survey based, Big Data as additional source ‐ Consumer confidence + sentiment in social media ‐ CPI traditional + scanner data + web collected prices ‐ Survey methodology is the basis ‐ Methodological considerations: ‐ For some Big Data sources information needs to be extracted first, e.g. • Determining sentiment of social media messages • Using pictures to identify product on the web 3
  4. 4. 1. Consumer confidence + social media (~10%) (~80%) - Combined sentiment of public Dutch Facebook and Twitter messages per month correlates ~0.9 with (monthly) Consumer Confidence survey data - Raw monthly aggregates of both series cointegrate - Social media sentiment improves precision of survey based Consumer Confidence estimate (Van den Brakel et al. (2017) Survey Methodology, forthcoming)
  5. 5. 2. Type of Big Data use – 2) Big Data as the main/single source, Census approach ‐ Road sensor based traffic intensity statistics ‐ CPI fully based on web collected prices ‐ Land use statistics based on satellite images ‐ AIS data of ships for maritime statistics ‐ These Big Data sources have in common that: • Target population is completely included (i.e. census) (e.g. roads, products, country, vessels) • Variable in source is identical/very similar/can be converted to the one needed! 5
  6. 6. 2. Dutch highways 6
  7. 7. 2. Dutch highways + road sensors 7
  8. 8. 2. Road sensor based intensity estimates Time (years) Numberofvehicles - Findings of 5 quality indicators are used to select (daily) data of sensors used - Missing data is the biggest problem (~40% of expected data is absent) - Vehicle estimates are calculated per road segment with sensor weights - Low sensor coverage of highways in first half of 2010 results in poor estimates
  9. 9. 3. Type of Big Data use – 3) Big Data as the main source, but population not complete ‐ Social tension indicator using social media ‐ ‘Day time population’ using mobile phone data ‐ Tourism statistics using mobile phone data ‐ Energy statistics using smart meters ‐ … ‐ Part of the target population is included ‐ Need to find ways to deal with/correct for missing part 9
  10. 10. 3. Type of Big Data use – 3) Try to ‘deal’ with missing part of ‘population’ ‐ Social tension monitor using social media • Detect relevant messages with keywords • Relative number of messages are used per day ‐ ‘Day time population’ using mobile phone data (1 provider) • Assume 1/3 of the population uses this provider • Use age distribution of provider population for correction • Future: Verify findings with data of another provider ‐ Tourism statistics using mobile phone data (1 provider) • Not done yet: Change of foreign phones accessing providers network ‐ It’s essential to find ways to obtain characteristics of the population included in the Big Data source! • Is challenging because sometimes directly available background characteristics are absent • Look for features (=measurable properties) 10
  11. 11. 3. Selectivity of mobile phone data Number of people in ‘Assen’ city Motor race (TT) 90.000 visitors Truckstar festival 55.000 visitors Overestimating the number of visitors based on mobile phone data of a single provider
  12. 12. Big Data based statistics – It’s possible, but depends on type of use – 1) Survey based -> Need to ‘link’ Big Data source – 2) Big Data census like -> Coverage (units) and comparability (variable) – 3) Big Data incomplete -> Selectivity, coverage and stability of population in source Especially topic 3 requires more methodological research - Find ways to determine coverage and correct for selectivity by extracting and studying ‘features’ - Find other data sources to increase coverage of target population 12
  13. 13. Thank you for your attention!@pietdaas