Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Digital Demography - WWW'17 Tutorial - Part II


Published on

Second part of a tutorial given at WWW'17 ( on Digital Demography. More information about the tutorial at Please reference the archival tutorial description (at when using the material.

Published in: Science
  • Be the first to comment

Digital Demography - WWW'17 Tutorial - Part II

  1. 1. Digital Demography Bogdan State & Ingmar Weber @bogdanstate @ingmarweber
  2. 2. The Next Three and a Half Hours 09:00 – 10h30: Part I: Overview of Traditional Demography (Bogdan) ● Standard models ● Standard data sources 10h30 – 11h00: Coffee Break and Network Opportunity 11h00 – 12h30: Part II: New Opportunities for Demography with Digital Data (Ingmar) ● Case studies about fertility, mortality and migration ● More about data sources
  3. 3. About Us: Bogdan Sociology PhD (Stanford), focused on computational sociology of social ties. Currently: Graduate Student at Stanford (CS), Data Scientist at Facebook. Long-standing interest in migration research. Articles on measurement of migration with big data, focus on highly-skilled migration and on social networks.
  4. 4. About Us: Ingmar Research Director at QCRI. Started working on demographics of web search at Yahoo Research Barcelona (2009-2012). Collaborating with Emilio Zagheni since 2010, focusing on international migration. Published seven articles on different aspects of WWW and demographics. Serving as ACM Distinguished Speaker, ACM financially supports travel expenses if you want to have me present at you event.
  5. 5. Part II: New Opportunities for Demography with Digital Data
  6. 6. The next 90 minutes • 16 case studies, i.e. published peer-reviewed papers (~65 min) - Breadth over depth - Key idea over methodological details - Organized by topic: fertility, mortality and migration • Not-so-obvious data sets, in particular ad audience estimates (~15 min) - How many Twitter users match criteria X? • Where to from here and discussion (~10 min) - What are you working on? How can we help you?
  7. 7. Fertility Image from
  8. 8. “Forecasting Births Using Google” Francesco C. Billari, Francesco D’Amuri, Juri Marcucci PAA Annual Meeting; 2013
  9. 9. Predict Monthly Fertility Rate Does Google search intensity (GI) for “maternity”, “pregnancy” or “ovulation” predict (with a lag) monthly birth rates? New to Google Trends? Example: Looks somewhat promising Also incorporate external factors
  10. 10. Model Performance Fit an autoregressive–moving-average (ARMA) model Encouraging results, but lots of models were tried. Potentially risk of overfitting. Correlation with birth rate. GI1 is the monthly average of the google index for ‘maternity’, GI2 is the monthly average of the GI for ‘ovulation’, and GI3 is the monthly average of the GI for ‘pregnancy’. Error rates with and without Google Trends data.
  11. 11. “Falsification” Test Lots of things correlate, either by chance or due to hidden factor Temporal interest in “skiing” correlated with flu activity Important: robust selection of key words Used Google Correlate with 2004-2006 time series data to find most correlated term. Turned out to be: “KXMB” KXMB is a local affiliate of CBS (one of the major US commercial broadcasting TVs) for central and western North Dakota Tested for prediction power. Got poor results (unlike for their terms).
  12. 12. “Fertility and its Meaning: Evidence from Search Behavior” Jussi Ojala, Emilio Zagheni, Francesco C. Billari, Ingmar Weber ICWSM; 2017
  13. 13. Study Goals (i) detect evidence for different contexts surrounding different types of fertility; Teen, low/high income, (un-)married, … (ii) model regional variation across states for different fertility levels; What distinguishes Alabama from California from New York? (iii) track temporal changes in fertility across time. Train a model across space, predict across time.
  14. 14. Feature Discovery via Google Trends
  15. 15. Different Contexts of Fertility Discover search terms correlated with different fertility rates across US states Remove terms with no conceivable link to sex, pregnancy or maternity
  16. 16. Predicting Spatial Variability Performance of the regression models using leave-one-out cross-validation. SMAPE is in [%], RMSE values are multiplied by 1,000. Use the previous terms to build models predicting state-level fertility rates All these models make predictions based on linear combinations of search intensity Goal: apply these spatial models across time
  17. 17. Learning Across Space, Predicting Across Time Temporal trend when applying the “teen” model across time. Values are rescaled to a maximum of 1.0. Pearson r correlation across 2010-2015 when using the spatial model to predict trends across time.
  18. 18. “Seasonal Variation in Internet Keyword Searches: A Proxy Assessment of Sex Mating Behaviors” Patrick M. Markey, Charlotte N. Markey Archives of Sexual Behavior; 2013
  19. 19. Seasonality of Mating-Related Web Searches Similar temporal patterns for searches about (i) prostitution and (ii) dating sites Births have a (weak) seasonal pattern Can we detect seasonal mating interest?
  20. 20. “Measuring the impact of health policies using Internet search patterns: the case of abortion” Ben Y. Reis, John S. Brownstein BMC Public Health; 2010
  21. 21. Searches for “abortion” vs. Abortion Rates Recent data:
  22. 22. The Impact of Policies on Search Behavior “With regard to the abortion policies available for study, abortion search volume was significantly higher in states having any of the following four restrictions: (i) mandatory waiting period, (ii) mandatory counseling, (iii) mandatory parental notification in the case of minors, and (iv) mandatory parental consent for minors. Examining abortion availability, abortion search volume was significantly higher in states where fewer than 10% of counties have providers.” “These findings are consistent with published evidence that local restrictions on abortion lead individuals to seek abortion services outside of their area.”
  23. 23. “#babyfever: Social and media influences on fertility desires” Lora E. Adair, Gary L. Brase, Karen Akao, Mackenzie Jantsch Personality and Individual Differences; 2014
  24. 24. #babyfever on Twitter
  25. 25. Mortality Image from
  26. 26. “Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population” Michael Fire, Yuval Elovici ACM Transactions on Intelligent Systems and Technology; 2015
  27. 27. A Wiki Approach to Online Genealogy Anonymized version available at:
  28. 28. Lifespan in the US over the Last 350 Years
  29. 29. Goal: Predict Someone’s Lifespan Born in US and >50, predict if >80
  30. 30. “Quantitative analysis of population-scale family trees using millions of relatives” Joanna Kaplanis, Assaf Gordon, Mary Wahl, Michael Gershovits, Barak Markus, Mona Sheikh, Melissa Gymrek, Gaurav Bhatia, Daniel G MarArthur, Alkes Price, Yaniv Erlich bioRxiv; 2017
  31. 31. Online Genealogy Data - Again 13 million people, after cleaning, in a single pedigree Small sample of mitochondria and Y-STR haplotypes (not discussed) Also location information. Cleaned, de-identified data available at:
  32. 32. Geographical Distribution of Data (Place of Birth) Pre 1800 Post 1800
  33. 33. Mortality and City Growth Their model (red) validated against previous models (Oeppen & Vaupel, black)
  34. 34. Mobility Over Time And a lot more! Check out the paper. Median migration distance in North American born individuals as a function of time. Red: mother-offspring, blue: father-offspring, black: marital radius. Dots represent the data before smoothing.
  35. 35. “A New Source of Data for Public Health Surveillance: Facebook Likes” Steven Gittelman, Victor Lange, Carol A. G. Crawford, Catherine A. Okoro, Eugene Lieb, Satvinder S. Dhingra, Elaine Trimarchi Journal of Medical Internet Research; 2015
  36. 36. Zip-Level “Like” Counts for Different Categories Data from Facebook’s advertising API. Details about current API later.
  37. 37. Predict County-Level Life Expectancy Map zip codes to counties Used 214 counties in the continental USA So what are the factors?
  38. 38. What are the Nine Factors? Examples: Factor 2 is good for you Factor 8 is bad for you
  39. 39. “A novel web informatics approach for automated surveillance of cancer mortality trends” Georgia Tourassi, Hong-Jun Yoon, Songhua Xu Journal of Biomedical Informatics; 2016
  40. 40. Crawling Cancer-Related Obituaries Use a web search engine to get seeds for queries such as “breast cancer obituary, New York” Example Then post-filter Then lung vs. breast cancer Then infer age and gender
  41. 41. Cancer Mortality Rates from Online Obituaries Percent of lung cancer deaths per age group based on SEER data and obituaries for both genders. Annual female breast cancer death rates based on obituaries and on National Vital Statistics Report (NVSR) for 2008–2012.
  42. 42. “Online obituaries are a reliable and valid source of mortality data” M. L. Soowamber, J. T. Granton, F. Bavaghar-Zaeimi, S. R. Johnson Journal of Clinical Epidemiology; 2016
  43. 43. Let Me Google if My Patient Died … Discharged patients might die at home without the hospital knowing Leads to underestimates of mortality for procedures and diseases Search patients’ first and last names in online obituaries
  44. 44. Not Covered in this Tutorial: Digital Mourning “"We will never forget you [online]": an empirical investigation of post-mortem myspace comments”; J. R. Brubaker, G. R. Hayes; 2011 “Death and mourning as sources of community participation in online social networks: R.I.P. pages in Facebook”; A. E. Forman, R. Kern, G. Gil-Egui; 2012 “Does the internet change how we die and mourn? Overview and analysis.”; T. Walter, R. Hourizi, W. Moncur, S. Pitsillides; 2012 “Beyond the Grave: Facebook as a Site for the Expansion of Death and Mourning”; J. R. Brubaker, G. R. Hayes, P. Dourish; 2013
  45. 45. Migration Image from (not Mobility) Migration = (i) across countries, and (ii) long-term Lots of work on mobility from Twitter/mobile phone CDR
  46. 46. “You are where you e-mail: using e-mail data to estimate international migration rates” Emilio Zagheni, Ingmar Weber WebSci; 2012
  47. 47. IP Address => Approximate Geolocation Any online service you frequently use knows your coarse-grained mobility pattern We used anonymized data from Yahoo
  48. 48. Data Collection Large sample of anonymized Yahoo email meta data (date, hashed user ID, inferred country), including self-reported birth year and gender Sent email between September 2009 and June 2011, at least once a month 43 million users, half from the US Migration: different modal country for [Sep 2009, Jun 2010] and [Jul 2010, Jun 2011] Also obtained internet penetration for (country, age, gender) group And migration data for European countries from Eurostat (for calibration)
  49. 49. Internet => Young & Educated => More Mobile Expect a particular type of selection bias: Highly mobile people are early adopters for internet (and email) use Introduce an ad-hoc correction factor (CF) pgac = internet penetration for gender g, age group a and country c k = factor that controls the strength of the selection bias Find appropriate k using calibration data for European countries
  50. 50. Results for the United States Red line: after applying correction factor. Top of gray area: estimates from raw data. The US don’t have good data on outgoing migration flows. Only some data from IRS on stocks of expats.
  51. 51. Sensitivity for Low Internet Penetration Countries Red line: using k=20 for CF. Gray area: Using k between 5 and 35 for CF.
  52. 52. “Studying inter-national mobility through IP geolocation” Bogdan State, Ingmar Weber, Emilio Zagheni WSDM; 2013
  53. 53. Data Collection Anonymized Yahoo log-in information, covering July 2011 to July 2012 Geolocated using IP address, using an average of 100 log in events per user ~10^8 users, 97% in one country, 3% in two countries, 0.23% in more countries Define migration: 2x 90 days in two countries (223 migrants after cleaning) Use “outdated” (April 2012) self-declared country-of-residence to define the origin Normalize out-edges for a given source country: Given that I’m leaving country X, where do I go?
  54. 54. What Predicts Target of a Migration Event?
  55. 55. Visualization of Conditional Migration Flows Black = origin, red = destination, solid lines = “no return”, dashed = some back-and-forth, dotted = pendular
  56. 56. “Inferring international and internal migration patterns from Twitter data” Emilio Zagheni, Venkata Rama Kiran Garimella, Ingmar Weber, Bogdan State WWW; 2014
  57. 57. Data Collection Used Twitter streaming API filter for geo-tagged tweets from OECD countries Pick 3,000 users per country, get their tweets Estimate out-migration and oversample countries where migration is rare Get data for ~500K users Activity thresholding: 3+ tweets in four-months windows, May 2011->April 2013 Left with ~15K users -> Small!
  58. 58. Estimated Out-Migration Rates
  59. 59. Difference-in-Differences Out-migration rates clearly an overestimate Non-representative user set Selection bias is changing over time Focus on between-country differences D D Also see: “Demographic research with non-representative internet data”, Zagheni & Weber, 2015
  60. 60. Results (Soft) Validation: Ireland out-migration rate grew by 2.2% 2011 -> 2012, more than most countries (Irish Central Statistics Office) Mexico also sees a reduction in out-migration (Pew Research Center)
  61. 61. “Migration of Professionals to the U.S. - Evidence from LinkedIn Data” Bogdan State, Mario Rodriguez, Dirk Helbing, Emilio Zagheni SocInfo; 2014
  62. 62. Data Collection Data for ~200 million LinkedIn Users Complete with education level and city/country of education/job No details about data cleaning/preprocessing included
  63. 63. Results
  64. 64. “From Migration Corridors to Clusters: The Value of Google+ Data for Migration Studies” Johnnatan Messias, Fabricio Benevenuto, Ingmar Weber, Emilio Zagheni ASONAM; 2016
  65. 65. Beyond Origin-Destination Migration Analysis I’m a German citizen living in Qatar. So did I migrate from Germany to Qatar? Yes, according to Qatari border control. But: Germany (78->99), United Kingdom (99->03), Germany (03->07), Switzerland (07->09), Spain (09->12), Qatar (12->now) Use the “places lived” on Google+ In 2012, no “currently”, just set of places Get tuples of co-lived countries
  66. 66. Flows/Corridors vs. Tuples/Clusters This is what border control can obtain (with directionality) This is what the Google+ “places lived” provides
  67. 67. Expected Cluster Frequencies Lots of migrant flows on (A,B), (A,C) and (B,C) => expect lots on (A,B,C) “Expect” = rank clusters according to: min(freqAB; freqAC; freqBC) * mean(freqAB; freqAC; freqBC) Best performing ranking approximation (Kendall .565, Spearman .754) Look at outliers and try to explain those
  68. 68. Outlier Frequencies Look at “expected rank – actual rank” Middle 20%: “close to expected” Top 20%: “higher than expected” Low 20%: “lower than expected”
  69. 69. Feature Analysis More than expected: (Spain, France, Italy) (UAE, India, Singapore) Less than expected: (Brazil, Mexico, USA) (Canada, China, UK) Most discriminative features for 3-class distinction
  70. 70. Other Digital Mobility Data: Mobile Phone Data Mostly used for studying mobility (within a country) rather than migration (across countries). Also used for socio-economic estimates (such as income estimates). See work by the following authors for examples (alphabetical order). Joshua Blumenstock, Francesco Calabrese, Nathan Eagle, Cesar Hidalgo, Alex ‘Sandy’ Pentland, Andrew Tatem,
  71. 71. More Data Sources Ad Audience Estimates as Digital Census Please consider citing this tutorial if you should use these data sets and tools. See the proceedings for citation details. Stay tuned for forthcoming work using this data.
  72. 72. Targeted Advertising as a Digital Census All the Internet giants make money with targeted advertising It’s in their commercial interest to “understand” their users Rich data on both demographic and behavioral attributes Usually not available for outside researchers, but … Some aggregate “audience estimates” available for advertisers: How many users/impressions match criteria X? Supported by (at least) Facebook, Twitter, and Google
  73. 73. Facebook’s Advertising Reach Estimates Easy-to-Use Python code Created by Matheus Araujo at QCRI Contact me if you want to (i) know about important details, and (ii) know what’s in the pipeline.
  74. 74. Sneak Preview: Estimating Stocks of Migrants Joint work with Emilio Zagheni and Krishna Gummadi. Currently under review.
  75. 75. Twitter’s Advertising Reach Estimates accounts/%3Aaccount_id/reach_estimate
  76. 76. Google’s Advertising Reach Estimates estimator-service
  77. 77. Using Online Ads to Reach Migrants Only described use as a passive data source. But can be used as an active outreach channel. Examples below. “Migrant Sampling Using Facebook Advertisements A Case Study of Polish: Migrants in Four European Countries”; S. Pötzschke, M. Braun; 2016 “Using Internet to Recruit Immigrants with Language and Culture Barriers for Tobacco and Alcohol Use Screening: A Study Among Brazilians”; B. H. Carlini, L. Safioti, T. C. Rue, L. Miles; 2014 “Reaching and recruiting Turkish migrants for a clinical trial through Facebook: A process evaluation”; B. Ü. Ince, P. Cuijpers, E. van 't Hof, H. Riper; 2014
  78. 78. Google Trends on Steroids Google Trends does not provide demographic information Get DMA-level demographic information (race, income, …) Join with DMA-level Google Trends information Can potentially give “average income of a web search query over time” But often sparsity problems, with data only showing for bigger cities (=> bias) See “The cost of racial animus on a black candidate: Evidence using Google search data”, Seth Stephens-Davidowitz; Journal of Public Economics; 2014 Also: “Demographic information flows”, Ingmar Weber, Alejandro Jaimes; CIKM 2010
  79. 79. Recall: Previously Mentioned Data Sources Online genealogy projects Online obituaries Google Correlate (= upload your own data, discover correlated search terms) Geotagged tweets Others? Baby announcements? Wedding invitations?
  80. 80. Enriching Your Data Demographic Inference 101 Please consider citing this tutorial if you should use these data sets and tools. See the proceedings for citation details.
  81. 81. Demographic Inference – Name Dictionaries First name gender dictionaries: Contact me for dictionary in “International Gender Differences and Gaps in Online Social Networks” Ethnicity Dictionary: Also see “Inferring Nationalities of Twitter Users and Studying Inter-National Linking”
  82. 82. Demographic Inference – Image-Based Inference Face++ Cognitive Services Microsoft Cognitive Services
  83. 83. Demographic Inference – Build Your Training Data FollowerWonk by Moz
  84. 84. Where to From Here?* *Other than lunch Image from user rculwellmins on Pinterest
  85. 85. Where to Go From Here Slides and references, including unused ones, will be posted at: (Annual?) Workshop at ICWSM: Social Media and Demographic Research, Forthcoming special collection on “Social Media and Demographic Research” for Demographic Research, edited by E. Zagheni (http://www.demographic-
  86. 86. Organizations IUSSP “Big Data and Population Processes” Panel, data-and-population-processes  See their events UN Global Pulse, Data-Pop Alliance, Digital Demography email list at UW,
  87. 87. Questions, Comments, Thoughts?