Learning from Web Activity                                              Jake Hofman                                  Joint...
Jake Hofman   (hofman@yahoo-inc.com)   Learning from Web Activity   JSM, 2011.08.01   2 / 42
“... applies social sciences techniques, including                algorithmic game theory, economics, network analysis,   ...
possible to estimate accurately the age of ori-      biodiversification event in the Paleogene (65 to    8. M. Foote, in E...
possible to estimate accurately the age of ori-      biodiversification event in the Paleogene (65 to    8. M. Foote, in E...
The clean real story                   “ We have a habit in writing articles published in                   scientific jour...
Outline              • The clean story              • The real story              • Lessons learnedJake Hofman   (hofman@y...
Demographic diversity on the Web  with Irmak Sirer and Sharad Goel                                       The clean story  ...
Motivation          Previous work is largely survey-based and focuses and group-level                              differen...
Motivation                “As of January 1997, we estimate that 5.2 million                African Americans and 40.8 mill...
Motivation        Chang, et. al., ICWSM (2010)Jake Hofman   (hofman@yahoo-inc.com)   Learning from Web Activity   JSM, 201...
Motivation                                 Focus on activity instead of access                                       How d...
Data              • Representative sample of 265,000 individuals in the US, paid                   via the Nielsen MegaPan...
Data              • Transform all demographic attributes to binary variables                e.g., Age → Over/Under 25, Rac...
Data              • Transform all demographic attributes to binary variables                e.g., Age → Over/Under 25, Rac...
Data              • Transform all demographic attributes to binary variables                e.g., Age → Over/Under 25, Rac...
Data              • Transform all demographic attributes to binary variables                e.g., Age → Over/Under 25, Rac...
Site-level skew                                   How diverse are site audiences?         • For each site and attribute,  ...
Site-level skew              Density                                                                                      ...
Site-level skew              Many sites have skew close the overall mean, but there also                             popul...
Site-level skew          This skew persists even when we restrict attention to the top 10k                                ...
Sites vs. ZIPs               How do diversity of the online and offline worlds compare?                                     ...
Sites vs. ZIPs               How do diversity of the online and offline worlds compare?                                     ...
Sites vs. ZIPs               How do diversity of the online and offline worlds compare?                                     ...
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Upcoming SlideShare
Loading in...5
×

Learning from Web Activity

1,956

Published on

JSM 2011

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,956
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Learning from Web Activity"

  1. 1. Learning from Web Activity Jake Hofman Joint work with Irmak Sirer & Sharad Goel Yahoo! Research August 1, 2011Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 1 / 42
  2. 2. Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 2 / 42
  3. 3. “... applies social sciences techniques, including algorithmic game theory, economics, network analysis, psychology, ethnography, and mechanism design, to online situations.” http://research.yahoo.comJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 3 / 42
  4. 4. possible to estimate accurately the age of ori- biodiversification event in the Paleogene (65 to 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, gin of almost all extant genera. It is then possi- 23 million years ago) that is perhaps not yet pp. 245–295. ble to plot a backward survivorship curve (8) captured in Alroy et al.’s database (5, 7). The 9. M. D. Spalding et al., Bioscience 57, 573 (2007). for each of the 27 global bivalve provinces (9). jury is still out on what may have caused this 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). On the basis of these curves, Krug et al. find event. But we should not lose sight of the fact that origination rates of marine bivalves in- that the steep rise to prominence of many mod- 10.1126/science.1169410 SOCIAL SCIENCE A field is emerging that leverages the capacity to collect and analyze data at a Computational Social Science scale that may reveal patterns of individual and group behaviors. David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 W“... a computational social science is emerging that e live life in the network. We check ment agencies such as the U.S. National Secur- critiqued or replicated. Neither scenario will our e-mails regularly, make mobile ity Agency. Computational social science could serve the long-term public interest of accumu- phone calls from almost any loca- become the exclusive domain of private com- lating, verifying, and disseminating knowledge. tion, swipe transit cards to use public trans- panies and government agencies. Alternatively, What value might a computational social leverages the capacity to collect and analyze data with an portation, and make purchases with credit cards. Our movements in public places may be there might emerge a privileged set of aca- demic researchers presiding over private data science—based in an open academic environ- ment—offer society, by enhancing understand- unprecedented breadth and depth and scale ...” captured by video cameras, and our medical records stored as digital files. We may post blog from which they produce papers that cannot be ing of individuals and collectives? What are the entries accessible to anyone, or maintain friend- ships through online social networks. Each of these transactions leaves digital traces that can http://sciencemag.org/content/323/5915/721 be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in eco- nomics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet compa- nies such as Google and Yahoo, and in govern-Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 4 / 42
  5. 5. possible to estimate accurately the age of ori- biodiversification event in the Paleogene (65 to 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, gin of almost all extant genera. It is then possi- 23 million years ago) that is perhaps not yet pp. 245–295. ble to plot a backward survivorship curve (8) captured in Alroy et al.’s database (5, 7). The 9. M. D. Spalding et al., Bioscience 57, 573 (2007). for each of the 27 global bivalve provinces (9). jury is still out on what may have caused this 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). On the basis of these curves, Krug et al. find event. But we should not lose sight of the fact that origination rates of marine bivalves in- that the steep rise to prominence of many mod- 10.1126/science.1169410 SOCIAL SCIENCE A field is emerging that leverages the capacity to collect and analyze data at a Computational Social Science scale that may reveal patterns of individual and group behaviors. David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 W“... shares with other nascent interdisciplinary fields e live life in the network. We check ment agencies such as the U.S. National Secur- critiqued or replicated. Neither scenario will our e-mails regularly, make mobile ity Agency. Computational social science could serve the long-term public interest of accumu- phone calls from almost any loca- become the exclusive domain of private com- lating, verifying, and disseminating knowledge. tion, swipe transit cards to use public trans- panies and government agencies. Alternatively, What value might a computational social (e.g., sustainability science) the need to develop a portation, and make purchases with credit cards. Our movements in public places may be there might emerge a privileged set of aca- demic researchers presiding over private data science—based in an open academic environ- ment—offer society, by enhancing understand- paradigm for training new scholars ...” captured by video cameras, and our medical records stored as digital files. We may post blog from which they produce papers that cannot be ing of individuals and collectives? What are the entries accessible to anyone, or maintain friend- ships through online social networks. Each of these transactions leaves digital traces that can http://sciencemag.org/content/323/5915/721 be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in eco- nomics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet compa- nies such as Google and Yahoo, and in govern-Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 4 / 42
  6. 6. The clean real story “ We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture 1 , 1965 1 http://bit.ly/feynmannobelJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 5 / 42
  7. 7. Outline • The clean story • The real story • Lessons learnedJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 6 / 42
  8. 8. Demographic diversity on the Web with Irmak Sirer and Sharad Goel The clean story (covering our tracks)Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 7 / 42
  9. 9. Motivation Previous work is largely survey-based and focuses and group-level differences in online accessJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 8 / 42
  10. 10. Motivation “As of January 1997, we estimate that 5.2 million African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Hoffman & Novak (1998)Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 8 / 42
  11. 11. Motivation Chang, et. al., ICWSM (2010)Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 9 / 42
  12. 12. Motivation Focus on activity instead of access How diverse is the Web? To what extent do online experiences vary across demographic groups?Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 10 / 42
  13. 13. Data • Representative sample of 265,000 individuals in the US, paid via the Nielsen MegaPanel2 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 2 Special thanks to Mainak MazumdarJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 11 / 42
  14. 14. Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/MaleJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
  15. 15. Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.comJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
  16. 16. Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors)Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
  17. 17. Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levelsJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
  18. 18. Site-level skew How diverse are site audiences? • For each site and attribute, calculate the skew in visitors (e.g., 93% of pageviews on Density foxnews.com are by White users) • For each attribute, plot the distribution of visitor skew across all sites 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White VisitorsJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 13 / 42
  19. 19. Site-level skew Density Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Visitors Proportion White Visitors Proportion College Educated Visitors Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Proportion Adult Visitors Household Incomes Under $50,000Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 14 / 42
  20. 20. Site-level skew Many sites have skew close the overall mean, but there also popular, highly-skewed sites Greater Than 90% Less Than 10% youravon.com coveritlive.com Female collectionsetc.com needlive.com foxnews.com blackplanet.com White wunderground.com mediatakeout.com news.google.com slumz.boxden.com College Educated nytimes.com sythe.com mail.yahoo.com nanowrimo.org Over 25 Years Old apps.facebook.com cbox.ws Household Income scarleteen.com opentable.com Under $50,000 boards.adultswim.com marketwatch.com Table 1: A selection of popular sites that are homogeneous along various demographic dimensions. visually apparent from Figure 5, there are significant differ- College Female 70 Non−White ! Over 25 ences in how groups distribute their time on the web. These ! ! 60 differences—which, as mentioned above, hold for highly fre- r−Capita Pageviews ! quented sites such as Facebook and YouTube—are in some White No College 50 cases even more pronounced for lower traffic sites. For in- 40 Male stance, the gaming site pogo.com accounts for less than 1% 30 of pageviews among both low and high income users, butJake Hofman (hofman@yahoo-inc.com) Under 25 Learning from Web Activityusers spend almost twice as much of their time 15 / 42 low income JSM, 2011.08.01
  21. 21. Site-level skew This skew persists even when we restrict attention to the top 10k or 1k sites 1.0 1.0 1.0 Proportion College Educated Visitors Proportion Female Site Visitors Proportion White Site Visitors 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 101 102 103 104 105 101 102 103 104 105 101 102 103 104 105 Site Rank Site Rank Site Rank 1.0 1.0 Household Incomes Under $50,000 Proportion of Adult Site Visitors 0.8 0.8 Proportion of Visitors With 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 101 102 103 104 105 101 102 103 104 105 Site Rank Site RankJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 16 / 42
  22. 22. Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion WhiteJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42
  23. 23. Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion White As expected, neighborhoods are more gender-balanced than sitesJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42
  24. 24. Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion White But sites typically have more racially diverse audiences than neighborhoods have residentsJake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42

×