Computational Social Science 
Jake Hofman 
Microsoft Research 
November 6, 2014 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 1 / 62
MSR NYC 
http://research.microsoft.com/en-us/labs/newyork/ 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 2 / 62
Questions 
Many long-standing questions in the social sciences are notoriously 
difficult to answer, e.g.: 
• “Who says what to whom in what channel with what e↵ect”? 
(Laswell, 1948) 
• How do ideas and technology spread through cultures? 
(Rogers, 1962) 
• How do new forms of communication a↵ect society? 
(Singer, 1970) 
• . . . 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 3 / 62
Conventional methods 
Typically difficult to observe the relevant information via 
conventional methods 
(Katz & Lazarsfeld, 1955) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 4 / 62
Large-scale data 
Recently available electronic data provide an unprecedented 
opportunity to address these questions at scale 
Demographic Behavioral Network 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 5 / 62
Computational social science 
An emerging discipline at the intersection of the social sciences, 
statistics, and computer science 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
Computational social science 
An emerging discipline at the intersection of the social sciences, 
statistics, and computer science 
(motivating questions) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
Computational social science 
An emerging discipline at the intersection of the social sciences, 
statistics, and computer science 
(fitting large, potentially sparse models) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
Computational social science 
An emerging discipline at the intersection of the social sciences, 
statistics, and computer science 
(parallel processing for filtering and aggregating data) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
possible to estimate accurately the age of ori-gin 
of almost all extant genera. It is then possi-ble 
to plot a backward survivorship curve (8) 
for each of the 27 global bivalve provinces (9). 
On the basis of these curves, Krug et al. find 
that origination rates of marine bivalves in-creased 
biodiversification event in the Paleogene (65 to 
23 million years ago) that is perhaps not yet 
captured in Alroy et al.’s database (5, 7). The 
jury is still out on what may have caused this 
event. But we should not lose sight of the fact 
that the steep rise to prominence of many mod-ern 
8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., 
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, 
pp. 245–295. 
9. M. D. Spalding et al., Bioscience 57, 573 (2007). 
10. S. M. Stanley, Paleobiology 33, 1 (2007). 
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 
10.1126/science.1169410 
SOCIAL SCIENCE 
We live life in the network. We check 
our e-mails regularly, make mobile 
phone calls from almost any loca-tion, 
swipe transit cards to use public trans-portation, 
and make purchases with credit 
cards. Our movements in public places may be 
captured by video cameras, and our medical 
records stored as digital files. We may post blog 
entries accessible to anyone, or maintain friend-ships 
through online social networks. Each of 
these transactions leaves digital traces that can 
be compiled into comprehensive pictures of 
both individual and group behavior, with the 
potential to transform our understanding of our 
lives, organizations, and societies. 
The capacity to collect and analyze massive 
amounts of data has transformed such fields as 
biology and physics. But the emergence of a 
data-driven “computational social science” has 
been much slower. Leading journals in eco-nomics, 
sociology, and political science show 
little evidence of this field. But computational 
social science is occurring—in Internet compa-nies 
such as Google and Yahoo, and in govern-ment 
agencies such as the U.S. National Secur-ity 
Agency. Computational social science could 
become the exclusive domain of private com-panies 
and government agencies. Alternatively, 
there might emerge a privileged set of aca-demic 
researchers presiding over private data 
from which they produce papers that cannot be 
A field is emerging that leverages the 
capacity to collect and analyze data at a 
scale that may reveal patterns of individual 
and group behaviors. 
critiqued or replicated. Neither scenario will 
serve the long-term public interest of accumu-lating, 
verifying, and disseminating knowledge. 
What value might a computational social 
science—based in an open academic environ-ment— 
offer society, by enhancing understand-ing 
of individuals and collectives? What are the 
Computational Social Science 
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 
“... a computational social science is emerging that 
leverages the capacity to collect and analyze data with an 
unprecedented breadth and depth and scale ...” 
http://sciencemag.org/content/323/5915/721 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 7 / 62
possible to estimate accurately the age of ori-gin 
of almost all extant genera. It is then possi-ble 
to plot a backward survivorship curve (8) 
for each of the 27 global bivalve provinces (9). 
On the basis of these curves, Krug et al. find 
that origination rates of marine bivalves in-creased 
biodiversification event in the Paleogene (65 to 
23 million years ago) that is perhaps not yet 
captured in Alroy et al.’s database (5, 7). The 
jury is still out on what may have caused this 
event. But we should not lose sight of the fact 
that the steep rise to prominence of many mod-ern 
8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., 
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, 
pp. 245–295. 
9. M. D. Spalding et al., Bioscience 57, 573 (2007). 
10. S. M. Stanley, Paleobiology 33, 1 (2007). 
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 
10.1126/science.1169410 
SOCIAL SCIENCE 
We live life in the network. We check 
our e-mails regularly, make mobile 
phone calls from almost any loca-tion, 
swipe transit cards to use public trans-portation, 
and make purchases with credit 
cards. Our movements in public places may be 
captured by video cameras, and our medical 
records stored as digital files. We may post blog 
entries accessible to anyone, or maintain friend-ships 
through online social networks. Each of 
these transactions leaves digital traces that can 
be compiled into comprehensive pictures of 
both individual and group behavior, with the 
potential to transform our understanding of our 
lives, organizations, and societies. 
The capacity to collect and analyze massive 
amounts of data has transformed such fields as 
biology and physics. But the emergence of a 
data-driven “computational social science” has 
been much slower. Leading journals in eco-nomics, 
sociology, and political science show 
little evidence of this field. But computational 
social science is occurring—in Internet compa-nies 
such as Google and Yahoo, and in govern-ment 
agencies such as the U.S. National Secur-ity 
Agency. Computational social science could 
become the exclusive domain of private com-panies 
and government agencies. Alternatively, 
there might emerge a privileged set of aca-demic 
researchers presiding over private data 
from which they produce papers that cannot be 
A field is emerging that leverages the 
capacity to collect and analyze data at a 
scale that may reveal patterns of individual 
and group behaviors. 
critiqued or replicated. Neither scenario will 
serve the long-term public interest of accumu-lating, 
verifying, and disseminating knowledge. 
What value might a computational social 
science—based in an open academic environ-ment— 
offer society, by enhancing understand-ing 
of individuals and collectives? What are the 
Computational Social Science 
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 
“... shares with other nascent interdisciplinary fields 
(e.g., sustainability science) the need to develop a 
paradigm for training new scholars ...” 
http://sciencemag.org/content/323/5915/721 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 7 / 62
The clean real story 
“We have a habit in writing articles published in 
scientific journals to make the work as finished as 
possible, to cover all the tracks, to not worry about the 
blind alleys or to describe how you had the wrong idea 
first, and so on. So there isn’t any place to publish, in 
a dignified manner, what you actually did in order to 
get to do the work ...” 
-Richard Feynman 
Nobel Lecture1, 1965 
1http://bit.ly/feynmannobel 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 8 / 62
Outline 
Search predictions "Right Round" 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
Web diversity 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
& 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Information di↵usion 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 9 / 62
Predicting consumer activity with Web search 
with Sharad Goel, S´ebastien Lahaie, David Pennock, Duncan Watts 
"Right Round" 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 10 / 62
Search predictions 
Motivation 
Does collective search activity 
provide useful predictive signal 
about real-world outcomes? 
"Right Round" 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 11 / 62
Search predictions 
Motivation 
Past work mainly focuses on predicting the present2 and ignores 
baseline models trained on publicly available data 
Date 
Flu Level (Percent) 
8 
7 
6 
5 
4 
3 
2 
1 
Actual 
Search 
Autoregressive 
2004 2005 2006 2007 2008 2009 2010 
2Varian, 2009 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 12 / 62
Search predictions 
Motivation 
We predict future sales for movies, video games, and music 
"Transformers 2" 
Time to Release (Days) 
Search Volume 
a 
−30 −20 −10 0 10 20 30 
"Tom Clancy's HAWX" 
Time to Release (Days) 
Search Volume 
b 
−30 −20 −10 0 10 20 30 
"Right Round" 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 13 / 62
Search predictions 
Search models 
For movies and video games, predict opening weekend box office 
and first month sales, respectively: 
log(revenue) = !0 + !1 log(search) + ✏ 
For music, predict following week’s Billboard Hot 100 rank: 
billboardt+1 = !0 + !1searcht + !2searcht−1 + ✏ 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 14 / 62
Search predictions 
Search volume 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 15 / 62
Search predictions 
Search models 
Search activity is predictive for movies, video games, and music 
weeks to months in advance 
Movies 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
109 
108 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
a 
103 104 105 106 107 108 109 
Video Games 
● 
● Non−Sequel 
Sequel 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
b 
103 104 105 106 107 
Music 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
Predicted Billboard Rank 
Actual Billboard Rank 
100 
80 
60 
40 
20 
0 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
c 
0 20 40 60 80 100 
Movies 
Time to Release (Weeks) 
Model Fit 
0.9 d 
0.8 
0.7 
0.6 
0.5 
0.4 
−6 −5 −4 −3 −2 −1 0 
Video Games 
Time to Release (Weeks) 
Model Fit 
0.9 e 
0.8 
0.7 
0.6 
0.5 
0.4 
−6 −5 −4 −3 −2 −1 0 
Music 
Time to Release (Weeks) 
Model Fit 
0.9 f 
0.8 
0.7 
0.6 
0.5 
0.4 
−6 −5 −4 −3 −2 −1 0 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 16 / 62
Search predictions 
Baseline models 
For movies, use budget, number of opening screens and Hollywood 
Stock Exchange: 
log(revenue) = !0 + !1 log(budget) + !2 log(screens) + 
!3 log(hsx) + ✏ 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
Search predictions 
Baseline models 
For video games, use critic ratings and predecessor sales (sequels 
only): 
log(revenue) = !0 + !1rating + !2 log(predecessor) + ✏ 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
Search predictions 
Baseline models 
For music, use an autoregressive model with the previously 
available rank: 
billboardt+1 = !0 + !1billboardt−1 + ✏ 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
Search predictions 
Baseline + combined models 
Baseline models are often surprisingly good 
Movies (Baseline) 
● 
● 
● 
● 
● 
● 
● 
● 
● 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
109 
108 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● ● 
● 
● 
● 
●
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
a 
103 104 105 106 107 108 109 
Video Games (Baseline) 
● 
● 
● 
● 
● Non−Sequel 
Sequel 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
b 
103 104 105 106 107 
Music (Baseline) 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
Predicted Billboard Rank 
Actual Billboard Rank 
100 
80 
60 
40 
20 
0 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
c 
0 20 40 60 80 100 
Movies (Combined) 
● 
● 
● 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
109 
108 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● ● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
d 
103 104 105 106 107 108 109 
Video Games (Combined) 
● 
● 
● 
● 
●● 
● Non−Sequel 
Sequel 
Predicted Revenue (Dollars) 
Actual Revenue (Dollars) 
107 
106 
105 
104 
103 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
e 
103 104 105 106 107 
Music (Combined) 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
Predicted Billboard Rank 
Actual Billboard Rank 
100 
80 
60 
40 
20 
0 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
f 
0 20 40 60 80 100 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 18 / 62
Search predictions 
Model comparison 
For movies, search is outperformed by the baseline and of little 
marginal value 
Model Fit 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
Combined 
Search 
Baseline 
Nonsequel Games 
Sequel Games 
Music 
Movies 
Flu 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
Search predictions 
Model comparison 
For video games, search helps substantially for non-sequels, less so 
for sequels 
Model Fit 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
Combined 
Search 
Baseline 
Nonsequel Games 
Sequel Games 
Music 
Movies 
Flu 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
Search predictions 
Model comparison 
For music, the addition of search yields a substantially better 
combined model 
Model Fit 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
Combined 
Search 
Baseline 
Nonsequel Games 
Sequel Games 
Music 
Movies 
Flu 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
Search predictions 
Summary 
• Relative performance and value of search varies across 
domains 
• Search provides a fast, convenient, and flexible signal across 
domains 
• “Predicting consumer activity with Web search” 
Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 20 / 62
Outline 
Search predictions "Right Round" 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
Web diversity 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
& 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Information di↵usion 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 21 / 62
Demographic diversity on the Web 
with Irmak Sirer and Sharad Goel (ICWSM 2012) 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
& 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 22 / 62
Motivation 
Previous work is largely survey-based and focuses and group-level 
di↵erences in online access 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 23 / 62
Motivation 
“As of January 1997, we estimate that 5.2 million 
African Americans and 40.8 million whites have ever used 
the Web, and that 1.4 million African Americans and 
20.3 million whites used the Web in the past week.” 
-Ho↵man & Novak (1998) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 23 / 62
Motivation 
Focus on activity instead of access 
How diverse is the Web? 
To what extent do online experiences vary across demographic 
groups? 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 24 / 62
Data 
• Representative sample of 265,000 individuals in the US, paid 
via the Nielsen MegaPanel3 
• Log of anonymized, complete browsing activity from June 
2009 through May 2010 (URLs viewed, timestamps, etc.) 
• Detailed individual and household demographic information 
(age, education, income, race, sex, etc.) 
3Special thanks to Mainak Mazumdar 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 25 / 62
Data 
# ls -alh nielsen_megapanel.tar 
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
Data 
# ls -alh nielsen_megapanel.tar 
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar 
• Normalize pageviews to at most three domain levels, sans www 
e.g. www.yahoo.com ! yahoo.com, 
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
Data 
# ls -alh nielsen_megapanel.tar 
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar 
• Normalize pageviews to at most three domain levels, sans www 
e.g. www.yahoo.com ! yahoo.com, 
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com 
• Restrict to top 100k (out of 9M+ total) most popular sites 
(by unique visitors) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
Data 
# ls -alh nielsen_megapanel.tar 
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar 
• Normalize pageviews to at most three domain levels, sans www 
e.g. www.yahoo.com ! yahoo.com, 
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com 
• Restrict to top 100k (out of 9M+ total) most popular sites 
(by unique visitors) 
• Aggregate activity at the site, group, and user levels 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
Aggregate usage patterns 
How do users distribute their time across di↵erent categories? 
Fraction of total pageviews 
0.25 
0.20 
0.15 
0.10 
0.05 
● 
● 
● 
● ● 
Social Media 
E−mail 
Games 
Portals 
Search 
All groups spend the majority of their time in the top five most 
popular categories 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 27 / 62
Aggregate usage patterns 
How do users distribute their time across di↵erent categories? 
● Social Media 
E−mail 
Games 
Portals 
Search 
User Rank by Daily Activity 
Fraction of Pageviews in Category 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 
● 
● ● ● ● 
● 
● 
● 
● 
● 
10% 30% 50% 70% 90% 
Highly active users devote nearly twice as much of their time to 
social media relative to typical individuals 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 27 / 62
Group-level activity 
How does browsing activity vary at the group level? 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
& 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Large di↵erences exist even at the aggregate level 
(e.g. women on average generate 40% more pageviews than men) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 28 / 62
Group-level activity 
How does browsing activity vary at the group level? 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
& 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Younger and more educated individuals are both more likely to 
access the Web and more active once they do 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 28 / 62
Group-level activity 
All demographic groups spend the majority of their time in the 
same categories 
Age 
Fraction of total pageviews 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● Social Media 
E−mail 
Games 
Portals 
Search 
● ● 
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 
Fraction of total pageviews 
0.4 
0.3 
0.2 
0.1 
0.0 
Education 
● ● 
● 
● 
● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● 
● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● ● 
● 
Other 
Hispanic 
Black 
White 
Asian 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
Group-level activity 
Older, more educated, male, wealthier, and Asian Internet users 
spend a smaller fraction of their time on social media 
Age 
Fraction of total pageviews 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● Social Media 
E−mail 
Games 
Portals 
Search 
● ● 
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 
Fraction of total pageviews 
0.4 
0.3 
0.2 
0.1 
0.0 
Education 
● ● 
● 
● 
● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● 
● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● ● 
● 
Other 
Hispanic 
Black 
White 
Asian 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
Group-level activity 
Lower social media use by these groups is often accompanied by 
higher e-mail volume 
Age 
Fraction of total pageviews 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● Social Media 
E−mail 
Games 
Portals 
Search 
● ● 
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 
Fraction of total pageviews 
0.4 
0.3 
0.2 
0.1 
0.0 
Education 
● ● 
● 
● 
● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● 
● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● ● 
● 
Other 
Hispanic 
Black 
White 
Asian 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
Group-level activity 
Female−to−male pageview ratio 
2 
1 
0.5 
● ● 
● 
● 
● 
● ● ● ● ● 
● ● ●●● ● 
● ● ● ● 
● ● ●●●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●● 
● ● ● ● ● 
●●● ● 
● ● ● ● ● ● ● ● ● ●● ● 
● ● 
● ● ● ●● 
● 
● ● ● 
● 
● 
● 
● 
● 
● 
● 
Pets 
Home & Garden 
Food & Cooking 
Apparel/Beauty 
Family Resources 
Holidays & Special Events 
Health, Fitness & Nutrition 
Multi−category Home & Fashion 
Photography 
Non−Profit 
Genealogy 
Universities 
Greeting Cards 
Cruise Lines 
Government 
Online Games 
Directories/Local Guides 
Gifts & Flowers 
Corporate Information 
Real Estate/Apartments 
Mass Merchandiser 
Multi−category Family & Lifestyles 
Multi−category Special Occasions 
Books 
Member Communities 
Educational Resources 
Shopping Directories & Guides 
E−mail 
Insurance 
Cellular/Paging 
Coupons/Rewards 
Kids, Games, Toys 
Loans 
Arts/Graphics 
Destinations 
Multi−category Travel 
Broadcast Media 
Religion & Spirituality 
Full Service Banks & Credit Unions 
Credit Card 
Software Manufacturers 
General Interest Portals & Communities 
Multi−category Telecom/Internet Services 
Full Service Commercial Banks & Credit Unions 
Financial Tools 
Classifieds/Auctions 
Maps/Travel Info 
Delivery/Stamps 
Search 
Career Development 
Hotels/Hotel Directories 
Airlines 
Instant Messaging 
Ground Transportation 
Free Merchandise 
Multi−category Entertainment 
Long Distance/Local Carrier 
Events 
ISP 
Music 
Research Tools 
Gambling/Sweepstakes 
Special Interest News 
Current Events & Global News 
Multi−category News & Information 
Multi−category Finance/Insurance/Investments 
Weather 
Web Hosting 
Videos/Movies 
Parts & Accessories 
Financial News & Information 
Automotive Manufacturer 
Internet Tools/Web Services 
Military 
Multi−category Automotive 
Automotive Information 
Hardware Manufacturers 
Targeted Portals & Communities 
Multi−Category Education & Careers 
Computer & Consumer Electronics News 
Multi−category Computers & Consumer Electronics 
Sports 
Adult 
Humor 
Personals 
Online Trading 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 30 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Average pageviews per month 
12 
10 
8 
6 
4 
2 
0 
Education 
● 
● 
● 
● ● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Post Graduate Degree 
Some College 
Associate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● ● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● 
● 
● 
Other 
Hispanic 
Black 
White 
Asian 
● News 
Health 
Reference 
Post-graduates spend three times as much time on health sites 
than adults with only some high school education 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Average pageviews per month 
12 
10 
8 
6 
4 
2 
0 
Education 
● 
● 
● 
● ● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Post Graduate Degree 
Some College 
Associate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● ● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● 
● 
● 
Other 
Hispanic 
Black 
White 
Asian 
● News 
Health 
Reference 
Asians spend more than 50% more time browsing online news than 
do other race groups 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Average pageviews per month 
12 
10 
8 
6 
4 
2 
0 
Education 
● 
● 
● 
● ● 
● 
● 
High School Graduate 
Some High School 
Grammar School 
Post Graduate Degree 
Some College 
Associate Degree 
Bachelor's Degree 
Sex 
● 
● 
Female 
Male 
Income 
● ● ● 
● 
● 
● 
$0−25k 
$75−100k 
$25−50k 
$50−75k 
$100−150k 
$150k+ 
Race 
● ● 
● 
● 
● 
Other 
Hispanic 
Black 
White 
Asian 
● News 
Health 
Reference 
Even when less educated and less wealthy groups gain access to 
the Web, they utilize these resources relatively infrequently 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Average pageviews per month 
12 
10 
8 
6 
4 
2 
0 
News 
● 
● ● 
● 
● 
High School Graduate 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Health 
● 
● ● 
● ● 
High School Graduate 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Reference 
● 
● ● 
● ● 
High School Graduate 
Some College 
Associate Degree 
Post Graduate Degree 
Bachelor's Degree 
Asian 
Black 
Hispanic 
White 
Controlling for other variables, e↵ects of race and gender largely 
disappear, while education continues to have large e↵ect 
pi = 
X 
j 
↵jxij + 
X 
j 
X 
k 
!jkxijxik + 
X 
j 
$jx2 
ij + ✏i 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 32 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Average pageviews per month 
12 
10 
8 
6 
4 
2 
0 
Health 
● 
● ● 
● ● 
High School Graduate 
Associate Degree 
Some College 
Post Graduate Degree 
Bachelor's Degree 
Female 
Male 
However, women spend considerably more time on health sites 
compared to men 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 33 / 62
Revisiting the digital divide 
How does usage of news, health, and reference vary with 
demographics? 
Female 
Male 
20 40 60 80 100 
Monthly pageviews on health sites 
However, women spend considerably more time on health sites 
compared to men, although means can be misleading 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 33 / 62
Individual-level prediction 
How well can one predict an individual’s demographics from their 
browsing activity? 
• Represent each user by the set of sites visited 
• Fit linear models4 to predict majority/minority for each 
attribute on 80% of users 
• Tune model parameters using a 10% validation set 
• Evaluate final performance on held-out 10% test set 
4http://bit.ly/svmperf 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 34 / 62
Individual-level prediction 
Reasonable (⇠70-85%) accuracy and AUC across all attributes 
Over/Under 25 
Years Old 
Female/Male 
White/Non−White 
Under/Over $50,000 
Household Income 
College/No College 
Accuracy 
● 
● 
● 
● 
● 
.5 .6 .7 .8 .9 1 
AUC 
● 
● 
● 
● 
● 
.5 .6 .7 .8 .9 1 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 35 / 62
Individual-level prediction 
Highly-weighted sites under the fitted models 
Large positive weight Large negative weight 
Female 
winster.com 
lancome-usa.com 
sports.yahoo.com 
espn.go.com 
White 
marlboro.com 
cmt.com 
mediatakeout.com 
bet.com 
College Educated 
news.yahoo.com 
linkedin.com 
youtube.com 
myspace.com 
Over 25 Years Old 
evite.com 
classmates.com 
addictinggames.com 
youtube.com 
Household Income 
Under $50,000 
eharmony.com 
tracfone.com 
rownine.com 
matrixdirect.com 
Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. 
Over/Under 25 
Years Old 
Female/Male 
White/Non−White 
Under/Over $50,000 
Household Income 
College/No College 
AUC 
! 
! 
! 
! 
! 
.5 .6 .7 .8 .9 1 
Accuracy 
! 
! 
! 
! 
! 
.5 .6 .7 .8 .9 1 
Figure 7, a measure that eectively re-normalizes the ma-jority 
and minority classes to have equal size. Intuitively, 
AUC is the probability that a model scores a randomly se-lected 
positive example higher than a randomly selected neg-ative 
one (e.g., the probability that the model correctly dis-tinguishes 
between a randomly selected female and male). 
Though an uninformative rule would correctly discriminate 
between such pairs 50% of the time, predictions based on 
browsing histories are relatively reliable, ranging from 74% 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 36 / 62
Individual-level prediction 
Substantially better performance when restricted to “stereotypical” 
users (⇠80-90%) 
● Age 
Sex 
Race 
Education 
Income 
Fraction of Users 
AUC 
0.95 
0.90 
0.85 
0.80 
0.75 
0.70 
● ● ● 
● 
● 
● 
● 
0.0 0.2 0.4 0.6 0.8 1.0 
● Age 
Sex 
Race 
Education 
Income 
Fraction of Users 
Accuracy 
0.95 
0.90 
0.85 
0.80 
0.75 
0.70 
● ● ●● 
● 
● 
● 
0.0 0.2 0.4 0.6 0.8 1.0 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 37 / 62
Individual-level prediction 
Similar performance even when restricted to top 1k sites 
Number of Domains 
AUC 
0.9 
0.8 
0.7 
0.6 
0.5 
● 
● 
● ● 
● Age 
Sex 
Race 
Education 
Income 
102 102.5 103 103.5 104 104.5 105 
Number of Domains 
Accuracy 
0.9 
0.8 
0.7 
0.6 
0.5 
● 
● 
● ● 
● Age 
Sex 
Race 
Education 
Income 
102 102.5 103 103.5 104 104.5 105 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 38 / 62
Site-level skew 
Proportion Female Visitors 
Density 
0.0 0.2 0.4 0.6 0.8 1.0 
Proportion White Visitors 
Density 
0.0 0.2 0.4 0.6 0.8 1.0 
Proportion College Educated Visitors 
Density 
0.0 0.2 0.4 0.6 0.8 1.0 
Proportion Adult Visitors 
Density 
0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With 
Household Incomes Under $50,000 
Density 
0.0 0.2 0.4 0.6 0.8 1.0 
Many sites have skew close the overall mean, but there also 
popular, highly-skewed sites 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 39 / 62
Individual-level prediction 
Proof of concept browser demo 
http://bit.ly/surfpreds 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 40 / 62
Summary 
• Highly active users spend disproportionately more of their 
time on social media and less on e-mail relative to the overall 
population 
• Access to research, news, and healthcare is strongly related to 
education, not as closely to ethnicity 
• User demographics can be inferred from browsing activity with 
reasonable accuracy 
• “Who Does What on the Web”, Goel, Hofman  Sirer, 
ICWSM 2012 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 41 / 62
Outline 
Search predictions Right Round 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
Web diversity 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Information di↵usion 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 42 / 62
The structual virality of online di↵usion 
with Ashton Anderson, Sharad Goel, Duncan Watts (201?) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 43 / 62
“Going Viral”? 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 44 / 62
“Going Viral”? 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 45 / 62
“Going Viral”? 
“Therefore we ... wish to proceed with great care as is 
proper, and to cut o↵ the advance of this plague and 
cancerous disease so it will not spread any further ...”5 
-Pope Leo X 
Exsurge Domine (1520) 
5http://www.economist.com/node/21541719 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 45 / 62
“Going Viral”? 
Rogers (1962), Bass (1969) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 46 / 62
“Going viral”? 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 47 / 62
“Going viral”? 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 47 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
• Restricted to 1.4 billion tweets containing links to top news, 
videos, images, and petitions sites 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
• Restricted to 1.4 billion tweets containing links to top news, 
videos, images, and petitions sites 
• Aggregated tweets by URL, resulting in 1 billion distinct 
“events” 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
• Restricted to 1.4 billion tweets containing links to top news, 
videos, images, and petitions sites 
• Aggregated tweets by URL, resulting in 1 billion distinct 
“events” 
• Crawled friend list of each adopter 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
• Restricted to 1.4 billion tweets containing links to top news, 
videos, images, and petitions sites 
• Aggregated tweets by URL, resulting in 1 billion distinct 
“events” 
• Crawled friend list of each adopter 
• Inferred “who got what from whom” to construct di↵usion 
trees 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
Data 
• Examined one year of tweets from July 2011 to July 2012 
• Restricted to 1.4 billion tweets containing links to top news, 
videos, images, and petitions sites 
• Aggregated tweets by URL, resulting in 1 billion distinct 
“events” 
• Crawled friend list of each adopter 
• Inferred “who got what from whom” to construct di↵usion 
trees 
• Characterized size and structure of trees 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
The Structural Virality of Online Di↵usion 
A 
B 
D 
C 
E 
Time 
Group posts by URL 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
The Structural Virality of Online Di↵usion 
A 
B 
D 
C 
E 
Time 
Label each friend who previously adopted as a potential parent 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
The Structural Virality of Online Di↵usion 
A 
B 
D 
C 
E 
Time 
Select each node’s most recent adopting friend as its parent 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
The Structural Virality of Online Di↵usion 
A 
B 
D 
C 
E 
Generations 
Characterize size and structure of trees 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
Information di↵usion 
Cascade size distribution 
10% 
1% 
0.1% 
0.01% 
0.001% 
0.0001% 
0.00001% 
1 10 100 1,000 10,000 
Cascade Size 
CCDF 
Focus on the rare hits that get at least 100 adoptions 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 50 / 62
Quantifying structure 
Measure the average distance between all pairs of nodes6 
⌫(T) = 
1 
n(n − 1) 
Xn 
i=1 
Xn 
j=1 
dij 
6Weiner (1947); correlated with other possible metrics 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 51 / 62
Quantifying structure 
Measure the average distance between all pairs of nodes6 
⌫(T) = 
2n 
n − 1 
 
1 
n 
X 
S2S 
|S|− 
1 
n2 
X 
S2S 
|S|2 
# 
6Weiner (1947); correlated with other possible metrics 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 51 / 62
Information di↵usion 
Size and virality by category 
Remarkable structural diversity across across categories 
100% 
10% 
1% 
0.1% 
0.01% 
0.001% 
100 1,000 10,000 
Cascade Size 
CCDF 
Videos 
Pictures 
News 
Petitions 
100% 
10% 
1% 
0.1% 
0.01% 
0.001% 
3 10 30 
Structural Virality 
CCDF 
Videos 
Pictures 
News 
Petitions 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 52 / 62
Information di↵usion 
Structural diversity 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 53 / 62
Information di↵usion 
Structural diversity 
Size is relatively poor predictive of structure 
Petitions News Pictures Videos 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
●
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● ● ●● ●●● ●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● ● 
●● 
● ● 
● ● 
● 
● 
●● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● ● 
● ● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● ● 
● 
● 
● ● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● ● 
● ● 
● ● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● ● ● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
●● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● ● 
● ● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
●●● 
● 
● 
● 
● 
● 
● 
● 
●● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
●●● 
●● 
● 
● 
● 
● ●● 
● ● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
●●● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
●● 
● 
● 
● 
● 
●●● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
30 
10 
3 
100 
300 
1,000 
100 
300 
1,000 
3,000 
100 
300 
1,000 
3,000 
10,000 
100 
300 
1,000 
3,000 
10,000 
Cascade size 
Structural virality 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 54 / 62
Simulations 
Simulate cascades with a simple SIR model7, 
varying infectivity and degree skew 662 CHAPTER 21. EPIDEMICS 
t 
y 
x z 
u 
r v 
w 
s 
(a) 
t 
y 
x z 
u 
r v 
w 
s 
(b) 
t 
y 
x z 
u 
r v 
w 
s 
(c) 
t 
y 
x z 
u 
r v 
w 
s 
(d) 
8 
Figure 21.2: The course of an SIR epidemic in which each node remains infectious for a 
number of steps equal to tI = 1. Starting with nodes y and z initially infected, the epidemic 
spreads to some but not all of the remaining nodes. In each step, shaded nodes with dark 
borders are in the Infectious (I) state and shaded nodes with thin borders are in the Removed 
(R) state. 
7Kermack  McKendrick (1927) 
8Easley  Kleinberg (2010) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 55 / 62
Simulations 
This reproduces the observed marginal distributions of size and 
structure 
100 
30 
10 
3 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
●● 
● 
● 
●● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
●● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●●● 
● ● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● ●● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● ● 
● 
● 
● 
● ● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
●● ● 
● 
● 
● 
● 
● 
● 
● 
● 
● 
● ● 
● 
● 
● 
100 
300 
1,000 
3,000 
100,000 
10,000 
30,000 
Cascade size 
Structural virality 
... but fails to account for the variance in structure given size 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 56 / 62
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 57 / 62
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 57 / 62
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 57 / 62
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 57 / 62
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 57 / 62
Information di↵usion 
Summary 
• Most cascades fail, resulting in fewer than two adoptions, on 
average 
• Of the hits that do succeed, we observe a wide range of 
diverse di↵usion structures 
• It’s difficult to say how something spread given only its 
popularity 
• “The structural virality of online di↵usion”, Anderson, Goel, 
Hofman  Watts (Under review.) 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 58 / 62
Outline 
Search predictions Right Round 
Week 
Rank 
10 
20 
30 
40 
c 
Billboard 
Search 
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 
Web diversity 
Daily Per−Capita Pageviews 
70 
60 
50 
40 
30 
20 
10 
0 
Under $25k 
● 
White 
● 
Some College 
● 
Under 65 
● 
● 
Over $25k 
Black 
 
Hispanic 
No College 
Over 65 
Female 
Male 
Income Race Education Age Sex 
Information di↵usion 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 59 / 62
Conclusion 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 60 / 62
Lessons learned 
Data jeopardy 
Regardless of scale, it’s difficult to find the right questions to ask 
of the data 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
Lessons learned 
Hacking 
Cleaning and normalizing data is a substantial amount of the work 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
Lessons learned 
Modeling 
Understanding human activity is often useful for detecting 
malicious activity 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
Lessons learned 
Modeling 
Simple methods (e.g., linear models) work surprisingly well, 
especially with lots of (diverse) data 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
Thanks. Questions? 
jake@jakehofman.com 
Also, we’re hiring: 
bit.ly/msrnyc_appsci 
bit.ly/msrnyc_eng 
@jakehofman (Microsoft Research) Computational Social Science November 6, 2014 62 / 62

NYC Data Science Meetup: Computational Social Science

  • 1.
    Computational Social Science Jake Hofman Microsoft Research November 6, 2014 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 1 / 62
  • 2.
    MSR NYC http://research.microsoft.com/en-us/labs/newyork/ @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 2 / 62
  • 3.
    Questions Many long-standingquestions in the social sciences are notoriously difficult to answer, e.g.: • “Who says what to whom in what channel with what e↵ect”? (Laswell, 1948) • How do ideas and technology spread through cultures? (Rogers, 1962) • How do new forms of communication a↵ect society? (Singer, 1970) • . . . @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 3 / 62
  • 4.
    Conventional methods Typicallydifficult to observe the relevant information via conventional methods (Katz & Lazarsfeld, 1955) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 4 / 62
  • 5.
    Large-scale data Recentlyavailable electronic data provide an unprecedented opportunity to address these questions at scale Demographic Behavioral Network @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 5 / 62
  • 6.
    Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
  • 7.
    Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (motivating questions) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
  • 8.
    Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (fitting large, potentially sparse models) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
  • 9.
    Computational social science An emerging discipline at the intersection of the social sciences, statistics, and computer science (parallel processing for filtering and aggregating data) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 6 / 62
  • 10.
    possible to estimateaccurately the age of ori-gin of almost all extant genera. It is then possi-ble to plot a backward survivorship curve (8) for each of the 27 global bivalve provinces (9). On the basis of these curves, Krug et al. find that origination rates of marine bivalves in-creased biodiversification event in the Paleogene (65 to 23 million years ago) that is perhaps not yet captured in Alroy et al.’s database (5, 7). The jury is still out on what may have caused this event. But we should not lose sight of the fact that the steep rise to prominence of many mod-ern 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, pp. 245–295. 9. M. D. Spalding et al., Bioscience 57, 573 (2007). 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 10.1126/science.1169410 SOCIAL SCIENCE We live life in the network. We check our e-mails regularly, make mobile phone calls from almost any loca-tion, swipe transit cards to use public trans-portation, and make purchases with credit cards. Our movements in public places may be captured by video cameras, and our medical records stored as digital files. We may post blog entries accessible to anyone, or maintain friend-ships through online social networks. Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in eco-nomics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet compa-nies such as Google and Yahoo, and in govern-ment agencies such as the U.S. National Secur-ity Agency. Computational social science could become the exclusive domain of private com-panies and government agencies. Alternatively, there might emerge a privileged set of aca-demic researchers presiding over private data from which they produce papers that cannot be A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. critiqued or replicated. Neither scenario will serve the long-term public interest of accumu-lating, verifying, and disseminating knowledge. What value might a computational social science—based in an open academic environ-ment— offer society, by enhancing understand-ing of individuals and collectives? What are the Computational Social Science David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 “... a computational social science is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale ...” http://sciencemag.org/content/323/5915/721 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 7 / 62
  • 11.
    possible to estimateaccurately the age of ori-gin of almost all extant genera. It is then possi-ble to plot a backward survivorship curve (8) for each of the 27 global bivalve provinces (9). On the basis of these curves, Krug et al. find that origination rates of marine bivalves in-creased biodiversification event in the Paleogene (65 to 23 million years ago) that is perhaps not yet captured in Alroy et al.’s database (5, 7). The jury is still out on what may have caused this event. But we should not lose sight of the fact that the steep rise to prominence of many mod-ern 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, pp. 245–295. 9. M. D. Spalding et al., Bioscience 57, 573 (2007). 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 10.1126/science.1169410 SOCIAL SCIENCE We live life in the network. We check our e-mails regularly, make mobile phone calls from almost any loca-tion, swipe transit cards to use public trans-portation, and make purchases with credit cards. Our movements in public places may be captured by video cameras, and our medical records stored as digital files. We may post blog entries accessible to anyone, or maintain friend-ships through online social networks. Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in eco-nomics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet compa-nies such as Google and Yahoo, and in govern-ment agencies such as the U.S. National Secur-ity Agency. Computational social science could become the exclusive domain of private com-panies and government agencies. Alternatively, there might emerge a privileged set of aca-demic researchers presiding over private data from which they produce papers that cannot be A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. critiqued or replicated. Neither scenario will serve the long-term public interest of accumu-lating, verifying, and disseminating knowledge. What value might a computational social science—based in an open academic environ-ment— offer society, by enhancing understand-ing of individuals and collectives? What are the Computational Social Science David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 “... shares with other nascent interdisciplinary fields (e.g., sustainability science) the need to develop a paradigm for training new scholars ...” http://sciencemag.org/content/323/5915/721 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 7 / 62
  • 12.
    The clean realstory “We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture1, 1965 1http://bit.ly/feynmannobel @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 8 / 62
  • 13.
    Outline Search predictions"Right Round" Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Web diversity Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black & Hispanic No College Over 65 Female Male Income Race Education Age Sex Information di↵usion @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 9 / 62
  • 14.
    Predicting consumer activitywith Web search with Sharad Goel, S´ebastien Lahaie, David Pennock, Duncan Watts "Right Round" Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 10 / 62
  • 15.
    Search predictions Motivation Does collective search activity provide useful predictive signal about real-world outcomes? "Right Round" Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 11 / 62
  • 16.
    Search predictions Motivation Past work mainly focuses on predicting the present2 and ignores baseline models trained on publicly available data Date Flu Level (Percent) 8 7 6 5 4 3 2 1 Actual Search Autoregressive 2004 2005 2006 2007 2008 2009 2010 2Varian, 2009 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 12 / 62
  • 17.
    Search predictions Motivation We predict future sales for movies, video games, and music "Transformers 2" Time to Release (Days) Search Volume a −30 −20 −10 0 10 20 30 "Tom Clancy's HAWX" Time to Release (Days) Search Volume b −30 −20 −10 0 10 20 30 "Right Round" Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 13 / 62
  • 18.
    Search predictions Searchmodels For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = !0 + !1 log(search) + ✏ For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = !0 + !1searcht + !2searcht−1 + ✏ @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 14 / 62
  • 19.
    Search predictions Searchvolume @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 15 / 62
  • 20.
    Search predictions Searchmodels Search activity is predictive for movies, video games, and music weeks to months in advance Movies ● ● ● ● ● ● ● ● ● ● Predicted Revenue (Dollars) Actual Revenue (Dollars) 109 108 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● a 103 104 105 106 107 108 109 Video Games ● ● Non−Sequel Sequel Predicted Revenue (Dollars) Actual Revenue (Dollars) 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● b 103 104 105 106 107 Music ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Predicted Billboard Rank Actual Billboard Rank 100 80 60 40 20 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● c 0 20 40 60 80 100 Movies Time to Release (Weeks) Model Fit 0.9 d 0.8 0.7 0.6 0.5 0.4 −6 −5 −4 −3 −2 −1 0 Video Games Time to Release (Weeks) Model Fit 0.9 e 0.8 0.7 0.6 0.5 0.4 −6 −5 −4 −3 −2 −1 0 Music Time to Release (Weeks) Model Fit 0.9 f 0.8 0.7 0.6 0.5 0.4 −6 −5 −4 −3 −2 −1 0 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 16 / 62
  • 21.
    Search predictions Baselinemodels For movies, use budget, number of opening screens and Hollywood Stock Exchange: log(revenue) = !0 + !1 log(budget) + !2 log(screens) + !3 log(hsx) + ✏ @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
  • 22.
    Search predictions Baselinemodels For video games, use critic ratings and predecessor sales (sequels only): log(revenue) = !0 + !1rating + !2 log(predecessor) + ✏ @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
  • 23.
    Search predictions Baselinemodels For music, use an autoregressive model with the previously available rank: billboardt+1 = !0 + !1billboardt−1 + ✏ @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 17 / 62
  • 24.
    Search predictions Baseline+ combined models Baseline models are often surprisingly good Movies (Baseline) ● ● ● ● ● ● ● ● ● Predicted Revenue (Dollars) Actual Revenue (Dollars) 109 108 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● a 103 104 105 106 107 108 109 Video Games (Baseline) ● ● ● ● ● Non−Sequel Sequel Predicted Revenue (Dollars) Actual Revenue (Dollars) 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● b 103 104 105 106 107 Music (Baseline) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Predicted Billboard Rank Actual Billboard Rank 100 80 60 40 20 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● c 0 20 40 60 80 100 Movies (Combined) ● ● ● Predicted Revenue (Dollars) Actual Revenue (Dollars) 109 108 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● d 103 104 105 106 107 108 109 Video Games (Combined) ● ● ● ● ●● ● Non−Sequel Sequel Predicted Revenue (Dollars) Actual Revenue (Dollars) 107 106 105 104 103 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● e 103 104 105 106 107 Music (Combined) ● ● ● ● ● ● ● ● ● ● ● ● Predicted Billboard Rank Actual Billboard Rank 100 80 60 40 20 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● f 0 20 40 60 80 100 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 18 / 62
  • 25.
    Search predictions Modelcomparison For movies, search is outperformed by the baseline and of little marginal value Model Fit 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Combined Search Baseline Nonsequel Games Sequel Games Music Movies Flu @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
  • 26.
    Search predictions Modelcomparison For video games, search helps substantially for non-sequels, less so for sequels Model Fit 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Combined Search Baseline Nonsequel Games Sequel Games Music Movies Flu @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
  • 27.
    Search predictions Modelcomparison For music, the addition of search yields a substantially better combined model Model Fit 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Combined Search Baseline Nonsequel Games Sequel Games Music Movies Flu @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 19 / 62
  • 28.
    Search predictions Summary • Relative performance and value of search varies across domains • Search provides a fast, convenient, and flexible signal across domains • “Predicting consumer activity with Web search” Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 20 / 62
  • 29.
    Outline Search predictions"Right Round" Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Web diversity Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black & Hispanic No College Over 65 Female Male Income Race Education Age Sex Information di↵usion @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 21 / 62
  • 30.
    Demographic diversity onthe Web with Irmak Sirer and Sharad Goel (ICWSM 2012) Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black & Hispanic No College Over 65 Female Male Income Race Education Age Sex @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 22 / 62
  • 31.
    Motivation Previous workis largely survey-based and focuses and group-level di↵erences in online access @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 23 / 62
  • 32.
    Motivation “As ofJanuary 1997, we estimate that 5.2 million African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Ho↵man & Novak (1998) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 23 / 62
  • 33.
    Motivation Focus onactivity instead of access How diverse is the Web? To what extent do online experiences vary across demographic groups? @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 24 / 62
  • 34.
    Data • Representativesample of 265,000 individuals in the US, paid via the Nielsen MegaPanel3 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 3Special thanks to Mainak Mazumdar @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 25 / 62
  • 35.
    Data # ls-alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
  • 36.
    Data # ls-alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com ! yahoo.com, us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
  • 37.
    Data # ls-alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com ! yahoo.com, us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
  • 38.
    Data # ls-alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com ! yahoo.com, us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levels @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 26 / 62
  • 39.
    Aggregate usage patterns How do users distribute their time across di↵erent categories? Fraction of total pageviews 0.25 0.20 0.15 0.10 0.05 ● ● ● ● ● Social Media E−mail Games Portals Search All groups spend the majority of their time in the top five most popular categories @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 27 / 62
  • 40.
    Aggregate usage patterns How do users distribute their time across di↵erent categories? ● Social Media E−mail Games Portals Search User Rank by Daily Activity Fraction of Pageviews in Category 0.30 0.25 0.20 0.15 0.10 0.05 ● ● ● ● ● ● ● ● ● ● 10% 30% 50% 70% 90% Highly active users devote nearly twice as much of their time to social media relative to typical individuals @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 27 / 62
  • 41.
    Group-level activity Howdoes browsing activity vary at the group level? Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black & Hispanic No College Over 65 Female Male Income Race Education Age Sex Large di↵erences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 28 / 62
  • 42.
    Group-level activity Howdoes browsing activity vary at the group level? Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black & Hispanic No College Over 65 Female Male Income Race Education Age Sex Younger and more educated individuals are both more likely to access the Web and more active once they do @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 28 / 62
  • 43.
    Group-level activity Alldemographic groups spend the majority of their time in the same categories Age Fraction of total pageviews 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Social Media E−mail Games Portals Search ● ● 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Fraction of total pageviews 0.4 0.3 0.2 0.1 0.0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Some College Associate Degree Post Graduate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
  • 44.
    Group-level activity Older,more educated, male, wealthier, and Asian Internet users spend a smaller fraction of their time on social media Age Fraction of total pageviews 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Social Media E−mail Games Portals Search ● ● 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Fraction of total pageviews 0.4 0.3 0.2 0.1 0.0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Some College Associate Degree Post Graduate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
  • 45.
    Group-level activity Lowersocial media use by these groups is often accompanied by higher e-mail volume Age Fraction of total pageviews 0.5 0.4 0.3 0.2 0.1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Social Media E−mail Games Portals Search ● ● 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Fraction of total pageviews 0.4 0.3 0.2 0.1 0.0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Some College Associate Degree Post Graduate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 29 / 62
  • 46.
    Group-level activity Female−to−malepageview ratio 2 1 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Pets Home & Garden Food & Cooking Apparel/Beauty Family Resources Holidays & Special Events Health, Fitness & Nutrition Multi−category Home & Fashion Photography Non−Profit Genealogy Universities Greeting Cards Cruise Lines Government Online Games Directories/Local Guides Gifts & Flowers Corporate Information Real Estate/Apartments Mass Merchandiser Multi−category Family & Lifestyles Multi−category Special Occasions Books Member Communities Educational Resources Shopping Directories & Guides E−mail Insurance Cellular/Paging Coupons/Rewards Kids, Games, Toys Loans Arts/Graphics Destinations Multi−category Travel Broadcast Media Religion & Spirituality Full Service Banks & Credit Unions Credit Card Software Manufacturers General Interest Portals & Communities Multi−category Telecom/Internet Services Full Service Commercial Banks & Credit Unions Financial Tools Classifieds/Auctions Maps/Travel Info Delivery/Stamps Search Career Development Hotels/Hotel Directories Airlines Instant Messaging Ground Transportation Free Merchandise Multi−category Entertainment Long Distance/Local Carrier Events ISP Music Research Tools Gambling/Sweepstakes Special Interest News Current Events & Global News Multi−category News & Information Multi−category Finance/Insurance/Investments Weather Web Hosting Videos/Movies Parts & Accessories Financial News & Information Automotive Manufacturer Internet Tools/Web Services Military Multi−category Automotive Automotive Information Hardware Manufacturers Targeted Portals & Communities Multi−Category Education & Careers Computer & Consumer Electronics News Multi−category Computers & Consumer Electronics Sports Adult Humor Personals Online Trading @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 30 / 62
  • 47.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Average pageviews per month 12 10 8 6 4 2 0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Post Graduate Degree Some College Associate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian ● News Health Reference Post-graduates spend three times as much time on health sites than adults with only some high school education @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
  • 48.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Average pageviews per month 12 10 8 6 4 2 0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Post Graduate Degree Some College Associate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian ● News Health Reference Asians spend more than 50% more time browsing online news than do other race groups @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
  • 49.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Average pageviews per month 12 10 8 6 4 2 0 Education ● ● ● ● ● ● ● High School Graduate Some High School Grammar School Post Graduate Degree Some College Associate Degree Bachelor's Degree Sex ● ● Female Male Income ● ● ● ● ● ● $0−25k $75−100k $25−50k $50−75k $100−150k $150k+ Race ● ● ● ● ● Other Hispanic Black White Asian ● News Health Reference Even when less educated and less wealthy groups gain access to the Web, they utilize these resources relatively infrequently @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 31 / 62
  • 50.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Average pageviews per month 12 10 8 6 4 2 0 News ● ● ● ● ● High School Graduate Some College Associate Degree Post Graduate Degree Bachelor's Degree Health ● ● ● ● ● High School Graduate Some College Associate Degree Post Graduate Degree Bachelor's Degree Reference ● ● ● ● ● High School Graduate Some College Associate Degree Post Graduate Degree Bachelor's Degree Asian Black Hispanic White Controlling for other variables, e↵ects of race and gender largely disappear, while education continues to have large e↵ect pi = X j ↵jxij + X j X k !jkxijxik + X j $jx2 ij + ✏i @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 32 / 62
  • 51.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Average pageviews per month 12 10 8 6 4 2 0 Health ● ● ● ● ● High School Graduate Associate Degree Some College Post Graduate Degree Bachelor's Degree Female Male However, women spend considerably more time on health sites compared to men @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 33 / 62
  • 52.
    Revisiting the digitaldivide How does usage of news, health, and reference vary with demographics? Female Male 20 40 60 80 100 Monthly pageviews on health sites However, women spend considerably more time on health sites compared to men, although means can be misleading @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 33 / 62
  • 53.
    Individual-level prediction Howwell can one predict an individual’s demographics from their browsing activity? • Represent each user by the set of sites visited • Fit linear models4 to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test set 4http://bit.ly/svmperf @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 34 / 62
  • 54.
    Individual-level prediction Reasonable(⇠70-85%) accuracy and AUC across all attributes Over/Under 25 Years Old Female/Male White/Non−White Under/Over $50,000 Household Income College/No College Accuracy ● ● ● ● ● .5 .6 .7 .8 .9 1 AUC ● ● ● ● ● .5 .6 .7 .8 .9 1 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 35 / 62
  • 55.
    Individual-level prediction Highly-weightedsites under the fitted models Large positive weight Large negative weight Female winster.com lancome-usa.com sports.yahoo.com espn.go.com White marlboro.com cmt.com mediatakeout.com bet.com College Educated news.yahoo.com linkedin.com youtube.com myspace.com Over 25 Years Old evite.com classmates.com addictinggames.com youtube.com Household Income Under $50,000 eharmony.com tracfone.com rownine.com matrixdirect.com Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. Over/Under 25 Years Old Female/Male White/Non−White Under/Over $50,000 Household Income College/No College AUC ! ! ! ! ! .5 .6 .7 .8 .9 1 Accuracy ! ! ! ! ! .5 .6 .7 .8 .9 1 Figure 7, a measure that eectively re-normalizes the ma-jority and minority classes to have equal size. Intuitively, AUC is the probability that a model scores a randomly se-lected positive example higher than a randomly selected neg-ative one (e.g., the probability that the model correctly dis-tinguishes between a randomly selected female and male). Though an uninformative rule would correctly discriminate between such pairs 50% of the time, predictions based on browsing histories are relatively reliable, ranging from 74% @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 36 / 62
  • 56.
    Individual-level prediction Substantiallybetter performance when restricted to “stereotypical” users (⇠80-90%) ● Age Sex Race Education Income Fraction of Users AUC 0.95 0.90 0.85 0.80 0.75 0.70 ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 ● Age Sex Race Education Income Fraction of Users Accuracy 0.95 0.90 0.85 0.80 0.75 0.70 ● ● ●● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 37 / 62
  • 57.
    Individual-level prediction Similarperformance even when restricted to top 1k sites Number of Domains AUC 0.9 0.8 0.7 0.6 0.5 ● ● ● ● ● Age Sex Race Education Income 102 102.5 103 103.5 104 104.5 105 Number of Domains Accuracy 0.9 0.8 0.7 0.6 0.5 ● ● ● ● ● Age Sex Race Education Income 102 102.5 103 103.5 104 104.5 105 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 38 / 62
  • 58.
    Site-level skew ProportionFemale Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion College Educated Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Adult Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Household Incomes Under $50,000 Density 0.0 0.2 0.4 0.6 0.8 1.0 Many sites have skew close the overall mean, but there also popular, highly-skewed sites @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 39 / 62
  • 59.
    Individual-level prediction Proofof concept browser demo http://bit.ly/surfpreds @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 40 / 62
  • 60.
    Summary • Highlyactive users spend disproportionately more of their time on social media and less on e-mail relative to the overall population • Access to research, news, and healthcare is strongly related to education, not as closely to ethnicity • User demographics can be inferred from browsing activity with reasonable accuracy • “Who Does What on the Web”, Goel, Hofman Sirer, ICWSM 2012 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 41 / 62
  • 61.
    Outline Search predictionsRight Round Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Web diversity Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black Hispanic No College Over 65 Female Male Income Race Education Age Sex Information di↵usion @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 42 / 62
  • 62.
    The structual viralityof online di↵usion with Ashton Anderson, Sharad Goel, Duncan Watts (201?) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 43 / 62
  • 63.
    “Going Viral”? @jakehofman(Microsoft Research) Computational Social Science November 6, 2014 44 / 62
  • 64.
    “Going Viral”? @jakehofman(Microsoft Research) Computational Social Science November 6, 2014 45 / 62
  • 65.
    “Going Viral”? “Thereforewe ... wish to proceed with great care as is proper, and to cut o↵ the advance of this plague and cancerous disease so it will not spread any further ...”5 -Pope Leo X Exsurge Domine (1520) 5http://www.economist.com/node/21541719 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 45 / 62
  • 66.
    “Going Viral”? Rogers(1962), Bass (1969) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 46 / 62
  • 67.
    “Going viral”? @jakehofman(Microsoft Research) Computational Social Science November 6, 2014 47 / 62
  • 68.
    “Going viral”? @jakehofman(Microsoft Research) Computational Social Science November 6, 2014 47 / 62
  • 69.
    Data • Examinedone year of tweets from July 2011 to July 2012 @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 70.
    Data • Examinedone year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 71.
    Data • Examinedone year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 72.
    Data • Examinedone year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 73.
    Data • Examinedone year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct di↵usion trees @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 74.
    Data • Examinedone year of tweets from July 2011 to July 2012 • Restricted to 1.4 billion tweets containing links to top news, videos, images, and petitions sites • Aggregated tweets by URL, resulting in 1 billion distinct “events” • Crawled friend list of each adopter • Inferred “who got what from whom” to construct di↵usion trees • Characterized size and structure of trees @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 48 / 62
  • 75.
    The Structural Viralityof Online Di↵usion A B D C E Time Group posts by URL @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
  • 76.
    The Structural Viralityof Online Di↵usion A B D C E Time Label each friend who previously adopted as a potential parent @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
  • 77.
    The Structural Viralityof Online Di↵usion A B D C E Time Select each node’s most recent adopting friend as its parent @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
  • 78.
    The Structural Viralityof Online Di↵usion A B D C E Generations Characterize size and structure of trees @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 49 / 62
  • 79.
    Information di↵usion Cascadesize distribution 10% 1% 0.1% 0.01% 0.001% 0.0001% 0.00001% 1 10 100 1,000 10,000 Cascade Size CCDF Focus on the rare hits that get at least 100 adoptions @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 50 / 62
  • 80.
    Quantifying structure Measurethe average distance between all pairs of nodes6 ⌫(T) = 1 n(n − 1) Xn i=1 Xn j=1 dij 6Weiner (1947); correlated with other possible metrics @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 51 / 62
  • 81.
    Quantifying structure Measurethe average distance between all pairs of nodes6 ⌫(T) = 2n n − 1 1 n X S2S |S|− 1 n2 X S2S |S|2 # 6Weiner (1947); correlated with other possible metrics @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 51 / 62
  • 82.
    Information di↵usion Sizeand virality by category Remarkable structural diversity across across categories 100% 10% 1% 0.1% 0.01% 0.001% 100 1,000 10,000 Cascade Size CCDF Videos Pictures News Petitions 100% 10% 1% 0.1% 0.01% 0.001% 3 10 30 Structural Virality CCDF Videos Pictures News Petitions @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 52 / 62
  • 83.
    Information di↵usion Structuraldiversity @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 53 / 62
  • 84.
    Information di↵usion Structuraldiversity Size is relatively poor predictive of structure Petitions News Pictures Videos ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 10 3 100 300 1,000 100 300 1,000 3,000 100 300 1,000 3,000 10,000 100 300 1,000 3,000 10,000 Cascade size Structural virality @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 54 / 62
  • 85.
    Simulations Simulate cascadeswith a simple SIR model7, varying infectivity and degree skew 662 CHAPTER 21. EPIDEMICS t y x z u r v w s (a) t y x z u r v w s (b) t y x z u r v w s (c) t y x z u r v w s (d) 8 Figure 21.2: The course of an SIR epidemic in which each node remains infectious for a number of steps equal to tI = 1. Starting with nodes y and z initially infected, the epidemic spreads to some but not all of the remaining nodes. In each step, shaded nodes with dark borders are in the Infectious (I) state and shaded nodes with thin borders are in the Removed (R) state. 7Kermack McKendrick (1927) 8Easley Kleinberg (2010) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 55 / 62
  • 86.
    Simulations This reproducesthe observed marginal distributions of size and structure 100 30 10 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 300 1,000 3,000 100,000 10,000 30,000 Cascade size Structural virality ... but fails to account for the variance in structure given size @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 56 / 62
  • 87.
    @jakehofman (Microsoft Research)Computational Social Science November 6, 2014 57 / 62
  • 88.
    @jakehofman (Microsoft Research)Computational Social Science November 6, 2014 57 / 62
  • 89.
    @jakehofman (Microsoft Research)Computational Social Science November 6, 2014 57 / 62
  • 90.
    @jakehofman (Microsoft Research)Computational Social Science November 6, 2014 57 / 62
  • 91.
    @jakehofman (Microsoft Research)Computational Social Science November 6, 2014 57 / 62
  • 92.
    Information di↵usion Summary • Most cascades fail, resulting in fewer than two adoptions, on average • Of the hits that do succeed, we observe a wide range of diverse di↵usion structures • It’s difficult to say how something spread given only its popularity • “The structural virality of online di↵usion”, Anderson, Goel, Hofman Watts (Under review.) @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 58 / 62
  • 93.
    Outline Search predictionsRight Round Week Rank 10 20 30 40 c Billboard Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Web diversity Daily Per−Capita Pageviews 70 60 50 40 30 20 10 0 Under $25k ● White ● Some College ● Under 65 ● ● Over $25k Black Hispanic No College Over 65 Female Male Income Race Education Age Sex Information di↵usion @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 59 / 62
  • 94.
    Conclusion @jakehofman (MicrosoftResearch) Computational Social Science November 6, 2014 60 / 62
  • 95.
    Lessons learned Datajeopardy Regardless of scale, it’s difficult to find the right questions to ask of the data @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
  • 96.
    Lessons learned Hacking Cleaning and normalizing data is a substantial amount of the work @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
  • 97.
    Lessons learned Modeling Understanding human activity is often useful for detecting malicious activity @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
  • 98.
    Lessons learned Modeling Simple methods (e.g., linear models) work surprisingly well, especially with lots of (diverse) data @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 61 / 62
  • 99.
    Thanks. Questions? jake@jakehofman.com Also, we’re hiring: bit.ly/msrnyc_appsci bit.ly/msrnyc_eng @jakehofman (Microsoft Research) Computational Social Science November 6, 2014 62 / 62