This document provides an introduction and overview for a course on modeling social data. It discusses the following key points:
1. The course will cover obtaining social data from online human interactions, framing questions about the data as mathematical models, and interpreting model results.
2. Topics will include exploratory data analysis, regression, classification, and networks.
3. Examples of social data that will be examined include demographic diversity on the web and modeling how ideas spread through networks.
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Modeling Social Data, Lecture 1: Case Studies
1. Introduction and Overview
APAM E4990
Modeling Social Data
Jake Hofman & Chris Wiggins
Columbia University
January 23, 2013
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 1 / 52
2. Housekeeping
• Who are we:
• Jake: PhD, now at Microsoft Research NYC
• Chris: APMA faculty, also NYT Chief Data Scientist
• E-Dean Fung: Cornell Undergrad, now APAM student
• Grading:
• 3 homeworks, each 20 percent of final grade
• 1 final project, 30 percent of final grade
• Class participation, 10 percent of final grade
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 2 / 52
3. Course overview
Modeling social data requires an understanding of:
1 How to obtain data produced by online human interactions,
2 What questions we typically ask about human-generated data,
3 How to reframe these questions as mathematical models, and
4 How to interpret the results of these models in ways that
address our questions.
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 3 / 52
4. Topics we will cover
Figure : modelingsocialdata.org
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 4 / 52
5. Mise-en-place
(Things you will need in your setup)
1 Bash shell
2 Git
3 Account on Github
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 5 / 52
6. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
7. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(motivating questions)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
8. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(fitting large, potentially sparse models)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
9. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(parallel processing for filtering and aggregating data)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
10. Questions
Many long-standing questions in the social sciences are notoriously
di cult to answer, e.g.:
• “Who says what to whom in what channel with what e↵ect”?
(Laswell, 1948)
• How do ideas and technology spread through cultures?
(Rogers, 1962)
• How do new forms of communication a↵ect society?
(Singer, 1970)
• . . .
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 7 / 52
11. Large-scale data
Recently available electronic data provide an unprecedented
opportunity to address these questions at scale
Demographic Behavioral Network
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 8 / 52
12. The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture1, 1965
1
http://bit.ly/feynmannobel
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 9 / 52
13. Examples!
Now some examples of social data we’ve worked with and
will be discussing
But first: questions?
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 10 / 52
15. Exploratory Data Analysis
(a.k.a. counting and plotting things)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 12 / 52
16. Demographic diversity on the Web
with Irmak Sirer and Sharad Goel (ICWSM 2012)
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
●
●
●
●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 13 / 52
17. Motivation
Previous work is largely survey-based and focuses and group-level
di↵erences in online access
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52
18. Motivation
“As of January 1997, we estimate that 5.2 million
African Americans and 40.8 million whites have ever used
the Web, and that 1.4 million African Americans and
20.3 million whites used the Web in the past week.”
-Ho↵man & Novak (1998)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52
19. Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 15 / 52
20. Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
2
Special thanks to Mainak Mazumdar
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 16 / 52
21. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
22. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com ! yahoo.com,
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
23. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com ! yahoo.com,
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
24. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com ! yahoo.com,
us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
• Aggregate activity at the site, group, and user levels
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
25. Aggregate usage patterns
How do users distribute their time across di↵erent categories?
Fractionoftotalpageviews
0.05
0.10
0.15
0.20
0.25
●
●
●
● ●
SocialM
edia
E−m
ail
G
am
es
Portals
Search
All groups spend the majority of their time in the top five most
popular categories
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52
26. Aggregate usage patterns
How do users distribute their time across di↵erent categories?
User Rank by Daily Activity
FractionofPageviewsinCategory
0.05
0.10
0.15
0.20
0.25
0.30
●
● ● ● ●
●
●
●
●
●
10% 30% 50% 70% 90%
● Social Media
E−mail
Games
Portals
Search
Highly active users devote nearly twice as much of their time to
social media relative to typical individuals
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52
27. Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
●
●
●
●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Large di↵erences exist even at the aggregate level
(e.g. women on average generate 40% more pageviews than men)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52
28. Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
●
●
●
●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Younger and more educated individuals are both more likely to
access the Web and more active once they do
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52
29. Group-level activity
All demographic groups spend the majority of their time in the
same categories
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
30. Group-level activity
Older, more educated, male, wealthier, and Asian Internet users
spend a smaller fraction of their time on social media
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
31. Group-level activity
Lower social media use by these groups is often accompanied by
higher e-mail volume
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
32. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Post-graduates spend three times as much time on health sites
than adults with only some high school education
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
33. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Asians spend more than 50% more time browsing online news than
do other race groups
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
34. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Even when less educated and less wealthy groups gain access to
the Web, they utilize these resources relatively infrequently
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
35. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
News
●
● ●
●
●
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Health
●
● ●
● ●
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Reference
●
● ●
● ●
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Asian
Black
Hispanic
White
Controlling for other variables, e↵ects of race and gender largely
disappear, while education continues to have large e↵ect
pi =
X
j
↵j xij +
X
j
X
k
jkxij xik +
X
j
j x2
ij + ✏i
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 22 / 52
36. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Health
●
● ●
● ●
H
igh
SchoolG
raduate
Som
e
C
ollegeAssociate
D
egreeBachelor's
D
egree
PostG
raduate
D
egree
Female
Male
However, women spend considerably more time on health sites
compared to men
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52
37. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Monthly pageviews on health sites
20 40 60 80 100
Female
Male
However, women spend considerably more time on health sites
compared to men, although means can be misleading
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52
38. Summary
• Highly active users spend disproportionately more of their
time on social media and less on e-mail relative to the overall
population
• Access to research, news, and healthcare is strongly related to
education, not as closely to ethnicity
• User demographics can be inferred from browsing activity with
reasonable accuracy
• “Who Does What on the Web”, Goel, Hofman & Sirer,
ICWSM 2012
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 24 / 52
45. Predicting consumer activity with Web search
with Sharad Goel, S´ebastien Lahaie, David Pennock, Duncan Watts
"Right Round"
Week
Rank
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 27 / 52
46. Search predictions
Summary
• Relative performance and value of search varies across
domains
• Search provides a fast, convenient, and flexible signal across
domains
• “Predicting consumer activity with Web search”
Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 37 / 52
48. The structural virality of online di↵usion
with Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 39 / 52
49. Information di↵usion
Summary
• Most cascades fail, resulting in fewer than two adoptions, on
average
• Of the hits that do succeed, we observe a wide range of
diverse di↵usion structures
• It’s di cult to say how something spread given only its
popularity
• “The structural virality of online di↵usion”, Anderson, Goel,
Hofman & Watts (Under review.)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 52 / 52