This document provides an overview and introduction to a course on modeling social data. It discusses how computational social science uses large-scale electronic data and mathematical models to address long-standing questions in social sciences about how ideas spread and new forms of communication affect society. The course will cover topics like exploratory data analysis, regression, classification, and networks. It also presents a case study that found search engine query data can predict future sales of movies, video games, and music rankings weeks in advance.
Probability Sampling and Alternative MethodologiesLangerResearch
Gary Langer's Oct. 2012 presentation to the National Science Foundation on the future of survey research. Discusses the limitations of emerging approaches to public opinion research (such as opt-in online panels and social media analysis).
Probability Sampling and Alternative MethodologiesLangerResearch
Gary Langer's Oct. 2012 presentation to the National Science Foundation on the future of survey research. Discusses the limitations of emerging approaches to public opinion research (such as opt-in online panels and social media analysis).
This is a presentation of interview findings which forms part of my PhD project, about how young people engage with their networks and use social media during job search to acquire information. The study was informed by Tom Wilson's model of information seeking (1981), and also included the use of the "name generator" approach to gather ego-net data. This facilitated the creation of ego-net visuals (or job search information networks), which were then analysed to understand the behaviours associated with networking during job search from a young persons perspective. It also provided evidence of how information acquired from people and organisations, both offline and via social media, can improve the quality of job search via access to higher levels of social capital.
Designing Surveys to Determine the Impact of Online Social Networks #naspatech8nicolalritter
As the popularity of online social networking sites continues to grow with incoming students, higher education needs to have a better understanding of online social media tools. An understanding of how to design surveys is needed to determine the impact of online social networks on student’s social and academic interactions. Student affairs professionals spend considerable time analyzing, reporting, and discussing survey results. We often hear during these conversations, how do we ensure this information is accurate? Is this information generalizable? Is this information meaningful? Did we ask about student usage? Should we have asked about student perceptions of online social networks? How does online social networking impact students’ academic and social interactions on campus? Should we have asked about privacy? This presentation will address these questions and others by outlining the steps in survey development, posing areas to explore in online social networks, and engaging attendees in the survey construction process.
IMPORTANT: Be sure to download the actual PowerPoint file, and look through our very detailed speaker notes for extensive commentary and reference citations.
Our perspective on the best way for Bitcoin to achieve mainstream adoption, fully updated for 2017.
We investigate the most beneficial attributes of Bitcoin and the Bitcoin ecosystem, and make a modest proposal for an easy way to bring Bitcoin firmly into mainstream use.
Recomendaciones integrales de política pública para las juventudes en la ar...Jorge Roldán
El objetivo de este documento es aportar al debate sobre la situación actual de las juventudes en la Argentina, con el explícito propósito de esbozar recomendaciones de política pública. Para ello, se parte del análisis de las intervenciones del Poder Ejecutivo Nacional dirigidas a los jóvenes, así como de los proyectos legislativos presentados en 2013 y 2014.
This is a presentation of interview findings which forms part of my PhD project, about how young people engage with their networks and use social media during job search to acquire information. The study was informed by Tom Wilson's model of information seeking (1981), and also included the use of the "name generator" approach to gather ego-net data. This facilitated the creation of ego-net visuals (or job search information networks), which were then analysed to understand the behaviours associated with networking during job search from a young persons perspective. It also provided evidence of how information acquired from people and organisations, both offline and via social media, can improve the quality of job search via access to higher levels of social capital.
Designing Surveys to Determine the Impact of Online Social Networks #naspatech8nicolalritter
As the popularity of online social networking sites continues to grow with incoming students, higher education needs to have a better understanding of online social media tools. An understanding of how to design surveys is needed to determine the impact of online social networks on student’s social and academic interactions. Student affairs professionals spend considerable time analyzing, reporting, and discussing survey results. We often hear during these conversations, how do we ensure this information is accurate? Is this information generalizable? Is this information meaningful? Did we ask about student usage? Should we have asked about student perceptions of online social networks? How does online social networking impact students’ academic and social interactions on campus? Should we have asked about privacy? This presentation will address these questions and others by outlining the steps in survey development, posing areas to explore in online social networks, and engaging attendees in the survey construction process.
IMPORTANT: Be sure to download the actual PowerPoint file, and look through our very detailed speaker notes for extensive commentary and reference citations.
Our perspective on the best way for Bitcoin to achieve mainstream adoption, fully updated for 2017.
We investigate the most beneficial attributes of Bitcoin and the Bitcoin ecosystem, and make a modest proposal for an easy way to bring Bitcoin firmly into mainstream use.
Recomendaciones integrales de política pública para las juventudes en la ar...Jorge Roldán
El objetivo de este documento es aportar al debate sobre la situación actual de las juventudes en la Argentina, con el explícito propósito de esbozar recomendaciones de política pública. Para ello, se parte del análisis de las intervenciones del Poder Ejecutivo Nacional dirigidas a los jóvenes, así como de los proyectos legislativos presentados en 2013 y 2014.
Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.
Discussed in detail about how to design and develop custom skills (think custom apps) for Amazon Alexa Voice service.
Discusses how to design voice based experiences in detail.
Panel held at LAK13: 3rd International Conference on Learning Analytics & Knowledge
http://simon.buckinghamshum.net/2013/03/lak13-edu-data-scientists-scarce-breed
Educational Data Scientists: A Scarce Breed
The Educational Data Scientist is currently a poorly understood, rarely sighted breed. Reports vary: some are known to be largely nocturnal, solitary creatures, while others have been reported to display highly social behaviour in broad daylight. What are their primary habits? How do they see the world? What ecological niches do they occupy now, and will predicted seismic shifts transform the landscape in their favour? What survival skills do they need when running into other breeds? Will their numbers grow, and how might they evolve? In this panel, the conference will hear and debate not only broad perspectives on the terrain, but will have been exposed to some real life specimens, and caught glimpses of the future ecosystem.
"data: past, present, and future" day 1 lecture 2020-01-20chris wiggins
What should our future statisticians, senators, and CEOs know about the history and ethics of data? How might understanding that history provide tools and resources to future citizens navigating a future shaped by data empowered algorithms? We've developed a course that introduces students, without prerequisites, to a historical view of our present condition, in which data-empowered algorithms shape our personal, professional, and political realities. The course attempts to integrate critical data studies with functional engagement with data (in Python via Jupyter notebooks), and interleaves an applied view of ethics throughout. The intellectual arc traces from the 18th century to present day, beginning with examples of contemporary technological advances, disquieting ethical debates, and financial success powered by panoptic persuasion architectures.
Presentation developed for NH IEEE groups, and adapted for life long learning audiences. Big Data has reached a tipping point, where applications are multiplying, many using our personal data in ways we do not expect, or necessarily agree with. This is augmented by Artificial Intelligence which can detect far more subtle criteria than a human might. A peek into the future -- for better or worse. Related course syllabus at: https://is.gd/BigDataIssues
ARC211: American Diversity and Design: Jonathan BessetteJonathan Bessette
The following pages document my responses to the online discussion questions in the Spring 2017 version of ARC 211 American Diversity and Design at the University at Buffalo - State University of New York.
Creating Connections: Collaborations Between Museums and SchoolsJ S-C
This presentation was for the 2015 Association of African American Museums Conference. It addresses the collaborative partnership between the National Civil Rights Museum and the Martin Luther King Jr. College Preparatory High School.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
1. Introduction and Overview
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 20, 2017
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 1 / 53
2. Course overview
Modeling social data requires an understanding of:
1 How to obtain data produced by (online) human interactions,
2 What questions we typically ask about human-generated data,
3 How to reframe these questions as mathematical models, and
4 How to interpret the results of these models in ways that
address our questions.
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 2 / 53
3. Questions
Many long-standing questions in the social sciences are notoriously
difficult to answer, e.g.:
• “Who says what to whom in what channel with what effect”?
(Laswell, 1948)
• How do ideas and technology spread through cultures?
(Rogers, 1962)
• How do new forms of communication affect society?
(Singer, 1970)
• . . .
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 3 / 53
4. Questions
Typically difficult to observe the relevant information via
conventional methods
Moreno, 1933
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 4 / 53
5. Large-scale data
Recently available electronic data provide an unprecedented
opportunity to address these questions at scale
Demographic Behavioral Network
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 5 / 53
6. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
7. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(motivating questions)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
8. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(fitting large, potentially sparse models)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
9. Computational social science
An emerging discipline at the intersection of the social sciences,
statistics, and computer science
(parallel processing for filtering and aggregating data)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
16. The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture1, 1965
1
http://bit.ly/feynmannobel
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 13 / 53
17. Outline
Web demographicsDailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
q
q
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Search predictions"Right Round"
Week
Rank
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Viral hits
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 14 / 53
18. Predicting consumer activity with Web search
with Sharad Goel, S´ebastien Lahaie, David Pennock, Duncan Watts
"Right Round"
Week
Rank
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 15 / 53
19. Search predictions
Motivation
Does collective search activity
provide useful predictive signal
about real-world outcomes?
"Right Round"
Week
Rank
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 16 / 53
20. Search predictions
Motivation
Past work mainly focuses on predicting the present2 and ignores
baseline models trained on publicly available data
Date
FluLevel(Percent)
1
2
3
4
5
6
7
8
2004 2005 2006 2007 2008 2009 2010
Actual
Search
Autoregressive
2
Varian, 2009
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 17 / 53
21. Search predictions
Motivation
We predict future sales for movies, video games, and music
"Transformers 2"
Time to Release (Days)
SearchVolume
a
−30 −20 −10 0 10 20 30
"Tom Clancy's HAWX"
Time to Release (Days)
SearchVolume
b
−30 −20 −10 0 10 20 30
"Right Round"
Week
Rank
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 18 / 53
22. Search predictions
Search models
For movies and video games, predict opening weekend box office
and first month sales, respectively:
log(revenue) = β0 + β1 log(search) +
For music, predict following week’s Billboard Hot 100 rank:
billboardt+1 = β0 + β1searcht + β2searcht−1 +
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 19 / 53
25. Search predictions
Baseline models
For movies, use budget, number of opening screens and Hollywood
Stock Exchange:
log(revenue) = β0 + β1 log(budget) + β2 log(screens) +
β3 log(hsx) +
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
26. Search predictions
Baseline models
For video games, use critic ratings and predecessor sales (sequels
only):
log(revenue) = β0 + β1rating + β2 log(predecessor) +
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
27. Search predictions
Baseline models
For music, use an autoregressive model with the previously
available rank:
billboardt+1 = β0 + β1billboardt−1 +
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
29. Search predictions
Model comparison
For movies, search is outperformed by the baseline and of little
marginal value
ModelFit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
N
onsequelG
am
es
SequelG
am
es
M
usic
M
ovies
Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
30. Search predictions
Model comparison
For video games, search helps substantially for non-sequels, less so
for sequels
ModelFit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
N
onsequelG
am
es
SequelG
am
es
M
usic
M
ovies
Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
31. Search predictions
Model comparison
For music, the addition of search yields a substantially better
combined model
ModelFit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
N
onsequelG
am
es
SequelG
am
es
M
usic
M
ovies
Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
32. Search predictions
Summary
• Relative performance and value of search varies across
domains
• Search provides a fast, convenient, and flexible signal across
domains
• “Predicting consumer activity with Web search”
Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 25 / 53
33. Demographic diversity on the Web
with Irmak Sirer and Sharad Goel (ICWSM 2012)
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
q
q
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 26 / 53
34. Motivation
Previous work is largely survey-based and focuses and group-level
differences in online access
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53
35. Motivation
“As of January 1997, we estimate that 5.2 million
African Americans and 40.8 million whites have ever used
the Web, and that 1.4 million African Americans and
20.3 million whites used the Web in the past week.”
-Hoffman & Novak (1998)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53
36. Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 28 / 53
37. Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel3
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
3
Special thanks to Mainak Mazumdar
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 29 / 53
38. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
39. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
40. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
41. Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
42. Aggregate usage patterns
How do users distribute their time across different categories?
Fractionoftotalpageviews
0.05
0.10
0.15
0.20
0.25
q
q
q
q q
SocialM
edia
E−m
ail
G
am
es
Portals
Search
All groups spend the majority of their time in the top five most
popular categories
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53
43. Aggregate usage patterns
How do users distribute their time across different categories?
User Rank by Daily Activity
FractionofPageviewsinCategory
0.05
0.10
0.15
0.20
0.25
0.30
q
q q q q
q
q
q
q
q
10% 30% 50% 70% 90%
q Social Media
E−mail
Games
Portals
Search
Highly active users devote nearly twice as much of their time to
social media relative to typical individuals
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53
44. Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
q
q
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Large differences exist even at the aggregate level
(e.g. women on average generate 40% more pageviews than men)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53
45. Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
q
q
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Younger and more educated individuals are both more likely to
access the Web and more active once they do
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53
46. Group-level activity
All demographic groups spend the majority of their time in the
same categories
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
q Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
47. Group-level activity
Older, more educated, male, wealthier, and Asian Internet users
spend a smaller fraction of their time on social media
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
q Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
48. Group-level activity
Lower social media use by these groups is often accompanied by
higher e-mail volume
Age
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
0.5
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
q Social Media
E−mail
Games
Portals
Search
Fractionoftotalpageviews
0.0
0.1
0.2
0.3
0.4
Education
● ●
●
●
●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
●
● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
● ●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
49. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Post-graduates spend three times as much time on health sites
than adults with only some high school education
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
50. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Asians spend more than 50% more time browsing online news than
do other race groups
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
51. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Education
●
●
●
● ●
●
●
G
ram
m
arSchool
Som
e
H
ig
h
School
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egree
PostG
raduate
D
egree
Sex
●
●
Fem
ale
M
ale
Income
● ● ●
●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●
●
●
●
O
ther
H
is
panic
Bla
ck
W
hite
Asia
n
● News
Health
Reference
Even when less educated and less wealthy groups gain access to
the Web, they utilize these resources relatively infrequently
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
52. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
News
q
q q
q
q
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Health
q
q q
q q
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Reference
q
q q
q q
H
ig
h
SchoolG
raduate
Som
e
C
ollege
Associa
te
D
egree
Bachelo
r's
D
egreePostG
raduate
D
egree
Asian
Black
Hispanic
White
Controlling for other variables, effects of race and gender largely
disappear, while education continues to have large effect
pi =
j
αj xij +
j k
βjkxij xik +
j
γj x2
ij + i
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 35 / 53
53. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Averagepageviewspermonth
0
2
4
6
8
10
12
Health
q
q q
q q
H
igh
SchoolG
raduate
Som
e
C
ollegeAssociate
D
egreeBachelor's
D
egree
PostG
raduate
D
egree
Female
Male
However, women spend considerably more time on health sites
compared to men
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53
54. Revisiting the digital divide
How does usage of news, health, and reference vary with
demographics?
Monthly pageviews on health sites
20 40 60 80 100
Female
Male
However, women spend considerably more time on health sites
compared to men, although means can be misleading
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53
55. Summary
• Highly active users spend disproportionately more of their
time on social media and less on e-mail relative to the overall
population
• Access to research, news, and healthcare is strongly related to
education, not as closely to ethnicity
• User demographics can be inferred from browsing activity with
reasonable accuracy
• “Who Does What on the Web”, Goel, Hofman & Sirer,
ICWSM 2012
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 37 / 53
56. The structural virality of online diffusion
with Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 38 / 53
59. “Going Viral”?
“Therefore we ... wish to proceed with great care as is
proper, and to cut off the advance of this plague and
cancerous disease so it will not spread any further ...”4
-Pope Leo X
Exsurge Domine (1520)
4
http://www.economist.com/node/21541719
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 40 / 53
60. “Going Viral”?
Rogers (1962), Bass (1969)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 41 / 53
63. “Going viral”?
How do popular things become popular?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 43 / 53
64. Data
• Examined one year of tweets from July 2011 to July 2012
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
65. Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,
videos, images, and petitions sites
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
66. Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,
videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct
“events”
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
67. Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,
videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct
“events”
• Crawled friend list of each adopter
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
68. Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,
videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct
“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusion
trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
69. Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,
videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct
“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusion
trees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
70. The Structural Virality of Online Diffusion
A
B
D
C
E
Time
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 45 / 53
71. Information diffusion
Cascade size distribution
0.00001%
0.0001%
0.001%
0.01%
0.1%
1%
10%
1 10 100 1,000 10,000
Cascade Size
CCDF
Focus on the rare hits that get at least 100 adoptions
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 46 / 53
72. Quantifying structure
Measure the average distance between all pairs of nodes5
5
Weiner (1947); correlated with other possible metrics
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 47 / 53
73. Information diffusion
Size and virality by category
Remarkable structural diversity across across categories
0.001%
0.01%
0.1%
1%
10%
100%
100 1,000 10,000
Cascade Size
CCDF
Videos
Pictures
News
Petitions
0.001%
0.01%
0.1%
1%
10%
100%
3 10 30
Structural Virality
CCDF
Videos
Pictures
News
Petitions
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 48 / 53
74. Information diffusion
Structural diversity
0 50 100 150
time
size
0 5 10 15 20
time
size
0 20 40 60 80 100 120 140
time
size
0 20 40 60 80 100 120
time
size
0.0 0.5 1.0 1.5
time
size
0 10 20 30 40 50 60 70
time
size
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 49 / 53
77. Information diffusion
Summary
• Most cascades fail, resulting in fewer than two adoptions, on
average
• Of the hits that do succeed, we observe a wide range of
diverse diffusion structures
• It’s difficult to say how something spread given only its
popularity
• “The structural virality of online diffusion”, Anderson, Goel,
Hofman & Watts (Management Science 2015)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 52 / 53
78. 1. Ask good questions
There’s nothing interesting in the data without them
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53
79. 2. Think before you code
5 minutes at the whiteboard is worth an hour at the keyboard
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53
80. 3. Keep the answers simple
Exploratory data analysis and linear models go a long way
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53