An Automated Snowball Census of the Political Web - JITP 2011

An Automated Snowball Census
of the Political Web

Abe Gong
University of Michigan
JITP 2011

Motivation

The blogosphere is one of the best sources of
political data in all history.

Understanding political bloggers can help us
understand political participation more broadly.

In order to compare “the average blogger” to
“the average citizen,” we need a representative
sample of bloggers.

Wanted: A sampling frame for
all political bloggers

Challenges: scale and sparseness

No complete index of blogs exists,
let alone political blogs
•
250 million web sites
•
40 new sites created every minutes
•
Only 3 in 1,000 sites are political

Previous research

Examples
Sample Types
● Johnson and Kaye,
•
Convenience 2004
● Lescovek, Backstrom
and Kleinberg, 2009

Big Data, but no attempt
at representativeness

Previous research

Sample Types Examples
•
Convenience •
McKenna and
Pole, 2008
•
Prominence
•
Wallsten, 2008

Good data, but
only includes
popular sites.

Previous research

Sample Types Examples
• Hindman,
•
Convenience Tsioutsiouliklis, and
•
Prominence Johnson, 2003
• Karpf, 2008
•
Snowball
Sample properties
unclear

Previous research

Examples
Sample Types • Lenhart and Fox, 2006
•
Convenience • Schlozman, Verba, and
Brady, 2010
•
Prominence • Lawrence, Sides, and
Farrell, 2010
•
Snowball • Karen's US-IMPACT study

•
Over-sample Representative sample, but
linking to Big Data is hard

Methodology

1. Start from a seed batch of political sites.
2. Download and classify each site in the
batch.
3. For political sites, harvest outbound
hyperlinks and add unvisited links to the
next batch.
4. Repeat from step 2 until no new links are
found.

Bag-of-words logit regression

Prob(political) ≈ logit(α+βX)
X = Vector of word counts
α = Bias term
β = Word weights

1. Hand-code a training sample (n=2,000)
2. Calibrate the computer
3. Hand-code a testing sample (n=200)
4. Evaluate the classifier

Classifier reliability

Human-human: 80.9%
Human-computer: 81.0%

Krippendorff's Alpha: .733

Census Results

Implemented in python: SnowCrawl
Executes in less than 24 hours
1.8 million sites crawled
800,000 political
42% blogs

http://code.google.com/p/snowcrawl

Comparison by strata

Top 500 Top 5,000 Census
Organization
Owned by orgs 66.1*** 53.1 44.4
Multiple authors 75.2* 66.7 62.2
M-updates/day 43.4*** 19.4*** 6.1

Design
Advertising 67.3** 57.1 51.2
Blogroll 57.5* 66.3*** 45.1
Video 48.7*** 35.7*** 18.3

Comparison by strata

Top 500 Top 5,000 Census
Polls and public opinion 70.8*** 65.3* 52.4
Elections and campaigns 50.4 45.9 51.2
Legislation and law-making 43.4 41.8 43.9
Implementation of policy 38.1 39.8 30.5
Decisions by courts 34.5*** 24.5 17.1
Political figures 46.0*** 39.8** 24.4
Political parties 38.9*** 32.7* 20.7
Philosophical discussion 26.5 29.6 25.6
State and local government 36.3* 38.8** 24.4
Foreign policy 42.5*** 38.8*** 15.9
International relations 31.9** 33.7** 18.3

Where next?

●
Survey of bloggers
●
Poststratification weighting
●
Network analysis
●
Content analysis of blogs
●
Blog post panel
●
Sentiment analysis/Survey imputation
●
Re-implement in Hadoop

Where next ...?

?

ANES
?

GSS

?

Roxy...?

Conclusions

1. Combinations of tools are
much more powerful than
individual tools – share ideas
across disciplines.

2. Sampling matters! With a
little extra effort, we can
sample populations on the
web.

3. Complementary data is the
key for the compSocSci
research agenda.

Conclusions

across disciplines.

2. Sampling matters! With a
little extra effort, we can
sample populations on the
web.

http://code.google.com/p/snowcrawl
3. Complementary data is the
key for the compSocSci
research agenda.

Conclusions

across disciplines.

2. Sampling matters! With a little
extra effort, we can sample
populations on the web.

3. Complementary, horizontal,
and offline data is key for the
compSocSci research agenda.

Thank you!

Questions? Comments?

Abe Gong
Public policy, political science, complex systems
University of Michigan
agong@umich.edu
lowlywonk.blogspot.com
Www-personal.umich.edu/~agong

An Automated Snowball Census of the Political Web - JITP 2011

An Automated Snowball Census of the Political Web - JITP 2011

Recommended

Recommended

More Related Content

Similar to An Automated Snowball Census of the Political Web - JITP 2011

Similar to An Automated Snowball Census of the Political Web - JITP 2011 (20)

More from Abe Gong

More from Abe Gong (7)

Recently uploaded

Recently uploaded (20)

An Automated Snowball Census of the Political Web - JITP 2011