Working abstract: This paper solves a persistent methodological problem for social scientists studying the political web: representative sampling. Virtually all existing studies of the political web are based on incomplete samples, and therefore lack generalizability. In this paper, I combine methods from computer science and sampling theory to conduct an automated snowball census of the political web and constructs an all-but-complete index of English political websites. I check the robustness of this index, use it to generate descriptive statistics for the entire political web, and demonstrate that studies based on ad hoc sampling strategies are likely to be biased in important ways. In future research, this bias can be eliminated by using this index as a sampling universe. In addition, the methods and open-source software presented here can be used to creating similar sampling frames for other online content domains.
5. Motivation
The blogosphere is one of the best sources of
political data in all history.
Understanding political bloggers can help us
understand political participation more broadly.
In order to compare “the average blogger” to
“the average citizen,” we need a representative
sample of bloggers.
7. Challenges: scale and sparseness
No complete index of blogs exists,
let alone political blogs
•
250 million web sites
•
40 new sites created every minutes
•
Only 3 in 1,000 sites are political
8. Previous research
Examples
Sample Types
● Johnson and Kaye,
•
Convenience 2004
● Lescovek, Backstrom
and Kleinberg, 2009
Big Data, but no attempt
at representativeness
9. Previous research
Sample Types Examples
•
Convenience •
McKenna and
Pole, 2008
•
Prominence
•
Wallsten, 2008
Good data, but
only includes
popular sites.
11. Previous research
Examples
Sample Types • Lenhart and Fox, 2006
•
Convenience • Schlozman, Verba, and
Brady, 2010
•
Prominence • Lawrence, Sides, and
Farrell, 2010
•
Snowball • Karen's US-IMPACT study
•
Over-sample Representative sample, but
linking to Big Data is hard
12. Methodology
1. Start from a seed batch of political sites.
2. Download and classify each site in the
batch.
3. For political sites, harvest outbound
hyperlinks and add unvisited links to the
next batch.
4. Repeat from step 2 until no new links are
found.
17. Bag-of-words logit regression
Prob(political) ≈ logit(α+βX)
X = Vector of word counts
α = Bias term
β = Word weights
1. Hand-code a training sample (n=2,000)
2. Calibrate the computer
3. Hand-code a testing sample (n=200)
4. Evaluate the classifier
20. Census Results
Implemented in python: SnowCrawl
Executes in less than 24 hours
1.8 million sites crawled
800,000 political
42% blogs
http://code.google.com/p/snowcrawl
21. Comparison by strata
Top 500 Top 5,000 Census
Organization
Owned by orgs 66.1*** 53.1 44.4
Multiple authors 75.2* 66.7 62.2
M-updates/day 43.4*** 19.4*** 6.1
Design
Advertising 67.3** 57.1 51.2
Blogroll 57.5* 66.3*** 45.1
Video 48.7*** 35.7*** 18.3
22. Comparison by strata
Top 500 Top 5,000 Census
Polls and public opinion 70.8*** 65.3* 52.4
Elections and campaigns 50.4 45.9 51.2
Legislation and law-making 43.4 41.8 43.9
Implementation of policy 38.1 39.8 30.5
Decisions by courts 34.5*** 24.5 17.1
Political figures 46.0*** 39.8** 24.4
Political parties 38.9*** 32.7* 20.7
Philosophical discussion 26.5 29.6 25.6
State and local government 36.3* 38.8** 24.4
Foreign policy 42.5*** 38.8*** 15.9
International relations 31.9** 33.7** 18.3
23. Where next?
●
Survey of bloggers
●
Poststratification weighting
●
Network analysis
●
Content analysis of blogs
●
Blog post panel
●
Sentiment analysis/Survey imputation
●
Re-implement in Hadoop
25. Conclusions
1. Combinations of tools are
much more powerful than
individual tools – share ideas
across disciplines.
2. Sampling matters! With a
little extra effort, we can
sample populations on the
web.
3. Complementary data is the
key for the compSocSci
research agenda.
26. Conclusions
1. Combinations of tools are
much more powerful than
individual tools – share ideas
across disciplines.
2. Sampling matters! With a
little extra effort, we can
sample populations on the
web.
http://code.google.com/p/snowcrawl
3. Complementary data is the
key for the compSocSci
research agenda.
27. Conclusions
1. Combinations of tools are
much more powerful than
individual tools – share ideas
across disciplines.
2. Sampling matters! With a little
extra effort, we can sample
populations on the web.
3. Complementary, horizontal,
and offline data is key for the
compSocSci research agenda.
28. Thank you!
Questions? Comments?
Abe Gong
Public policy, political science, complex systems
University of Michigan
agong@umich.edu
lowlywonk.blogspot.com
Www-personal.umich.edu/~agong