An Automated Snowball Census of the Political Web - JITP 2011

606 views

Published on

Working abstract: This paper solves a persistent methodological problem for social scientists studying the political web: representative sampling. Virtually all existing studies of the political web are based on incomplete samples, and therefore lack generalizability. In this paper, I combine methods from computer science and sampling theory to conduct an automated snowball census of the political web and constructs an all-but-complete index of English political websites. I check the robustness of this index, use it to generate descriptive statistics for the entire political web, and demonstrate that studies based on ad hoc sampling strategies are likely to be biased in important ways. In future research, this bias can be eliminated by using this index as a sampling universe. In addition, the methods and open-source software presented here can be used to creating similar sampling frames for other online content domains.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
606
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Automated Snowball Census of the Political Web - JITP 2011

  1. 1. An Automated Snowball Census of the Political Web Abe Gong University of Michigan JITP 2011
  2. 2. Motivation
  3. 3. Motivation
  4. 4. Motivation
  5. 5. MotivationThe blogosphere is one of the best sources ofpolitical data in all history.Understanding political bloggers can help usunderstand political participation more broadly.In order to compare “the average blogger” to“the average citizen,” we need a representativesample of bloggers.
  6. 6. Wanted: A sampling frame forall political bloggers
  7. 7. Challenges: scale and sparseness No complete index of blogs exists, let alone political blogs• 250 million web sites• 40 new sites created every minutes• Only 3 in 1,000 sites are political
  8. 8. Previous research Examples Sample Types ● Johnson and Kaye,• Convenience 2004 ● Lescovek, Backstrom and Kleinberg, 2009 Big Data, but no attempt at representativeness
  9. 9. Previous research Sample Types Examples• Convenience • McKenna and Pole, 2008• Prominence • Wallsten, 2008 Good data, but only includes popular sites.
  10. 10. Previous research Sample Types Examples • Hindman,• Convenience Tsioutsiouliklis, and• Prominence Johnson, 2003 • Karpf, 2008• Snowball Sample properties unclear
  11. 11. Previous research Examples Sample Types • Lenhart and Fox, 2006• Convenience • Schlozman, Verba, and Brady, 2010• Prominence • Lawrence, Sides, and Farrell, 2010• Snowball • Karens US-IMPACT study• Over-sample Representative sample, but linking to Big Data is hard
  12. 12. Methodology1. Start from a seed batch of political sites.2. Download and classify each site in thebatch.3. For political sites, harvest outboundhyperlinks and add unvisited links to thenext batch.4. Repeat from step 2 until no new links arefound.
  13. 13. Toy Example
  14. 14. Toy Example
  15. 15. Toy Example
  16. 16. Toy Example
  17. 17. Bag-of-words logit regressionProb(political) ≈ logit(α+βX) X = Vector of word counts α = Bias term β = Word weights1. Hand-code a training sample (n=2,000)2. Calibrate the computer3. Hand-code a testing sample (n=200)4. Evaluate the classifier
  18. 18. Text Classifier Word Cloud
  19. 19. Classifier reliability Human-human: 80.9% Human-computer: 81.0% Krippendorffs Alpha: .733
  20. 20. Census ResultsImplemented in python: SnowCrawl Executes in less than 24 hours 1.8 million sites crawled 800,000 political 42% blogs http://code.google.com/p/snowcrawl
  21. 21. Comparison by strata Top 500 Top 5,000 CensusOrganizationOwned by orgs 66.1*** 53.1 44.4Multiple authors 75.2* 66.7 62.2M-updates/day 43.4*** 19.4*** 6.1DesignAdvertising 67.3** 57.1 51.2Blogroll 57.5* 66.3*** 45.1Video 48.7*** 35.7*** 18.3
  22. 22. Comparison by strata Top 500 Top 5,000 CensusPolls and public opinion 70.8*** 65.3* 52.4Elections and campaigns 50.4 45.9 51.2Legislation and law-making 43.4 41.8 43.9Implementation of policy 38.1 39.8 30.5Decisions by courts 34.5*** 24.5 17.1Political figures 46.0*** 39.8** 24.4Political parties 38.9*** 32.7* 20.7Philosophical discussion 26.5 29.6 25.6State and local government 36.3* 38.8** 24.4Foreign policy 42.5*** 38.8*** 15.9International relations 31.9** 33.7** 18.3
  23. 23. Where next?● Survey of bloggers● Poststratification weighting● Network analysis● Content analysis of blogs● Blog post panel● Sentiment analysis/Survey imputation● Re-implement in Hadoop
  24. 24. Where next ...? ? ANES ? GSS ? Roxy...?
  25. 25. Conclusions1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.2. Sampling matters! With a little extra effort, we can sample populations on the web.3. Complementary data is the key for the compSocSci research agenda.
  26. 26. Conclusions1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.2. Sampling matters! With a little extra effort, we can sample populations on the web. http://code.google.com/p/snowcrawl3. Complementary data is the key for the compSocSci research agenda.
  27. 27. Conclusions1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.2. Sampling matters! With a little extra effort, we can sample populations on the web.3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.
  28. 28. Thank you! Questions? Comments? Abe Gong Public policy, political science, complex systems University of Michigan agong@umich.edu lowlywonk.blogspot.com Www-personal.umich.edu/~agong

×