SlideShare a Scribd company logo
An Automated Snowball Census
      of the Political Web

             Abe Gong
        University of Michigan
             JITP 2011
Motivation
Motivation
Motivation
Motivation

The blogosphere is one of the best sources of
political data in all history.


Understanding political bloggers can help us
understand political participation more broadly.


In order to compare “the average blogger” to
“the average citizen,” we need a representative
sample of bloggers.
Wanted: A sampling frame for
all political bloggers
Challenges: scale and sparseness


    No complete index of blogs exists,
    let alone political blogs
•
    250 million web sites
•
    40 new sites created every minutes
•
    Only 3 in 1,000 sites are political
Previous research

                   Examples
    Sample Types
                   ●   Johnson and Kaye,
•
    Convenience        2004
                   ●   Lescovek, Backstrom
                       and Kleinberg, 2009


                   Big Data, but no attempt
                   at representativeness
Previous research

    Sample Types   Examples
•
    Convenience     •
                        McKenna and
                        Pole, 2008
•
    Prominence
                    •
                        Wallsten, 2008


                   Good data, but
                   only includes
                   popular sites.
Previous research

    Sample Types   Examples
                   •   Hindman,
•
    Convenience        Tsioutsiouliklis, and
•
    Prominence         Johnson, 2003
                   •   Karpf, 2008
•
    Snowball
                   Sample properties
                   unclear
Previous research

                   Examples
    Sample Types   •   Lenhart and Fox, 2006
•
    Convenience    •   Schlozman, Verba, and
                       Brady, 2010
•
    Prominence     •   Lawrence, Sides, and
                       Farrell, 2010
•
    Snowball       •   Karen's US-IMPACT study

•
    Over-sample    Representative sample, but
                   linking to Big Data is hard
Methodology

1. Start from a seed batch of political sites.
2. Download and classify each site in the
batch.
3. For political sites, harvest outbound
hyperlinks and add unvisited links to the
next batch.
4. Repeat from step 2 until no new links are
found.
Toy Example
Toy Example
Toy Example
Toy Example
Bag-of-words logit regression

Prob(political) ≈ logit(α+βX)
  X = Vector of word counts
  α = Bias term
  β = Word weights


1. Hand-code a training sample (n=2,000)
2. Calibrate the computer
3. Hand-code a testing sample (n=200)
4. Evaluate the classifier
Text Classifier Word Cloud
Classifier reliability



    Human-human:         80.9%
    Human-computer: 81.0%


    Krippendorff's Alpha: .733
Census Results

Implemented in python: SnowCrawl
 Executes in less than 24 hours
 1.8 million sites crawled
 800,000 political
 42% blogs


                     http://code.google.com/p/snowcrawl
Comparison by strata

                   Top 500   Top 5,000   Census
Organization
Owned by orgs      66.1***   53.1        44.4
Multiple authors   75.2*     66.7        62.2
M-updates/day      43.4***   19.4***      6.1

Design
Advertising        67.3**    57.1        51.2
Blogroll           57.5*     66.3***     45.1
Video              48.7***   35.7***     18.3
Comparison by strata


                             Top 500   Top 5,000   Census
Polls and public opinion     70.8***   65.3*       52.4
Elections and campaigns      50.4      45.9        51.2
Legislation and law-making   43.4      41.8        43.9
Implementation of policy     38.1      39.8        30.5
Decisions by courts          34.5***   24.5        17.1
Political figures            46.0***   39.8**      24.4
Political parties            38.9***   32.7*       20.7
Philosophical discussion     26.5      29.6        25.6
State and local government   36.3*     38.8**      24.4
Foreign policy               42.5***   38.8***     15.9
International relations      31.9**    33.7**      18.3
Where next?

●
    Survey of bloggers
●
    Poststratification weighting
●
    Network analysis
●
    Content analysis of blogs
●
    Blog post panel
●
    Sentiment analysis/Survey imputation
●
    Re-implement in Hadoop
Where next ...?


                            ?



                  ANES
                                ?



                      GSS


                  ?

                      Roxy...?
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.


2. Sampling matters! With a
   little extra effort, we can
   sample populations on the
   web.


3. Complementary data is the
   key for the compSocSci
   research agenda.
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.


2. Sampling matters! With a
   little extra effort, we can
   sample populations on the
   web.

                                    http://code.google.com/p/snowcrawl
3. Complementary data is the
   key for the compSocSci
   research agenda.
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.



2. Sampling matters! With a little
   extra effort, we can sample
   populations on the web.



3. Complementary, horizontal,
   and offline data is key for the
   compSocSci research agenda.
Thank you!



             Questions? Comments?



                        Abe Gong
     Public policy, political science, complex systems
                  University of Michigan
                   agong@umich.edu
                 lowlywonk.blogspot.com
            Www-personal.umich.edu/~agong
An Automated Snowball Census of the Political Web - JITP 2011

More Related Content

Similar to An Automated Snowball Census of the Political Web - JITP 2011

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Digital Methods Initiative
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Netnography webinar
Netnography webinarNetnography webinar
Netnography webinarsuresh sood
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our dataHeather Piwowar
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)TimelessFuture
 
Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...Mike Kujawski
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
A Social Cloud for Public eResearch
A Social Cloud for Public eResearchA Social Cloud for Public eResearch
A Social Cloud for Public eResearchSimon Caton
 
Visualizing communication at scad school of design
Visualizing communication at scad school of designVisualizing communication at scad school of design
Visualizing communication at scad school of designSAAD ALZAROONI, CM
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsLynn Connaway
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsOCLC
 
Lida change-reference-abels
Lida change-reference-abelsLida change-reference-abels
Lida change-reference-abelsfpehar
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen ScienceAndrea Wiggins
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
 
Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Carly Strasser
 
Information Architecture Workshop
Information Architecture WorkshopInformation Architecture Workshop
Information Architecture WorkshopPeter Morville
 
Summer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social ParticipationSummer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social ParticipationUniversity of Maryland
 
Visualising activity in learning networks using open data and educational ...
Visualising activity in learning networks   using open data and educational  ...Visualising activity in learning networks   using open data and educational  ...
Visualising activity in learning networks using open data and educational ...Michael Paskevicius
 

Similar to An Automated Snowball Census of the Political Web - JITP 2011 (20)

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Gunderman, Slayton, and Wang, "Planning for the Long-Term"
Gunderman, Slayton, and Wang, "Planning for the Long-Term"Gunderman, Slayton, and Wang, "Planning for the Long-Term"
Gunderman, Slayton, and Wang, "Planning for the Long-Term"
 
Netnography webinar
Netnography webinarNetnography webinar
Netnography webinar
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
A Social Cloud for Public eResearch
A Social Cloud for Public eResearchA Social Cloud for Public eResearch
A Social Cloud for Public eResearch
 
Visualizing communication at scad school of design
Visualizing communication at scad school of designVisualizing communication at scad school of design
Visualizing communication at scad school of design
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Lida change-reference-abels
Lida change-reference-abelsLida change-reference-abels
Lida change-reference-abels
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen Science
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014
 
Information Architecture Workshop
Information Architecture WorkshopInformation Architecture Workshop
Information Architecture Workshop
 
Summer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social ParticipationSummer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social Participation
 
Visualising activity in learning networks using open data and educational ...
Visualising activity in learning networks   using open data and educational  ...Visualising activity in learning networks   using open data and educational  ...
Visualising activity in learning networks using open data and educational ...
 

More from Abe Gong

The Edison Moment for the Internet of You
The Edison Moment for the Internet of YouThe Edison Moment for the Internet of You
The Edison Moment for the Internet of YouAbe Gong
 
Building for resilience
Building for resilienceBuilding for resilience
Building for resilienceAbe Gong
 
Building for resilience (with speaking notes)
Building for resilience (with speaking notes)Building for resilience (with speaking notes)
Building for resilience (with speaking notes)Abe Gong
 
The Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe GongThe Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe GongAbe Gong
 
How to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephantHow to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephantAbe Gong
 
Picking programming packages
Picking programming packagesPicking programming packages
Picking programming packagesAbe Gong
 
Gong info heist
Gong info heistGong info heist
Gong info heistAbe Gong
 

More from Abe Gong (7)

The Edison Moment for the Internet of You
The Edison Moment for the Internet of YouThe Edison Moment for the Internet of You
The Edison Moment for the Internet of You
 
Building for resilience
Building for resilienceBuilding for resilience
Building for resilience
 
Building for resilience (with speaking notes)
Building for resilience (with speaking notes)Building for resilience (with speaking notes)
Building for resilience (with speaking notes)
 
The Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe GongThe Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe Gong
 
How to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephantHow to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephant
 
Picking programming packages
Picking programming packagesPicking programming packages
Picking programming packages
 
Gong info heist
Gong info heistGong info heist
Gong info heist
 

Recently uploaded

Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 

Recently uploaded (20)

Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 

An Automated Snowball Census of the Political Web - JITP 2011

  • 1. An Automated Snowball Census of the Political Web Abe Gong University of Michigan JITP 2011
  • 5. Motivation The blogosphere is one of the best sources of political data in all history. Understanding political bloggers can help us understand political participation more broadly. In order to compare “the average blogger” to “the average citizen,” we need a representative sample of bloggers.
  • 6. Wanted: A sampling frame for all political bloggers
  • 7. Challenges: scale and sparseness No complete index of blogs exists, let alone political blogs • 250 million web sites • 40 new sites created every minutes • Only 3 in 1,000 sites are political
  • 8. Previous research Examples Sample Types ● Johnson and Kaye, • Convenience 2004 ● Lescovek, Backstrom and Kleinberg, 2009 Big Data, but no attempt at representativeness
  • 9. Previous research Sample Types Examples • Convenience • McKenna and Pole, 2008 • Prominence • Wallsten, 2008 Good data, but only includes popular sites.
  • 10. Previous research Sample Types Examples • Hindman, • Convenience Tsioutsiouliklis, and • Prominence Johnson, 2003 • Karpf, 2008 • Snowball Sample properties unclear
  • 11. Previous research Examples Sample Types • Lenhart and Fox, 2006 • Convenience • Schlozman, Verba, and Brady, 2010 • Prominence • Lawrence, Sides, and Farrell, 2010 • Snowball • Karen's US-IMPACT study • Over-sample Representative sample, but linking to Big Data is hard
  • 12. Methodology 1. Start from a seed batch of political sites. 2. Download and classify each site in the batch. 3. For political sites, harvest outbound hyperlinks and add unvisited links to the next batch. 4. Repeat from step 2 until no new links are found.
  • 17. Bag-of-words logit regression Prob(political) ≈ logit(α+βX) X = Vector of word counts α = Bias term β = Word weights 1. Hand-code a training sample (n=2,000) 2. Calibrate the computer 3. Hand-code a testing sample (n=200) 4. Evaluate the classifier
  • 19. Classifier reliability Human-human: 80.9% Human-computer: 81.0% Krippendorff's Alpha: .733
  • 20. Census Results Implemented in python: SnowCrawl Executes in less than 24 hours 1.8 million sites crawled 800,000 political 42% blogs http://code.google.com/p/snowcrawl
  • 21. Comparison by strata Top 500 Top 5,000 Census Organization Owned by orgs 66.1*** 53.1 44.4 Multiple authors 75.2* 66.7 62.2 M-updates/day 43.4*** 19.4*** 6.1 Design Advertising 67.3** 57.1 51.2 Blogroll 57.5* 66.3*** 45.1 Video 48.7*** 35.7*** 18.3
  • 22. Comparison by strata Top 500 Top 5,000 Census Polls and public opinion 70.8*** 65.3* 52.4 Elections and campaigns 50.4 45.9 51.2 Legislation and law-making 43.4 41.8 43.9 Implementation of policy 38.1 39.8 30.5 Decisions by courts 34.5*** 24.5 17.1 Political figures 46.0*** 39.8** 24.4 Political parties 38.9*** 32.7* 20.7 Philosophical discussion 26.5 29.6 25.6 State and local government 36.3* 38.8** 24.4 Foreign policy 42.5*** 38.8*** 15.9 International relations 31.9** 33.7** 18.3
  • 23. Where next? ● Survey of bloggers ● Poststratification weighting ● Network analysis ● Content analysis of blogs ● Blog post panel ● Sentiment analysis/Survey imputation ● Re-implement in Hadoop
  • 24. Where next ...? ? ANES ? GSS ? Roxy...?
  • 25. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. 3. Complementary data is the key for the compSocSci research agenda.
  • 26. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. http://code.google.com/p/snowcrawl 3. Complementary data is the key for the compSocSci research agenda.
  • 27. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. 3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.
  • 28. Thank you! Questions? Comments? Abe Gong Public policy, political science, complex systems University of Michigan agong@umich.edu lowlywonk.blogspot.com Www-personal.umich.edu/~agong