Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

MIT 2017 - Measuring Partisan Conflict

Download to read offline

How Data Labs measured partisan conflict using machine learning

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

MIT 2017 - Measuring Partisan Conflict

  1. 1. Measuring Partisan Conflict in Congress Using Unstructured Data and Machine Learning Patrick van Kessel Data Science Associate, Data Labs @pvankessel
  2. 2. December 7, 2019 2www.pewresearch.org Who am I and why am I here? • Currently at Pew Research Center, data and software-driven research • Previously: • Undergrad @ UT: government, economics; business, philosophy • Grad @ UChicago: social science research methods, content analysis • IT Project Services at NORC: helped introduce and institutionalize data science • Programming as a lifelong hobby
  3. 3. December 7, 2019 3www.pewresearch.org What is Data Labs? New group at Pew Research Center dedicated to using data science to inform decision makers and the public
  4. 4. December 7, 2019 4www.pewresearch.org What do we study? • Lots of things • First project: political communication • Specifically, elites: members of Congress
  5. 5. December 7, 2019 5www.pewresearch.org Polarization in Congress is increasing DW-NOMINATE, House
  6. 6. December 7, 2019 6www.pewresearch.org Public antipathy is too
  7. 7. December 7, 2019 7www.pewresearch.org Is communication the connection? • Many suspect that how members communicate with the public may mediate these two trends (Converse, Campbell, Miller, Stokes, Lenz, etc.) • Potential feedback loop Politician rhetoric Constituent opinions Voting and activism Different politicians / incentives
  8. 8. December 7, 2019 8www.pewresearch.org What Politicians Say Matters September 2009: “There are also those who claim that our reform effort will insure illegal immigrants. This, too, is false – the reforms I’m proposing would not apply to those who are here illegally.”
  9. 9. December 7, 2019 9www.pewresearch.org What Politicians Say Matters “YOU LIE!” - Rep. Joe Wilson, R-S.C.
  10. 10. December 7, 2019 10www.pewresearch.org And it can get them attention (especially when critical)
  11. 11. December 7, 2019 11www.pewresearch.org Can we study it? • Can we study congressional communications systematically? • What are politicians saying, and how are they saying it? • What catches the public’s (or media’s) attention? • How does rhetoric align with politician behavior and public opinion?
  12. 12. December 7, 2019 12www.pewresearch.org How we measured Congressional rhetoric • Collected 200,000+ press releases and Facebook posts • 114th Congress, Jan 2015 – Apr 2016 • Coded 7k documents • Criticism • Anger • Bipartisanship • Trained machine learning models • Document-level predictions • Legislator-level estimates • Joined with other data • Ideological behavior • District demographics • Measured reactions on Facebook • Likes • Comments • Shares
  13. 13. December 7, 2019 13www.pewresearch.org How we measured Congressional rhetoric • Collected 200,000+ press releases and Facebook posts • 114th Congress, Jan 2015 – Apr 2016 • Coded 7,000+ documents • Criticism • Anger • Bipartisanship • Trained machine learning models • Document-level predictions • Legislator-level estimates • Joined with other data • Ideological behavior • District demographics • Measured reactions on Facebook • Likes • Comments • Shares
  14. 14. December 7, 2019 14www.pewresearch.org How we measured Congressional rhetoric • Collected 200,000+ press releases and Facebook posts • 114th Congress, Jan 2015 – Apr 2016 • Coded 7k documents • Criticism • Anger • Bipartisanship • Trained machine learning models • Document-level predictions • Legislator-level estimates • Joined with other data • Ideological behavior • District demographics • Measured reactions on Facebook • Likes • Comments • Shares
  15. 15. December 7, 2019 15www.pewresearch.org How we measured Congressional rhetoric • Collected 200,000+ press releases and Facebook posts • 114th Congress, Jan 2015 – Apr 2016 • Coded 7k documents • Criticism • Anger • Bipartisanship • Trained machine learning models • Document-level predictions • Legislator-level estimates • Joined with other data • Ideological behavior • District demographics • Measured reactions on Facebook • Likes • Comments • Shares
  16. 16. December 7, 2019 16www.pewresearch.org How we measured Congressional rhetoric • Collected 200,000+ press releases and Facebook posts • 114th Congress, Jan 2015 – Apr 2016 • Coded 7k documents • Criticism • Anger • Bipartisanship • Trained machine learning models • Document-level predictions • Legislator-level estimates • Joined with other data • Ideological behavior • District demographics • Measured reactions on Facebook • Likes • Comments • Shares
  17. 17. WHAT WE FOUND December 7, 2019 www.pewproject.org 17
  18. 18. Politicians aren’t uniformly critical December 7, 2019 www.pewresearch.org 18 Average % members’ communications containing or mentioning… Presidential out-party more indignant Leadership more indignant Average % of press releases containing indignant disagreement
  19. 19. Rhetoric aligns with constituencies December 7, 2019 www.pewresearch.org 19 Average % of press releases containing disagreement Safer seats more indignant Competitive seats more bipartisan Average % of press releases containing bipartisanship
  20. 20. Rhetoric also aligns with behavior / ideology December 7, 2019 www.pewresearch.org 20 Average % of press releases containing disagreement More ideological/partisan members more indignant Moderates are more bipartisan Average % of press releases containing bipartisanship
  21. 21. Messaging can also vary by medium December 7, 2019 www.pewresearch.org 21 Average % members’ communications containing or mentioning… Bipartisanship more common in press releases than Facebook posts Ideological caucuses lean towards Facebook over press releases Average share of outreach consisting of Facebook posts compared with press releases
  22. 22. December 7, 2019 22www.pewresearch.org Criticism gets more attention on Facebook (regardless of members’ popularity) Estimated Facebook engagement for a post containing…
  23. 23. December 7, 2019 23www.pewresearch.org So what does this mean? • So… people are paying attention to negative messaging on Facebook • What does this attention mean? Should we care? • Does this impact activism or voting behavior? • Is this impacting public polarization? • Does attention translate to other venues?
  24. 24. December 7, 2019 24www.pewresearch.org So what does this mean… if I don’t care about politics? • Major data collection, cleaning, and analysis effort • Lots of challenges that you may encounter in a variety of research domains • Universally-useful tools and methods that can help
  25. 25. HOW WE DID IT (AND WHAT WAS DIFFICULT) December 7, 2019 www.pewproject.org 25
  26. 26. GETTING STARTED December 7, 2019 www.pewproject.org 26
  27. 27. Getting Started December 7, 2019 www.pewresearch.org 27 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools
  28. 28. Getting Started December 7, 2019 www.pewresearch.org 28 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools Solution: Cloud computing / open-source
  29. 29. CLOUD COMPUTING IS GREAT (IF YOU’RE STARTING FROM SCRATCH) Tip #1 December 7, 2019 www.pewproject.org 29
  30. 30. Getting Started • Computing environments / databases / file storage: • Amazon EC2 • Amazon RDS • Amazon S3 December 7, 2019 www.pewresearch.org 30 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools Solution: Cloud computing / open-source
  31. 31. Getting Started • Analysis tools: • Jupyter December 7, 2019 www.pewresearch.org 31 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools Solution: Cloud computing / open-source
  32. 32. R IS GREAT FOR ANALYSIS PYTHON IS GREAT FOR EVERYTHING Tip #2 December 7, 2019 www.pewproject.org 32
  33. 33. JUPYTER CAN DO BOTH IT’S GREAT FOR EXPLORATORY ANALYSIS Tip #3 December 7, 2019 www.pewproject.org 33
  34. 34. Getting Started • Code repositories: • Github December 7, 2019 www.pewresearch.org 34 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools Solution: Cloud computing / open-source
  35. 35. ALWAYS TRACK YOUR CODE IN GITHUB Tip #4 December 7, 2019 www.pewproject.org 35
  36. 36. Getting Started • Dev tools: • PyCharm • R Studio December 7, 2019 www.pewresearch.org 36 Challenge: No existing infrastructure or code • We need: • Computing environments • Databases • File storage • Analysis tools • Code repositories • Dev tools Solution: Cloud computing / open-source
  37. 37. Getting Started December 7, 2019 www.pewresearch.org 37 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface
  38. 38. Getting Started • NoSQL/MongoDB • Hadoop/HDFS/Hive • Proprietary database (Vertica) December 7, 2019 www.pewresearch.org 38 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface Solution: Big data framework
  39. 39. Getting Started • NoSQL/MongoDB • Hadoop/HDFS/Hive • Proprietary database (Vertica) December 7, 2019 www.pewresearch.org 39 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface Solution: Big data framework
  40. 40. YOU PROBABLY DON’T NEED “BIG DATA” TOOLS Tip #5 December 7, 2019 www.pewproject.org 40
  41. 41. Getting Started • You probably don’t need this stuff • Relational databases are easier to work with, and force you to stay organized December 7, 2019 www.pewresearch.org 41 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface Solution: Big data framework
  42. 42. Getting Started • Why Postgres? • Scales beautifully • Faster than MongoDB with support for JSON and more • Super robust December 7, 2019 www.pewresearch.org 42 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface Solution: PostgreSQL + Django
  43. 43. USE A RELATIONAL DATABASE (TRY POSTGRES) Tip #6 December 7, 2019 www.pewproject.org 43
  44. 44. Getting Started • Why Django? • Designed for websites, but great for data science • Define tables and relationships with code • Management commands • TONS of modular plugins • Task scheduling • Historical records • Easy to make a user interface December 7, 2019 www.pewresearch.org 44 Challenge: How do I organize all my data? • Need to link together lots of different data sources • Want some structure – but need to easily modify it • Organize scripts, recurring tasks • Logging, historical records • Might need a user interface Solution: PostgreSQL + Django
  45. 45. Getting Started December 7, 2019 www.pewresearch.org 45 Challenge: How do I organize all my data? Solution: PostgreSQL + Django
  46. 46. TRY DJANGO Tip #7 December 7, 2019 www.pewproject.org 46
  47. 47. DATA COLLECTION December 7, 2019 www.pewproject.org 47
  48. 48. CONGRESS MEMBERS Data Collection December 7, 2019 www.pewproject.org 48
  49. 49. Data Collection December 7, 2019 www.pewresearch.org 49 Challenge: Need info on members of Congress • Names • Party affiliation • Terms of office • Committee and caucus memberships • Voting records
  50. 50. Data Collection • @unitedstates Github • GovTrack • Sunlight • GPO • VoteView (DW-NOMINATE) December 7, 2019 www.pewresearch.org 50 Challenge: Need info on members of Congress • Names • Party affiliation • Terms of office • Committee and caucus memberships • Voting records Solution: Google it!
  51. 51. PRESS RELEASES Data Collection December 7, 2019 www.pewproject.org 51
  52. 52. Data Collection December 7, 2019 www.pewresearch.org 52 Challenge: How do we get press releases? • No central repository for political press releases
  53. 53. Data Collection • Web Services Kit (WSK) • SOAP API • XML results • Aggregate wire service feeds December 7, 2019 www.pewresearch.org 53 Challenge: How do we get press releases? • No central repository for political press releases Solution: LexisNexis API “office of” W/s (senator OR sen OR rep OR representative) W/s (released OR issued) W/s following W/s (statement OR release)) OR “U.S. SENATE DOCUMENTS” OR “U.S. HOUSE OF REPRESENTATIVES DOCUMENTS” OR (“PRESS RELEASE” AND “Congressional Press Releases”
  54. 54. Data Collection • Web Services Kit (WSK) • SOAP API • XML results • Aggregate wire service feeds December 7, 2019 www.pewresearch.org 54 Challenge: How do we get press releases? • No central repository for political press releases Solution: LexisNexis API
  55. 55. Data Collection December 7, 2019 www.pewresearch.org 55 Challenge: How do we link press releases to members of Congress? • LexisNexis doesn’t parse out members of Congress
  56. 56. Data Collection • Manually identify missing nicknames • Regular expressions to find names • Fuzzy matching (Levenshtein distance) to match to names in the database • Review samples of press releases to identify misattributions • Always spot-check your data! December 7, 2019 www.pewresearch.org 56 Challenge: How do we link press releases to members of Congress? • LexisNexis doesn’t parse out members of Congress Solution: Fuzzy matching and regular expressions The office of U.S. Rep. <FIRST_NAME> <LAST_NAME> issued the following statement:
  57. 57. FUZZY MATCHING ISN’T PERFECT - SPOT-CHECK IT Tip #8 December 7, 2019 www.pewproject.org 57
  58. 58. Data Collection December 7, 2019 www.pewresearch.org 58 Challenge: LexisNexis may not have everything • No clue how the wire services get their releases; they may pick and choose • Our Boolean search may not have grabbed everything • Can we supplement from other sources?
  59. 59. Data Collection • Official websites have “news release” sections • Write web scrapers (Scrapy) December 7, 2019 www.pewresearch.org 59 Challenge: LexisNexis may not have everything • No clue how the wire services get their releases; they may pick and choose • Our Boolean search may not have grabbed everything • Can we supplement from other sources? Solution: Web Scraping
  60. 60. Data Collection December 7, 2019 www.pewresearch.org 60 Challenge: All of the websites have different formats • Site organization and pagination can be different • Different HTML tags used for article titles, dates, etc.
  61. 61. Data Collection • Use XPath to define how to find press release elements on each website • Hire a freelance programmer on Upwork December 7, 2019 www.pewresearch.org 61 Challenge: All of the websites have different formats • Site organization and pagination can be different • Different HTML tags used for article titles, dates, etc. Solution: XPath and Outsourcing
  62. 62. XPATH IS GREAT FOR WEB SCRAPING Tip #9 December 7, 2019 www.pewproject.org 62
  63. 63. BE A KIND WEB SCRAPER - CHECK ROBOTS.TXT… AND DON’T DDOS YOUR DATA SOURCES Tip #10 December 7, 2019 www.pewproject.org 63
  64. 64. Data Collection December 7, 2019 www.pewresearch.org 64 Challenge: The programmer makes a ton of mistakes • Mixed up politicians with the same last name • Missing fields (dates) for some websites
  65. 65. Data Collection • Sample and review press releases for each politician • Manually fix mistakes • Check external work more closely next time December 7, 2019 www.pewresearch.org 65 Challenge: The programmer makes a ton of mistakes • Mixed up politicians with the same last name • Missing fields (dates) for some websites Solution: Manually fix everything; be more careful next time
  66. 66. ALWAYS SPOT-CHECK OUTSIDE WORK Tip #11 December 7, 2019 www.pewproject.org 66
  67. 67. SOCIAL MEDIA DATA Data Collection December 7, 2019 www.pewproject.org 67
  68. 68. Data Collection December 7, 2019 www.pewresearch.org 68 Challenge: How do we find members’ social media accounts? • Multiple official/unofficial accounts • Can’t rely on their websites • Open-source lists are missing data
  69. 69. Data Collection • Ask 5 people to hunt down the accounts for each politician • Merge with third-party lists December 7, 2019 www.pewresearch.org 69 Challenge: How do we find members’ social media accounts? • Multiple official/unofficial accounts • Can’t rely on their websites • Open-source lists are missing data Solution: Open-source lists + Mechanical Turk
  70. 70. HOORAY, MECHANICAL TURK! Tip #12 December 7, 2019 www.pewproject.org 70
  71. 71. Data Collection December 7, 2019 www.pewresearch.org 71 Challenge: How do we remove bad accounts? • Open-source lists also have errors • As with Upworkers, Mechanical Turkers make mistakes
  72. 72. Data Collection • Manually review and verify one-off accounts December 7, 2019 www.pewresearch.org 72 Challenge: How do we remove bad accounts? • Open-source lists also have errors • As with Upworkers, Mechanical Turkers make mistakes Solution: Check the outliers
  73. 73. ALWAYS SPOT-CHECK OUTSIDE WORK Tip #12 December 7, 2019 www.pewproject.org 73
  74. 74. Data Collection December 7, 2019 www.pewresearch.org 74 Challenge: How do we filter down to official accounts? • Politicians have multiple accounts • Official office accounts • Campaign accounts • General-purpose personal accounts
  75. 75. Data Collection • Cross-reference their official websites • Manually review anyone without an account • Found a few politicians that didn’t list their official account on their website December 7, 2019 www.pewresearch.org 75 Challenge: How do we filter down to official accounts? • Politicians have multiple accounts • Official office accounts • Campaign accounts • General-purpose personal accounts Solution: Write some custom logic 96% have an official account 79% also have an unofficial account 17% are official only 4% are unofficial only
  76. 76. DON’T ASSUME YOUR DATA ARE UNIFORMLY COMPARABLE Tip #13 December 7, 2019 www.pewproject.org 76
  77. 77. Data Collection December 7, 2019 www.pewresearch.org 77 Challenge: How do we get the posts? • Can’t exactly scrape Twitter or Facebook (legally)
  78. 78. Data Collection • Sometimes you get lucky and don’t need to scrape • Python wrapper for Facebook and Twitter • Facebook • Public accounts, no restrictions! • Twitter • Only lets you backfill most recent 3000, would need to assess coverage • Purchasing historical data through Gnip can be expensive • Facebook seemed like a good first step December 7, 2019 www.pewresearch.org 78 Challenge: How do we get the posts? • Can’t exactly scrape Twitter or Facebook (legally) Solution: Use the API (for Facebook, at least)
  79. 79. DATA CLEANING December 7, 2019 www.pewproject.org 79
  80. 80. Data Cleaning December 7, 2019 www.pewresearch.org 80 Challenge: Junk content and boilerplate • Scraping isn’t perfect • Some politicians use boilerplate • Different wire services may append boilerplate
  81. 81. DON’T ASSUME YOUR DATA ARE CLEAN Tip #14 December 7, 2019 www.pewproject.org 81
  82. 82. Data Cleaning • Sentence tokenizer • Filter with regular expressions where possible • Code a sample of sentences manually • 6% of sentences were boilerplate • Train an algorithm to make up the difference (linear SVC) • 86% precision • 88% recall • 6.5% of text removed, on average December 7, 2019 www.pewresearch.org 82 Challenge: Junk content and boilerplate • Scraping isn’t perfect • Some politicians use boilerplate • Different wire services may append boilerplate Solution: Regex and machine learning Boilerplate Examples: •Your browser does not support iframes. •Welcome to the on-line office for Congressman Donald Payne, Jr. •">’ ); document.write( addy71707 ); document.write( ’</a>’ ); //–>kdcr@dordt.edu’ );//–> •Washington, DC Office 2417 Rayburn HOB Washington, DC 20515 Phone: (202) 225-2331 Fax: (202) 225-6475 Hours: M-F 9AM-5PM EST •He represents California’s 29th Congressional District, which includes the communities of Alham- bra, Altadena, Burbank, East Pasadena, East San Gabriel, Glendale, Monterey Park, Pasadena, San Gabriel, South Pasadena and Temple City.
  83. 83. REGULAR EXPRESSIONS ARE YOUR FRIEND Tip #15 December 7, 2019 www.pewproject.org 83
  84. 84. Data Cleaning December 7, 2019 www.pewresearch.org 84 Challenge: Press releases may have duplicates • Multiple sources • Computationally expensive to compare all press release combinations
  85. 85. Data Cleaning • Group by politician • Filter pairs of documents down to probable duplicates using TF-IDF cosine similarity • Train a custom de-duplication model (Random forest) • Dates • Length of text • Levenshtein ratios • 96% precision, 97% recall December 7, 2019 www.pewresearch.org 85 Challenge: Press releases may have duplicates • Multiple sources • Computationally expensive to compare all press release combinations Solution: Custom de-duplication process
  86. 86. INCREASE COMPLEXITY IN STAGES, CONSIDER EFFICIENCY Tip #16 December 7, 2019 www.pewproject.org 86
  87. 87. CONTENT ANALYSIS December 7, 2019 www.pewproject.org 87
  88. 88. Content Analysis December 7, 2019 www.pewresearch.org 88 Challenge: Classifying subjective content is hard • How do you define anger? • What constitutes an attack? • Lots of researchers try to categorize releases into MECE ontologies • Mutually-exclusive, collectively- exhaustive categories
  89. 89. Content Analysis • You don’t always need MECE • Clearly defined, specific targets • Iteratively refined codebook with lots of pilot testing December 7, 2019 www.pewresearch.org 89 Challenge: Classifying subjective content is hard • How do you define anger? • What constitutes an attack? • Lots of researchers try to categorize releases into MECE ontologies Solution: Write a good codebook
  90. 90. DON’T FORCE UNNECESSARY STRUCTURE ON YOUR DATA Tip #17 December 7, 2019 www.pewproject.org 90
  91. 91. TEST YOUR CODEBOOKS Tip #18 December 7, 2019 www.pewproject.org 91
  92. 92. Content Analysis December 7, 2019 www.pewresearch.org 92 Challenge: Criticism may be rare • Congressional communications cover a lot of ground • Random sample might not capture enough signal
  93. 93. Content Analysis • Use target-focused keywords that are probably more likely to contain partisan attacks December 7, 2019 www.pewresearch.org 93 Challenge: Criticism may be rare • Congressional communications cover a lot of ground • Random sample might not capture enough signal Solution: Keyword oversampling
  94. 94. Content Analysis • Use target-focused keywords that are probably more likely to contain partisan attacks December 7, 2019 www.pewresearch.org 94 Challenge: Criticism may be rare • Congressional communications cover a lot of ground • Random sample might not capture enough signal Solution: Keyword oversampling Sample Target Oversampling Facebook 3,000 20% Democrats 40% Republicans 20% Obama 20% Random Press releases 2,000 20% Democrats 40% Republicans 20% Obama 20% Random
  95. 95. KEYWORD OVERSAMPLING CAN HELP BOOST YOUR INCIDENCE RATE Tip #19 December 7, 2019 www.pewproject.org 95
  96. 96. Content Analysis December 7, 2019 www.pewresearch.org 96 Challenge: Need to classify a LOT of documents • Unsupervised learning isn’t good enough (we compared our models to LIWC) • Need thousands of training observations to train a good model
  97. 97. Content Analysis • Custom interface with built-in examples and guidance • 5 coders for every document • Low cost allowed for far larger training corpus December 7, 2019 www.pewresearch.org 97 Challenge: Need to classify a LOT of documents • Unsupervised learning isn’t good enough (we compared our models to LIWC) • Need thousands of training observations to train a good model Solution: Mechanical Turk API
  98. 98. Content Analysis December 7, 2019 www.pewresearch.org 98 Challenge: Need a single measure, and it needs to be a good one • Are Mechanical Turkers as good as we are? Are they consistent? • How do you turn five coders into one?
  99. 99. Content Analysis • Averaged using a threshold selected for maximum agreement with in-house coders December 7, 2019 www.pewresearch.org 99 Challenge: Need a single measure, and it needs to be a good one • Are Mechanical Turkers as good as we are? Are they consistent? • How do you turn five coders into one? Solution: Compare to an in-house subsample with agreement-driven averaging Document Category MTurk threshold Expert-Turk Kappa Expert Kappa Facebook posts Bipartisanship 2/5 0.91 0.76 Benefits 3/5 0.58 0.66 Disagreement 2/5 0.89 0.87 Indig. disagreement 2/5 0.80 0.45 Press releases Bipartisanship 2/5 0.79 0.76 Benefits 3/5 0.50 0.65 Disagreement 2/5 0.92 0.83 Indig. disagreement 2/5 0.71 0.46
  100. 100. MACHINE LEARNING December 7, 2019 www.pewproject.org 100
  101. 101. Machine Learning December 7, 2019 www.pewresearch.org 101 Challenge: How do we turn text into features? • Algorithms need numbers, not words
  102. 102. Machine Learning • Text cleaning • TF-IDF (we tried Word2Vec too) • Regex-targeted sentence subsets in addition to full documents December 7, 2019 www.pewresearch.org 102 Challenge: How do we turn text into features? • Algorithms need numbers, not words Solution: Text cleaning + TF-IDF
  103. 103. SMART FEATURES > FANCY FEATURES Tip #20 December 7, 2019 www.pewproject.org 103
  104. 104. Machine Learning December 7, 2019 www.pewresearch.org 104 Challenge: Models could be biased • Republicans and Democrats may use language differently; might attack at different rates • Could over/underestimate a particular party • Keywords related to particular politicians could introduce endogeneity
  105. 105. Machine Learning • Trained models by target and party • Vocabulary could vary by context • Custom stopword list (names, locations, etc.) • RBF support vectors w/ regularization • Tested for perceptual bias by coder party ID December 7, 2019 www.pewresearch.org 105 Challenge: Models could be biased • Republicans and Democrats may use language differently; might attack at different rates • Could over/underestimate a particular party • Keywords related to particular politicians could introduce endogeneity Solution: Separate models and custom stopwords
  106. 106. SOME WORDS MAY INTRODUCE ENDOGENEITY, USE CUSTOM STOPWORDS Tip #21 December 7, 2019 www.pewproject.org 106
  107. 107. Machine Learning December 7, 2019 www.pewresearch.org 107
  108. 108. • The right tools can make your life easier (Django, Postgres, PyCharm) • Always double-check your data, even (well, especially) when it’s “big” • Take it one step at a time – slow and steady • Mechanical Turk is great • Garbage in, garbage out (use good features, and it’s not hard to get good models) • Google often • Have fun with it! December 7, 2019 108www.pewresearch.org Key Takeaways
  109. 109. December 7, 2019 www.pewproject.org 109 Thank you! Patrick van Kessel Data Science Associate, Data Labs pvankessel@pewresearch.org Questions?
  110. 110. December 7, 2019 110www.pewresearch.org What is Pew Research Center? Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping America and the world. It does not take policy positions. The Center conducts public opinion polling, demographic research, content analysis and other data-driven social science research. It studies U.S. politics and policy; journalism and media; internet, science and technology; religion and public life; Hispanic trends; global attitudes and trends; and U.S. social and demographic trends. All of the Center’s reports are available at www.pewresearch.org. Pew Research Center is a subsidiary of The Pew Charitable Trusts, its primary funder. Follow us on Twitter at: @pewresearch @facttank

How Data Labs measured partisan conflict using machine learning

Views

Total views

69

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×