O3S 2018 - Data at Pew Research Center

Patrick van Kessel
Patrick van KesselSenior Data Scientist
Data at Pew Research Center
OS3 2018
Patrick van Kessel
Senior Data Scientist
@pvankessel
December 7, 2019 2
About Pew Research Center
Pew Research Center is a nonpartisan fact tank that informs the public
about the issues, attitudes, and trends shaping the world. It does not take
policy positions. The Center conducts public opinion polling, demographic
research, content analysis and other data-driven social science research. It
studies U.S politics and policy; journalism and media; internet, science and
technology; religion and public life; Hispanic trends; global attitudes and
trends; and U.S. social and demographic trends. All of the Center’s reports
are available at www.pewresearch.org. Pew Research Center is a
subsidiary of The Pew Charitable Trusts, its primary funder.
December 7, 2019 3
Our Relationship with Data
Are we data creators or data consumers? We’re both!
- Organic data
- Open-source tools and
methods
- Survey data
- Augmented organic data
- Analysis and findings
- New methods and tools
The World
Pew Research Center
December 7, 2019 4
What is Data Labs?
• Created to collect, repurpose, and
enrich organic data to supplement our
surveys
• Data scientists, engineers, and
computational social scientists
• Conduct original research and
collaborate with other teams
• Promote emerging computational
methods and new data sources
December 7, 2019 5
Leveraging Open Data
• Social media data (APIs)
• Facebook
• Twitter
• YouTube
• Google
• Administrative datasets
• FEC
• FCC
• Census / ACS
• Other organic data
• Online sermons
• Mechanical Turk listings
• Google search results
• Google Maps
• Congressional press releases
• News articles
December 7, 2019 6
Leveraging Open Data
FEC data
Twitter data
FCC data
December 7, 2019 7
Leveraging Open Data
Facebook data
Facebook data
Reddit data
December 7, 2019 8
Contributing Open Data
• Traditionally, Pew Research
Center has been a data producer
• 15+ years of survey research
• We strive to share as much data
as we can
December 7, 2019 9
Contributing Open Data
• Most of our datasets eventually
become available for download
• Free and available to the public
• http://www.pewresearch.org/downl
oad-datasets/
• http://www.pewresearch.org/fact-
tank/2018/03/09/how-to-access-
pew-research-center-survey-data/
December 7, 2019 10
Contributing Open Data
• Survey data released as .sav files!?
• A proprietary format, but one that
preserves question text and labels
• Can be used in open-source
statistical analysis programs like R
(using packages like foreign and
haven)
• We even have an online guide on
how to use these files:
https://medium.com/pew-research-
center-decoded/how-to-analyze-
pew-research-center-survey-data-in-
r-f326df360713
December 7, 2019 11
Leveraging Open Data
• Some organic online data is becoming more difficult to collect for research:
• Social media API restrictions (Facebook, Twitter)
• GDPR
• But we’re working towards finding a balance between the benefits of privacy
and social research
• A number of companies are now forging public-private research
partnerships and have put out calls for proposals (e.g.
https://socialscience.one)
December 7, 2019 12
When You Can’t Share Data
• Even when available, organic data can be difficult to share
• Terms of service / API restrictions
• Size and complexity
• Survey data can’t always be shared, either
• Privacy concerns / disclosure risk
• Especially with panel data: can’t release detailed geographic information
December 7, 2019 13
When You Can’t Share Data
• Some emerging solutions show promise
• Differential privacy
• Synthetic data
• But these are currently difficult to implement
• So, if you can’t make your data open, how do you still support open
scholarship?
December 7, 2019 14
Share What Data You Can
• Share some of the data, even if you
can’t share it all
• Summary stats and aggregations
• We try to make what we can
available, even if we can’t release the
raw data
December 7, 2019 15
Share the Process
• There’s still opportunity for methodological transparency
• Explain in detail how the data were made
• How sampling frame was defined
• How the data changed at every step (preprocessing, etc.)
• How algorithms were trained
• How data were weighted
• Conduct and describe extensive validation
• Provide everything necessary for successful replication if not reproduction
December 7, 2019 16
Share the Process
• Our team’s methodology appendices tends to be nearly as long as our
reports
• We also have a dedicated Methods team that produces reports entirely
focused on methodological transparency and innovation
Report Methodology
Sharing the News in a
Polarized Congress
11 pages 9 pages
Taking Sides on Facebook 21 pages 20 pages
Bots in the Twittersphere 11 pages 18 pages
Partisan Conflict and
Congressional Outreach
38 pages 25 pages
December 7, 2019 17
Share the Process
• Also: share the assumptions and limitations!
• Be transparent about:
• 1) All methodological decisions you make
• 2) What the data can - and can’t - say
• How?
• Robustness checks and human validation
• Control for confounds with regressions where possible
• Show how the results of an analysis change
under different assumptions
December 7, 2019 18
Share the Process
December 7, 2019 19
Share the Process
• New blog to make our methods more transparent
and accessible
December 7, 2019 20
Share the Tools
• We’re starting to release code publicly on
GitHub: http://github.com/pewresearch
December 7, 2019 21
Looking Forward
• More survey data releases, including panel data
• More open-sourcing, including tools that we use to analyze survey/organic
data
• Rigorous devotion to adhering to (and defining) best practices
December 7, 2019 22
Engage with the Community
• Be responsive to questions from other researchers, interested observers
• Present in-progress work at academic and technical conferences
• Engage in a conversation about transparency
• And always try to do more
December 7, 2019 23
• To that end:
• What can we share with you?
• What data do you want to see more of?
• What methodologies can we be more transparent about?
• What tools or software would be useful to release?
• Let us know! info@pewresearch.org
December 7, 2019 24
Thank You!
Patrick van Kessel
Senior Data Scientist
pvankessel@pewresearch.org
1 of 24

More Related Content

Featured(20)

ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani30.2K views
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking6.9K views
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.1K views
I Rock Therefore I Am. 20 Legendary Quotes from PrinceI Rock Therefore I Am. 20 Legendary Quotes from Prince
I Rock Therefore I Am. 20 Legendary Quotes from Prince
Empowered Presentations142.8K views
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Read with Pride | LGBTQ+ ReadsRead with Pride | LGBTQ+ Reads
Read with Pride | LGBTQ+ Reads
Kayla Martin-Gant1.1K views
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn87.8K views

O3S 2018 - Data at Pew Research Center

  • 1. Data at Pew Research Center OS3 2018 Patrick van Kessel Senior Data Scientist @pvankessel
  • 2. December 7, 2019 2 About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes, and trends shaping the world. It does not take policy positions. The Center conducts public opinion polling, demographic research, content analysis and other data-driven social science research. It studies U.S politics and policy; journalism and media; internet, science and technology; religion and public life; Hispanic trends; global attitudes and trends; and U.S. social and demographic trends. All of the Center’s reports are available at www.pewresearch.org. Pew Research Center is a subsidiary of The Pew Charitable Trusts, its primary funder.
  • 3. December 7, 2019 3 Our Relationship with Data Are we data creators or data consumers? We’re both! - Organic data - Open-source tools and methods - Survey data - Augmented organic data - Analysis and findings - New methods and tools The World Pew Research Center
  • 4. December 7, 2019 4 What is Data Labs? • Created to collect, repurpose, and enrich organic data to supplement our surveys • Data scientists, engineers, and computational social scientists • Conduct original research and collaborate with other teams • Promote emerging computational methods and new data sources
  • 5. December 7, 2019 5 Leveraging Open Data • Social media data (APIs) • Facebook • Twitter • YouTube • Google • Administrative datasets • FEC • FCC • Census / ACS • Other organic data • Online sermons • Mechanical Turk listings • Google search results • Google Maps • Congressional press releases • News articles
  • 6. December 7, 2019 6 Leveraging Open Data FEC data Twitter data FCC data
  • 7. December 7, 2019 7 Leveraging Open Data Facebook data Facebook data Reddit data
  • 8. December 7, 2019 8 Contributing Open Data • Traditionally, Pew Research Center has been a data producer • 15+ years of survey research • We strive to share as much data as we can
  • 9. December 7, 2019 9 Contributing Open Data • Most of our datasets eventually become available for download • Free and available to the public • http://www.pewresearch.org/downl oad-datasets/ • http://www.pewresearch.org/fact- tank/2018/03/09/how-to-access- pew-research-center-survey-data/
  • 10. December 7, 2019 10 Contributing Open Data • Survey data released as .sav files!? • A proprietary format, but one that preserves question text and labels • Can be used in open-source statistical analysis programs like R (using packages like foreign and haven) • We even have an online guide on how to use these files: https://medium.com/pew-research- center-decoded/how-to-analyze- pew-research-center-survey-data-in- r-f326df360713
  • 11. December 7, 2019 11 Leveraging Open Data • Some organic online data is becoming more difficult to collect for research: • Social media API restrictions (Facebook, Twitter) • GDPR • But we’re working towards finding a balance between the benefits of privacy and social research • A number of companies are now forging public-private research partnerships and have put out calls for proposals (e.g. https://socialscience.one)
  • 12. December 7, 2019 12 When You Can’t Share Data • Even when available, organic data can be difficult to share • Terms of service / API restrictions • Size and complexity • Survey data can’t always be shared, either • Privacy concerns / disclosure risk • Especially with panel data: can’t release detailed geographic information
  • 13. December 7, 2019 13 When You Can’t Share Data • Some emerging solutions show promise • Differential privacy • Synthetic data • But these are currently difficult to implement • So, if you can’t make your data open, how do you still support open scholarship?
  • 14. December 7, 2019 14 Share What Data You Can • Share some of the data, even if you can’t share it all • Summary stats and aggregations • We try to make what we can available, even if we can’t release the raw data
  • 15. December 7, 2019 15 Share the Process • There’s still opportunity for methodological transparency • Explain in detail how the data were made • How sampling frame was defined • How the data changed at every step (preprocessing, etc.) • How algorithms were trained • How data were weighted • Conduct and describe extensive validation • Provide everything necessary for successful replication if not reproduction
  • 16. December 7, 2019 16 Share the Process • Our team’s methodology appendices tends to be nearly as long as our reports • We also have a dedicated Methods team that produces reports entirely focused on methodological transparency and innovation Report Methodology Sharing the News in a Polarized Congress 11 pages 9 pages Taking Sides on Facebook 21 pages 20 pages Bots in the Twittersphere 11 pages 18 pages Partisan Conflict and Congressional Outreach 38 pages 25 pages
  • 17. December 7, 2019 17 Share the Process • Also: share the assumptions and limitations! • Be transparent about: • 1) All methodological decisions you make • 2) What the data can - and can’t - say • How? • Robustness checks and human validation • Control for confounds with regressions where possible • Show how the results of an analysis change under different assumptions
  • 18. December 7, 2019 18 Share the Process
  • 19. December 7, 2019 19 Share the Process • New blog to make our methods more transparent and accessible
  • 20. December 7, 2019 20 Share the Tools • We’re starting to release code publicly on GitHub: http://github.com/pewresearch
  • 21. December 7, 2019 21 Looking Forward • More survey data releases, including panel data • More open-sourcing, including tools that we use to analyze survey/organic data • Rigorous devotion to adhering to (and defining) best practices
  • 22. December 7, 2019 22 Engage with the Community • Be responsive to questions from other researchers, interested observers • Present in-progress work at academic and technical conferences • Engage in a conversation about transparency • And always try to do more
  • 23. December 7, 2019 23 • To that end: • What can we share with you? • What data do you want to see more of? • What methodologies can we be more transparent about? • What tools or software would be useful to release? • Let us know! info@pewresearch.org
  • 24. December 7, 2019 24 Thank You! Patrick van Kessel Senior Data Scientist pvankessel@pewresearch.org