1. Data at Pew Research Center
OS3 2018
Patrick van Kessel
Senior Data Scientist
@pvankessel
2. December 7, 2019 2
About Pew Research Center
Pew Research Center is a nonpartisan fact tank that informs the public
about the issues, attitudes, and trends shaping the world. It does not take
policy positions. The Center conducts public opinion polling, demographic
research, content analysis and other data-driven social science research. It
studies U.S politics and policy; journalism and media; internet, science and
technology; religion and public life; Hispanic trends; global attitudes and
trends; and U.S. social and demographic trends. All of the Center’s reports
are available at www.pewresearch.org. Pew Research Center is a
subsidiary of The Pew Charitable Trusts, its primary funder.
3. December 7, 2019 3
Our Relationship with Data
Are we data creators or data consumers? We’re both!
- Organic data
- Open-source tools and
methods
- Survey data
- Augmented organic data
- Analysis and findings
- New methods and tools
The World
Pew Research Center
4. December 7, 2019 4
What is Data Labs?
• Created to collect, repurpose, and
enrich organic data to supplement our
surveys
• Data scientists, engineers, and
computational social scientists
• Conduct original research and
collaborate with other teams
• Promote emerging computational
methods and new data sources
5. December 7, 2019 5
Leveraging Open Data
• Social media data (APIs)
• Facebook
• Twitter
• YouTube
• Google
• Administrative datasets
• FEC
• FCC
• Census / ACS
• Other organic data
• Online sermons
• Mechanical Turk listings
• Google search results
• Google Maps
• Congressional press releases
• News articles
6. December 7, 2019 6
Leveraging Open Data
FEC data
Twitter data
FCC data
7. December 7, 2019 7
Leveraging Open Data
Facebook data
Facebook data
Reddit data
8. December 7, 2019 8
Contributing Open Data
• Traditionally, Pew Research
Center has been a data producer
• 15+ years of survey research
• We strive to share as much data
as we can
9. December 7, 2019 9
Contributing Open Data
• Most of our datasets eventually
become available for download
• Free and available to the public
• http://www.pewresearch.org/downl
oad-datasets/
• http://www.pewresearch.org/fact-
tank/2018/03/09/how-to-access-
pew-research-center-survey-data/
10. December 7, 2019 10
Contributing Open Data
• Survey data released as .sav files!?
• A proprietary format, but one that
preserves question text and labels
• Can be used in open-source
statistical analysis programs like R
(using packages like foreign and
haven)
• We even have an online guide on
how to use these files:
https://medium.com/pew-research-
center-decoded/how-to-analyze-
pew-research-center-survey-data-in-
r-f326df360713
11. December 7, 2019 11
Leveraging Open Data
• Some organic online data is becoming more difficult to collect for research:
• Social media API restrictions (Facebook, Twitter)
• GDPR
• But we’re working towards finding a balance between the benefits of privacy
and social research
• A number of companies are now forging public-private research
partnerships and have put out calls for proposals (e.g.
https://socialscience.one)
12. December 7, 2019 12
When You Can’t Share Data
• Even when available, organic data can be difficult to share
• Terms of service / API restrictions
• Size and complexity
• Survey data can’t always be shared, either
• Privacy concerns / disclosure risk
• Especially with panel data: can’t release detailed geographic information
13. December 7, 2019 13
When You Can’t Share Data
• Some emerging solutions show promise
• Differential privacy
• Synthetic data
• But these are currently difficult to implement
• So, if you can’t make your data open, how do you still support open
scholarship?
14. December 7, 2019 14
Share What Data You Can
• Share some of the data, even if you
can’t share it all
• Summary stats and aggregations
• We try to make what we can
available, even if we can’t release the
raw data
15. December 7, 2019 15
Share the Process
• There’s still opportunity for methodological transparency
• Explain in detail how the data were made
• How sampling frame was defined
• How the data changed at every step (preprocessing, etc.)
• How algorithms were trained
• How data were weighted
• Conduct and describe extensive validation
• Provide everything necessary for successful replication if not reproduction
16. December 7, 2019 16
Share the Process
• Our team’s methodology appendices tends to be nearly as long as our
reports
• We also have a dedicated Methods team that produces reports entirely
focused on methodological transparency and innovation
Report Methodology
Sharing the News in a
Polarized Congress
11 pages 9 pages
Taking Sides on Facebook 21 pages 20 pages
Bots in the Twittersphere 11 pages 18 pages
Partisan Conflict and
Congressional Outreach
38 pages 25 pages
17. December 7, 2019 17
Share the Process
• Also: share the assumptions and limitations!
• Be transparent about:
• 1) All methodological decisions you make
• 2) What the data can - and can’t - say
• How?
• Robustness checks and human validation
• Control for confounds with regressions where possible
• Show how the results of an analysis change
under different assumptions
19. December 7, 2019 19
Share the Process
• New blog to make our methods more transparent
and accessible
20. December 7, 2019 20
Share the Tools
• We’re starting to release code publicly on
GitHub: http://github.com/pewresearch
21. December 7, 2019 21
Looking Forward
• More survey data releases, including panel data
• More open-sourcing, including tools that we use to analyze survey/organic
data
• Rigorous devotion to adhering to (and defining) best practices
22. December 7, 2019 22
Engage with the Community
• Be responsive to questions from other researchers, interested observers
• Present in-progress work at academic and technical conferences
• Engage in a conversation about transparency
• And always try to do more
23. December 7, 2019 23
• To that end:
• What can we share with you?
• What data do you want to see more of?
• What methodologies can we be more transparent about?
• What tools or software would be useful to release?
• Let us know! info@pewresearch.org
24. December 7, 2019 24
Thank You!
Patrick van Kessel
Senior Data Scientist
pvankessel@pewresearch.org