Open science and data sharing: the DataFirst experience/Martin Wittenberg
Open Science and Data sharing:
the DataFirst experience
Martin Wittenberg
DataFirst
26 October 2017
Open Science
Overview
• Introduction
• Data and the research ecosystem
• The problem of measurement in the social
sciences
• Difficulties with sharing data
• Why sharing data is essential
• The role of a data platform like DataFirst
Open Science
Introduction
• I’m an economist trying to understand what has happened to South
Africa since the end of apartheid
– Particularly in relation to wages, employment, inequality, service
delivery
• Data and data quality are key
• I also direct DataFirst, which is an organisation based at UCT
dedicated to making it easier for researchers to access social
science microdata
• www.datafirst.uct.ac.za
• https://sites.google.com/site/martinwwittenberg/home
Open Science
Data and the research ecosystem
• Data doesn’t just appear
• The value and meaning of data arises from
how it emerges within the
Open Science
Data and the research ecosystem
Theory
• e.g. how markets work
Application
• e.g. the impact of
imposing a minimum
wage in 2018
Measurement
• e.g. Quarterly Labour
Force Survey
• e.g. tax returns
Open Science
Measurement
• Sometimes for research purposes
• But also incidental to other purposes
– e.g. tax data, satellite “night light” data
• Understand context, rules and procedures used
– Sampling theory
– Measurement instrument (e.g. questionnaire)
– Fieldwork practice
– Post-fieldwork data capture & processing
– Imputations for missing values
Open Science
Measurement in the social sciences
• Crucial to also understand what you are not
seeing
– Non-response
• In the social sciences the subjects of research
often have an interest in the outcome
– Choose what to report
Open Science
An example from my research
Compare
earnings in tax
data and surveys
• Wages of
employees
Blog post at
http://www.econ3x3.org/
Open Science
Measurement issues
The picture
when looking at
earnings from
self-employment
(business profits)
Why?
• Penalties for
not reporting
• But accurate
reporting
means paying
more tax
Open Science
Data within the research ecosystem
• In summary, data is not useful for research
unless
– We know where it has come from
– What sort of errors/biases are likely to be involved
in the measurement process
• AND
– People who are working on applied questions
know that it exists/can be accessed
Open Science
Difficulties with sharing data
• One of the challenges of sharing data is to
provide enough information about
– Context
– Measurement process
(Metadata)
• Plus the data must be stored in a way that it is
“discoverable”
• All of this costs time and effort
Open Science
Other difficulties
• Fear of getting scooped with one’s own data
• Fear of someone else finding a path-breaking
application of the data that one hadn’t thought of
• Fear of problems/errors in the measurement
process being exposed
• Confidentiality/privacy of respondents
– Ethics clearance
Open Science
How might one deal with these?
• Getting scooped
– Delay public release
• “Important Science” vs “Mere data gathering”
– Underlying issue is really one of skill
– Response is often “data squatting”/rent extraction
– A more creative response is to find ways to get
training programmes up around the data
Open Science
Issues with sharing, cont.
• Exposing problems with the measurement
process
– Becomes more critical if these data are the only
ones available
– Reality is that there is no 100% clean dataset
– Provided that there is still a detectable “signal” in
the data, it can still be used for science
• It becomes easier to “fix” the problems if they are
openly acknowledged
Open Science
Issues with sharing, cont.
• Confidentiality
– “Open science” doesn’t mean that the data has to
be available on the web for anyone
– Key issue is that there have to be transparent
protocols for access
– e.g. “Secure Labs” as recently established in
DataFirst
Open Science
Why sharing is essential
• Proper science
– Can only be done if results can be replicated
– Errors in analysis/measurement exposed
• New insights
– It is impossible for one team to be on top of all the ways in
which a dataset could be used
– Making data available allows some of the best and brightest
people in the world to think about your issues/problems
• e.g. much of our insights into the impact and effectiveness of South
Africa’s old age pension system came from American academics
– Of course some garbage is likely to be generated in the process
too
Open Science
Why sharing is essential, cont.
• Improvement in skills
– South African quantitative social scientists of my
generation learned most of what we know from
seeing international economists (notably Nobel
prize winner Angus Deaton) work on our data
• He showed that there are fascinating questions to be
answered
• He made his code available
Open Science
How do we make sharing more
successful?
• This is really a question not only about the
incentives to researchers and research
organisations
• But also about institutions that can facilitate
this process
• Organisations like DataFirst play an important
role here
Open Science
The issue is really how to strengthen
the links
Theory
• e.g. how markets work
Application
• e.g. the impact of
imposing a minimum
wage in 2018
Measurement
• e.g. Quarterly Labour
Force Survey
• e.g. tax returns
Open Science
How can we strengthen these loops?
• These are not “add-ons” – they are an integral
part of a successful science infrastructure
– Like libraries, research clouds etc.
– Need to be supported:
• Financially
• Mandates for sharing data, particularly if public funds
have been used in collecting them