More Related Content

Slideshows for you(20)

Similar to Open science and data sharing: the DataFirst experience/Martin Wittenberg(20)


More from African Open Science Platform(20)


Open science and data sharing: the DataFirst experience/Martin Wittenberg

  1. Open Science and Data sharing: the DataFirst experience Martin Wittenberg DataFirst 26 October 2017
  2. Open Science Overview • Introduction • Data and the research ecosystem • The problem of measurement in the social sciences • Difficulties with sharing data • Why sharing data is essential • The role of a data platform like DataFirst
  3. Open Science Introduction • I’m an economist trying to understand what has happened to South Africa since the end of apartheid – Particularly in relation to wages, employment, inequality, service delivery • Data and data quality are key • I also direct DataFirst, which is an organisation based at UCT dedicated to making it easier for researchers to access social science microdata • •
  4. Open Science Data and the research ecosystem • Data doesn’t just appear • The value and meaning of data arises from how it emerges within the
  5. Open Science Data and the research ecosystem Theory • e.g. how markets work Application • e.g. the impact of imposing a minimum wage in 2018 Measurement • e.g. Quarterly Labour Force Survey • e.g. tax returns
  6. Open Science Measurement • Sometimes for research purposes • But also incidental to other purposes – e.g. tax data, satellite “night light” data • Understand context, rules and procedures used – Sampling theory – Measurement instrument (e.g. questionnaire) – Fieldwork practice – Post-fieldwork data capture & processing – Imputations for missing values
  7. Open Science Measurement in the social sciences • Crucial to also understand what you are not seeing – Non-response • In the social sciences the subjects of research often have an interest in the outcome – Choose what to report
  8. Open Science An example from my research Compare earnings in tax data and surveys • Wages of employees Blog post at
  9. Open Science Measurement issues The picture when looking at earnings from self-employment (business profits) Why? • Penalties for not reporting • But accurate reporting means paying more tax
  10. Open Science Data within the research ecosystem • In summary, data is not useful for research unless – We know where it has come from – What sort of errors/biases are likely to be involved in the measurement process • AND – People who are working on applied questions know that it exists/can be accessed
  11. Open Science Difficulties with sharing data • One of the challenges of sharing data is to provide enough information about – Context – Measurement process (Metadata) • Plus the data must be stored in a way that it is “discoverable” • All of this costs time and effort
  12. Open Science Other difficulties • Fear of getting scooped with one’s own data • Fear of someone else finding a path-breaking application of the data that one hadn’t thought of • Fear of problems/errors in the measurement process being exposed • Confidentiality/privacy of respondents – Ethics clearance
  13. Open Science How might one deal with these? • Getting scooped – Delay public release • “Important Science” vs “Mere data gathering” – Underlying issue is really one of skill – Response is often “data squatting”/rent extraction – A more creative response is to find ways to get training programmes up around the data
  14. Open Science Issues with sharing, cont. • Exposing problems with the measurement process – Becomes more critical if these data are the only ones available – Reality is that there is no 100% clean dataset – Provided that there is still a detectable “signal” in the data, it can still be used for science • It becomes easier to “fix” the problems if they are openly acknowledged
  15. Open Science Issues with sharing, cont. • Confidentiality – “Open science” doesn’t mean that the data has to be available on the web for anyone – Key issue is that there have to be transparent protocols for access – e.g. “Secure Labs” as recently established in DataFirst
  16. Open Science Why sharing is essential • Proper science – Can only be done if results can be replicated – Errors in analysis/measurement exposed • New insights – It is impossible for one team to be on top of all the ways in which a dataset could be used – Making data available allows some of the best and brightest people in the world to think about your issues/problems • e.g. much of our insights into the impact and effectiveness of South Africa’s old age pension system came from American academics – Of course some garbage is likely to be generated in the process too
  17. Open Science Why sharing is essential, cont. • Improvement in skills – South African quantitative social scientists of my generation learned most of what we know from seeing international economists (notably Nobel prize winner Angus Deaton) work on our data • He showed that there are fascinating questions to be answered • He made his code available
  18. Open Science How do we make sharing more successful? • This is really a question not only about the incentives to researchers and research organisations • But also about institutions that can facilitate this process • Organisations like DataFirst play an important role here
  19. Open Science The issue is really how to strengthen the links Theory • e.g. how markets work Application • e.g. the impact of imposing a minimum wage in 2018 Measurement • e.g. Quarterly Labour Force Survey • e.g. tax returns
  20. Overview Dissemination Data Producer Skilled user Dissemination Feedback
  21. Overview Replicability of results Data Published Paper Analysis Review/Replication Follow-up Skilled Researcher Reader
  22. Overview Best practice data production Data Producer Methodological Research “Best practice” Practical Issues Feedback
  23. Overview Best practice data analysis
  24. Open Science How can we strengthen these loops? • These are not “add-ons” – they are an integral part of a successful science infrastructure – Like libraries, research clouds etc. – Need to be supported: • Financially • Mandates for sharing data, particularly if public funds have been used in collecting them