Tim Osborn: Research Integrity: Integrity of the published record


Published on

Tim Osborn, Reader, University of East Anglia

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Although interested in the issues of research data management, it’s not a primary area for me Consequently my contribution is rather narrower and discipline-specific than others, because that’s what I’m familiar with. Nevertheless I hope it is still a useful “case study” that can inform the wider debate
  • I was asked to talk about why the integrity of the published record has become important for climate research. This might imply that it hasn’t always been important or that it is more important for climate research than for some other disciplines, neither of which is true. Nevertheless, what are the features of the current global warming issue that might pick it out for special attention? Pushing scientific knowledge to its limits (and beyond current limits). Pace of scientific advancement needs to be maximised without harming integrity. Inextricably linked to development of poorer countries, economic and demographic growth, inequalities, etc. Say more about the high stakes next Intense scrutiny is good, we need more of it, it will improve scientific knowledge and understanding
  • It’s worth considering the controversy that followed the hacking of emails and other documents from the Climatic Research Unit and its relevance to the relationship between research data management and the integrity of the published research record. From the scientific community, integrity of our research was questioned very little. Overwhelmingly the community was supportive. But nevertheless there was widespread questioning (at least initially) in the main stream media and this was influenced also by relatively few but vociferous critics, and helped by various critics and/or vested interests willing to communicate their viewpoints. It’s important to distinguish between perceptions of integrity and actual integrity. As confirmed by the Muir-Russell review, for example, we did not destroy raw data; we did not manipulate data inappropriately or with the intention of obtaining a pre-determined outcome; we recorded, justified and published the few data adjustments that were made so that others could understand and test them; and third parties had access to the data and other materials necessary to reproduce our published research. Yet this isn’t how things were portrayed. My view (and others I’ve asked) is that the management of our research data played only a small role. The hacked emails and their interpretation were much more significant.
  • But there isn’t room for complacency. Although issues relating to research data may have played little role, there were areas of valid criticism and this made it harder to defend our integrity – especially in the cross-over of criticism into the mainstream media. Rather than analysing this interaction further, it is better to identify the improvements that can – and should – be made and move on to how best this can be done. These ideas apply, of course, to the whole climate science community, not specifically to the Climatic Research Unit. Overall, various aspects of data management can be improved, supporting more transparency and openness, more re-use of data, and unambiguous links between published findings and the data on which they have been built. We have begun to investigate some of these issues, for example in our JISC-funded project “ACRID”, though I can’t go into the details of that today.
  • Here I've listed important outcomes of improved data management and sharing by the climate science community. First, supporting the reproducibility and thus integrity of published research Second, exploring beyond published findings to assess their robustness to changes in data, methods, assumptions, and so on. Third, facilitating more re-use of existing data sets. Despite having an estimated 10,000 terabytes of data about the climate system, it is so complex that we need to maximise our use of available datasets.
  • But there are some challenges to consider. The volume of research is considerable, with an estimated publication rate approaching 15,000 articles related to climate change each year. Publishing and sharing the data and scientific workflows associated with each one is a significant task – is it necessary?
  • The volume of data is also very large, and projected to grow rapidly, with a big expansion of data simulated by climate models and, particularly, associated with remote sensing instruments mounted on satellites and ground radar. The small “in situ/other” bars, although a much smaller volume, are very disparate and therefore present their own considerable challenge.
  • As well as challenges, there are also limitations. Some of these issues are necessary products of the adversarial system in which research (and publication and funding) is carried out. Some data cannot be published openly, some data producers want open publication delayed so they can exploit their data further. Then there are the resources – time and money – needed to improve data sharing and publication.
  • Many datasets are truly open, or are open for non-commercial research. But some are subject to real non-disclosure agreements. For example most data observed at UK weather stations is subject to this agreement, which is treated very seriously. I've used this data in many publications – even winning a medal from the Royal Meteorological Society for some of this work – but I can't share the raw data that I used. I can't even keep a copy myself to ensure that I can prove my work is repeatable. I could I suppose download a new copy in future years to demonstrate that it is repeatable, but can I guarantee that the data in the data centre won't have been changed (quality controlled, updated, etc.)? No!
  • Here is another example. The most widely used basis for analysing changes in global-scale patterns of precipitation is constructed from weather station data that cannot be disclosed. These non-disclosure agreements are real – but need to be phased out somehow.
  • Informal agreements between colleagues sharing data are also genuine, and the consequence of breaching the trust established between colleagues is too.
  • Traditionally, climate data aren't published in their own right, but rather as part of an article that analyses the data and reports the findings. The scientists gain citations and credit, but it can take years in some cases – first amassing sufficient results to represent publishable critical mass, and then negotiating the peer-review system.
  • The main co-benefits are probably not linked to the ability to demonstrate that published work can be replicated. Providing data (and other materials) with a publication, perhaps as supplementary online materials which many journals now allow, is certainly a useful option but in many cases it isn't seen as a sufficient benefit in itself. For how many of the 13,000 climate change articles published per year would this supplementary material actually be looked at and used? Again, it must be “sold” to the scientist by other benefits. There are also limitations in how useful this route of supplementary material is, in relation to being able to cite specific datasets, finding them for re-use. Another concern is the proliferation of multiple copies of a dataset. Are they identical, or subtly different? How to tell? Better is to provide a unique identifier and address to existing data that were used in the article, rather than a copy of the data.
  • There are new opportunities to publish datasets in their own right, rather than as part of scientific articles. Although meta-data and other accompanying information must also be provided, it is still a smaller task than completing a study and associated paper, and the peer review can also be much lighter touch. The lag from data collection to publication could be reduced. The datasets are citable, allowing due credit and encouraging more scientists to follow this route, and need to be uniquely identifiable over a long period of time.
  • In terms of the location of long-term data publishing and archiving, the publishers of scientific journals do not seem to be ideal. Yes, it does help provide a very strong link between a published article and the data used. But publishers – especially commercial rather than academic – are not the place to guarantee long-term preservation, due to the commercial realities in which they operate. Cannot guarantee the longevity of a journal and its associated archive of data and materials. Unlike the journal article, are the supplementary materials archived (e.g. by British Library) and easily located? Publishers – and also individual university archives – may also not provide the functionality that dedicated data centres can provide, related to tools and search engines. Similarly they aren't ideal for supporting data re-use if scattered across hundreds of institutions and/or journals. Much better are existing data centres, especially if they are dedicated to specific disciplines and have a mandate to support the scientific community by provided long-term archiving.
  • In my experience, the more specific (e.g. subdiscipline) the better. For example the World Data Centre for Paleoclimatology does well by splitting their archives according to subdiscipline – e.g. the International Tree-Ring Data Bank. Why? Well the more generalised these are, the more complex the data and meta-data model becomes, and formats tailored to the needs of specific cases are harder to cater for. There is a steeper barrier, some aspects may appear to be irrelevant, and the objective of encouraging greater data sharing is not met.
  • The stakes are particularly high but the context in which decisions must be made is very difficult, without past precedents to know what the most optimal route to follow is. If, as some suggest, it were easy and cheap to reduced greenhouse gas emissions and if the impacts of not doing so were very damaging, the best policy would be obvious. If, as others contend, the situation is reverse, the best policy (“do nothing”) is also easily chosen. In reality, the context is much harder. Taking action is not easy or cheap. The net effects could be very serious. But they might not be. Or they might be serious for some but not for everyone. The net impact of climate change is very uncertain – and the uncertainty range includes some changes that are not just economically damaging and could be beyond what we can adapt to.
  • There is a significant cost involved, and increasing its value in re-use means spending more time in publishing the data – meeting standards for data and meta-data, and providing other materials But the “cost” is not simply a matter of funding. Though time = money, if you give an academic more funding to cover a task that doesn’t have an obvious benefit, they will still be reluctant to use the funding for that task. The solution is to focus on the co-benefits of committing this time and these resources.
  • Go to ‘View’ menu > ‘Master’ > ‘Slide Master’ to edit the titles on this slide Go to ‘View’ menu > ‘Header and Footer…’ to edit the footers on this slide (click ‘Apply’ to change only the currently selected slide, or ‘Apply to All’ to change the footers on all slides.
  • Tim Osborn: Research Integrity: Integrity of the published record

    1. 1. Climate research data and research integrity Dr Tim Osborn Climatic Research Unit School of Environmental Sciences University of East Anglia JISC Research Integrity Conference: the Importance of Good Data Management 13 September 2011
    2. 2. Integrity of the published research record <ul><li>Why is it important for climate research and why now? </li></ul><ul><ul><ul><li>(Of course it’s always been important and not just for this discipline) </li></ul></ul></ul><ul><li>The global warming issue: </li></ul><ul><ul><li>Scientifically challenging </li></ul></ul><ul><ul><li>Politically, socially and economically contentious </li></ul></ul><ul><ul><li>High stakes (economic and non-economic) </li></ul></ul><ul><ul><li>Under intense scrutiny </li></ul></ul>
    3. 3. Climate change hacked emails controversy <ul><li>The integrity of our research was severely questioned </li></ul><ul><ul><li>What role did research data issues (management, sharing, etc.) play in this? </li></ul></ul><ul><ul><ul><li>Need to distinguish research integrity from perceptions of research integrity </li></ul></ul></ul><ul><ul><li>These issues probably played a rather small role </li></ul></ul><ul><ul><ul><li>Our research data and the research record were preserved </li></ul></ul></ul><ul><ul><ul><li>We “created” very little raw data and we have an excellent record in preserving and publishing for re-use our derived data </li></ul></ul></ul><ul><ul><li>Instead, the perception of doubt arose very much more from the contents of the hacked emails and their interpretation </li></ul></ul>
    4. 4. Climate change hacked emails controversy <ul><li>Improved research data management and sharing would have made little difference to the attacks on our integrity </li></ul><ul><ul><li>Not to our critics, perhaps a small role in the cross-over to the main-stream media </li></ul></ul><ul><li>Nevertheless, there are areas where we can improve and we received some criticism in these areas </li></ul><ul><li>The climate science community as a whole should improve </li></ul><ul><ul><li>Data sharing for openness, for re-use </li></ul></ul><ul><ul><li>Improved data management for preserving workflows and linking articles to analysis to data (e.g. JISC ACRID) </li></ul></ul>
    5. 5. Managing and sharing research data: why should we improve? <ul><ul><li>Supports reproducibility (necessary) and repeatability (desirable) </li></ul></ul><ul><ul><ul><li>Maintains (actual and perceived) integrity of research </li></ul></ul></ul><ul><ul><ul><li>Essential because high-stake decisions must be informed by sound scientific assessment </li></ul></ul></ul><ul><ul><li>Supports further exploration of scientific findings </li></ul></ul><ul><ul><ul><li>Scientific findings that are not clear cut (e.g. in the vicinity of the statistical significance) are more sensitive to variations in data, methodological choices, assumptions, etc. </li></ul></ul></ul><ul><ul><li>Supports data re-use for other studies </li></ul></ul><ul><ul><ul><li>We are data poor (despite > 10,000 TB) relative to the complexity of the climate system </li></ul></ul></ul>
    6. 6. <ul><ul><li>Estimated numbers of climate change articles: </li></ul></ul><ul><ul><li>Total > 100,000 </li></ul></ul><ul><ul><li>Just 2009 > 13,000 which is > 1 / hour </li></ul></ul>Grieneisen & Zhang (2011) doi: 10.1038/nclimate1093 Sharing climate data: some challenges
    7. 7. <ul><ul><li>Data volume is already large (> 10,000 TB) </li></ul></ul><ul><ul><li>Projected to grow tenfold by end of this decade </li></ul></ul>Overpeck et al. (2011) doi: 10.1126/science.1197869 Sharing climate data: some challenges
    8. 8. Sharing climate data: some limitations <ul><li>Data with non-disclosure agreements </li></ul><ul><ul><li>Formal or informal agreements </li></ul></ul><ul><li>Holding back for future exploitation </li></ul><ul><ul><li>Controlling use, getting recognition </li></ul></ul><ul><li>Time and resources </li></ul><ul><ul><li>Costs may be obvious, benefits may be unrealised </li></ul></ul><ul><ul><li>Standards, meta-data and software increase the value in re-use, but can increase the time needed </li></ul></ul>
    9. 9. Non-disclosure agreements: real or excuse? <ul><li>Example 1: UK climate data </li></ul><ul><ul><li>Data sets must not be passed on to third parties under any circumstances... Once the project work using the data has been completed, copies of the datasets held by the end user should be deleted ... The introduction of sanctions against individuals or Departments may be considered if breaches occur. </li></ul></ul><ul><ul><ul><li>http://badc.nerc.ac.uk/conditions/ukmo_agreement.html </li></ul></ul></ul>
    10. 10. Non-disclosure agreements: real or excuse? <ul><li>Example 2: Global precipitation data </li></ul><ul><ul><li>One of the most widely used analyses of variations in precipitation across the global land surface is “based on the complete GPCC monthly rainfall station data-base (the largest monthly precipitation station database of the world with data from ca. 85,000 different stations)... Corresponding to international agreement, station data provided by Third Parties are protected .” </li></ul></ul><ul><ul><ul><li>http://gpcc.dwd.de </li></ul></ul></ul>
    11. 11. Non-disclosure agreements: real or excuse? <ul><li>Informal agreements exist too </li></ul><ul><ul><li>Especially with newly collected data provided in advance of its formal publication </li></ul></ul><ul><ul><li>These agreements with colleagues, and the consequences of breaching them, are genuine (regardless of what the ICO might decide if tested under FOI/EIR legislation!) </li></ul></ul>
    12. 12. Holding back data for future exploitation <ul><li>Traditionally, climate data itself aren’t published </li></ul><ul><li>Instead, a journal article is published reporting findings arising from some analysis of the data </li></ul><ul><ul><li>Provides a citable outcome for which the scientist gains credit </li></ul></ul><ul><li>This could take many months to a few years </li></ul><ul><ul><li>Because publishable findings may only arise from extensive analysis of the data or from a collection of multiple records </li></ul></ul><ul><ul><li>and it has to go through peer-review system </li></ul></ul><ul><li>In the meantime, the data may have been shared and used under non-disclosure restrictions </li></ul>
    13. 13. Ways forward…1 <ul><li>Providing data (and other materials) with a publication to allow it to be reproduced (or perhaps repeated) </li></ul><ul><ul><ul><li>E.g. supplementary online materials </li></ul></ul></ul><ul><ul><li>Seen as a burden for all 13,000 climate change articles per year </li></ul></ul><ul><ul><ul><li>Co-benefits must be evident to make this worthwhile </li></ul></ul></ul><ul><ul><ul><li>Citation and data re-use </li></ul></ul></ul><ul><ul><li>Potential proliferation of copies of identical (or perhaps not!) copies of datasets </li></ul></ul><ul><ul><ul><li>Better to provide a unique identifier to existing data that have been used, rather than a copy of the data </li></ul></ul></ul>
    14. 14. Ways forward…2 <ul><li>Data publication </li></ul><ul><ul><li>Newly collected (observed, simulated, derived) datasets published in their own right, not as part of scientific paper </li></ul></ul><ul><ul><li>Meta-data and other accompanying information </li></ul></ul><ul><ul><ul><li>But could speed up the lag from data collection to data publication, and much lighter-touch peer review </li></ul></ul></ul><ul><ul><li>Citable (e.g. DOI) allows due credit </li></ul></ul><ul><ul><li>Identifiable (long-lasting URI) allows unique identification </li></ul></ul><ul><ul><ul><li>Should be unique – updates or modifications to the data should have separate unique identifier (how to link between versions – considered in our JISC ACRID project) </li></ul></ul></ul>
    15. 15. Preferred data archives…1 <ul><li>Storing data with publisher, linked directly to article </li></ul><ul><ul><li>Useful (not essential) for a strong link between article and data </li></ul></ul><ul><ul><li>Not ideal for long term preservation, large datasets, tools for exploring data, searches of databases etc. </li></ul></ul><ul><ul><li>Not ideal for re-use </li></ul></ul><ul><li>University archiving possible, but similar disadvantages </li></ul><ul><li>Discipline-specific, dedicated data centres are preferable </li></ul><ul><ul><li>E.g. World Data Center system ( http://www.icsu-wds.org/ ) </li></ul></ul><ul><ul><li>WDC-Climate, WDC-Paleoclimate, BADC, BODC, ITRDB, CMIP5 </li></ul></ul>
    16. 16. Preferred data archives…2 <ul><li>Sub-discipline specific archives superior to broader archives </li></ul><ul><ul><li>More generalised approaches provide a steeper barrier for submission (e.g. describing all environmental data sets via one standard meta-data model – very large model, much to learn etc.) </li></ul></ul><ul><ul><li>Approaches tailored to sub-disciplines avoid irrelevant structures, formats, meta-data </li></ul></ul><ul><ul><li>Sometimes expertise is needed rather than extra meta-data </li></ul></ul>
    17. 17. Summary points <ul><li>Improved data sharing and links to published findings are needed across the climate science community, to increase the pace of knowledge creation and to support the integrity of published work </li></ul><ul><li>New approaches to publishing newly constructed datasets should be encouraged and adopted where possible </li></ul><ul><ul><li>Bringing benefits of citations, credit and unique identification </li></ul></ul><ul><li>Published articles should identify data used, preferably via citation/identification of already published data rather than providing a further copy of the data </li></ul><ul><li>Subject-specific data archives are preferred, offering better support for data re-use </li></ul><ul><li>Other issues (non-disclosure agreements, time and resources) need to be considered – benefits must be clear to encourage them to be overcome </li></ul>
    18. 19. Global warming issue: high stakes <ul><li>Easy contexts for decision making: </li></ul><ul><ul><ul><li>Cost of reducing GHGs low, adverse impact of not doing so is high </li></ul></ul></ul><ul><ul><ul><li>Cost of reducing GHGs high, adverse impact of not doing so is low </li></ul></ul></ul><ul><li>Decision making in the actual context is much harder: </li></ul><ul><ul><ul><li>Significantly reducing GHGs may prove difficult with moderate to high costs </li></ul></ul></ul><ul><ul><ul><li>Net effects of not reducing GHGs are very uncertain and could range from fairly moderate to very severe adverse impact </li></ul></ul></ul>
    19. 20. Global warming issue: high stakes <ul><li>Easy contexts for decision making: </li></ul><ul><ul><ul><li>Cost of reducing GHGs low, adverse impact of not doing so is high </li></ul></ul></ul>
    20. 21. Global warming issue: high stakes <ul><li>Easy contexts for decision making: </li></ul><ul><ul><ul><li>Cost of reducing GHGs low, adverse impact of not doing so is high </li></ul></ul></ul>
    21. 22. Global warming issue: high stakes <ul><li>Easy contexts for decision making: </li></ul><ul><ul><ul><li>Cost of reducing GHGs low, adverse impact of not doing so is high </li></ul></ul></ul>
    22. 23. Global warming issue: high stakes <ul><li>Easy contexts for decision making: </li></ul><ul><ul><ul><li>Cost of reducing GHGs low, adverse impact of not doing so is high </li></ul></ul></ul><ul><ul><ul><li>Cost of reducing GHGs high, adverse impact of not doing so is low </li></ul></ul></ul>
    23. 24. Global warming issue: high stakes <ul><li>Decision making in the actual context is much harder: </li></ul><ul><ul><ul><li>Significantly reducing GHGs may prove difficult with moderate to high costs </li></ul></ul></ul><ul><ul><ul><li>Net effects of not reducing GHGs are very uncertain and could range from fairly moderate to very severe adverse impact </li></ul></ul></ul>
    24. 25. Time and resources <ul><li>Must not mistake reluctance to commit time and resources with desire to avoid disclosure </li></ul><ul><li>There is a real cost involved </li></ul><ul><ul><li>Standards, meta-data and software increase the value in re-use, but can increase the time needed </li></ul></ul><ul><li>The answer is not simply to obtain funding </li></ul><ul><ul><li>Even with specific funding, unless the benefits of sharing data, meta-data are clear there will be pressure to do things with more obvious benefits </li></ul></ul>
    25. 26. 14/09/11 Wellcome Collection Conference Centre, 13 September 2011 slide Research Integrity Conference The importance of good data management