My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote)
Upcoming SlideShare
Loading in...5
×
 

My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote)

on

  • 614 views

My keynote talk for Eurocris2014, Rome. I make the case for reuse of research data, discuss the barriers and look at ways we are trying to overcome them.

My keynote talk for Eurocris2014, Rome. I make the case for reuse of research data, discuss the barriers and look at ways we are trying to overcome them.

Statistics

Views

Total Views
614
Views on SlideShare
528
Embed Views
86

Actions

Likes
4
Downloads
9
Comments
0

1 Embed 86

https://twitter.com 86

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is a keynote delivered at CRIS2014 in Rome, 2014-05-14
  • This is an outline of what I’ll be talking about. I hope to persuade you of the value of data reuse and. Having done that, to examine why we sometimes find it difficult. I’ll look at how we can overcome these barriers and what you as research administrators can do. I’lll then return to my opening themes in reverse order. Returning to the themes in this way means that this talk follows sonata form, the classic structure of the first movement of a symphony. I don’t claim my talk will be as beautiful but I do hope it will be enlightening.
  • I’m from an organisation in the UK called the digital curation centre. We are funded to help UK universities improve their own research data management practices. But that task is international in nature so we do much work in collaboration with others outside the UK.
  • Data curation is an odd term, combining vocabulary from the world of museums and science. It doesn’t always transate well to languages other than English. So this is an attempt to define what it means. It’s about preservation, but it’s more than that. It involves destruction and it involves adding value.
  • There are many such stories of unexpected data reuse; these are a few examples. The last, exemplified in the Old Weather project, is seeing the original data being reused for at least the third time and in doing so is helping both climatologist and family historians through a single piece of transcription work. An impressive result.
  • There are many such stories of unexpected data reuse; these are a few examples. The last, exemplified in the Old Weather project, is seeing the original data being reused for at least the third time and in doing so is helping both climatologist and family historians through a single piece of transcription work. An impressive result.
  • There are many such stories of unexpected data reuse; these are a few examples. The last, exemplified in the Old Weather project, is seeing the original data being reused for at least the third time and in doing so is helping both climatologist and family historians through a single piece of transcription work. An impressive result.
  • There are many such stories of unexpected data reuse; these are a few examples. The last, exemplified in the Old Weather project, is seeing the original data being reused for at least the third time and in doing so is helping both climatologist and family historians through a single piece of transcription work. An impressive result.
  • For an audience such as this, I shouldn’t have to explain why data reuse is important. But just in case, and to explain why some things have happened the way they have, I’ll describe some of the drivers.Ensuring that all research data is discoverable and reusable increases the quality of the research that we do. It can add to the data we collect ourselves and can improve the statistical rigour of our results. Exposing data to scrutiny makes it more straightforward to validate or challenge the findings of others.Making data available also improves the speed with which we can do research. If someone else has already gathered the data we need (perhaps for a different end use), we can move directly to the analysis stage of our work, saving both time and money.And saving money increases the efficiency of research. We hope that the money saved lets us do more research, but even if it doesn’t society as a whole will gain. There’s evidence behind this that I’ll come to later, but it is an effective counter to those in some universities who feel that increasing funder requirements for data management simply leads to additional costs with no gain. There is a gain in all these areas, and hence every one of the actors – researchers, their employers, their funders, and society, should be motivated to make this happen.
  • Mu ch as I enjoy the JIR, it isn’t the publication most of us aim for. But it brings home one compelling argument for making data available, that of research integrity. Almost all fraud, and other less clear-cut cases of bad research, can be associated with the unavailability of research data. There are real consequences, including human suffering and death – of which more later. And did I mention that making your data available makes it more likely to be cited? Don’t worry, I will again.
  • These are just a few examples, some of outright fraud and others of simply dodgy research all of which would have been uncovered far more quickly had the data been made routinely available. The Duke case in particular roused the suspicions of many in the field but took many years to get to the bottom of because data was locked away. It is just one example of a set of practices described very clearly by Ben Goldacre in ‘Bad Pharma’. Missing data is the largest section in his book, although he has other justified concerns with research relating to medial treatments. It has led to a global movement to ensure that all clinical trial data is made available. But medicine is by no means the only area affected.
  • Did I mention that making data available increases citations? This is a win all round. If you don’t believe me, here are three studies from three different areas that all show robust, positive correlations. The effect size varies with discipline, but we have enough evidence now that anyone who says that their area is different needs to come up with evidence to show why.
  • Getting better at managing research data isn’t just about keeping more stuff for longer. It’s also about being more selective about what we do keep and documenting the decision-making process that we use. Reports such as this make clear that technological advances means that the cost of producing data is dropping more rapidly than the cost of retention. Some arguments show that if we attempt to retain everything it won’t be long before we’re spending the entire GDP purely on data storage. That’s an extreme analysis, but the problem is real as CERN know well. In some disciplines it really is wiser to just generate the data again when it is needed. But for many observational disciplines, that opportunity isn’t open to us.
  • Yet some researchers still aren’t convinced by the rhetoric. Carly Strasser at CDL has listed some of the reasons for not sharing data that she’s encountered – and here are some of my one-line responses. I’m not saying that the concerns aren’t sincere or reasonable but they can all be dealt with and some are positively misguided. The purpose of data centres, for instance, is to make data independently reusable (as stated in the OAIS standard) which relieves researchers of the burden of dealing with questions about it, at the same time as increasing the likelihood that their data will be cited.
  • Medicine does, however, provide some clear reasons why we can’t just stick all research data on the internet for anyone to trawl through. When human subjects are involved there are real concerns about confidentiality. Yet what alltrials.net and other initiatives make clear is that the *existence* of the data should never be hidden. That allows it to be discovered and for negotiations to take place about its use. It avoids costly replication, which can delay scientific discovery and involve human suffering when the replication takes the form of a clinical trial.
  • Many of you may be familiar with this graph from the Hubble Space telescope data archive. It tells the same story in a different way, and also tells a story about the transformation of astronomy as a discipline. In the days of photographic plates, sharing (analogue) astronomical data was difficult. Digital instruments transformed this, and some time around 2000, more research was being done with old data than with new data. I could be more specific about this if the data behind this graph was made available, incidentally!

My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote) My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote) Presentation Transcript

  • My Data, Our Data, Your Data: data reuse through data management Kevin Ashley Digital Curation Centre www.dcc.ac.uk @kevingashley Kevin.ashley@ed.ac.uk Reusable with attribution: CC-BY The DCC is supported by Jisc
  • A summary • Why data reuse ? • What stops us ? • How data management helps • Harmonising the goals of research administration and research • Barriers again • The case for reuse - again 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 2
  • My home – the DCC • Mission – to increase capability and capacity for research data services in UK institutions • Not just a UK problem – an international one • Training, shared services, guidance, policy, standards, futures 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 3
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 4 What is data curation ? • “Maintaining, preserving and adding value to research data throughout its lifecycle” • More than preservation: – Active management – dealing with change • Less than preservation: – Lifecycle sometimes involves destruction
  • DCC guidance 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 5
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 6 SWEDEN DENMARK CANADA
  • Data reuse stories • The palaeontologist who saved years of work with archaeological data 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 7
  • What a paleontologist looks at 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 8 Now 100 million years ago 25m 50m 75m 1m
  • What a paleontologist looks at 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 9 Now 100 million years ago 25m 50m 75m 1mNow 1 million years 750,000500,000100,000
  • What an archaeologist looks at 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 10 Now 1 million years 750,000500,000100,000 100,000 years ago 75,000 50,00025,000
  • Data reuse stories • The palaeontologist who saved years of work with archaeological data • The 19th-century ships logs that help us model climate change 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 11
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 12 The Old weather project Data for research, not from research
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 13
  • Data reuse stories • The palaeontologist who saved years of work with archaeological data • The 19th-century ships logs that help us model climate change • The ‘noise’ from research radar that mapped dust from Eyjafjallajökull 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 14
  • Data reuse - messages 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 15 Often your data tells stories that your publications do not Not all data comes from other researchers One person’s noise is another person’s signal Discipline-bounded data discovery doesn’t give us all we need or want
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 16 Why care? • Data is expensive – an investment • Reuse: – More research – Teaching & Learning – Planning • Impact – with or without publication • Accountability • Legal & regulatory requirements
  • Why does this matter? • Research quality – How close can we get to the truth? • Research speed – How quickly can we get to the truth? • Research finance – How much does the truth cost? • Improving one or more of these is of interest to all actors: • Researchers as data creators • Researchers as data reusers • Research institutions • Funders – hence government and society 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 17
  • G8UK - Endorses OA Open Data Charter Policy Paper 18 June 2013 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 18 G8UK - Billigt offenen Zugang Eine offene Daten Charter Strategiepapier.
  • Funder requirements • UK • USA – NSF, NEH, NIH • Europe • Most place burden on researcher – some on the institution 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 19 http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx
  • RCUK policy - The 1-minute version • Research data are a public good – make openly available in timely & responsible way • Have policies & plans. Data with long-term value should be preserved & usable • Metadata for discovery & reuse. Link publications & data • Sometimes law, ethics get in the way. We understand. • Limited embargos OK. Recognition is important – always cite data sources • OK to use public money to do this. Do it efficiently. 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 20
  • EPSRC policy points • Awareness of regulatory environment • Data access statement • Policies and processes • Data storage • Structured metadata descriptions • DOIs for data • Securely preserved for a minimum of 10 years from last use 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY Compliance expected by 2015
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 22 DCC Policy Summary http://www.dcc.ac.uk/resources/policy-and-legal
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 23 Findable, citable data has value • Important to link publications to data (and vice versa) • Increases citations – of data & publication • Increases reuse (hence value) • But effects exist even without publication, if data is: – Archived – Citable – Discoverable MORAL: build a data registry
  • What stops data reuse • Loss • Destruction • Pride • Gluttony • Ineptitude • Concealment • Bureaucracy • Complexity • Procrastination • Lack of potential 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 24
  • Kevin Ashley – Eurocris2014 - CC-BY 25 “Departments don’t have guidelines or norms for personal back-up and researcher procedure, knowledge and diligence varies tremendously. Many have experienced moderate to catastrophic data loss” Incremental Project Report, June 2010 http://www.flickr.com/photos/mattimattila/3003324844/ 2014-05-14
  • What stops data reuse • Loss • Destruction • Pride • Gluttony • Ineptitude • Concealment • Bureaucracy • Complexity • Procrastination • Lack of potential 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 26
  • How people talk about data • I put my data in figshare and I got a DOI for it • Not our data; the university’s data; my funder’s data; the data; the people’s data; your data. 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 27
  • Data ownership – it’s messy • You need ownership to make data free • Governments may assert this • Industrial collaborators – understanding role of public funding • Research admin tracks the rules 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 28
  • ON METADATA 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 29
  • Disciplines – current state • Typically specialised • Focussed on discipline-specific concerns • Frequently embedded – hence processing required to expose independently • Historic failure to express generic concepts generically – Place – Time 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 30
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 31
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 32 Understanding Data Requirements http://www.dcc.ac.uk/
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 33
  • Data centres are good value! • See Jisc reports on ADS, BADC, UKDA: • Returns on investment between 400% and 1200% 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 34
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 35
  • Integrity • Not everyone publishes here • Almost all fraud connected to unavailable data • People suffer & die due to research fraud • When your research is reproducible – it gets cited 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 36
  • Integrity – not without data • Cyril Burt – Twin studies on intelligence. – Questioned 1976; now discredited • Duke case – Data hiding leads to wasted treatments, clinical trials, probable death & huge lawsuits • Dutch cases – Stapel – 55 publications – “fictitious data” – Poldermans – fabricated data or negligence? 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 37 “The case for open data: the Duke Clinical Trials “– blog post, Kevin Ashley, http://www.dcc.ac.uk/news/case-open-data-duke-clinical-trials “Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud?” – Doorn, Dillo, van Horik, IJDC 8(1); doi:10.2218/ijdc.v8i1.256
  • Citability • Making data available increases citations • Everyone – academic, funder, institution – loves citations • Want evidence? – Alter, Pienta, Lyle – 240%, social sciences * – Piwowar, Vision – 9% (microarray data)† – Henneken, Accomazzi – 20% (astronomy) # 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 38 † Piwowar H, Vision TJ. (2013) Data reuse & the open data citation advantage. PeerJ PrePrints 1:e1v1 http://dx.doi.org/10.7287/peerj.preprints.1v1 * Amy Pienta, George Alter, Jared Lyle, (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data. http://hdl.handle.net/2027.42/78307 # Edwin Henneken, Alberto Accomazzi, (2011) Linking to Data - Effect on Citation Rates in Astronomy. http://arxiv.org/abs/1111.3618
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 39 How to cite data What data to keep
  • The Data Deluge is upon us 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 40 Sensor’s ability to produce data outstrips IT’s ability to process it
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 41
  • Roles and Responsibilities What data to keep 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 42
  • Excuses – and responses • “People will ask questions” – So use a data centre or repository • “It will be misinterpreted” – Stuff happens. Also, openness encourages correction • “It’s not interesting” – Let others be the judge – your noise is my signal • “I might get another paper out of it” – Up to a point. We might get more research out of it • “I don’t have permission” – A real problem. But solvable at senior level • “It’s too bad/complicated” –see above • “It’s not a priority” – Unfortunately, funders are making it so. But if you looked at the evidence, it would be your priority as well 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 43 See e.g. Carly Strasser’s blog: http://datapub.cdlib.org/2013/04/24/closed-data-excuses-excuses/
  • Should all data be open? • NO • Many reasons – most to do with human subjects • But data existence should always be open • Allows discovery & negotiation on use • Avoids pointless replication 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 44
  • Kevin Ashley – Eurocris2014 - CC-BY 45 Some conundrums • Releasing genome data is OK when it’s: – An identified human subject – An anonymous human subject – Your pet dog – Another mammal – An insect – A plant – A virus 2014-05-14
  • It’s amazing what people will share… 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 46
  • Data reuse from Hubble 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 47
  • 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 48
  • Pimp your data – make it findable & reusable 2014-04-25 Kevin Ashley, DCC – SocSciScot14 - CC-BY 49 Gking.harvard.edu/data
  • Data is variable • Not always textual • Not always tabular • Not always fixed – continual change • Not always clearly authored – think of archival provenance • Not always associated with publication • Often with indistinct boundaries • Multi-dimensional and non-linear 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 50
  • Some messages for you • Some things we need to know about data: – When/where/what is it about? – Who owns it – What rights apply – What it is derived from & how – What software may be associated – What data management plan applies – How do I gain access ? – Where is it ? – When was/will it be destroyed? 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 51
  • What about your data? • If administrative data isn’t freely available, why not? • Expose it in bulk – not just as a web page • Gain the value from your overheads! 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 52
  • What about collaboration? • Collaborate within the university • Collaborate with partners • Collaborate with regional, national services • Not everything can be done well locally • Some examples… 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 53
  • http://dataintelligence.3tu.nl/en/home/ Choice of RDM training materials for librarians Up-skilling for data http://datalib.edina.ac.uk/mantra/libtraining.html 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 54
  • My message to researchers • The credit belongs to you • The data belongs to all of us • Share, and we all reap the benefits 2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 55