Inverting the data pyramid: maximising the value of data reuse (IMCW2014/ICKM2014 keynote)
1. Inverting the Pyramid:
Maximising the value of
research data to society
Kevin Ashley
Digital Curation Centre
www.dcc.ac.uk
@kevingashley
Kevin.ashley@ed.ac.uk
Reusable with attribution: CC-BY
The DCC is supported by Jisc
2. My home – the DCC
• Mission – to
increase capability
and capacity for
research data
services in UK
institutions
• Not just a UK
problem – an
international one
• Training, shared
services, guidance,
policy, standards,
futures
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 2
3. DCC networks and partnerships
Original Slide:
Martin Donnelly,
DCC
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 3
4. About me
• 35 years ago – a mathematician in medical
research
• Acquired a skill for rescuing old data:
– Lost code books
– Lost programs
– Bad or obsolete media or systems
• It was fun – but it should not have been
necessary
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 4
5. My home – the DCC
• Mission – to
increase capability
and capacity for
research data
services in UK
institutions
• Not just a UK
problem – an
international one
• Training, shared
services, guidance,
policy, standards,
futures
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 5
6. Generic science data lifecycle
PLAN COLLECT INTEGRATE/
TRANSFORM
PUBLISH DISCOVER ARCHIVE/
DISCARD
Adapted from: Harnessing the Power of Digital Data: Taking the Next Step.‖
Scientific Data Management (SDM) for Government Agencies:
Report from the Workshop to Improve SDM.
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 6
10. Herve L’Hour’s analysis
• Data lifecycles are linear, cyclical or spiral
(sometimes all three)
• See more at
http://www.dcc.ac.uk/events/research-data-management-
forum-rdmf/rdmf11 - workflows
& research data management
• Linear cycles are project-based or repository-based
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 10
12. But in research…
"DIKW-diagram" by RobOnKnowledge - Own work. Licensed under
Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:DIKW-diagram.
png#mediaviewer/File:DIKW-diagram.png
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 12
13. I ♥ your data!
I don’t ♥ what you said
about it.
14. LIDAR & RADAR images of ice cloud –
H. Ruschennberg
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
14
15. 2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
15
The Old
weather
project
Data for
research,
not from
research
16. Data reuse stories
• The palaeontologist who saved years of work
with archaeological data
• The 19th-century ships logs that help us model
climate change
• The ‘noise’ from research radar that mapped
dust from Eyjafjallajökull
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 16
17. Data reuse - messages
Often your data tells
stories that your
publications do not
Not all data comes from
other researchers
Discipline-bounded data
discovery doesn’t give us
all we need or want
One person’s noise is
another person’s signal
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 17
18. Understanding Biodiversity
• We don’t understand what drives it
• What helps, hinders speciation
• No one project or data source is enough
• Biology, geology, climate science, chemistry…
• Big and small problems
• Reanalysis & gap analysis
19. Research on Biodiversity…
• Requires many different data sources
• Not all will be published
• Not all publications are for similar research
reasons, so…
• Citing the publication is irrelevant
• Some is research data, other government or
reference data
20. Why care?
• Data is expensive – an investment
• Reuse:
– More research
– Teaching & Learning
– Planning
• Impact – with or without publication
• Accountability
• Legal & regulatory requirements
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 20
21. Why does this matter?
• Research quality
– How close can we get to
the truth?
• Research speed
– How quickly can we get
to the truth?
• Research finance
– How much does the
truth cost?
• Improving one or more
of these is of interest to
all actors:
• Researchers as data
creators
• Researchers as data
reusers
• Research institutions
• Funders – hence
government and society
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
21
22. Creative data reuse
• http://vimeo.com/38402965
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 22
23. Integrity – not without data
• Cyril Burt
– Twin studies on intelligence.
– Questioned 1976; now discredited
• Duke case
– Data hiding leads to wasted treatments, clinical
trials, probable death & huge lawsuits
• Dutch cases
– Stapel – 55 publications – “fictitious data”
– Poldermans – fabricated data or negligence?
“The case for open data: the Duke Clinical Trials “– blog post, Kevin Ashley, http://www.dcc.ac.uk/news/case-open-data-duke-clinical-trials
“Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud?” – Doorn, Dillo, van Horik, IJDC 8(1); doi:10.2218/ijdc.v8i1.256
2014-01-08 Kevin Ashley – ESIP Winter 2014 - CC-BY 23
24. Without data reuse:
•We can waste billions
• People suffer & die
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 24
25. Data reuse from Hubble
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 25
26. Data reuse is already
happening – and
researchers can change
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 26
27. Where can it happen
Global, international
Nationally
By Subject Institution
Research Group
29. Research data centres are good value!
• See Jisc reports on ADS, BADC, UKDA:
• Returns on investment between 400% and
1200%
• Unfortunately – many research domains have
no relevant data centres
http://www.jisc.ac.uk/whatwedo/programmes/di_dir
ections/strategicdirections/badc.aspx
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 29
30. “Provision for data management, for
curation and long-term preservation, and
for the sharing and re-use of data, varies
wildly between subject areas.”
“The data management needs of many
researchers are little considered or catered for.”
If greater provision is to be
made, a shortfall in
infrastructure (both technical
and human) must be
overcome.
Policy makers are aware that
in many areas of enquiry,
researchers’ access to well-managed,
open and reusable
data opens up significant
opportunities.
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
30
All from JISC MRD 2
call, 2010
33. The library as custodian
• Increasing role for library to provide access to
institutional assets
• See Lorcan Dempsey’s thoughts on the inside-out
library vs outside-in library
– http://www.slideshare.net/lisld/the-inside-out-library
• Build on library strengths – preservation,
access, curation, selection
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 33
34. G8UK - Endorses
OA
Open Data
Charter
Policy Paper
18 June 2013
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
34
35. Funder requirements
http://www.epsrc.ac.uk/abo
ut/standards/researchdata/P
ages/policyframework.aspx
UK - RCUK
Canada
UK - RCUK
USA – NSF,
Denmark NEH, etc
USA – non-government
funders (Sloan,
Gates,…)
Europe
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 35
36. RCUK policy - The 1-minute version
• Research data are a public good – make openly
available in timely & responsible way
• Have policies & plans. Data with long-term value
should be preserved & usable
• Metadata for discovery & reuse. Link publications &
data
• Sometimes law, ethics get in the way. We understand.
• Limited embargos OK. Recognition is important –
always cite data sources
• OK to use public money to do this. Do it efficiently.
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 36
37. EPSRC policy points
• Awareness of regulatory environment
• Data access statement
• Policies and processes
Compliance
• Data storage
expected by 2015
• Structured metadata descriptions
• DOIs for data
• Securely preserved for a minimum of 10 years
from last use
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY
38. 2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
DCC Policy
Summary
38
http://www.dcc.ac.uk/resources/policy-and-legal
39. Helping make data reuse possible –
experience from the DCC
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 39
40. Some lessons – a summary
• Data reuse is rarely as simple as people think it is
• It is already happening
• It is good for research, for researchers, for funders, for
universities
• Without senior management attention and researcher
involvement, your initiative will fail
• Research data management services cannot involve the
library alone
• Researchers need to know your services exist
• Training for young researchers in good data practice is
valuable
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 40
41. DCC ‘institutional engagement’
Assess
needs
Make the case
Develop
support and
services
RDM policy
development
Customised Data
Management Plans
DAF & CARDIO
assessments
Guidance and
training
Workflow
assessment
DCC
support
team
Advocacy with senior
management
Institutional
data catalogues
Pilot RDM
tools
Original Slide:
Graham Pryor,
DCC
…and support policy implementation
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
41
42. Some institutional roles
• Leadership – coordinate action
• Audit – who has what, where does it go?
• Advice on access – data, wherever it is
• Preservation – permanence
• Citability
• Data/publication linking
• Promoting data in teaching
• Selection
• Education – early career researchers
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 42
43. Who (in the UK) is leading RDM work?
RESEARCHERS
Library
IT
Research
Office
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 43
45. Some example services
• Storage – persistent, shareable
• Permanent, citeable identifiers
• Database as a service (e.g. Oxford ORDS)
• Embed tools in Excel – Dataup, others
• Workflow management – Taverna
• Training for early career researchers
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 45
46. Make data creation easier
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 46
47. Make data citable
• Making data available increases citations
• Everyone – academic, funder, institution –
loves citations
• Want evidence?
– Alter, Pienta, Lyle – 240%, social sciences *
– Piwowar, Vision – 9% (microarray data)†
– Henneken, Accomazzi – 20% (astronomy) #
# Edwin Henneken, Alberto Accomazzi, (2011) Linking to Data - Effect on Citation Rates in Astronomy. http://arxiv.org/abs/1111.3618
* Amy Pienta, George Alter, Jared Lyle, (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.
http://hdl.handle.net/2027.42/78307
† Piwowar H, Vision TJ. (2013) Data reuse & the open data citation advantage. PeerJ PrePrints 1:e1v1
http://dx.doi.org/10.7287/peerj.preprints.1v1
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 47
48. Make data discoverable
• Data must be discoverable to be reused
• Alone, or in conjunction with publication
• Services include:
– Institutional catalogues
– national data registries
– Repository registries – databib, re3data
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 48
49. Dataverse –
helping
researchers
make data
findable &
reusable
Gking.harvard.edu/data
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 49
51. http://dataintelligence.3tu.nl/en/home/
Choice of RDM training
materials for librarians
Up-skilling
for data
http://datalib.edina.ac.uk/mantra/libtraining.html
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
51
52. What data to keep
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY 52
53. The Data Deluge is upon us
Sensor’s ability
to produce data
outstrips IT’s
ability to
process it
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 53
54. Roles and
Responsibilities
What data to keep
2014-11-25
Kevin Ashley –IMCW/ICKM-2014, Antalya -
CC-BY
54
55. IDCC15 – London, Feb 9-12 2015
The 10th
International
Digital
Curation
Conference
http://www.dcc.ac.uk/events/idcc15
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 55
56. My message to researchers
• The credit belongs to you
• The data belongs to all of us
• Share, and we all reap the
benefits
• The story doesn’t end with a
publication
2014-11-25 Kevin Ashley –IMCW/ICKM-2014, Antalya - CC-BY 56
Editor's Notes
I’m Kevin Ashley; I run an organisation called the Digital Curation Centre (DCC) in the UK, and I’ve been invited here today to talk about research data management.
My home – the DCC – is a national service whose role is to increase the capability and capacity for UK research institutions – mainly universities – to run their own research data services. Where it makes sense, we also run some national services which those universities use. Because this is not just an national problem, we work alongside many partners and colleagues around the world.
This collection of logos illustrates many, though not all, of the partnerships we have been or are involved with.
But there’s something about me you should know as well. 35 years ago I began my first job, a mathematician supporting clinical researchers in a large research institution. I acquired many skills there, learning from older researchers and other staff who were happy to pass their knowledge on to me. In particular, I got a reputation as someone who was good at rescuing old data that might otherwise be lost. Lost because coding systems had been forgotten, because the programs required had been lost or no longer worked, or because the media or systems involved were now obsolete. It was great fun – technical detective work. But even then I knew something was wrong. Much of this data was irreplaceable; some involved experiments on human subjects that had involved considerable suffering. Data like that should be capable of being used more than once.
So, via many other jobs involving HPC, networks and digital archives, I ended up running the DCC.
We provide training, shared services, guidance, policy, we develop standards and we look at future possible directions in this area.
But the strapline – the phrase at the heading of our web pages – bears closer examination. “Because good research needs good data” is behind all of what we do and much of what I’ll say today. The data we use isn’t always ours, but it always needs to be good.
There are many possible views of how research takes place. This generic one is typical – a plan leads to data collection, some type of transformation, publication, discovery by others of your results, and a later decision to keep or discard the data. That last step rarely appears – it is only in this diagram because it was written by people whose main concern is taking care of scientific data.
Some views are more complex. This, from an e-science curation report in 2003, is not quite as linear, but it still involves the same basic processes. It is notable from its description of the eventual consumers – not just other researchers, but also industry and the public.
And some are just too complicated to be of any use at all.
This is from my own organisation. You will notice that it is a loop – where data, once produced and retained, can go on to inform other research.
Herve L’Hour analysed many research lifecycles and gave a presentation earlier this year about them – I encourage you to read it to learn more. He found that lifecycles are linear, cyclical (like the DCC’s) or spiral, or sometimes a hybrid of all three. Linear cycles tend to be produced from this with a project view or a repository view – where the repository is the final step in acquisition of data from a researcher. Yet it isn’t the final step – we’ll see later why not.
We have similar linear views in the world of knowledge management. The pyramids have different levels, but they all view data as a large, raw, underlying substrate which is successively refined to produce insight of some form. Again, these diagrams encourage the view that data is used once to produce a few shining nuggets of wisdom. Yet knowledge managers know better. Your skills are directed at getting many insights out of large data collections – but diagrams such as this don’t reflect that. I worry because these diagrams can affect our thinking about problems and strategies.
In research we don’t want our insight to be an end point. We hope that it leads others to new investigations – if nothing else it will give us citations! But it isn’t always our publications, where insights are recorded, that is the target of reuse. Sometimes it is our data.
You see, sometimes our knowledge is irrelevant. What someone else wants is our data, and information about how we collected it. They will learn something completely new from it. What we knew does not help them.
Here are some examples of data reuse. And the last comes from researchers at TU Delft in the Netherlands. They use a technology called LIDAR to measure raindrops and ice crystals in clouds. They are interested in knowing more about how rain, snow and hail forms. Their detectors produce many gigabytes of data an hour and most of it is not of interest to them. They filter the data to discard the noise and retain what they see as signal. But in early 2010 a volcano in Icelland caused severe disruption to air travel in Europe. One problem was that no one could be really sure where the dust from the volcano had travelled to. These researchers realised that the data they were discarding might be able to give answers – and they were right. The raw data was rescued just before it was due to be deleted, 2 weeks after collection.
The old weather project is using crowdsourcing to capture data from digitised ship’s logs from the 19th century and earlier. Because the logs contain precise recordings of temperature, wind speed and so on as well as accurate location information, this provides us with a comprehensive picture of weather around the world in the 19th century, information which we can use to calibrate today’s climate models. Before this, the logs were used by economic historians studying trade patterns. And finally, the crowd-sourcers are also capturing data about the names of people who left and joined the ships, data that is of interest to family historians. That is 3 separate uses for this original material, which wasn’t even created for research purposes.
And there are many other tales, including the paleontologist – someone looking at dinosaur bones – who found what he needed to know in an archaeological data archive.
So there are four lessons I would draw from these examples. Our data can tell stories that our publications do not –think of the archaeological data, which also tells us that discipline-bounded data discovery – chemical data for chemists, or even neurochemical data for neurochemists – does not always meet our needs. The Old Weather project reminds us that not all data used by researchers was produced for researchers, and the final example tells us that signal and noise are just two views of the same thing.
There are bigger questions that data reuse can answer. Biodiversity is a big area of research. Although we believe it is a good thing, we don’t understand enough about what causes it. Simplistic views – such as the idea that hot, wet environments promote biodiversity – are known not to be universally true. Trying to understand biodiversity can’t be done by the data from one project or one discipline. We need data from many sources to investigate potential theories and data from projects that are large and small, from examinations of species in a single pond to studies on a global scale. When we integrate all this data, we may find gaps – reasons to collect data to fill them.
Not all of that data will have led to publications. Even where it did, the publication may have little to do with our interest n biodiversity. We need to be able to cite this data as data, not cite the (possibly non-existent) publication it originally produced. Not all of that data will come from other researchers – some will be government data, some reference data, some from a commercial environment such as agriculture and forestry.
Kevin Ashley, DCC, UKSG Glasgow. CC-BY
For an audience such as this, I shouldn’t have to explain why data reuse is important. But just in case, and to explain why some things have happened the way they have, I’ll describe some of the drivers.
Ensuring that all research data is discoverable and reusable increases the quality of the research that we do. It can add to the data we collect ourselves and can improve the statistical rigour of our results. Exposing data to scrutiny makes it more straightforward to validate or challenge the findings of others.
Making data available also improves the speed with which we can do research. If someone else has already gathered the data we need (perhaps for a different end use), we can move directly to the analysis stage of our work, saving both time and money.
And saving money increases the efficiency of research. We hope that the money saved lets us do more research, but even if it doesn’t society as a whole will gain. There’s evidence behind this that I’ll come to later, but it is an effective counter to those in some universities who feel that increasing funder requirements for data management simply leads to additional costs with no gain. There is a gain in all these areas, and hence every one of the actors – researchers, their employers, their funders, and society, should be motivated to make this happen.
For an example of creative data re-use in a teaching context, see the work of globe4D. A simple device allows us to visualise data about the earth, asking what-if questions about changing sea-levels and temperatures. But we can move time back and forth as well, looking at the continents as they were 200m years ago and asking the same what-if questions then. When we’re bored with the Earth, we can do the same things with Mars or Venus – what would Mars look like with oceans of the same volume as Earth? Again, this requires integrating data from many open sources with some simple technology (and some very good visualisation.) It creates a tool which allows us to ask deep questions easily and quickly see the answers; from my own experience, it is capable of turning a group of adults into children with ease. This is a good thing – we rediscover curiousity and enthusiasm. But it’s also a great teaching tool for children, if the adults get out of the way for long enough!
These are just a few examples, some of outright fraud and others of simply dodgy research all of which would have been uncovered far more quickly had the data been made routinely available. The Duke case in particular roused the suspicions of many in the field but took many years to get to the bottom of because data was locked away. It is just one example of a set of practices described very clearly by Ben Goldacre in ‘Bad Pharma’. Missing data is the largest section in his book, although he has other justified concerns with research relating to medical treatments. It has led to a global movement to ensure that all clinical trial data is made available.
But medicine is by no means the only area affected.
So there’s one powerful argument for exposing the existence of data and enabling re-use – without doing so, we waste billions on ineffective treatments, and people suffer and die. I could just stop here, but I won’t. Other arguments are available.
Many of you may be familiar with this graph from the Hubble Space telescope data archive. It tells the same story in a different way, and also tells a story about the transformation of astronomy as a discipline. In the days of photographic plates, sharing (analogue) astronomical data was difficult. Digital instruments transformed this, and some time around 2000, more research was being done with old data than with new data.
Which leads to our second lesson. Some people say data sharing and reuse is a difficult change for researchers. In some disciplines, it is. But many have been doing it for some time, and those that have changed have benefited as a result.
Research happens at many different scales – internationally, nationally, in small groups and many scales in between. Taking care of data, curating it, needs to reflect all these scales. We have existing examples at every one of them.
2014-11-25
We know that these data centres are good value – this study by Jisc shows that the return on investment they generate is between 400% and 1200% - rates of return that would make them very valuable in the commercial world. But the benefit generated is for society as a whole, not a set of shareholders. And worse, many areas of research don’t have data centres to cater for them.
So we have a position where some of the infrastructure exists to enable data sharing, but not all. It is good in some domains of research and not others; good in some countries, some universities, and not others. This was part of the motivation for Jisc’s Managing Research Data programme in 2010 onwards – some selective quotes from the call are here. We see recognition from policy makers of the value of data reuse; that provision is unequal across subject areas; that many researchers are poorly catered for; and that infrastructure needs to be created. That infrastructure is not just technical. The human element – training, skills, changing attitudes – is equally important.
… and that means that, whether you think a library should look like this….
… or like this….
…. That there is a role for the library to play, in providing access to and caring for institutional assets of all types, including research data. This fits with Lorcan Dempsey’s view of the inside-out library – a shift from a library whose role is to acquire material from outside for the benefit of those inside, to one that showcases what is produced inside to achieve impact outside. Providing services around research data builds on the traditional strengths of libraries and librarians – preservation, access, curation, selection, as well as good researcher relationships.
Governments around the world recognise this, along with the value of public data. This statement suddenly made RDM something that government ministers cared about – not something I thought I would see in my lifetime.
But this is happening in many other countries. The USA was another early adopter, and in Europe Horizon2020 has increasingly strict rules about data sharing. Denmark, Canada and others are also acting. Most policies place the burden of compliance on the researcher – some on the organisation where the research takes place. Typically we are seeing policies about open access to data come a few years after policies about open access to publications, though the gap is narrowing. And large non-government funders are also beginning to act, with the Wellcome Trust leading the way in medical research.
RCUK is an umbrella body for government research funders in the UK. It has a set of general principles, summarised here, about data from research that it funds. They are not onerous, and read like common sense.
EPSRC, which funds engineering and physical sciences, interprets these requirements and chooses to make requirements of the university, not the individual researcher. It sees the duty of the university as being to assist the researcher to share data – by being able to preserve it securely, to expose metadata about it, to provide permanent identifiers for data, and so on.
I give these examples as illustrations of how funders approach this, as it affects how universities and researchers respond. If you are interested in more details about policy, the DCC has a series of web pages describing research data policies from around the world.
Enough about the case for data reuse. Enough in the UK were convinced that the DCC received additional funding to work with universities to accelerate the development of research data services. I’ll summarise some lessons that we learnt from this experience.
They are summarised here. Some I’ve covered already; some are worth noting before you begin to do something in your own university. Senior management commitment, from more than one individual, is necessary to sustain change. So is the involvement of your researchers from the outset. Don’t develop policy or design services without them.
Although the library is a key player in research data curation, it cannot act alone – other services providers within the university must be involved. Whatever you do develop, awareness-raising is key and needs to be repeated regularly. Researchers need to know what is available to them to make use of it.
The DCC is now 10 years old, but the lessons I will speak of are informed by work we began in 2011 to put much of our guidance into practice. We worked closely initially with 20 universities, to help them establish research data management (RDM) services. We transferred what we learnt from doing that to other universities in order to build up capacity nationally in the UK. Naturally, we learnt from others as well and we hope to pass on that knowledge outside the UK also.
We called this work ‘institutional engagement’. We behaved much like consultants, and the work we did depended on what was most needed in a particular organisation. This diagram, produced by my past colleague Graham Pryor, illustrates the range of activities involved, beginning with advocacy – helping to make the case for doing work – to establishing particular services and developing policy.
Different universities will choose to organise RDM services in different ways. But there are some common roles which can be identified. It is useful to look at this list and decide which of these roles could be taken by your library. This list is not exhaustive and I won’t have time to cover all these roles today. I mention here two tools which help with two of these activities. CARDIO helps with needs assessment, and deciding what actions to take next to establish RDM services. Where re your existing strengths & weaknesses? CARDIO helps to answer these questions, and helps you assess progress in future years. DAF helps to answer the simple questions “What have we got already? Where is it? Who is responsible for it?”
But we can begin with leadership. This diagram shows who is taking the lead role in defining RDM services within UK universities.
The library is leading in most cases and is involved regardless of who’s championing the cause.
Research offices are often the lead partner – seemingly for strategic reasons of senior buy-in and financial commitment.
IT are only leading in 2 out of the 20 cases and are disengaged / absent in a few others. Researchers are always involved, but are never the lead.
So what services can be provided?
These are some examples – I will speak about a few of them.
Dataup is plugin for Microsoft Excel, developed by the California Digital Library in collaboration with Microsoft. Many people say serious researchers should not use Excel for data analysis. Others recognise that researchers will do it whatever we say, so the best thing to do is to make Excel a better tool.
Making data citable is a simple service that brings great benefits. Here are links to three studies that show very positive effects that arise from making data available, citable and connected to papers. You can use a repository to provide identifiers such as Handles or work with an organisation like Datacite who can help you provide DOIs.
But as we saw at the beginning of my talk, the data must be discoverable – we cannot assume that people will find our data via publications. An institutional repository can help; experience in Australia shows that a national service which aggregates metadata about datasets can have a much greater impact. Hence we are copying their approach in the UK. We are aware of similar initiatives in a few other countries.
Harvard, with NSF funding, developed a service called Dataverse, which is now the basis of a national service in the Netherlands. It makes it easy for researchers to upload an d describe their data and to update the description over time. This example has one interesting feature which shows how researcher behaviour can change – this page shows that the data was made available before the associated paper was published. The author thus got reaction and publicity for their work in advance of publication.
We used our guidance documents extensively in this work, and produced case studies and new guidance as a result. It includes material you might find useful, such as training materials on RDM for librarians. I have small number of copies of our documents with me today, but they are all freely available from our website, or you can purchase print copies. You can also adapt them, translate them – all use Creative Commons Attribution licences.
One service area I mentioned is training for early career researchers – here are some examples of freely-available training materials and online courses, accompanied by a course aimed at librarians. Why not work with others to adapt and improve these training materials for use in Spain?
Some guidance is aimed directly at researchers – here are two examples on data citation and writing data management plans. We have also produced a freely-available tool for writing such plans (DMPonline) but the guide doesn’t assume that you are using it. Other tools exist – DMPTool from a USA-based consortium is the best known.
Getting better at managing research data isn’t just about keeping more stuff for longer. It’s also about being more selective about what we do keep and documenting the decision-making process that we use. Reports such as this make clear that technological advances means that the cost of producing data is dropping more rapidly than the cost of retention. Some arguments show that if we attempt to retain everything it won’t be long before we’re spending the entire GDP purely on data storage. That’s an extreme analysis, but the problem is real as CERN know well. In some disciplines it really is wiser to just generate the data again when it is needed. But for many observational disciplines, that opportunity isn’t open to us.
This guide – now accompanied by a simple checklist – is particularly relevant. It is about what archivists call appraisal and what librarians often call selection.
I’ll pause for a brief advert for our conference next year. If you want the chance to take part in far more in-depth discussions about these issues, do register to attend.
And end with the message I give to researchers about their data – accompanied by a similar message from the 3TU Datacentrum in the Netherlands, a collaboration between 3 universities. The credit for your data belongs to you, the researcher - but the data belongs to all of us and should be shared.