TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Use and reuse: research data locally & globally #esipfed
1. USE AND REUSE
Research data locally and globally
Kevin Ashley
Digital Curation Centre
www.dcc.ac.uk
@kevingashley
Kevin.ashley@ed.ac.uk
Reusable with attribution: CC-BY
The DCC is supported by Jisc & FP7
2. Why does this matter?
• Research quality
– How close can we get to
the truth?
• Research speed
– How quickly can we get
to the truth?
• Research finance
– How much does the
truth cost?
2014-01-08
• Improving one or more
of these is of interest to
all actors:
• Researchers as data
creators
• Researchers as data
reusers
• Research institutions
• Funders – hence
government and society
Kevin Ashley – ESIP Winter 2014 - CC-BY
2
3. The Data Deluge is upon us
Sensor’s ability
to produce data
outstrips IT’s
ability to
process it
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
3
4. Funders are making demands
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
4
5. EPSRC expects all those institutions it funds
to develop a roadmap that aligns … with
EPSRC’s expectations by 1st May 2012;
to be fully compliant … by 1st May 2015.
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
2014-01-08
Kevin Ashley – ESIP Winter 2014 CC-BY
5
6. •
•
•
•
•
•
•
Awareness of regulatory environment
Data access statement
Policies and processes
Data storage
Structured metadata descriptions
DOIs for data
Securely preserved for a minimum of 10 years
from last use
2014-01-08
Kevin Ashley – ESIP Winter 2014 CC-BY
6
7. Where are funders making demands?
•
•
•
•
USA – NSF, NEH, some philanthropic funders
UK
Germany – DFG
Europe – European Commission (H2020)
Often tied to requirements on open access to
research publications – but not as common.
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
7
8. To universities, that looks like a
problem
• Funder requirements exist for a reason:
– That data is valuable
• Value to funder, society from reuse
• Value to the institution is there also
BIS business case: £1.5m investment in research data
services pays back 2.5 times after 5 years
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
8
9. Research Data Centres – the solution!
MANY AREAS OF
RESEARCH HAVE NO
DATA CENTRE TO
SERVE THEM
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
9
10. Data 1200%
Want a 400% -> centres
return on your
investment?
deliver value
Try BADC!
http://www.jisc.ac.uk/whatwedo/programmes/di_directions/strategicdirections/badc.aspx
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
10
11. Data reuse from Hubble
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
11
14. Cloud – sorted!
• Sorry, but it isn’t.
• High-use datasets and long tail present
different economic and technical challenges
• See David Rosenthal’s analysis of the
economics of Amazon for preservation
“Distributed digital preservation in the cloud”
IJDC 8(1), 2013 doi:10.2218/ijdc.v8i1.248
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
14
17. National responses – supporting
universities
• USA – NSF initiatives (DataONE, SEAD, Data
Conservancy et al)
• Australia – ANDS, RDSI
• UK – DCC, Jisc ‘Managing Research Data’
programmes
• Netherlands – Research Data Netherlands
• Canada – Research Data Canada
• Also grassroots or funder-led work in Finland,
Denmark, Germany
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
17
18. UK- Jisc acts through DCC to help
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
18
19. DCC ‘institutional engagement’
Assess
needs
Institutional
data catalogues
Workflow
assessment
DAF & CARDIO
assessments
Advocacy with senior
management
Make the case
Pilot RDM
tools
DCC
support
team
Guidance and
training
Develop
support and
services
RDM policy
development
Customised Data
Management Plans
…and support policy implementation
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
19
21. Australian National Data Service
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
National Service, backed
with university-level
initiatives
21
22. Excuses – and responses
• “People will ask questions”
– So use a data centre or repository
• “It will be misinterpreted”
– Stuff happens. Also, openness encourages correction
• “It’s not interesting”
– Let others be the judge – your noise is my signal
• “I might get another paper out of it”
– Up to a point. We might get more research out of it
• “I don’t have permission”
– A real problem. But solvable at senior level
• “It’s too bad/complicated” –see above
• “It’s not a priority”
– Unfortunately, funders are making it so. But if you looked at the
evidence, it would be your priority as well
2014-01-08
See e.g. Carly Strasser’s blog:
http://datapub.cdlib.org/2013/04/24/closed-data-excuses-excuses/
22
Kevin Ashley – ESIP Winter 2014 - CC-BY
23. These excuses bear a strong
resemblance to those used by
politicians and civil servants who argue
against the release of government
records
This is not a group you want to be
compared with
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
23
24. Integrity
• Not everyone publishes
here
• Almost all fraud
connected to
unavailable data
• People suffer & die due
to research fraud
• When your research is
reproducible – it gets
cited
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
24
25. Integrity – not without data
• Cyril Burt
– Twin studies on intelligence.
– Questioned 1976; now discredited
• Duke case
– Data hiding leads to wasted treatments, clinical
trials, probable death & huge lawsuits
• Dutch cases
– Stapel – 55 publications – “fictitious data”
– Poldermans – fabricated data or negligence?
“The case for open data: the Duke Clinical Trials “– blog post, Kevin Ashley, http://www.dcc.ac.uk/news/case-open-data-duke-clinical-trials
“Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud?” – Doorn, Dillo, van Horik, IJDC 8(1); doi:10.2218/ijdc.v8i1.256
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
25
26. Should all data be open?
• NO
• Many reasons – most to do with human
subjects
• But data existence should always be open
• Allows discovery & negotiation on use
• Avoids pointless replication
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
26
27. Gentleman’s data centres
• Some data centres have club-like behaviour
– Barriers to access
– Only for contributors
– Territorial
• Not without value, but barriers to progress
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
27
28. Citability
• Making data available increases citations
• Everyone – academic, funder, institution –
loves citations
• Want evidence?
– Alter, Pienta, Lyle – 240%, social sciences *
– Piwowar, Vision – 9% (microarray data)†
– Henneken, Accomazzi – 20% (astronomy) #
# Edwin Henneken, Alberto Accomazzi, (2011) Linking to Data - Effect on Citation Rates in Astronomy. http://arxiv.org/abs/1111.3618
* Amy Pienta, George Alter, Jared Lyle, (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.
http://hdl.handle.net/2027.42/78307
† Piwowar H, Vision TJ. (2013) Data reuse & the open data citation advantage. PeerJ PrePrints 1:e1v1
http://dx.doi.org/10.7287/peerj.preprints.1v1
2014-01-08
28
Kevin Ashley – ESIP Winter 2014 - CC-BY
29. Can we find it?
• Data must be discoverable to be reused
• Alone, or in conjunction with publication
• Institutional catalogues, national data
registries, national and international domainspecific services
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
29
30. Data discovery around the world
• Research Data Australia
• UK data registry pilot &
Gateway2Research
• Research Data
Netherlands
• World Data System
• re3data.org &
databib.org –
discovering repositories
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
30
35. Other global work of note
• Domain initiatives such as Belmont forum
• International generic groups – RDA, CODATA
• Problem-specific services – Datacite, EZID,…
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
35
39. Data reuse stories
• The palaeontologist who saved years of work
with archaeological data
• The ‘noise’ from research radar that mapped
dust from Eyjafjallajökull
• The 19th-century logs and photographs that
help us model climate change
Often your data tells
stories that your
publications do not
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
39
41. Thanks for your attention
kevin.ashley@ed.ac.uk
www.dcc.ac.uk
@kevingashley
2014-01-08
Kevin Ashley – ESIP Winter 2014 - CC-BY
41
Editor's Notes
I’m from the digital curation centre in the UK – for those of you who haven’t heard of us I’ll be explaining a little of what we do later on in this talk. I’m here today to talk about one of the things that is central to ESIP’s existence, the effective management and reuse of research data. But I’m not going to talk about the earth sciences specifically, but look more generally at what is happening at local, national and global level to ensure that research data is used and reused effectively.
For an audience such as this, I shouldn’t have to explain why data reuse is important. But just in case, and to explain why some things have happened the way they have, I’ll describe some of the drivers.Ensuring that all research data is discoverable and reusable increases the quality of the research that we do. It can add to the data we collect ourselves and can improve the statistical rigour of our results. Exposing data to scrutiny makes it more straightforward to validate or challenge the findings of others.Making data available also improves the speed with which we can do research. If someone else has already gathered the data we need (perhaps for a different end use), we can move directly to the analysis stage of our work, saving both time and money.And saving money increases the efficiency of research. We hope that the money saved lets us do more research, but even if it doesn’t society as a whole will gain. There’s evidence behind this that I’ll come to later, but it is an effective counter to those in some universities who feel that increasing funder requirements for data management simply leads to additional costs with no gain. There is a gain in all these areas, and hence every one of the actors – researchers, their employers, their funders, and society, should be motivated to make this happen.
Getting better at managing research data isn’t just about keeping more stuff for longer. It’s also about being more selective about what we do keep and documenting the decision-making process that we use. Reports such as this make clear that technological advances means that the cost of producing data is dropping more rapidly than the cost of retention. Some arguments show that if we attempt to retain everything it won’t be long before we’re spending the entire GDP purely on data storage. That’s an extreme analysis, but the problem is real as CERN know well. In some disciplines it really is wiser to just generate the data again when it is needed. But for many observational disciplines, that opportunity isn’t open to us.
Funders are aware of this and making increasingly stringent requirements about what researchers do with data and how and when they document it. The UK’s NERC runs a network of data centres to capture much of the data from research which it funds as well as providing data from the larger instruments it is responsible for. It also requires data management plans – an outline with the proposal and a more fully worked-up plan if the proposal is successful. The details differ with other funders but all are moving or have moved in this direction.
Some, such as EPSRC in the UK, have taken a slightly different tack. They place the burden of compliance on the university rather than the researcher. They expect universities to provide appropriate services to researchers to enable them to do the Right Thing, whatever that is. Looming deadlines in 2012and in 2015 got the attention of senior university management. EPSRC are the biggest research funder in the UK. No one wants to put that funding stream at risk.
The expectations that universities need to sign up are listed here – their roadmaps need to demonstrate how they are going to deliver on these expectations by 2015. They include a commitment to keep data for 10 years after its last use – note, not just after the project ends. Some worry that this means they need to keep data for 100 years. I say that if your data is still being used (and cited) 100 years later you should break out the champagne, not worry about paying for it.
Those are UK examples and many of you will be familiar with parallel requirements in the USA. In Germany, DFG is making similar requirements and the European Commission’s new Horizon 2020 programme has also included requirements about research data for the first time. Government funders are often driven by the same ideas that are pushing increasing openness with data from all areas of public activity, much of it administrative. The G8 have issued strongly-worded statements in this area. But health funders in particular are often charitable rather than government funded. They often began with requirements about open access to publications arising from publicly-funded research. Data is an obvious next step.
They all have similar motivations, but value alone is key. Data costs money – it’s an asset. And we want to sweat that asset to the greatest extent possible. The business case for some of the DCC’s activity, accepted by the UK Treasury, foresaw a return on a modest investment of 2.5 times after 5 years – which then continues indefinitely.
Some people felt that disciplinary data centres were the answer to all this. There are lots of them after all – these are just a few UK examples. But many disciplines don’t have them and they aren’t easy to create. There is therefore an ongoing role for someone else to have custodial responsibility for much research data and universities and other research institutions are the natural home for much of it. National libraries in some countries also see a role for themselves.
Some recent studies have used rigorous methodologies to examine the cost-effectiveness of disciplinary data archives or repositories. This most recent one shows impressive returns on the amount spent on BADC – rates of return that would be highly attractive were this a commercial venture. But it isn’t, of course. The financial benefit flows to the community as a whole, not to the data centre which is simply a cost we bear to save overall. This observation, incidentally, is equally applicable no matter how youchoose to spell ‘center’.
Many of you may be familiar with this graph from the Hubble Space telescope data archive. It tells the same story in a different way, and also tells a story about the transformation of astronomy as a discipline. In the days of photographic plates, sharing (analogue) astronomical data was difficult. Digital instruments transformed this, and some time around 2000, more research was being done with old data than with new data. I could be more specific about this if the data behind this graph was made available, incidentally!
But we do need to be beware of dependence on a single custodian for any set of data. This recent news story contains speculation that political motives are behind the loss of much material from research libraries in Canada, which includes much pre-digital data. The story isn’t without controversy, but it is only one example from many around the world.
Commercial actors are also entering the scene either to provide services to universities or research groups or direct to researchers. Arkivum falls into the former group; figshare began with the latter but is also now moving into an institutional offering. Digital Science, the people behind figshare, themselves owned by NPG, clearly believe they can extract value from th e data they will end up with.
Those worried about where we’ll store all this data sometimes point to cloud solutions as a panacea. They do have a useful role to play, and I see we’ll be hearing about some of the success stories later on in this meeting. But David Rosenthal’s analysis shows clearly that the cloud isn’t effective for the long-term storage of even little—used data.
I urge you to read David’s blog and his article in IJDC to get a better understanding of his arguments. For the moment you’ll have to take it on trust. This graph compares costs of storing data for 100 years in either Amazon S3 or local systems, using different values of Kryder’s law which describes the change in unit storage cost over time. S3 loses out by a very large margin for all values, yet also has exit costs that make it almost impossible to get out of it cost effectively once you have opted in.
It’s still true even for Amazon Glacier, the low-cost option supposedly aimed at long-term preservation. The gap is smaller, but still there. Worse, every use of the data dramatically impacts the cost. This graph assumes that the only access is for periodic verification of the data, perhaps only once every 2 years.
Some countries have mounted national efforts to support universities to deal with the issues more effectively. In th e US, this has primarily been through NSF programmes and projects such as DataONE. You’ll be hearing more about these so I’ll say no more myself. In Australia, the parallel initiatives of the ANDS and RDSI provide national infrastructure backed with funded action within universities. In the UK, the DCC performs a similar role to ANDS and Jisc’s MRD programmes fund the univerisity-level action, often with partners such as publishers and international groups such as CODATA or CASRAI. The Netherlands and Canada both have similarly-named national initiatives, although that in the Netherlands is already delivering based on grassroots joint model between a data centre and universities. There is also activiity in Finland, Denmark and Germany – and possibly elsewhere.
In the UK, the DCC provides a mixture of guidance, events, current awareness, online services for tasks like data management planning, and embedded work in universities.
The embedded work contains multiple components which help with everything from initially making the case for action through training support staff and researchers and designing and delivering services to researchers that work with national and international infrastructure.
Funded work in universities, subject to competitive bidding, complements this work. These are examples of training programmes developed for research disciplines and for library staff in effective data management.
The Australian National Data Service has a similar remit, but substantially more funding. It has a clear goal to increase data reuse in Australia and of Australian research and a simple vision of the change it intends to bring about. I’ll say more about some of its services later.
Yet some researchers still aren’t convinced by the rhetoric. Carly Strasser at CDL has listed some of the reasons for not sharing data that she’s encountered – and here are some of my one-line responses. I’m not saying that the concerns aren’t sincere or reasonable but they can all be dealt with and some are positively misguided. The purpose of data centres, for instance, is to make data independently reusable (as stated in the OAIS standard) which relieves researchers of the burden of dealing with questions about it, at the same time as increasing the likelihood that their data will be cited.
It’s unfortunate that I find many of these excuses familiar from the time that I ran services for the UK national archive dealing with government data. They are nearly all the same – although it’s true that politicians rarely argue that they want to get one more paper out of the data before it is released. Whatever, these people aren’t company that you want to be in.
Mu ch as I enjoy the JIR, it isn’t the publication most of us aim for. But it brings home one compelling argument for making data available, that of research integrity. Almost all fraud, and other less clear-cut cases of bad research, can be associated with the unavailability of research data. There are real consequences, including human suffering and death – of which more later. And did I mention that making your data available makes it more likely to be cited? Don’t worry, I will again.
These are just a few examples, some of outright fraud and others of simply dodgy research all of which would have been uncovered far more quickly had the data been made routinely available. The Duke case in particular roused the suspicions of many in the field but took many years to get to the bottom of because data was locked away. It is just one example of a set of practices described very clearly by Ben Goldacre in ‘Bad Pharma’. Missing data is the largest section in his book, although he has other justified concerns with research relating to medial treatments. It has led to a global movement to ensure that all cliniical trial data is made available. But medicine is by no means the only area affected.
Medicine does, however, provide some clear reasons why we can’t just stick all research data on the internet for anyone to trawl through. When human subjects are involved there are real concerns about confidentiality. Yet what alltrials.net and other initiatives make clear is that the *existence* of the data should never be hidden. That allows it to be discovered and for negotiations to take place about its use. It avoids costly replication, which can delay scientific discovery and involve human suffering when the replication takes the form of a clinical trial.
There are other concerns that I have with the way some data centres behaved historically. Some feel like genteleman’s clubs – only available to members and with significant and sometimes abstruse barriers to membership. Many are moving away from this model but there is still some way to go.
Did I mention that making data available increases citations? This is a win all round. If you don’t believe me, here are three studies from three different areas that all show robust, positive correlations. The effect size varies with discipline, but we have enough evidence now that anyone who says that their area is different needs to come up with evidence to show why.
It’s not enough that data is preserved – it must be findable, both from the publication that describes it and as an entity in itself. Not all data goes with a publication. Services at many different levels have a role to play, particularly to ensure that data reuse can happen in a cross-disciplinary way.
ANDS were the first to do something at national level and in a generic way with research data australia. We in the UK are following much of their model. Meanwhile the government funders are merging what were funder-level discovery services for all their research outputs. There are broad multi-disciplines services such as the World Data System and two services at present tackling a related problem – finding an appropriate place to put your data. This is a real problem for many.
Re3data is a DFG-funded project trying to tackle this, building in part on work done by the DCC along with biomedcentral and … The brief descriptions have handy icons expressing things like usage conditions in a compact way, but there’s lots more detail available as these screenshots show. Links to the terms and conditions of use and the standards employed are particularly valuable.
Here in the USA the databib project is undertaking a similar initiative – this is a record for the same archive, the archaeology data service in York, UK. It’s briefer, but tells us much of what we need to know. Of note is the fact that all this data is available as RDF, making it easy for others to build services on top on top of this registry. That’s going to be important. I hope re3data will do something similar.
Services like this make it easy when we want to locate two datasets, perhaps from two sub-disciplines, to combine – a common enough requirement.
But increasingly we want to undertake combinations of hundreds or even thousands of individual datasets and to do so in a relatively automated way. In general, we don’t yet have services that make this straightforward.
There’s more taking place at global level, far more than I have time to discuss here. Many of you will be familiar with the work of the Belmont forum on climate data. There are many other groups working in large domains such as this. There are also more generic intitatives, some of them long-lived such as CODATA and others much newer such as the RDA. Both are working particularly hard to identify generic solutions to many problems that relate to research data management in ways that individual disciplines will find hard to do. Both achieve much more through global coordination than any national or even continental initiative can achieve. One generic issue is that of providing permanent, citable identifiers for data. EZID, from the California Digital Library, is one solution. Datacite is another, backed by national libraries in many countries.
All this work is aimed at making data reuse simpler. We’re all familiar with research lifecycle models such as this one, where ideas lead to projects, data and publications.
We know that these lifecycles can connect, with one group building on the work of another
But research that collects data doesn’t always provide publishable results in a useful timeframe, or at all. Much of this work is aimed at making sure that we can still benefit from the data of others even when parts of the lifecycle are broken. And those others will benefit from the citations we provide to their source data.
There are many such stories of unexpected data reuse; these are a few examples. The last, exemplified in the Old Weather project, is seeing the original data being reused for at least the third time and in doing so is helping both climatologist and family historians through a single piece of transcription work. An impressive result.
The message from colleagues at 3TU in Denmark is one I would like to leave you with. See your data as a treasure, but one you only gain value from when it is shared.