Small Data: Bridging the Gap Between Generic and Specific Repositories
Small Data, or: Bridging the Gap Between Speciﬁc and Generic Research Repositories April 11, 2013 Anita de Waard VP Research Data CollaboraDons firstname.lastname@example.org hHp://researchdata.elsevier.com/
There are many eﬀorts to enhance data storing and sharing... • Many diﬀerent research databases– both generic (Dryad, Dataverse, …) and speciﬁc (NIF, IEDA, PDB, …) • Many systems for creaDng/sharing workﬂows (Taverna, MyExperiment, Vistrails, Workﬂow4Ever etc) • Many e-‐lab notebooks (LabGuru, LabArchives, LaBlog, etc) • Scores of projects, commiHees, standards, bodies, grants, iniDaDves, conferences for discussing and connecDng all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc) • You can make a living out of this ;-‐)! (and many of us do…)
…but this is what scienDsts do: Using anDbodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of this, and writes a paper. End of story.
Why save research data? A. Data PreservaDon: – Preserve record of scienDﬁc process, provenance – Enable reproducible research B. Data Use: – Use results obtained by others – Do beHer science! – Improve interdisciplinary work
Where the data goes now: PDB: A small porDon of data 88,3 k (1-‐2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-‐focused 2 M scienDsts data repositories 0.6 k MiRB: 2 M papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 ﬁles 0.6 M Datacite: 1.5 M
So this needs to happen: PDB: A small porDon of data 88,3 k (1-‐2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-‐focused 2 M scienDsts data repositories 0.6 k MiRB: 2 M papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 ﬁles 0.6 M INCREASE DATA PRESERVATION Datacite: 1.5 M
Data PreservaDon Issues: ObjecDon: “Our lab notebooks are all on paper – it’s how we do things” Response: Grao tools closely on scienDsts’ daily pracDce Example: create tailored metadata collecDon tools on mini-‐tablets in labs to replace paper notebooks
Data PreservaDon Issues: ObjecDon: “I need to see a direct beneﬁt of any eﬀort I put in.” Response: Create tools to allow beHer insight in own and other’s results. Example: ‘PI-‐Dashboard’: allow immediate access/analysis of shared data: new science!
Data Use Issues: ObjecDon: “I don’t really trust anyone else’s data – and don’t think they’ll trust mine” Response: Create social networking context; allow data owner to provide granular access control. Example: • In Urban Lab app, data stored by researcher name. • PI decides who gets to see which data • Match up with NIF and Eagle-‐I ontologies on back end so export of (part of) data is possible at any Dme. c o n s o r t i u m
Data Use Issues: • ObjecDon: “I am afraid other people might scoop my discoveries” • Response: Reward system needs to move from direct compeDDon to a ‘shared mission’ approach (cf. Mars) • Example: Data Rescue Challenge in the geosciences: collect and reward stories/pracDces of data preservaDon, enable cross-‐disciplinary access and use of all data. The 2013 Interna.onal Data Rescue Award in the Geosciences Organised by IEDA and Elsevier Research Data Services hHp://researchdata.elsevier.com/datachallenge
Data PreservaDon and AnnotaDon: : Fine, I’ll do it– but where the hell do I put it? WANT AND Domain-‐Speciﬁc Domain of study: Collaborators: Local Data Repository Data Repository DIFFERENT ALL THEY Generic METADATA!!!! InsDtuDonal Data Repository Funding Agency: University: Data Repository
Comparing Repository Types: Repository Advantages Disadvantages Eﬀort, Reuse, Credit, Compliance Local data Easy! No one steals No one sees it. Habit, Ease, Privacy, Control repository your data. Not compliant with MORE ANNOTATION requirements InsDtuDonal Not very diﬃcult. Data can’t easily be Repository Administrators are reused. Credit? happy. Generic data Not very hard to do. Data can’t be easily repository Have complied! reused. Credit… Domain-‐speciﬁc Data can be reused. Lot of work – for data repository Credit! curators
Conclusions for data annotaDon: “Instead of building newer and larger weapons of mass destrucHon, I think mankind should try to get more use out of the ones we have” Deep Thoughts by Jack Handy • Let’s use the data standards we already have – and agree on using the same ones • Work with exisDng data repositories in a ﬁeld to come to a lowest common denominator of metadata • Tailor the systems to be opDmally easy to use for scienDsts in terms of metadata: add as liHle as you have to, as few Dmes as you can.
Summary: • Data PreservaDon: – Tailor tools to ﬁt scienDsts’ workﬂow – follow the experiment! – We are creaDng repositories of shared experiments: Enable demonstrably beFer science! • Data Use: – Allow owner full control over who sees which data -‐ create social networking context – CollecDvely pioneer long-‐term funding opDons; support/ develop ‘shared mission’ funding challenges • How annotaDon can help reuse: – Collaborate between (generic/speciﬁc, insDtuDonal, cross-‐ naDonal) data faciliDes to integrate repositories, enable cross-‐ repository usage and reuse exisIng metadata.
QuesDons? Anita de Waard VP Research Data CollaboraDons email@example.com hHp://researchdata.elsevier.com/
Elsevier Research Data Services Goals: 1. Increase Data PreservaDon: Help increase the amount and quality of data preserved and shared 2. Improve Data Use: Help increase the value and usability of the data shared by increasing annotaDon, normalizaDon, provenance enabling enhanced interoperability 3. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the insDtute, and the funding body, enabling more sustainable plaworms.
Guiding Principles of RDS: • In principle, all open data stays open and URLs, front end etc. stay where they are (i.e. with repository) • CollaboraDon is tailored to data repositories’ unique needs/interests-‐ ‘service-‐model’ type: – Aspects where collaboraDon is needed are discussed – A collaboraDon plan is drawn up using a Service-‐Level Agreement: agree on Dme, condiDons, etc. • Transparent business model • Very small (2/3 people) department; immediate communicaDon; instant deployment of ideas.
“But aren’t you guys in it for the money?” • Yes, we are-‐ like most businesses… • Is your real quesDon perhaps: ‘Does no one want to work with you anymore because of the Open Access debate?’ • The OA debate focuses on three issues: – IPR and Access issues E.g. BY-‐NC-‐SA? Github? ..? – Opaque business models E.g. Gold Open Access? Shared funding model? Commercial analyDcs with shared royalDes? – Lack of perceived added We oﬀer a service: only use value it if it’s any good!