Whither Small Data?


Published on

My talk for the BRDI meeting at Washington DC http://sites.nationalacademies.org/PGA/brdi/PGA_080945

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Are current modes of publication and excessive reliance on essentially only one medium (articles and books) serving scholarship or limiting it?
  • Whither Small Data?

    1. 1. Whither Small Data?Some Thoughts on Managing Research Data February 26, 2013 Anita de Waard VP Research Data Collaborations, Elsevier RDS a.dewaard@elsevier.com
    2. 2. Why should data be saved?A. Hold scientists accountable: Data Preservation – Preserve record of scientific process, provenance – Enable reproducible researchB. Do better science: Data Use – Use results obtained by others! – Improve interdisciplinary workC. Enable long-term access: Sustainable Models – Use for technology transfer; societal/industrial development – Reward scientists for data creation (credit/attribution) – Allow public/others insight/use of results
    3. 3. Where The Data Goes Now: PDB: A small portion of data 88,3 k (1-2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-focused 2 M scientists data repositories 0.6 k MiRB:2 My papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 files 0.6 My Datacite: 1.5 My
    4. 4. Key Needs: DEVELOP SUSTAINABLE MODELS PDB: A small portion of data 88,3 k (1-2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-focused 2 M scientists data repositories 0.6 k MiRB:2 My papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 files 0.6 My INCREASE DATA PRESERVATION Datacite: 1.5 My
    5. 5. A. Data Preservation:• Issues: – Currently data is often used by single researchers or small groups: many different, idiosyncratic formats – Often not in electronic form (maps, images) – No metadata: when, where, by whom, WHY was this data collected?• Needs: – Tools to make data export/storage simple and unavoidable – Policies that make data sharing mandatory and simple – Systems that reward data sharing/digitisation
    6. 6. B. Data Use:• Issues: – In generic data repositories, data cannot be used because of inadequate metadata, lack of quality review, lack of provenance – It’s expensive to make data useable! – Domain-specific data stores are not cross- searchable across discipline/national borders• Needs: – Standardised metadata systems across systems/repositories and tools to apply them easily – Integration layers to enable cross-repository queries – A funding model to enable long-term preservation
    7. 7. C. Sustainable Models:• Issues: – Many successful domain-specific data repositories are running out of funding – Is adding metadata something you want to keep paying PhD+ scientists to do? – Unclear who foots the bill: the researcher? The institute? The grant agency? For how long?• Needs: – Attribution models for rewarding scientists – Policies to improve cross-domain and cross-national collaborations – Funding models to sustain databases long-term
    8. 8. Linking papers to research data:Database Object Linked DisplayedPangaea Google Maps Location Map with locationProtein Databank PDB Protein 3d Protein VisualisationGenbank Gene Name NCBI Gene ViewerExoplanets + Exoplanet name Rich Information on extrasolar PlanetsSpecies + Species name Rich information on species 9
    9. 9. Towards ‘wrapping papers around data’ metadata 1. Store metadata on all materials metadata metadata 2. Track the methods while doing them 3. Write papers that ‘wrap around’ this metadata 4. Don’t ‘send’ your papers – just metadata expose them to the outside world 5. Invite reviews; open data to trusted parties, at trusted time Rats were subjected to two 6. Allow apps/tools to integrate grueling tests (click on fig 2 to see underlying data). These results suggest that the neurological pain pro- Calculate, coordinate… Review Revise Compile, comment, Edit compare…
    10. 10. Research Data Services:A. Increase Data Preservation: Help increase the amount and quality of data preserved and sharedB. Improve Data Use: Help increase the value and usability of the data shared by increasing annotation, normalization, provenance enabling enhanced interoperabilityC. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the institute, and the funding body, enabling more sustainable platforms.
    11. 11. Guiding Principles of RDS:• In principle, all open data stays open and URLs, front end etc. stay where they are (i.e. with repository)• Collaboration is tailored to data repositories’ unique needs/interests- ‘service-model’ type: – Aspects where collaboration is needed are discussed – A collaboration plan is drawn up using a Service-Level Agreement: agree on time, conditions, etc.• Transparent business model• Very small (2/3 people) department; immediate communication; instant deployment of ideas
    12. 12. Three pilots:1. Carnegie Mellon Electrophysiology Lab: A. Data Input: Develop a suite of tools to enable simple data capturing on a handheld device, add metadata during experiment, store with raw traces and create dashboard for viewing B. Data Use: Integrate with NIF and eagle-I ontologies, enable access through NIF; combine with other sources2. ImageVault, with Duke CIVM: A. Data Input: Get 3D image data into common format, resolution, annotated to allow comparison B. Data Use: View other image data sets & do image analytics C. Sustainable Models: Create funding for 3D image sets: free layer for raw data/subscription analytics.
    13. 13. 3. IEDA Data Rescue Process StudyData Rescue: – Identify 3 -5 data sets that need to be ‘rescued’ – Work with investigators to identify data sources, formats – Work with IEDA to define metadata standards, quality checks etc.Data Rescue Process: – A group of data wranglers perform ‘electrification’ and annotation – (Open source) software is developed where needed, to help this process – We help develop common standards, if needed
    14. 14. 3. IEDA Data Rescue Process StudyData Rescue Process Study:Jointly publish a report on a ‘gap analysis’ comparingwhere are we now vs. and where we need to be, including: – What we did (data imported, processes/standards created/described; software built; user tests, outcomes) – Effort involved (time, software, equipment, skills, etc) – How easy it would be to scale up; what part of data out there could be done this way. – Recommendations for tools and skills that are needed, if we want to scale up this process
    15. 15. Summary:• Three key issues: A. Data Preservation B. Data Use C. Sustainable Models• Elsevier’s approach: – Linking data to papers – Wrap papers around data – Explore role in the research data space• Elsevier RDS: – Three pilots (CMU, Duke, IEDA) to investigate issues – We’ll report back in about a year!
    16. 16. Questions? Anita de WaardVP Research Data Collaborations, Elsevier a.dewaard@elsevier.com