The blessing and the curse: handshaking between general and specialist data repositories

•

2 likes•970 views

This document discusses the challenges of depositing data in both generalist and specialist repositories. It notes that while specialized repositories are best for standardized data, many datasets fall into the "long tail" of less common types. Generalist repositories can accommodate long-tail data but require redundant metadata. The document explores how to link data and publications between repositories and assess data quality. It concludes that promoting standards for interoperability between repositories and rallying the research community around those standards could help address these issues.

The blessing and the curse:
handshaking between
general and
specialist data repositories
Hilmar Lapp (NESCent), Todd Vision (UNC Chapel Hill)
GSC 15 Conference, Bethesda, MD
April 22-24, 2013

Which data goes where?
Which is required?

Addressing the long tail of orphan data
Volume
Rank frequency of datatype
Specialized repositories
(e.g. GenBank, GBIF)
Orphan data
After Heidorn (2008) http://hdl.handle.net/2142/9127
Many datasets belong to the
long tail. Though less
standardized, they can be rich in
information content and have
unique value

General purpose repositories
cater to long-tail data

And that’s aside from
the proverbial Babel of
data formats.

Enter Publication:
Please enter your publication:
Publication:
Enter Publication:
Metadata
has to be
provisioned
redundantly

How to concisely link to
the supporting data?

Given the article, how
do I ﬁnd the data?

Given a data
record, how
do I ﬁnd
related data?

How do I assess quality
and ﬁtness for purpose?

• The End
 To make data archiving and reuse a standard part of scholarly communication.
• The Means
 Integrate data archiving with the process of publication.
 Make archiving easy and low burden for both authors and journals.
 Give researchers incentives to archive their data.
 Promote responsible data reuse.
 Empower journals, societies & publishers in shared governance.
 Ensure sustainability and long-term preservation.
 Work with and support trusted, specialized disciplinary repositories.
• The Scope
 Research data in sciences and medicine. (Early focus on evolution and ecology).
 Content must be complementary to existing disciplinary repositories.
 Data must be associated with a vetted publication (article, thesis, book chapter, etc.)
 Associated non-data content (e.g. software scripts, ﬁgures) where appropriate

Lessons learnt
• Different priorities on deposit versus
metadata richness may void beneﬁts
• Advantages of one-stop deposition and
when to use it are not obvious to users
• Custom-building handshaking
protocols is not robust, doesn’t scale

How to promote
• Minimum metadata
reporting standards?
• Uptake of community
specialist repositories?
• Archival of all long-tail
data?
• Linking between
repositories?

Standards for repository
& web of data
interoperability

Promoting community
rallying around standards
?

Repo: http://datadryad.org
Blog: http://blog.datadryad.org
Wiki: http://datadryad.org/wiki
Code: http://code.google.com/p/dryad
List: dryad-users@nescent.org
@datadryad
Dryad

This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016: https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data Slides in Merce Crosas site: http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence

EPSRC Policy Compliance: What researchers need to know

Historic Environment Scotland

FAIR Data Management and FAIR Data Sharing

Merce Crosas

Research data spring: extending the OPD to cover RDM

Jisc RDM

Data sharing promotes many goals of the NIH research endeavor. It is particularly important for unique data that cannot be readily replicated. Data sharing allows scientists to expedite the translation of research results into knowledge, products, and procedures to improve human health. Do you know what a data sharing plan should include? Are you aware of common practices and standards for data sharing? Do you know what services are available to help share your data responsibly? This workshop will begin to address these questions. Q&A will follow the presentation. Anyone interested in or planning to apply for NIH funding should attend. Note: The NIH data-sharing policy applies to applicants seeking $500,000 or more in direct costs in any year of the proposed research.

‘Good, better, best’? Examining the range and rationales of institutional dat...

Robin Rice

Research Data Management: Why is it important?

EDINA, University of Edinburgh

Overcoming obstacles to sharing data about human subjects

Robin Rice

RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...

ASIS&T

Building research data management services at the University of Edinburgh: a ...

Robin Rice

The challenge of sharing data well, how publishers can help

Varsha Khodiyar

Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.

RDAP14: DataNet Federal Consortium Update

ASIS&T

Mike Mertens Directions for RDM day one summary

Jisc

Smith RDAP11 NSF Data Management Plan Case Studies

ASIS&T

Dark Data In the Long Tail of Science: Examples in Biology

Bryan Heidorn

Library and data lecture for inf21306

Hugo Besemer

What's hot

Rots RDAP11 Data Archives in Federal Agencies

ASIS&T

Research Data Management in practice, RIA Data Management Workshop Brisbane 2017

ARDC

Valen Metadata and the [Data] Repository

National Information Standards Organization (NISO)

Borgman - Privacy, Policy and Data Governance in the University

National Information Standards Organization (NISO)

Publishers and RDM

Centre for Digital Scholarship, Leiden University Libraries

RDAP14 Poster: openICPSR: a public access repository for storing and sharing ...

ASIS&T

Publishing perspectives on data management & future directions

ARDC

A Data Citation Roadmap for Scholarly Data Repositories

LIBER Europe

Burton - Security, Privacy and Trust

National Information Standards Organization (NISO)

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...

Merce Crosas

NIH Data Sharing Plan Workshop - Handout

IUPUI

‘Good, better, best’? Examining the range and rationales of institutional dat...

Robin Rice

Research Data Management: Why is it important?

EDINA, University of Edinburgh

Overcoming obstacles to sharing data about human subjects

Robin Rice

RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...

ASIS&T

Building research data management services at the University of Edinburgh: a ...

Robin Rice

The challenge of sharing data well, how publishers can help

Varsha Khodiyar

RDAP14: DataNet Federal Consortium Update

ASIS&T

Mike Mertens Directions for RDM day one summary

Jisc

Smith RDAP11 NSF Data Management Plan Case Studies

ASIS&T

What's hot (20)

Rots RDAP11 Data Archives in Federal Agencies

Research Data Management in practice, RIA Data Management Workshop Brisbane 2017

Valen Metadata and the [Data] Repository

Borgman - Privacy, Policy and Data Governance in the University

Publishers and RDM

RDAP14 Poster: openICPSR: a public access repository for storing and sharing ...

Publishing perspectives on data management & future directions

A Data Citation Roadmap for Scholarly Data Repositories

Burton - Security, Privacy and Trust

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...

NIH Data Sharing Plan Workshop - Handout

‘Good, better, best’? Examining the range and rationales of institutional dat...

Research Data Management: Why is it important?

Overcoming obstacles to sharing data about human subjects

RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...

Building research data management services at the University of Edinburgh: a ...

The challenge of sharing data well, how publishers can help

RDAP14: DataNet Federal Consortium Update

Mike Mertens Directions for RDM day one summary

Smith RDAP11 NSF Data Management Plan Case Studies

Viewers also liked

Dark Data In the Long Tail of Science: Examples in Biology

Bryan Heidorn

Library and data lecture for inf21306

Hugo Besemer

Bringing reason to phenotype diversity, character change, and common descent

Hilmar Lapp

Introduction to Research Data Management at Lancaster University

Lancaster University Library

Reproducible Science - Panel at iEvoBio 2014

Hilmar Lapp

Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...

Research Support Team, IT Services, University of Oxford

Sharing Data: An Introductory Workshop from OpenAIRE and Foster

OpenAIRE

Open Bioinformatics Foundation: 2014 Update & Some Introspection

Hilmar Lapp

Data Metadata and Data Citation - Emma Ganley (PLoS)

National Information Standards Organization (NISO)

The Needs of Stakeholders in the RDM Process - the role of LEARN

LEARN Project

Open science and the individual researcher

Bram Zandbelt

A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...

LIBER Europe

This talk was given by Prof. Geoffrey Boulton of the University of Edinburgh at LIBER's 42nd annual conference in Munich. Here is a brief summary: "The data storm that has been unleashed by novel means of data acquisition, manipulation and their instantaneous communication have posed both great challenges and opportunities for science. The challenge is to maintain scientific self-correction, which depends on concurrent publication of concepts and the underlying evidence. The opportunity is to exploit massive and complex data volumes in creating new knowledge. Both are non-trivial tasks. The former requires ‘intelligent openness‘." "The latter requires new ways of thinking and new forms of collaboration, which make major demands on scientists, their institutions, those that fund science and those who publish it. Open access publishing is important, but open data is fundamental to scientific progress." "In a post-Gutenberg era, can the library maintain its historic role as an efficient repository of scientific knowledge? Can it provide support for the creation of new knowledge? What responsibilities should it discharge, and how? What skills are required by those discharging the library function? And how do we achieve a realisable objective, of having all the publications online, all the data online, and for the two to be interoperable?" Learn more about LIBER at www.libereurope.eu

Viewers also liked (12)

Dark Data In the Long Tail of Science: Examples in Biology

Library and data lecture for inf21306

Bringing reason to phenotype diversity, character change, and common descent

Introduction to Research Data Management at Lancaster University

Reproducible Science - Panel at iEvoBio 2014

Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...

Sharing Data: An Introductory Workshop from OpenAIRE and Foster

Open Bioinformatics Foundation: 2014 Update & Some Introspection

Data Metadata and Data Citation - Emma Ganley (PLoS)

The Needs of Stakeholders in the RDM Process - the role of LEARN

Open science and the individual researcher

A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...

Similar to The blessing and the curse: handshaking between general and specialist data repositories

The Dryad Digital Repository: Published evolutionary data as part of the gre...

Todd Vision

NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data

Susanna-Assunta Sansone

Full Erdmann Ruttenberg Community Approaches to Open Data at Scale

National Information Standards Organization (NISO)

Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...

The University of Edinburgh

Some Ideas on Making Research Data: "It's the Metadata, stupid!"

Anita de Waard

W3C Library Linked Data Incubator Group - 2011

Antoine Isaac

NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science

Susanna-Assunta Sansone

Open Science Governance and Regulation/Simon Hodson

Academy of Science of South Africa (ASSAf)

INSERM - Data Management & Reuse of Health Data - May 2017

Susanna-Assunta Sansone

Research data life cycle

University of Arizona

The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...

Peter McQuilton

A 10 minute presentation given in Denver (CO) on the 15th September as part of the IG Elixir Bridging Force, WG Biosharing Registry,WG Data Type Registries,WG Metadata Standards Catalog joint session of the Research Data Alliance 8th Plenary (part of International Data Week). This presentation covers the proliferation of data, databases, and data standards in biomedicine, and how BioSharing can help inform and educate users on this landscape and relationships between data, databases and data standards.

Open Data and Institutional Repositories

Robin Rice

Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014

Susanna-Assunta Sansone

eROSA Stakeholder WS1: Data discovery through federated dataset catalogues

e-ROSA

The expanding dataverse

Merce Crosas

HKU Data Curation MLIM7350 Class 9

Scott Edmunds

FAIR BioData Management

Ulrike Wittig

Fair sample and data access -David Van enckevort

Data Science NIH

David Van Enckevort - FAIR sample and data access

DataSciSIG

NIH Data Science Special Interest Group

Yaffa Rubinstien

Similar to The blessing and the curse: handshaking between general and specialist data repositories (20)

The Dryad Digital Repository: Published evolutionary data as part of the gre...

NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data

Full Erdmann Ruttenberg Community Approaches to Open Data at Scale

Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...

Some Ideas on Making Research Data: "It's the Metadata, stupid!"

W3C Library Linked Data Incubator Group - 2011

NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science

Open Science Governance and Regulation/Simon Hodson

INSERM - Data Management & Reuse of Health Data - May 2017

Research data life cycle

The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...

Open Data and Institutional Repositories

Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014

eROSA Stakeholder WS1: Data discovery through federated dataset catalogues

The expanding dataverse

HKU Data Curation MLIM7350 Class 9

FAIR BioData Management

Fair sample and data access -David Van enckevort

David Van Enckevort - FAIR sample and data access

NIH Data Science Special Interest Group

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Essentials of Automations: Optimizing FME Workflows with Parameters

GraphRAG is All You need? LLM & Knowledge Graph

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Neuro-symbolic is not enough, we need neuro-*semantic*

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

JMeter webinar - integration with InfluxDB and Grafana

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

UiPath Test Automation using UiPath Test Suite series, part 3

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

How world-class product teams are winning in the AI era by CEO and Founder, P...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Accelerate your Kubernetes clusters with Varnish Caching

Leading Change strategies and insights for effective change management pdf 1.pdf

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Bits & Pixels using AI for Good.........

Elevating Tactical DDD Patterns Through Object Calisthenics

The blessing and the curse: handshaking between general and specialist data repositories

1. The blessing and the curse: handshaking between general and specialist data repositories Hilmar Lapp (NESCent), Todd Vision (UNC Chapel Hill) GSC 15 Conference, Bethesda, MD April 22-24, 2013

2. > 180 for biological sciences alone

3. Which data goes where? Which is required?

4. Addressing the long tail of orphan data Volume Rank frequency of datatype Specialized repositories (e.g. GenBank, GBIF) Orphan data After Heidorn (2008) http://hdl.handle.net/2142/9127 Many datasets belong to the long tail. Though less standardized, they can be rich in information content and have unique value

5. General purpose repositories cater to long-tail data

6. General purpose repositories cater to long-tail data

7. And that’s aside from the proverbial Babel of data formats.

8. Where does this leave the user?

9. Where to deposit what, and how?

10. Enter Publication: Please enter your publication: Publication: Enter Publication: Metadata has to be provisioned redundantly

11. How to concisely link to the supporting data?

12. Given the article, how do I ﬁnd the data?

13.

14. Given a data record, how do I ﬁnd related data?

15. How do I assess quality and ﬁtness for purpose?

16. Lessons from Dryad/TreeBASE handshaking

17. • The End  To make data archiving and reuse a standard part of scholarly communication. • The Means  Integrate data archiving with the process of publication.  Make archiving easy and low burden for both authors and journals.  Give researchers incentives to archive their data.  Promote responsible data reuse.  Empower journals, societies & publishers in shared governance.  Ensure sustainability and long-term preservation.  Work with and support trusted, specialized disciplinary repositories. • The Scope  Research data in sciences and medicine. (Early focus on evolution and ecology).  Content must be complementary to existing disciplinary repositories.  Data must be associated with a vetted publication (article, thesis, book chapter, etc.)  Associated non-data content (e.g. software scripts, ﬁgures) where appropriate

18.

19.

20.

21.

22. Lessons learnt • Different priorities on deposit versus metadata richness may void beneﬁts • Advantages of one-stop deposition and when to use it are not obvious to users • Custom-building handshaking protocols is not robust, doesn’t scale

23. How to promote • Minimum metadata reporting standards? • Uptake of community specialist repositories? • Archival of all long-tail data? • Linking between repositories?

24. Data Metadata Links Data Metadata Links

25.

26. Standards for repository & web of data interoperability

27. Standards for repository & web of data interoperability

28. Promoting community rallying around standards ?

29. Promoting community rallying around standards ?

30. Repo: http://datadryad.org Blog: http://blog.datadryad.org Wiki: http://datadryad.org/wiki Code: http://code.google.com/p/dryad List: dryad-users@nescent.org @datadryad Dryad

Editor's Notes

Specialized repository infrastructure exists for certain data-types, e.g. DNA sequences and species occurrence data. But vast quantities of valuable and irreplaceable data are comprise the long tail, much in idiosyncratically formatted spreadsheets and other nonstandardized files. An archive is not needed to replace existing repositories, but to provide a home for orphan data and enable ALL the data underlying a publication to be archived.
Dryad was was developed to fill the infrastructure gap for journals that wished to sincerely promote data archiving. One that could be used not only by those authors producing certain types of data, or only those authors most motivated to share, but by all the authors to whom the journal’s data policy would apply.

The blessing and the curse: handshaking between general and specialist data repositories

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to The blessing and the curse: handshaking between general and specialist data repositories

Similar to The blessing and the curse: handshaking between general and specialist data repositories (20)

More from Hilmar Lapp

More from Hilmar Lapp (17)

Recently uploaded

Recently uploaded (20)

The blessing and the curse: handshaking between general and specialist data repositories

Editor's Notes