Leveraging publication metadata to help overcome the data ingest bottleneck <br />Todd J. Vision<br />National Evolutionar...
The End<br />To make data archiving integral to scientific publishing.  <br />The scope<br />Data underlying findings in t...
The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PD...
The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PD...
A publication package<br />
A publication package<br />1<br />1. Integrated manuscript and data submission<br />
A publication package<br />2<br />1<br />1. Integrated manuscript and data submission<br />2. Handshaking with specialized...
Integrated<br />Submit manuscript<br />
Integrated<br />Submit manuscript<br />Manuscript metadata<br />
Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />
Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />
Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acc...
Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acc...
Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acc...
Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />...
Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />...
Article<br />Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain ...
Integrated submission<br />Currently integrated or in process: 20<br />All journals with Dryad content: >70 <br />A minori...
Dryad vs. Supplementary Online Materials<br />
612 downloads<br />
Member nodes<br /><ul><li> Dryad, ORNL DAAC, Knowledge Network for Biocomplexity, etc.</li></ul>Coordinating nodes<br />In...
Why Dryad yearns for ORCIDs<br />Replace name strings with identities<br />Disambiguation of like names<br />Clustering of...
How to get ORCIDs into Dryad<br />Ideally sent to Dryad by integrated journals<br />Pre-review/Pre-production: allows coau...
What do we know about authors?<br />Names<br />Often abbreviated except for corresponding or submitting author<br />At lea...
Some requirements<br />Recognizing ORCIDs for authenticated users<br />Mapping to InCommon Silver profiles<br />ORCIDs for...
Business model issues<br />Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals.<br ...
"Cherish old knowledge that you may acquire new" <br />	The Analects of Confucius<br />Special thanks to<br />Elena Feinst...
Dryad Metadata Profile (v3.0)<br />Article<br />Data Package<br /><ul><li> dc.identifier = doi of article
Upcoming SlideShare
Loading in...5
×

Leveraging publication metadata to help overcome the data ingest bottleneck

1,090

Published on

A talk on Dryad given at the ORCID Participant Meeting in Boston, 5/18/2011

Published in: Technology, Education
2 Comments
3 Likes
Statistics
Notes
  • The idea is that an ORCID (or, more likely, a cluster of ORCIDs) can be equated to a researcher identity. This would a repository to recognize different records to which the same researcher has contributed using an exact match between ORCIDs rather than approximate matching of names, institutions, and other identifiable information.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The relationship between ORCIDs and identities is confusing to me. But this is only a slide deck -- more research req'd.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,090
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
2
Likes
3
Embeds 0
No embeds

No notes for slide
  • Demand from the user community (i.e. biologists) has led to a distributed network of sometimes inadequate, often unsustainable, generally non-interoperable, solutions, from personal websites to publisher hosted supplementary materials. Dryad, by contrast, is designed to be a self-sustainingservice that rationalizes the space and is responsive to multiple stakeholders:journals, publishers, societies, funders, authors, and data users
  • There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Dryad is designed specifically to enable archiving and reuse of this long tail of orphan data. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
  • A classic example of orphan data is Bumpus’ (1898) sparrows &quot; ... on February 1 of the present year, when, after an uncommonly severe storm of snow, rain, and sleet, a number of English [house] sparrows were brought to the Anatomical Laboratory of Brown University Seventy-two of these birds revived; sixty-four perished; ... “ Pages of his data on the measurements of birds that died versus those that revived, and the data has been used ever since to test statistical methods of measuring natural selection on multivariate traits, and for teaching evolutionary biology. This is not high-throughput biology. This is a single clever opportunistic, low-tech and idiosyncratic study by an individual investigator, who, by virtue of having published his data, enhanced the value of his science, and is still being read to this day. These data are nearly meaningless out of context. But even without elaborate machine-readable metadata, these data have been in 100s of papers, by many1000s of students, and is still being reused over a century later. The other lesson to take away from here is that publication related data is only the tip of the iceberg of the long tail, but it is the low-hanging fruit. It is the most consistently valuable, the most consistently reusable, and the easiest to archive in a systematic way, because there is a vast socio-technical-economic infrastructure of publication that can be leveraged.
  • Two aspects of the way Dryad works that merit more detailed explanation, both of which contribute to lowering the burden on data submission.
  • Integrated manuscript submission described in more detail, since this is more relevant to how Dryad data records get populated with ORCIDs
  • Canonical handshaking workflow: data files exchanged in a BagIttarball (files + manifest), completed submissions harvested via OAI-PMH updates.Future plans to use OAI-ORE for the manifest. Mechanism could be extended to allow deposit from research software (e.g. R, digital lab notebooks) or institutional repositories (e.g. withSWORD)
  • An example of an email sent to Dryad from an integrated journal (Molecular Ecology) with author names highlighted.
  • One can see these benefits action with a A 2009 data package compiled wood anatomy data from 8412 plant species. It has already been downloaded over 600 times! While some of these downloads may lead to citations, there is probably a good deal of data reuse for educational purposes, and exploration of analytical methods on this unique dataset.The inset from the corresponding Ecology Letters article shows the geographical distribution of wood density in North and South America. Each data point is the mean wood density value of all unique species occurrences in that cell. Wood density clearly varies in a very predictable way with temperature, precipitation, and seasonality. Dryad contains the data underlying this figure, but without Dryad, researchers would be unable to reconstruct the original data from this image for testing new hypotheses.
  • Dryad is one of many member nodes in the DataONE network, an NSF funded DataNet that includes federal labs, research stations, earth observatory networks, citizen science data archives, etc, which includes both earth science and life science data. Data is replicated across members nodes, while metadata and a layer of services are provided by a smaller number of redundant coordinating nodes. Dryad is the only member node that focuses on publication data, although many of the others are use in published literature. One of the technical goals is to support distributed authentication, so CRUD rights can be propagated from node to node for individuals, groups and organizations. DataONE has adopted InCommon, a single-sign on technology for US research and education, which is based on SAML-based authentication and authorization (e.g. Shibboleth). Specifically, DataONE uses InCommon Silver, which uses verified profiles allowing a high degree of trust.
  • Dryad Metadata Application Profile 3.0
  • Leveraging publication metadata to help overcome the data ingest bottleneck

    1. 1. Leveraging publication metadata to help overcome the data ingest bottleneck <br />Todd J. Vision<br />National Evolutionary Synthesis Center<br />Department of Biology <br />University of North Carolina at Chapel Hill<br />ORCID Participant Meeting, Harvard, May 2011<br />
    2. 2. The End<br />To make data archiving integral to scientific publishing. <br />The scope<br />Data underlying findings in the peer-reviewed biological literature.<br />The Means<br />Integrated submission of data with the manuscript<br />Low barrier to submission (at the datafile level)<br />Free reuse of data (free as in both speech & beer)<br />Journals share responsibility for governance and sustainability<br />
    3. 3. The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PDB)<br />Volume<br />Orphan data<br />Rank frequency of datatype<br />
    4. 4. The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PDB)<br />Volume<br />Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation.pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass. <br />Orphan data<br />Rank frequency of datatype<br />
    5. 5. A publication package<br />
    6. 6. A publication package<br />1<br />1. Integrated manuscript and data submission<br />
    7. 7. A publication package<br />2<br />1<br />1. Integrated manuscript and data submission<br />2. Handshaking with specialized repositories<br />
    8. 8. Integrated<br />Submit manuscript<br />
    9. 9. Integrated<br />Submit manuscript<br />Manuscript metadata<br />
    10. 10. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />
    11. 11. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />
    12. 12. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />
    13. 13. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />
    14. 14. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />Article<br />Publication<br />Data publication<br />
    15. 15.
    16. 16. Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Submit data<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />Article<br />Publication<br />Data publication<br />
    17. 17. Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Submit data<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Author adds DOI<br />Data DOI<br />Article metadata<br />Curation<br />Article publication<br />Article<br />Publication<br />Article metadata<br />harvested<br />Data publication<br />
    18. 18. Article<br />Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011<br />Dryad data package<br />Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384<br />
    19. 19. Integrated submission<br />Currently integrated or in process: 20<br />All journals with Dryad content: >70 <br />A minority require data prior to review<br />Journals published by a variety of organizations<br />Traditional (incl. Oxford University Press, Wiley-Blackwell)<br />Open Access (incl. BMC, BMJ Open)<br />Society publishers (e.g. with Allen Press, or independent)<br />
    20. 20. Dryad vs. Supplementary Online Materials<br />
    21. 21. 612 downloads<br />
    22. 22. Member nodes<br /><ul><li> Dryad, ORNL DAAC, Knowledge Network for Biocomplexity, etc.</li></ul>Coordinating nodes<br />Investigator toolkit<br />
    23. 23. Why Dryad yearns for ORCIDs<br />Replace name strings with identities<br />Disambiguation of like names<br />Clustering of synonymous names<br />Confidently recognizing different data packages that share an author<br />Enabling<br />Accurate author searches<br />Internal and external author hyperlinks<br />Aggregation of author contributions<br />Inclusion of data records in the profiles of coauthors<br />Propagation of ORCIDs with Dryad metadata<br />Manual curation of names not feasible<br />Only ~20% of Dryad authors in Library of Congress name auth. file<br />Manual control would explode curation costs<br />
    24. 24. How to get ORCIDs into Dryad<br />Ideally sent to Dryad by integrated journals<br />Pre-review/Pre-production: allows coauthors to edit data packages<br />Post-production: works for all other uses<br />Non-integrated journals<br />Lookup API based on article or affiliation data<br />To be avoided<br />Authors required to enter ORCIDs during submission<br />Authors required to register during submission<br />
    25. 25. What do we know about authors?<br />Names<br />Often abbreviated except for corresponding or submitting author<br />At least one article they have written<br />Title, journal, volume, pages, DOI, abstract<br />Other identifiable information<br />An email for submitting authors<br />Sometimes: institutional affiliation and contact information for corresponding authors<br />
    26. 26. Some requirements<br />Recognizing ORCIDs for authenticated users<br />Mapping to InCommon Silver profiles<br />ORCIDs for organizations (e.g. consortia)<br />Dspacesupport<br />Curator interface for ORCID lookup/verification<br />Lookup/registration option from submission interface<br />Allowing metadata relationships (e.g. of an ORCID with a name)<br />Mechanisms for curator to <br />Flag duplicates and errors<br />Register provisional ORCIDs<br />Map to other profiles (e.g. InCommon)<br />
    27. 27. Business model issues<br />Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals.<br />With a not-for-profit budget<br />Feasibility requires wide adoption by publishers<br />And manuscript-submission system developers!<br />Favored model<br />Pay for use of automated lookup services, with costs scaled by usage level<br />Credit for curator contributions<br />
    28. 28. "Cherish old knowledge that you may acquire new" <br /> The Analects of Confucius<br />Special thanks to<br />Elena Feinstein<br />Jane Greenberg<br />Ryan Scherle<br />For more information:<br />http://datadryad.org<br />http://blog.datadryad.org<br />http://datadryad.org/wiki<br />http://code.google.com/p/dryad<br />dryad-users@nescent.org<br />Facebook: Dryad<br />Twitter: @datadryad<br />
    29. 29. Dryad Metadata Profile (v3.0)<br />Article<br />Data Package<br /><ul><li> dc.identifier = doi of article
    30. 30. bibo.status = article publication status
    31. 31. dc.creator = authors of article
    32. 32. dc.issued = article publication date
    33. 33. dc.title = title of article
    34. 34. bibo.journal = journal title
    35. 35. bibo.issn and bibo.eissn
    36. 36. bibo.volume
    37. 37. bibo.issue
    38. 38. bibo.pageStart and bibo.pageEnd
    39. 39. dc.abstract = article abstract
    40. 40. dc.isReferencedBy = data package doi
    41. 41. dc.identifier = doi of data package
    42. 42. dc.relation.hasPart = dois of data files
    43. 43. dc.references = handle of article description record
    44. 44. dc.title = title of data package
    45. 45. dc.description (not article abstract, optional)
    46. 46. dc.creator = authors of data package
    47. 47. dc.date (with refinements – dates associated with submission to Dryad and archiving in the repository)
    48. 48. dryad.external = GenBank accession number, TreeBASE identifier
    49. 49. dc.relation = URL of related resource
    50. 50. dc.subject = general keywords
    51. 51. DarwinCore.ScientificName = taxon keywords
    52. 52. dc.spatial = geographic keywords
    53. 53. dc.temporal = timespan keywords
    54. 54. dryad.curatorNote</li></ul>Datafile<br /><ul><li> dc.identifier = doi of data file
    55. 55. dc.relation.isPartOf = doi of data package
    56. 56. file-specific description: keywords, authors, format, size, checksum, etc.
    57. 57. embargo information (type, end date)</li>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×