Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leveraging publication metadata to help overcome the data ingest bottleneck

1,555 views

Published on

A talk on Dryad given at the ORCID Participant Meeting in Boston, 5/18/2011

Published in: Technology, Education

Leveraging publication metadata to help overcome the data ingest bottleneck

  1. 1. Leveraging publication metadata to help overcome the data ingest bottleneck <br />Todd J. Vision<br />National Evolutionary Synthesis Center<br />Department of Biology <br />University of North Carolina at Chapel Hill<br />ORCID Participant Meeting, Harvard, May 2011<br />
  2. 2. The End<br />To make data archiving integral to scientific publishing. <br />The scope<br />Data underlying findings in the peer-reviewed biological literature.<br />The Means<br />Integrated submission of data with the manuscript<br />Low barrier to submission (at the datafile level)<br />Free reuse of data (free as in both speech & beer)<br />Journals share responsibility for governance and sustainability<br />
  3. 3. The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PDB)<br />Volume<br />Orphan data<br />Rank frequency of datatype<br />
  4. 4. The long tail of orphan data in “small science”<br />after B. Heidorn<br />Specialized repositories<br />(e.g. GenBank, PDB)<br />Volume<br />Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation.pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass. <br />Orphan data<br />Rank frequency of datatype<br />
  5. 5. A publication package<br />
  6. 6. A publication package<br />1<br />1. Integrated manuscript and data submission<br />
  7. 7. A publication package<br />2<br />1<br />1. Integrated manuscript and data submission<br />2. Handshaking with specialized repositories<br />
  8. 8. Integrated<br />Submit manuscript<br />
  9. 9. Integrated<br />Submit manuscript<br />Manuscript metadata<br />
  10. 10. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />
  11. 11. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />
  12. 12. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />
  13. 13. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />
  14. 14. Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />Article<br />Publication<br />Data publication<br />
  15. 15.
  16. 16. Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Submit data<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Article metadata<br />Curation<br />Article<br />Publication<br />Data publication<br />
  17. 17. Non-integrated<br />Integrated<br />Submit manuscript<br />Submit data<br />Manuscript metadata<br />Review passcode<br />Peer review<br />Submit data<br />Acceptance notification<br />Curation<br />Data DOI<br />Production<br />Author adds DOI<br />Data DOI<br />Article metadata<br />Curation<br />Article publication<br />Article<br />Publication<br />Article metadata<br />harvested<br />Data publication<br />
  18. 18. Article<br />Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011<br />Dryad data package<br />Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384<br />
  19. 19. Integrated submission<br />Currently integrated or in process: 20<br />All journals with Dryad content: >70 <br />A minority require data prior to review<br />Journals published by a variety of organizations<br />Traditional (incl. Oxford University Press, Wiley-Blackwell)<br />Open Access (incl. BMC, BMJ Open)<br />Society publishers (e.g. with Allen Press, or independent)<br />
  20. 20. Dryad vs. Supplementary Online Materials<br />
  21. 21. 612 downloads<br />
  22. 22. Member nodes<br /><ul><li> Dryad, ORNL DAAC, Knowledge Network for Biocomplexity, etc.</li></ul>Coordinating nodes<br />Investigator toolkit<br />
  23. 23. Why Dryad yearns for ORCIDs<br />Replace name strings with identities<br />Disambiguation of like names<br />Clustering of synonymous names<br />Confidently recognizing different data packages that share an author<br />Enabling<br />Accurate author searches<br />Internal and external author hyperlinks<br />Aggregation of author contributions<br />Inclusion of data records in the profiles of coauthors<br />Propagation of ORCIDs with Dryad metadata<br />Manual curation of names not feasible<br />Only ~20% of Dryad authors in Library of Congress name auth. file<br />Manual control would explode curation costs<br />
  24. 24. How to get ORCIDs into Dryad<br />Ideally sent to Dryad by integrated journals<br />Pre-review/Pre-production: allows coauthors to edit data packages<br />Post-production: works for all other uses<br />Non-integrated journals<br />Lookup API based on article or affiliation data<br />To be avoided<br />Authors required to enter ORCIDs during submission<br />Authors required to register during submission<br />
  25. 25. What do we know about authors?<br />Names<br />Often abbreviated except for corresponding or submitting author<br />At least one article they have written<br />Title, journal, volume, pages, DOI, abstract<br />Other identifiable information<br />An email for submitting authors<br />Sometimes: institutional affiliation and contact information for corresponding authors<br />
  26. 26. Some requirements<br />Recognizing ORCIDs for authenticated users<br />Mapping to InCommon Silver profiles<br />ORCIDs for organizations (e.g. consortia)<br />Dspacesupport<br />Curator interface for ORCID lookup/verification<br />Lookup/registration option from submission interface<br />Allowing metadata relationships (e.g. of an ORCID with a name)<br />Mechanisms for curator to <br />Flag duplicates and errors<br />Register provisional ORCIDs<br />Map to other profiles (e.g. InCommon)<br />
  27. 27. Business model issues<br />Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals.<br />With a not-for-profit budget<br />Feasibility requires wide adoption by publishers<br />And manuscript-submission system developers!<br />Favored model<br />Pay for use of automated lookup services, with costs scaled by usage level<br />Credit for curator contributions<br />
  28. 28. "Cherish old knowledge that you may acquire new" <br /> The Analects of Confucius<br />Special thanks to<br />Elena Feinstein<br />Jane Greenberg<br />Ryan Scherle<br />For more information:<br />http://datadryad.org<br />http://blog.datadryad.org<br />http://datadryad.org/wiki<br />http://code.google.com/p/dryad<br />dryad-users@nescent.org<br />Facebook: Dryad<br />Twitter: @datadryad<br />
  29. 29. Dryad Metadata Profile (v3.0)<br />Article<br />Data Package<br /><ul><li> dc.identifier = doi of article
  30. 30. bibo.status = article publication status
  31. 31. dc.creator = authors of article
  32. 32. dc.issued = article publication date
  33. 33. dc.title = title of article
  34. 34. bibo.journal = journal title
  35. 35. bibo.issn and bibo.eissn
  36. 36. bibo.volume
  37. 37. bibo.issue
  38. 38. bibo.pageStart and bibo.pageEnd
  39. 39. dc.abstract = article abstract
  40. 40. dc.isReferencedBy = data package doi
  41. 41. dc.identifier = doi of data package
  42. 42. dc.relation.hasPart = dois of data files
  43. 43. dc.references = handle of article description record
  44. 44. dc.title = title of data package
  45. 45. dc.description (not article abstract, optional)
  46. 46. dc.creator = authors of data package
  47. 47. dc.date (with refinements – dates associated with submission to Dryad and archiving in the repository)
  48. 48. dryad.external = GenBank accession number, TreeBASE identifier
  49. 49. dc.relation = URL of related resource
  50. 50. dc.subject = general keywords
  51. 51. DarwinCore.ScientificName = taxon keywords
  52. 52. dc.spatial = geographic keywords
  53. 53. dc.temporal = timespan keywords
  54. 54. dryad.curatorNote</li></ul>Datafile<br /><ul><li> dc.identifier = doi of data file
  55. 55. dc.relation.isPartOf = doi of data package
  56. 56. file-specific description: keywords, authors, format, size, checksum, etc.
  57. 57. embargo information (type, end date)</li>

×