Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preparing Data for (Open) Publication

872 views

Published on

Talk given at CODATA Workshop, Bangalore, India, March 7 2015.

Published in: Science
  • Be the first to comment

Preparing Data for (Open) Publication

  1. 1. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Brian Hole, Founder and CEO ISI CODATA Workshop, Bangalore, 9th March 2015 Preparing data for (open) publication
  2. 2. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Overview  Think open  Plan early and well  Examples  Follow best practices
  3. 3. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress think open
  4. 4. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
  5. 5. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Plan early & well
  6. 6. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Data management plans • US: • National Science Foundation (NSF) • National Endowment for the Humanities (NEH) • National Aeronautics and Space Administration (NASA) • National Oceanic and Atmospheric Administration (NOAA) • Institute of Museum and Library Services (IMLS) • Agency for Healthcare Research and Quality (AHRQ) • Gordon & Betty Moore Foundation • Alfred P. Sloan Foundation • UK: Economic and Social Research Council (ESRC) • Encourage that best practices are followed • Provide a structured approach to data throughout its lifecycle • Now mandated by many funders • Europe: Horizon 2020 • Other international mandates: http://www.sherpa.ac.uk
  7. 7. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress DMP structure 5. Storage and backup 6. Selection and preservation 3. Documentation and metadata 1. Administrative data 4. Ethics and legal compliance 2. Data collection 7. Data sharing 8. Responsibilities and resources Source: DCC. (2013). Checklist for a Data Management Plan. v.4.0. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/data-management-plans
  8. 8. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 1. Administrative data • Basic information e.g. project title, your name, contact details, reference numbers / IDs Here you should record basic information to identify and contextualise your plan. Identifiers may help to link your DMP with information held in other systems. You should include: • A summary of the research to explain the purpose for which data are being collected • Details of related policies and procedures e.g. institutional data policy or departmental guidelines Source: XKCD, http://xkcd.com/97/
  9. 9. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 2. Data collection • How will you structure and name your folders and files? • What quality assurance processes will you adopt? • What standards or methodologies will you use to create data? Here you should consider what data you will collect and how. • Do your chosen formats and software enable sharing and long-term access to the data? • Are there any existing data that you can reuse? Source: SMBC, http://smbc-comics.com/ index.php?db=comics&id=1849
  10. 10. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 3. Documentation and metadata • How will you capture / create this documentation and metadata? • What documentation and metadata will accompany the data? Here you should consider what information is needed for the data to be to be read and interpreted in the future. Estimate how much time and effort will be needed to create this supporting documentation and ensure that you allow for sufficient resource. • What metadata standards will you use and why? Source: Gary Larson
  11. 11. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 4. Ethics and metadata • How will you protect the identity of participants if required? e.g. via anonymisation • Will data sharing be postponed / restricted? e.g. to publish or seek patents Here you should consider any ethical or legal issues, particularly in terms of restrictions they may place on data sharing. • Have you gained consent for data sharing and preservation? • How will the data be licensed for reuse? Source: SMBC, http://smbc-comics.com/ index.php?db=comics&id=1957
  12. 12. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 5. Storage and backup • What are the risks to data security and how will these be managed? Here you should consider where the data will be stored and any implications this has for backup, access and security. • Who will be responsible for backup and recovery? • Do you have sufficient storage or will you need to include charges for additional services? • How will you ensure that collaborators can access your data securely? Source: SMBC, http://smbc-comics.com/ index.php?db=comics&id=2237
  13. 13. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 6. Selection and preservation • What are the foreseeable research uses for your data? • Which data should be preserved and potentially shared? • Which data must be retained or destroyed for contractual, legal, or regulatory purposes? Here you should determine which data are of long-term value and should be preserved. Decide how best to preserve those data, for example by depositing in repositories. • What is the long-term preservation plan for the dataset? • Have you costed in the time and effort required to prepare the data for preservation and sharing? Source: XKCD, http://xkcd.com/309/
  14. 14. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 7. Data sharing • When will you make the data available? • Are any restrictions on data sharing required? Here you should consider which data you will share and how. The methods used will depend on a number of factors such as the type, size, complexity and sensitivity of the data. Also consider how people might acknowledge the reuse of your data (e.g. via citations) so you gain impact. • With whom will you share the data, and under what conditions? • What action will you take to overcome or minimise restrictions? • How will potential users find out about your data? Source: SMBC, http://smbc-comics.com/ index.php?db=comics&id=100
  15. 15. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 7. Responsibilities and resources • Who is responsible for implementing the DMP, and ensuring it is reviewed and revised? • How will responsibilities be split across partner sites in collaborative research projects? Here you should assign roles and responsibilities for all data management activities. Also carefully consider any resources needed to deliver your plan. These costs can usually be written into grant applications but need to be clearly outlined and justified. • What resources will you require to deliver your plan? • Is additional specialist expertise or equipment required? Source: SMBC, http://smbc-comics.com/ index.php?db=comics&id=1893
  16. 16. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Online tools • DMPOnline (UK Digital Curation Centre) https://dmponline.dcc.ac.uk/ • DMPTool (California Digital Library) https://dmp.cdlib.org/
  17. 17. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress examples
  18. 18. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Repositories Modified from: XKCD
  19. 19. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress GenBank • Two upload tools – Bankit for short sequences, Sequin for complex or multiple sequences • Sequence data uploaded as a FASTA file • Immediate or future release instruction • Citation of a reference paper • Names of source organisms and any related descriptive data • Sequence features (e.g. CDS, gene, rRNA, tRNA, with nucleotide intervals and product names) and topology • Organism name, applicable source modifiers, location • Genus and species names (if not previously provided in FASTA file) • If name is new or unrecognized, provide best known taxonomic lineage • If genus and/or species names are not known, provide most specific name known (for example:Bacillus sp., Uncultured bacterium, Uncultured archaeon) • Most complete name for any synthetic vector (for example: Cloning vector pAB234, Transfer vector p789Abc) • Source modifiers include: strain, clone, isolate, specimen-voucher, isolation- source, country • Location: organelle (mitochondrion, chloroplast, etc); map and/or chromosome
  20. 20. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress ClinicalTrials.gov • Web-based data entry system called the Protocol Registration and Results System (PRS) • Section 801 of the US Food and Drug Administration Amendments Act of 2007 requires clinical trial registration and the submission of results • Standard format • Study Type: ‘Observational’ or ‘Interventional’ • Outcome Measures: The Primary and Secondary Outcome Measure Titles and Descriptions • Outcome Measure Time Frame • Conditions or Focus of the Study • Intervention Information: Each intervention is entered separately using the Intervention Type, Name, and Description data elements • Eligibility: List of key inclusion and exclusion criteria • Locations
  21. 21. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress FlyBase • FlyBase contains a complete annotation of the Drosophila melanogaster genome • It also includes a searchable bibliography of research on Drosophila genetics • Detail which genes feature in your paper, and FB will link your paper to those genes for the next release cycle. • Provide additional information during the submission process about your publication and help the Curators to speed up your curation. • The whole process takes about 5mins!
  22. 22. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Figshare • Accepts all types and formats of data, no restrictions • Concise metadata, takes around 5 minutes • Multiple licenses, no cost.
  23. 23. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Dryad Typical process: • Authors submit their manuscripts to the journal for consideration. • Journal provides information about manuscripts to Dryad through automated notices from the manuscript processing system, which creates a provisional Dryad record for the data. • Journal invites authors to archive data in Dryad, through a custom submission link that brings the author to the provisional record. • Authors upload their files to Dryad through the submission link supplied by the journal; no redundant information need be entered and the article details are correct. • Dryad Curators process and approve the data files and register the Digital Object Identifier (DOI), a permanent identifier that allows the data to be cited and tracked; curators convey the DOI to the journal. • Journal and publisher add the Dryad DOI to all forms of the final article, enabling readers of the article to access the data. • Dryad can also provide links to data in other repositories, including sequences in GenBank and phylogenetic trees in TreeBASE. • License: CC0 • Cost: $80 / ₹5,000
  24. 24. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress
  25. 25. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Nature Scientific Data • Papers are called “data descriptors” • Fill out and submit a paper template • Requires an ISA-tab metadata file • Quality of data a major focus. • CC-BY/NC • APC £890 / ₹84,000
  26. 26. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress GigaScience • Data submitted to public databases, complemented by its citable form in GigaDB • License: CC0 • APC: £650 / ₹61,000
  27. 27. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Ubiquity Press 1. The paper contents a. The methods section of the paper must provide sufficient detail that a reader can understand how the resource was created. b. The resource must be correctly described. c. The reuse section must provide concrete and useful suggestions for reuse of the reuse. 2. The deposited resource a. The repository must be suitable for resource and have a sustainability model. b. Open license permits unrestricted access (e.g. CC0), or access guaranteed if criteria met (must qualify) c. A version in an open, non-proprietary format. d. Labeled in such a way that a 3rd party can make sense of it. e. Must be actionable.
  28. 28. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress The basics of the model Data papers are short 1) Low barrier data publication Peer review is quick and objective 2) Online authoring Low APC: £100 / ₹1,000 Lower cost (straight to XML) Encourages shorter form 3) Open access only (CC-BY) 4) The publisher is not the repository No-questions-asked waivers
  29. 29. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress follow best practices
  30. 30. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress 11 Best practices for data publication • Record your methodology well – think about reproducibility • Make sure you can export to open formats • Record your selection and QA processes well • Choose appropriate metadata standards, record from the beginning • Ensure you obtain proper consent, and that it allows for open publication if possible • Consider the timing of data publication – e.g. to coincide with research papers • Consider potential reuse scenarios from the start • Choose an appropriate repository • Think about possible restrictions and access conditions early – justify and seek to minimise • Plan to publish with maximum dissemination – data paper? • Allocate time and funding for data publication in grant proposals
  31. 31. brian.hole@ubiquitypress.com www.ubiquitypress.com / @ubiquitypress Any questions? Or please feel free to contact brian.hole@ubiquitypress.com

×