Data Stewardship for SPATIAL/IsoCamp 2014


Published on

Overview of open science and best practices for data management for 2014 SPATIAL and IsoCamp, University of Utah. 13 June 2014.

Published in: Science, Technology

Data Stewardship for SPATIAL/IsoCamp 2014

  1. 1. Data Stewardship Carly Strasser California Digital Library SPATIAL / IsoCamp June 2014 Tips & Tools
  2. 2. I am not a librarian. But I do work at a library.
  3. 3. Enable data sharing Encourage new incentives Think about code sharing Work with libraries, publishers and researchers Explore new tools to help change system Build tools
  4. 4. Why are you here? Science: you’re (probably) doing it wrong
  5. 5. Back in the day… Da Vinci Curie Newton Darwin
  6. 6. Research has changed Better
  7. 7. From wikimedia Such Internet! So many tools! From Flickr by John Jobby So much data!
  8. 8. Research has changed Worse
  9. 9. Digital data FromFlickrbyFlickmor FromFlickrbyUSArmyEnvironmentalCommand FromFlickrbyDW0825 C. Strasser CourteseyofWHOI FromFlickrbydeltaMike
  10. 10. Digital data + Complex workflows
  11. 11. Scientists are bad at data management.
  12. 12. An embarrassing example… From Flickr by lincolnblues
  13. 13. ?
  14. 14. From Flickr by ransomtech Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow
  15. 15. From Flickr by ransomtech Reproducibility Transparency Reuse NO
  16. 16. From Flickr by johntrainor Why should I care?
  17. 17. Because reproducibility* is one of the fundamental tenets of science. *reproducibility: being able to go from data to figures/results not reproducibility: independently verifiable via following same techniques.
  18. 18. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible.
  19. 19. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible. Because Fox News, creationism, and the war on science.
  20. 20. “Help us identify grants that are wasteful or that you don’t think are a good use of taxpayer dollars.” ! Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science and Technology
  21. 21. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible. Because Fox News, creationism, and the war on science Because it means faster progress.
  22. 22. Because you are a good person.
  23. 23. From Flickr by Redden-McAllister From Flickr by Ken Cowell From Flickr Brandi Jordan
  24. 24. Open Science Making data research dissemination available to all
  25. 25. Map of Scientific Collaborations
  26. 26. Because you have to.
  27. 27. Journals Institutions Funders From Flickr by Eva Rinaldi Celebrity and Live Music Photographer
  28. 28. … “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.” Feb 2013
  29. 29. 1.  Maximize free public access 2.  Ensure researchers create data management plans 3.  Allow costs for data preservation and access in proposal budgets 4.  Ensure evaluation of data management plan merits 5.  Ensure researchers comply with their data management plans 6.  Promote data deposition into public repositories 7.  Develop approaches for identification and attribution of datasets 8.  Educate folks about data stewardship From Flickr by Joe Crimmings Photography
  30. 30. From  Flickr  by  Michael  Tinkler  
  31. 31. data management FromFlickrbyBigSwedeGuy Best Practices
  32. 32. From Flickr by Mark Sardella Plan before data collection
  33. 33. •  Create a key (data dictionary) •  Make sure names are unique •  Define codes FromFlickrbyzebbie Planning Design sample naming scheme
  34. 34. Planning Design file naming scheme
  35. 35. Use descriptive file names •  Unique •  Reflect contents From  R  Cook,  ESA  Best  Practices  Workshop  2010   Bad: Mydata.xls 2001_data.csv best version.txt Better: Eaffinis_nanaimo_2010_counts.xls Site name Year What was measured Study organism *Not for everyone * Planning Design file naming scheme
  36. 36. Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv … From S. Hampton Planning Design file organization Consider… •  Dependencies? •  File formats? •  Time of collection? •  Order of analysis?
  37. 37. Planning Constrain entries Atomize Break down spreadsheets Design your spreadsheet
  38. 38. A relational database is A set of tables Relationships among the tables A language to specify & query the tables A RDB provides Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors From Mark Schildhauer Planning Consider a database
  39. 39. You should invest time in learning databases if your data sets are large or complex Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old Planning From Mark Schildhauer Consider a database
  40. 40. Store your data in a repository Institutional archive Discipline/specialty archive Pick a data repository From Flickr by torkildr Ask a librarian Repos of repos: Planning
  41. 41. FromFlickrbysepasynod From Flickr by taberandrew From Flickr by withassociates What software? What hardware? What personnel? How often? Set up reminders! Test system Decide on preservation/backup Planning
  42. 42. …document that describes what you will do with your data throughout the research project From Flickr by Barbies Land Write a data management plan! Planning
  43. 43. DMP components But they all have different requirements and express them in different ways •  What will be collected •  Methods •  Standards •  Metadata •  Sharing/access •  Long-term storage Planning From Flickr by Barbies Land
  44. 44. Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community Planning
  45. 45. During Data Collection & Entry From Flickr by Julia Manzerova
  46. 46. Realistically: •  Archive .csv version of raw data •  Make a “raw” tab in working data file •  Do all work on other tabs During collection Keep raw data raw
  47. 47. Raw data as .csv R script for processing & analysis During collection Ideally: •  Use scripts to process data •  Save them with data Keep raw data raw
  48. 48. During collection Document your workflow Temperature data Salinity data Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Summary statistics Data in spread- sheet Workflow: how you get from the raw data to the final products of your research Simple workflow: flow chart
  49. 49. During collection Workflow: how you get from the raw data to the final products of your research Simple workflow: commented script •  R, SAS, MATLAB… •  Well-documented code is Easier to review Easier to share Easier to use for repeat analysis # % $ & Document your workflow
  50. 50. Fancy schmancy workflows Resulting output During collection Document your workflow
  51. 51. Workflows enable •  Reproducibility •  Transparency •  Reuse From Flickr by merlinprincesse During collection Document your workflow
  52. 52. Constrain data entries •  Excel lists •  Data validation •  Google docs forms Modified from K. Vanderbilt During collection
  53. 53. Atomize During collection One piece of information per cell
  54. 54. Create parameter table From doi:10.3334/ORNLDAAC/777 From doi:10.3334/ORNLDAAC/777 From R Cook, ESA Best Practices Workshop 2010 During collection Break down spreadsheets Fake a relational database Create a site table
  55. 55. Why are you promoting Excel? During collection Create metadata
  56. 56. Metadata: data reporting WHO created the data? WHAT is the content of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed? FromFlickrby//ichaelPatric|{ During collection Create metadata
  57. 57. Digital context •  Name of the data set •  The name(s) of the data file(s) in the data set •  Date the data set was last modified •  Example data file records for each data type file •  Pertinent companion files •  List of related or ancillary data sets •  Software (including version number) used to prepare/read the data set •  Data processing that was performed Personnel & stakeholders •  Who collected •  Who to contact with questions •  Funders Scientific context •  Scientific reason why the data were collected •  What data were collected •  What instruments (including model & serial number) were used •  Environmental conditions during collection •  Temporal & spatial resolution •  Standards or calibrations used Information about parameters •  How each was measured or produced •  Units of measure •  Format used in the data set •  Precision & accuracy if known Information about data •  Definitions of codes used •  Quality assurance & control measures •  Known problems that limit data use (e.g. uncertainty, sampling problems) During collection Create metadata
  58. 58. •  Provide structure to describe data Common terms | definitions | language | structure •  Come in many flavors EML , FGDC, ISO19115, DarwinCore,… •  Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM) What is metadata? Metadata standards… During collection Standard < Create metadata
  59. 59. Back up daily During collection From Flickr by lippo From Flickr by see phar Original Near Far
  60. 60. During collection From Flickr by Barbies Land Remember that data management plan? Revisit Review Revise
  61. 61. During collection Schedule a time each week or month Revisit Review Revise From Flickr by purplemattfish
  62. 62. From  Flickr  by  celikins   Where to start?
  63. 63. From Flickr by Andy Graulund Make a resolution • Triage on current projects • Get advisor, lab mates, collaborators on board • Do better next time
  64. 64. Start working online From  Flickr  by  karindalziel  
  65. 65. Reproducibility, E-notebooks, Online science
  66. 66. Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community dmptool.orgWrite a DMP
  67. 67. Where should I put my data? Find a repository
  68. 68. Get help FromFlickrbythewmatt
  69. 69. FromFlickrbyNorthCarolinaDigital HeritageCenter From Flickr by Madison Guy Get help from your library
  70. 70. Learn new skills software carpentry
  71. 71. From Flickr by Micah Taylor Other Fun Stuff
  72. 72. Altmetrics? Impact Factors + Citation Counts Credit in academia…
  73. 73. Altmetrics Article-level metrics Altmetrics for alt-products Data Code Slides Blogs Downloads Tweets Mentions Views From Flickr by Skakerman
  74. 74. Altmetrics Article-level metrics Altmetrics for alt-products
  75. 75. Researcher  Identification  
  76. 76. BIG initiatives…
  77. 77. NSF funded DataNet Project Office of Cyberinfrastructure
  78. 78. New partners…
  79. 79. Better methods…
  80. 80. Better methods…
  81. 81. Science is changing. Embrace it.
  82. 82. From Flickr by dotpolka Manage & share your data!
  83. 83. Website Email Twitter Slides @carlystrasser