Coping with Data for WHOI JP Students

1,026 views

Published on

Data management best practices presentation for JP Students at Woods Hole Oceanographic Institution, 12 April 2014.

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,026
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Coping with Data for WHOI JP Students

  1. 1. Coping With Your Data Carly Strasser California Digital Library carlystrasser@gmail.com WHOI 10 April 2014 Tips & Tools
  2. 2. Roadmap 3. Toolbox 1. Background 2. Best practices
  3. 3. C.Strasser
  4. 4. From Flickr by robertpaulyoung Scientists are bad at data management.
  5. 5. Many tables
  6. 6. Embedded figures
  7. 7. my spreadsheet No headings
  8. 8. my spreadsheet
  9. 9. my spreadsheet
  10. 10. ?
  11. 11. From Flickr by ransomtech Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow
  12. 12. From Flickr by ransomtech Reproducibility Transparency Reuse NO
  13. 13. From Flickr by johntrainor Why should I care?
  14. 14. Because they care: From Flickr by Redden-McAllister
  15. 15. the Truth Fromsandierpastures.com Data management Metadata Data repositories Data sharing You need to know about
  16. 16. … “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.” Feb 2013
  17. 17. 1.  Maximize free public access 2.  Ensure researchers create data management plans 3.  Allow costs for data preservation and access in proposal budgets 4.  Ensure evaluation of data management plan merits 5.  Ensure researchers comply with their data management plans 6.  Promote data deposition into public repositories 7.  Develop approaches for identification and attribution of datasets 8.  Educate folks about data stewardship From Flickr by Joe Crimmings Photography
  18. 18. data management FromFlickrbyBigSwedeGuy Best Practices
  19. 19. From Flickr by Mark Sardella Plan before data collection
  20. 20. •  Create a key (data dictionary) •  Make sure names are unique •  Define codes FromFlickrbyzebbie Planning Design sample naming scheme
  21. 21. PhDcomics.com Planning Design file naming scheme
  22. 22. Use descriptive file names •  Unique •  Reflect contents From  R  Cook,  ESA  Best  Practices  Workshop  2010   Bad: Mydata.xls 2001_data.csv best version.txt Better: Eaffinis_nanaimo_2010_counts.xls Site name Year What was measured Study organism *Not for everyone * Planning Design file naming scheme
  23. 23. From S. Hampton Planning Design file organization
  24. 24. Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv … From S. Hampton Planning Design file organization Consider… •  Dependencies? •  File formats? •  Time of collection? •  Order of analysis? Workflows!
  25. 25. Planning Constrain entries Atomize Break down spreadsheets Design your spreadsheet From Flickr by Ulleskelf
  26. 26. A relational database is A set of tables Relationships among the tables A language to specify & query the tables A RDB provides Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors From Mark Schildhauer Planning Consider a database
  27. 27. You should invest time in learning databases if your data sets are large or complex Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old Planning From Mark Schildhauer Consider a database
  28. 28. Store your data in a repository Institutional archive Discipline/specialty archive Pick a data repository From Flickr by torkildr Ask a librarian Repos of repos: databib.org re3data.org Planning
  29. 29. FromFlickrbysepasynod From Flickr by taberandrew From Flickr by withassociates What software? What hardware? What personnel? How often? Set up reminders! Test system Decide on preservation/backup Planning
  30. 30. …document that describes what you will do with your data throughout the research project From Flickr by Barbies Land Write a data management plan! Planning
  31. 31. DMP components But they all have different requirements and express them in different ways •  What will be collected •  Methods •  Standards •  Metadata •  Sharing/access •  Long-term storage Planning From Flickr by Barbies Land
  32. 32. Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community dmptool.org Planning
  33. 33. During Data Collection & Entry From Flickr by Julia Manzerova
  34. 34. Realistically: •  Archive .csv version of raw data •  Make a “raw” tab in working data file •  Do all work on other tabs During collection Keep raw data raw
  35. 35. Raw data as .csv R script for processing & analysis During collection Ideally: •  Use scripts to process data •  Save them with data Keep raw data raw
  36. 36. During collection Document your workflow Temperature data Salinity data Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Summary statistics Data in spread- sheet Workflow: how you get from the raw data to the final products of your research Simple workflow: flow chart
  37. 37. During collection Workflow: how you get from the raw data to the final products of your research Simple workflow: commented script •  R, SAS, MATLAB… •  Well-documented code is Easier to review Easier to share Easier to use for repeat analysis # % $ & Document your workflow
  38. 38. Fancy schmancy workflows Resulting output https://kepler-project.org During collection Document your workflow
  39. 39. Workflows enable •  Reproducibility •  Transparency •  Reuse From Flickr by merlinprincesse During collection Document your workflow
  40. 40. Constrain data entries •  Excel lists •  Data validation •  Google docs forms Modified from K. Vanderbilt During collection
  41. 41. Atomize During collection One piece of information per cell
  42. 42. Create parameter table From doi:10.3334/ORNLDAAC/777 From doi:10.3334/ORNLDAAC/777 From R Cook, ESA Best Practices Workshop 2010 During collection Break down spreadsheets Fake a relational database Create a site table
  43. 43. Why are you promoting Excel? During collection Create metadata
  44. 44. Metadata: data reporting WHO created the data? WHAT is the content of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed? FromFlickrby//ichaelPatric|{ During collection Create metadata
  45. 45. Digital context •  Name of the data set •  The name(s) of the data file(s) in the data set •  Date the data set was last modified •  Example data file records for each data type file •  Pertinent companion files •  List of related or ancillary data sets •  Software (including version number) used to prepare/read the data set •  Data processing that was performed Personnel & stakeholders •  Who collected •  Who to contact with questions •  Funders Scientific context •  Scientific reason why the data were collected •  What data were collected •  What instruments (including model & serial number) were used •  Environmental conditions during collection •  Temporal & spatial resolution •  Standards or calibrations used Information about parameters •  How each was measured or produced •  Units of measure •  Format used in the data set •  Precision & accuracy if known Information about data •  Definitions of codes used •  Quality assurance & control measures •  Known problems that limit data use (e.g. uncertainty, sampling problems) During collection Create metadata
  46. 46. •  Provide structure to describe data Common terms | definitions | language | structure •  Come in many flavors EML , FGDC, ISO19115, DarwinCore,… •  Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM) What is metadata? Metadata standards… During collection Standard < Create metadata
  47. 47. Back up daily During collection From Flickr by lippo From Flickr by see phar Original Near Far
  48. 48. During collection From Flickr by Barbies Land Remember that data management plan? Revisit Review Revise
  49. 49. During collection Schedule a time each week or month Revisit Review Revise From Flickr by purplemattfish
  50. 50. FromFlickrbydipster1 Toolbox
  51. 51. Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community dmptool.org Write a DMP
  52. 52. databib.org Where should I put my data? Find a repository
  53. 53. Get help FromFlickrbythewmatt
  54. 54. DCXL blog: dcxl.cdlib.org Toolbox: Get help
  55. 55. FromFlickrbyNorthCarolinaDigital HeritageCenter From Flickr by Madison Guy Get help from your library
  56. 56. carlystrasser@gmail.com Get help from me
  57. 57. From Flickr by Andy Graulund Make a resolution • Triage on current projects • Get advisor, lab mates, collaborators on board • Do better next time
  58. 58. FromFlickrbytwm1340 Culture Shift Ahead
  59. 59. science source notebook content access data government knowledge FromFlickrbycdsessums
  60. 60. From Flickr by dotpolka Doing science is a privilege. Data hoarding is science malpractice. Manage & share your data!
  61. 61. Website Email Twitter Slides carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser

×