Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Matters for AGU Early Career Conference

1,000 views

Published on

Presentation for 2014 Early Career Conference at AGU in San Francisco, 14 December. Covers data management, data sharing, open science.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data Matters for AGU Early Career Conference

  1. 1. Data Matters Tips & Tools for Better Research Carly Strasser, California Digital Library carlystrasser@gmail.com AGU Student & Early Career Scientist Conference 14 Dec 2014 From Flickr by Lachlan Donald
  2. 2. Why are you here? Science: you’re (probably) doing it wrong
  3. 3. From Wikimedia Commons Back in the day… From ahswhg.wikispaces.com
  4. 4. Back in the day… Da Vinci Curie Newton classicalschool.blogspot.com Darwin
  5. 5. Research has changed Better
  6. 6. From wikimedia Such Internet! So many tools! From Flickr by John Jobby So much data!
  7. 7. Research has changed Worse
  8. 8. Digital data From Flickr by Flickmor From Flickr by DW0825 From Flickr by US Army Environmental Command C. Strasser Courtesey of WHOI From Flickr by deltaMike
  9. 9. Digital data + Complex workflows
  10. 10. i.telegraph.co.uk
  11. 11. Scientists are bad at data management.
  12. 12. An embarrassing example… From Flickr by lincolnblues
  13. 13. ?
  14. 14. From Flickr by ransomtech Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow
  15. 15. Why should I care? From Flickr by johntrainor
  16. 16. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible.
  17. 17. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible. Because Fox News, creationism, and the war on science.
  18. 18. “Help us identify grants that are wasteful or that you don’t think are a good use of taxpayer dollars.” Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science and Technology
  19. 19. Because reproducibility is one of the fundamental tenets of science. Because we need to be credible. Because Fox News, creationism, and the war on science Because it means faster progress.
  20. 20. Because you are a good person.
  21. 21. From Flickr by Redden-McAllister From Flickr by Ken Cowell From Flickr Brandi Jordan
  22. 22. Map of Scientific Collaborations flowingdata.com
  23. 23. Because you have to.
  24. 24. Journals Institutions Funders From Flickr by Eva Rinaldi Celebrity and Live Music Photographer
  25. 25. Feb 2013 … “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”
  26. 26. From Flickr by Michael Tinkler
  27. 27. From Flickr by Big Swede Guy data management Best Practices
  28. 28. From Flickr by Mark Sardella Plan before data collection
  29. 29. Design sample naming schemePlanning • Create a key (data dictionary) • Make sure names are unique • Define codes From Flickr by zebbie
  30. 30. Design file naming schemePlanning Use descriptive file names • Unique • Reflect contents From R Cook, ESA Best Practices Workshop 2010 Bad: Mydata.xls 2001_data.csv best version.txt Better: Eaffinis_nanaimo_2010_counts.xls Site name Year What was measured Study organism *Not for everyone *
  31. 31. Design file organizationPlanning Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv … Consider… • Dependencies? • File formats? • Time of collection? • Order of analysis? From S. Hampton
  32. 32. Planning Design your spreadsheet Constrain entries Atomize Break down spreadsheets From Flickr by Ulleskelf
  33. 33. Consider a databasePlanning A relational database is A set of tables Relationships among the tables A language to specify & query the tables A RDB provides Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors From Mark Schildhauer
  34. 34. Pick a data repository Store your data in a repository Institutional archive Discipline/specialty archive From Flickr by torkildr Planning
  35. 35. Pick a data repository Store your data in a repository Institutional archive Discipline/specialty archive From Flickr by torkildr Planning Ask a librarian
  36. 36. Pick a data repository Store your data in a repository Institutional archive Discipline/specialty archive From Flickr by torkildr Planning Ask a librarian Repos of repos: databib.org re3data.org
  37. 37. Decide on preservation/backup From Flickr by sepa synod From Flickr by taberandrew From Flickr by withassociates Planning
  38. 38. Decide on preservation/backup From Flickr by sepa synod From Flickr by taberandrew From Flickr by withassociates What software? What hardware? What personnel? How often? Set up reminders! Test system Planning
  39. 39. …document that describes what you will do with your data throughout the research project From Flickr by Barbies Land Write a data management plan! Planning
  40. 40. Planning DMP components • What will be collected • Methods • Standards • Metadata • Sharing/But they access all have • Long-term storage different requirements and express them in different ways From Flickr by Barbies Land
  41. 41. dmptool.org Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community Planning
  42. 42. During Data Collection & Entry From Flickr by Julia Manzerova
  43. 43. Realistically: • Archive .csv version of raw data • Make a “raw” tab in working data file • Do all work on other tabs During Keep raw data rawcollection
  44. 44. Keep raw data raw Raw data as .csv During collection R script for processing & analysis Ideally: • Use scripts to process data • Save them with data
  45. 45. During Document your workflowcollection Workflow: how you get from the raw data to the final products of your research Temperature data Salinity data Data import into Excel Quality control & “Clean” T data cleaning & S data Analysis: mean, SD Graph production Data in spread-sheet Summary statistics Simple workflow: flow chart
  46. 46. During collection Workflow: how you get from the raw data to the final products of your research Commented script • R, SAS, MATLAB… • Well-documented code is Easier to review Easier to share Easier to use for repeat analysis # %$ & Document your workflow
  47. 47. Constrain data entries • Excel lists • Data validation • Google docs forms Modified from K. Vanderbilt During collection
  48. 48. Atomize During collection One piece of information per cell
  49. 49. During Break down spreadsheetscollection Fake a relational database Create parameter table From doi:10.3334/ORNLDAAC/777 From doi:10.3334/ORNLDAAC/777 From R Cook, ESA Best Practices Workshop 2010 Create a site table
  50. 50. Metadata: data reporting WHO created the data? WHAT is the content of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed? From Flickr by //ichael Patric|{ During Create metadatacollection
  51. 51. Create metadatacollection Digital context • Name of the data set • The name(s) of the data file(s) in the data set • Date the data set was last modified • Example data file records for each data type file • Pertinent companion files • List of related or ancillary data sets • Software (including version number) used to prepare/read the data set • Data processing that was performed Personnel & stakeholders • Who collected • Who to contact with questions • Funders During Scientific context • Scientific reason why the data were collected • What data were collected • What instruments (including model & serial number) were used • Environmental conditions during collection • Temporal & spatial resolution • Standards or calibrations used Information about parameters • How each was measured or produced • Units of measure • Format used in the data set • Precision & accuracy if known Information about data • Definitions of codes used • Quality assurance & control measures • Known problems that limit data use (e.g. uncertainty, sampling problems)
  52. 52. < Create metadata St a n da rd Metadata standards… • Provide structure to describe data During collection What is metadata? Common terms | definitions | language | structure • Come in many flavors EML , FGDC, ISO19115, DarwinCore,… • Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
  53. 53. Back up daily During collection From Flickr by lippo From Flickr by see phar Original Near Far
  54. 54. During collection From Flickr by Barbies Land Remember that data management plan? Revisit Review Revise
  55. 55. During collection Schedule a time each week or month Revisit Review Revise From Flickr by purplemattfish
  56. 56. From Flickr by celikins Where to start?
  57. 57. Make a resolution • Triage on current projects • Get advisor, lab mates, collaborators on board • Do better next time From Flickr by Andy Graulund
  58. 58. From Flickr by karindalziel Start working online
  59. 59. Open notebooks http://datapub.cdlib.org
  60. 60. Write a DMPdmptool.org Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community
  61. 61. databib.org Find a repository Where should I put my data?
  62. 62. Learn new skills software carpentry www.software-carpentry.org
  63. 63. Other Fun Stuff From Flickr by Micah Taylor
  64. 64. Credit in academia… Altmetrics? Impact Factors + Citation Counts
  65. 65. Altmetrics Article-level metrics Altmetrics for alt-products Data Code Slides Blogs Downloads Tweets Mentions Views From Flickr by Skakerman
  66. 66. Altmetrics Article-level metrics Altmetrics for alt-products
  67. 67. Researcher Identification
  68. 68. BIG initiatives…
  69. 69. NSF funded DataNet Project Office of Cyberinfrastructure www.dataone.org
  70. 70. New partners…
  71. 71. Better methods…
  72. 72. Better methods…
  73. 73. From Flickr by dotpolka Manage & share your data!
  74. 74. Website Email Twitter Slides carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser

×