Data Management: Scientist Perspective - UC3 Data Curation Workshop

405 views
358 views

Published on

Presentation on data management: the current landscape, barriers to management, and data types. For UC3-CDL data curation for practitioners workshop, 8 Nov 2012 in Oakland CA.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
405
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data Management: Scientist Perspective - UC3 Data Curation Workshop

  1. 1. From  Calisphere  via  California  State  University  Libraries,    ark:/13030/c818356g Data A  Scientist’s    @carlystrasser Carly  Strasser Perspective Management
  2. 2. Roadmap 5.  Landscape 4.  Solutions 3.  Barriers 2.  How  Bad  Is  It? 1.  A  Brief  History C.  Strasser
  3. 3. A BriefFrom  Calisphere  via  Santa  Clara  University,     History of Dataark:/13030/kt696nc7j2 Collection Or…  how  scientists  came  to  be  so   bad  at  data  management
  4. 4. The  lab/field  notebook Curie Newton Darwin Da  Vinci classicalschool.blogspot.com
  5. 5. From  Flickr  by    DW0825 From  Flickr  by  Flickmor From  Flickr  by    deltaMike The  lab/field  notebook…? www.woodrow.org C.  Strasser Courtesey  of  WHOI From  Flickr  by  US  Army  Environmental  Command
  6. 6. Digital  data +   Complex  workflows
  7. 7. Data Models Maximum   Likelihood   estimation Matrix   Models Images Tables Paper
  8. 8. From  Flickr  by  stevecadman The Wide World of Data
  9. 9. Data  Types
  10. 10. Dimensions of Data Datum Data  file Metadata Dataset Data  collection Data  repository
  11. 11. Data DiversityTemporal   File  size Units File  structure organization File  type Documentatio Spatial n  extent structure Metadata Collection   Codes Project  intent practices Analysis
  12. 12. Big DataOSTP  March  2012 Big  Data  Effort  Launched
  13. 13. The   LiDle   Guys? From  Flickr  by  jason  tinder
  14. 14. The Long Tail Size  of  dataset or grant  ($) #  datasets or  #  researchers or  #  grants
  15. 15. From  Flickr  by  Old  Shoe  Woman The Fallout
  16. 16. UGLY TRUTH Many  (most?)   researchers…   5shortessays.blogspot.com are  not  taught  data  management don’t  know  what  metadata  are can’t  name  data  centers  or  repositories don’t  share  data  publicly  or  store  it  in  an  archive aren’t  convinced  they  should  share  data
  17. 17. Information  Entropy   Fig.  1  of  Michener  et  al.  1997
  18. 18. A Real Life Example
  19. 19. 2  tables Random  notes C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton
  20. 20. Wash  Cres  Lake  Dec  15  Dont_Use.xls C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton
  21. 21. Random  stats  output C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVA C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance F C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 1.17 Total 10 35.55962 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 From  Stephanie  Hampton
  22. 22. C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 SampleID -22.68 -22.22 ALG03 0.34 ALG05 4.31 3.66 ALG07 25376 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 Weight (mg) -29.44 -28.98 2.91 1.74 0.62 2.91 -0.03 25382 3.04 2.95 Square 0.080178 R 3.01 3 2.99 2.92 2.9 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 %C-29.23 6.85 1.07 0.95 35.560.30 25388 33.49 41.17 Observations43.74 11 4.51 1.59 4.37 33.58 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 delta 13C -21.85 -21.11 0.45 4.72 -28.054.07 25392 -29.56 -27.32 ANOVA -27.50 -22.68 -24.58 -21.06 -29.44 C1 ALG04 2.98 37.90 delta 13C_ca -27.42 -26.96 -20.65 1.36 1.21 -27.590.56 25394 -29.10 c -26.86 -27.04 df SS -22.22 MS F -24.12 Significance F -20.60 -28.98 C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 %N 0.48 1.17 2.30 1.68 1.97 Total 1.3610 35.55962 0.34 0.15 0.34 1.74 delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% delta 15N_ca -1.62 -0.06 0.14 2.06 Intercept -4.297428 4.671099 3.66 0.34 -2.34 -2.17 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 -0.03 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 4.00 3.00 2.00 1.00 Series1 0.00 -35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00 -1.00 -2.00 -3.00
  23. 23. The  Fallout Data   Reuse Data   Sharing Data   Management
  24. 24. From  Flickr  by  iowa_spirit_walker Stewardship Why? Hurdles to Data
  25. 25. From  Flickr  by  indigoprime Barriers: Cost From  Flickr  by  kobiz7 C.  Strasser
  26. 26. Barriers: Sociocultural From  Flickr  by  freefotouk Not  the  norm
  27. 27. Barriers: Sociocultural Not  the  norm Lack  of  /  too  many  standards
  28. 28. Barriers: Sociocultural Not  the  norm Lack  of  /  too   From  Flickr  by  toucanradio many  standards Disparate  data From  Flickr  by  Chris  Campbell
  29. 29. Barriers: Sociocultural From  Flickr  by  uniinnsbruck Not  the  norm Lack  of  /  too  many  standards Disparate  data Lack  of  training
  30. 30. From  Flickr  by  Christina  Ann   VanMeter Missed   opportunities Loss  of  rights  or  benefits From  Flickr  by  pnh Barriers: Sociocultural Conflict From  Flickr  by  tymesynk Misuse
  31. 31. Barriers: Sociocultural Lack  of  incentives Time   consuming  &   expensive No  From  Flickr  by  bthomso requirements Reward   structure
  32. 32. From  Flickr  by    MarqueYe  University generation?But what about the next
  33. 33. Are Undergrads Learning AboutData Management? Survey  of  Undergraduate  Ecology  courses 38  Research  focused 48  schools 10  Education  focused Schools  selected… •  One  of  top  grad  schools  for  Ecology •  Most  recipients  of  NSF  GRFP
  34. 34. Are Undergrads Learning AboutData Management?Strasser  and  Hampton,  In  press  at  Ecosphere Metadata  generation Software  choice File  naming QAQC Backing  up   Workflows Data  sharing Data  re-­‐‑use Meta-­‐‑analysis Reproducibility Notebook  protocols Databases  
  35. 35. Are Undergrads Learning AboutData Management? Quality control & Yes assurance No Naming computer files Types of files & software to use Metadata generation Workflows Protecting data Databases & data archiving Data re-use Meta-analysis Data sharing Reproducibility Notebook protocols 0 20 40 60 80 100 Percent
  36. 36. Are Undergrads Learning AboutData Management? 4.5 4.0 Importance for Undergraduates 3.5 3.0 2.5 2.0 r = 0.306 1.5 p < 0.05 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Value as a Researcher
  37. 37. Are Undergrads Learning About Data Management? 70 Time 60 Not  appropriate   at  this  level 50 Covered  in  lab Percent Citing Barrier 40 Students  not   Class  too   prepared large Instructor   30 Funding/ not  prepared resources 20 Covered  in   10 other  courses 0 e el ab ed es e ed es Tim lev in l par urc larg par urs his ed re eso too re r co at t ver ot p g/ r ss ot p the te co sn in Cla or n o pria be ent und uct d in pro oul d Stu d F Ins tr re t ap Sh C ove No
  38. 38. Solutions
  39. 39. Preach the BenefitsShort  term •  Spend  less  time  on  DM  &  more  on  research •  Easier  to  use  data •  Collaborators  can  understand  &  use  data Long  term •  Scientists  outside  project  can  find,  understand  &   use  data •  Credit  for  data  products  &  their  use •  Funders  protect  their  investment  
  40. 40. EducateDataONE  Best  Practices  Lessons What  is  the  Data  Life  Cycle?   Data  management  planning   Data  collection,  entry  &  manipulation   Data  quality  control  &  quality  assurance   Data  protection:  Data  backups   Data  preservation,  curation  &  archiving   Data  documentation:   What  is  metadata? Value  of  metadata   How  to  write  good  metadata   Data  management:  Data  sharing   Data  citation  practices   Collaboration/communication  technologies   Data  analysis:  Workflows  and  other  tools  
  41. 41. Primer
  42. 42. Use New TechnologyE-­‐‑notebooks •  NoteBook •  ORNL  eNote   •  Evernote •  Google  Docs •  Blogs •  wikis •  TheLabNotebook.com •  NoteBookMaker Workflows TheLabNotebook.com!•  Kepler •  Taverna
  43. 43. Use New Technology
  44. 44. C.  Strasser The Current Landscape
  45. 45. Data are being recognized asfirst class products of research From  Flickr  by  Richard  Moross
  46. 46. Publishing DataFrom  Flickr  by  Richard  Moross
  47. 47. From  Flickr  by  Richard  Moross Citing DataExample: Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20
  48. 48. From  Flickr  by  Richard  Moross Data Management RequirementsJournal  publishers
  49. 49. From  Flickr  by  Richard  Moross Data Management Requirements Journal  publishers Funders
  50. 50. Plan Analyze Collect Integrate Assure From  Flickr  by  darkuncle   Discover Describe Preserve
  51. 51. What  is  a   Will  I  get   data   credit  for  my   work? Plan management   plan? Analyze Collect What  tools  do   Are  there   I  use? standards? Integrate Assure What  is   How  much   metadata? Who  can  help   will  it  cost? From  Flickr  by  darkuncle   me? Discover Describe Where  do  I   Preserve How  do  I   preserve  my   preserve  my   data? data?
  52. 52. B A C
  53. 53. NSF  funded  DataNet  Project Office  of  Cyberinfrastructure Community   Cyberinfrastructure Engagement  &   Outreach Courtesy  of  DataONE
  54. 54. My  website carlystrasser.net Email  me carlystrasser@gmail.com Tweet  me @carlystrasser   My  slides slideshare.net/carlystrasser CDL  Blog datapub.cdlib.org

×