Your SlideShare is downloading. ×
0
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Herding for Scientists - UC Davis OA Week

880

Published on

Presentation for the UC Davis for Open Access Week. Covers the current status of data management in the sciences, best practices for data management, data management planning, and tools for …

Presentation for the UC Davis for Open Access Week. Covers the current status of data management in the sciences, best practices for data management, data management planning, and tools for researchers.

2 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
880
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
2
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. From  Flickr  by  freefotouk   Data  Herding  for  Scientists   Organizing    |    Managing    |    Analyzing     Unruly  Digital  Data    Carly  Strasser,  PhD  California  Digital  Library  UC  Office  of  the  President   UC  Davis  Open  Access  Week  carly.strasser@ucop.edu   October  2012  
  • 2. Roadmap   5.  Toolbox   4.  Planning     3.  Best  practices     2.  Data  management  landscape  1.  Background    
  • 3. More  Late What  role  can   r   libraries  play  in   data  education?  NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure   What  barriers  to  sharing   can  we  eliminate?   Why  don’t  people   share  data?   Is  data  management   Do  attitudes  about   being  taught?   sharing  differ   among  disciplines?   How  can  we  promote  storing   data  in  repositories?  
  • 4. What  role  can   libraries  play  in   data  education?   What  barriers  to  sharing   can  we  eliminate?   Why  don’t  people   share  data?   Is  data  management  Do  attitudes  about   being  taught?   sharing  differ  among  disciplines?   How  can  we  promote  storing   data  in  repositories?  
  • 5. A  Brief  From  Calisphere  via  Santa  Clara  University,     History  of   Data  ark:/13030/kt696nc7j2   Collection   Or…  how  scientists  came  to  be  so   bad  at  data  management  
  • 6. The  lab/field  notebook   Curie   Newton   Darwin   Da  Vinci  classicalschool.blogspot.com  
  • 7. From  Flickr  by    DW0825   From  Flickr  by  Flickmor   From  Flickr  by    deltaMike   Digital  data   www.woodrow.org   C.  Strasser   Courtesey  of  WHOI   From  Flickr  by  US  Army  Environmental  Command  
  • 8. Digital  data   +     Complex  workflows  
  • 9. 2  tables   Random  notes  C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton  (2010)       ESA  Workshop  on  Best  Practices  
  • 10. Wash  Cres  Lake  Dec  15  Dont_Use.xls  C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton  (2010)       ESA  Workshop  on  Best  Practices  
  • 11. Random  stats  output  C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVA C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance F C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 1.17 Total 10 35.55962 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569
  • 12. C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 SampleID -22.68 -22.22 ALG03 0.34 ALG05 4.31 3.66 ALG07 25376 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 Weight (mg) -29.44 -28.98 2.91 1.74 0.62 2.91 -0.03 25382 3.04 2.95 Square 0.080178 R 3.01 3 2.99 2.92 2.9 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 %C-29.23 6.85 1.07 0.95 35.560.30 25388 33.49 41.17 Observations43.74 11 4.51 1.59 4.37 33.58 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 delta 13C -21.85 -21.11 0.45 4.72 -28.054.07 25392 -29.56 -27.32 ANOVA -27.50 -22.68 -24.58 -21.06 -29.44 C1 ALG04 2.98 37.90 delta 13C_ca -27.42 -26.96 -20.65 1.36 1.21 -27.590.56 25394 -29.10 c -26.86 -27.04 df SS -22.22 MS F -24.12 Significance F -20.60 -28.98 C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 %N 0.48 1.17 2.30 1.68 1.97 Total 1.3610 35.55962 0.34 0.15 0.34 1.74 delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% delta 15N_ca -1.62 -0.06 0.14 2.06 Intercept -4.297428 4.671099 3.66 0.34 -2.34 -2.17 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 -0.03 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 4.00 3.00 2.00 1.00 Series1 0.00 -35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00 -1.00 -2.00 -3.00 13  
  • 13. UGLY TRUTH Many  (most?)  btrgroup.com   researchers…         are  not  taught  data  management   don’t  know  what  metadata  are   can’t  name  data  centers  or  repositories   don’t  share  data  publicly  or  store  it  in  an  archive   aren’t  convinced  they  should  share  data    
  • 14. Hurdles  to   From  Flickr  by  iowa_spirit_walker  Data  Stewardship   •  Cost   •  Confusion  about  standards   •  Disparate  datasets   •  Lack  of  training   •  Fear  of  lost  rights  or  benefits   •  No  incentives  
  • 15. Data   Reuse   From  Flickr  by  AJC1   Data   Sharing   From  Flickr  by  Redden-­‐McAllister   Data  Management  
  • 16. C.  Strasser   The  Current  Landscape  
  • 17. Data  are  being  recognized  as  first  class  products  of  research   From  Flickr  by  Richard  Moross  
  • 18. From  Flickr  by  Richard  Moross   Data  Management   Requirements  Journal  publishers  
  • 19. From  Flickr  by  Richard  Moross   Data  Management   Requirements  Journal  publishers      Funders    
  • 20. Publishing  Data  From  Flickr  by  Richard  Moross  
  • 21. From  Flickr  by  Richard  Moross   Citing  Data  Example:  Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20    
  • 22. What  should  you  be  doing?   From  Flickr  by  spanaut  
  • 23. From  Flickr  by  P1r   Best   Practices  
  • 24. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 25. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 26. 2.  Data  collection  &  organization  Create  unique  identifiers   •  Decide  on  naming  scheme  early   •  Create  a  key   •  Different  for  each  sample   From  Flickr  by  zebbie   From  Flickr  by  sjbresnahan  
  • 27. 2.  Data  collection  &  organization   Standardize   •  Consistent  within  columns   – only  numbers,  dates,  or  text   •  Consistent  names,  codes,  formats  Modified  from  K.  Vanderbilt     From  Pink  Floyd,  The  Wall      themurkyfringe.com  
  • 28. 2.  Data  collection  &  organization   Standardize   •  Reduce  possibility   of  manual  error  by   constraining  entry   choices   Excel  lists   Data Google  Docs     Forms   validataion  Modified  from  K.  Vanderbilt    
  • 29. 2.  Data  collection  &  organization       Create  parameter  table   Create  a  site  table   From  doi:10.3334/ORNLDAAC/777  From  doi:10.3334/ORNLDAAC/777   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  • 30. 2.  Data  collection  &  organization   What  about  A  relational  database  is     databases?    A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables    A  RDB  provides    Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Reduced  redundancy  &  entry  errors     From  Mark  Schildhauer  
  • 31. 2.  Data  collection  &  organization   You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex     Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old  From  Mark  Schildhauer  
  • 32. 2.  Data  collection  &  organization    Use  descriptive  file  names  *   •  Unique   •  Reflect  contents  Bad:    Mydata.xls   Better:  Eaffinis_nanaimo_2010_counts.xls      2001_data.csv      best  version.txt   Study   Year   organism   Site   name   What  was   measured     *Not  for  everyone   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  • 33. 2.  Data  collection  &  organization  Organize  files    logically   Biodiversity   Lake   Experiments   Biodiv_H20_heatExp_2005to2008.csv   Biodiv_H20_predatorExp_2001to2003.csv   …   Field  work   Biodiv_H20_PlanktonCount_2001toActive.csv   Biodiv_H20_ChlAprofiles_2003.csv   …     Grassland   From  S.  Hampton  
  • 34. 2.  Data  collection  &  organization    Preserve  information   R  script  for  processing  &   analysis   •  Keep  raw  data  raw   •  Use  scripts  to  process  data      &  save  them  with  data   Raw  data  as  .csv  
  • 35. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 36. 3.  Quality  control  and  quality  assurance  Before  data  collection  •  Define  &  enforce  standards  •  Assign  responsibility  for  data  quality   From  Flickr  by  StacieBee  
  • 37. 3.  Quality  control  and  quality  assurance  During  data  collection/entry   •  Minimize  manual  entry   •  Use  double  entry   •  Use  a  database   •  Document  changes   From  Flickr  by  schock  
  • 38. 3.  Quality  control  and  quality  assurance  After  data  entry  •  Check  for  missing,  impossible,   anomalous  values  •  Perform  statistical  summaries    •  Look  for  outliers   •  Normal  probability  plots   •  Regression   •  Scatter  plots   60   50   40   •  Maps   30   20   10   0   0   10   20   30   40    
  • 39. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 40. 4.  Metadata  basics   Why  are  you   What  is   promoting   metadata?   Excel?  
  • 41. 4.  Metadata  basics      Metadata  =  Data  reporting     WHO  created  the  data?   WHAT  is  the  content  of  the  data  set?   WHEN  was  it  created?   From  Flickr  by    //ichael  Patric|{   WHERE  was  it  collected?   HOW  was  it  developed?   WHY  was  it  developed?    
  • 42. •  Scientific  context   4.  Metadata  basics   •  Scientific  reason  why  the  data  were   collected   •  What  data  were  collected  •  Digital  context   •  What  instruments  (including  model  &   •  Name  of  the  data  set   serial  number)  were  used   •  The  name(s)  of  the  data  file(s)  in  the  data   •  Environmental  conditions  during  collection   set   •  Where  collected  &  spatial  resolution  When   •  Date  the  data  set  was  last  modified   collected  &  temporal  resolution   •  Example  data  file  records  for  each  data   •  Standards  or  calibrations  used   type  file   •  Information  about  parameters   •  Pertinent  companion  files   •  How  each  was  measured  or  produced   •  List  of  related  or  ancillary  data  sets   •  Units  of  measure   •  Software  (including  version  number)   •  Format  used  in  the  data  set   used  to  prepare/read    the  data  set   •  Precision  &  accuracy  if  known   •  Data  processing  that  was  performed   •  Information  about  data  •  Personnel  &  stakeholders   •  Definitions  of  codes  used   •  Who  collected     •  Quality  assurance  &  control  measures   •  Who  to  contact  with  questions   •  Known  problems  that  limit  data  use  (e.g.   •  Funders   uncertainty,  sampling  problems)     •  How  to  cite  the  data  set  
  • 43. 4.  Metadata  basics   What  is   metadata?  Select  the  appropriate  metadata  standard  •  Provides  structure  to  describe  data   Common  terms    |    definitions    |    language    |    structure  •  Lots  of  different  standards    EML  ,  FGDC,  ISO19115,  DarwinCore,…  •  Tools  for  creating  metadata  files    Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)        
  • 44. 4.  Metadata  basics   What  ds  a   What  ioes   metadata   standard?   look  like?  
  • 45. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 46. 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  flow  charts   Temperature   data   Data  import  into  R   Data  in  R   Salinity                 format   data   Quality  control  &   “Clean”  T   data  cleaning   &  S  data   Analysis:  mean,  SD   Summary   statistics   Graph  production  
  • 47. 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  commented  scripts   •  R,  SAS,  MATLAB   •  Well-­‐documented  code  is…   Easier  to  review   Easier  to  share   %   #   $   Easier  to  repeat  analysis   &  
  • 48. 5.  Workflows  Fancy  Schmancy  workflows:  Kepler   Resulting  output   https://kepler-­‐project.org  
  • 49. 5.  Workflows   Workflows  enable     From  Flickr  by  merlinprincesse   Reproducibility    can  someone  independently  validate  findings?   Transparency      others  can  understand  how  you  arrived  at  your  results   Executability      others  can  re-­‐run  or  re-­‐use  your  analysis    
  • 50. 5.  Workflows  Minimally:  document  your  analysis      commented  code;  simple  flow-­‐chart     www.littlebytesoflife.com  Emerging  workflow  applications  will…   −  Link  software  for  executable  end-­‐to-­‐end  analysis   −  Provide  detailed  info  about  data  &  analysis   −  Facilitate  re-­‐use  &  refinement  of  complex,  multi-­‐step   analyses   −  Enable  efficient  swapping  of  alternative  models  &   algorithms   −  Help  automate  tedious  tasks  
  • 51. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse    
  • 52. 6.  Data  stewardship  &  reuse   From  Flickr  by  greensambaman   The  20-­‐Year  Rule   The  metadata  accompanying  a   data  set  should  be  written  for  a   user  20  years  into  the  future   RULE       (National  Research  Council  1991)  
  • 53. 6.  Data  stewardship  &  reuse  Use  stable  formats      csv,  txt,  tiff  Create  back-­‐up  copies     original,  near,  far  Periodically  test  ability  to  restore  information   Modified from R. Cook  
  • 54. 6.  Data  stewardship  &  reuse   Store  your  data  in  a  repository   Institutional  archive   Discipline/specialty  archive         From  Flickr  by  torkildr  
  • 55. 6.  Data  stewardship  &  reuse   Practice  Data  Citation   Allows  readers  to  find  data  products   Get  credit  for  data  and  publications   Promotes  reproducibility   Better  measure  of  research  impact   Example:   Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological   diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from   characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20     Learn  more  at  www.datacite.org   Modified from R. Cook  
  • 56. From  Flickr  by  Global  X  Planning  
  • 57. What  is  a  data  management  plan?  A  document  that  describes  what  you  will  do  with  your  data  during  your  research  and  after  you  complete  your  research  
  • 58. Why  should  I  prepare  a  DMP?       Saves  time   Increases  efficiency   Easier  to  use  data       Others  can  understand  &  use  data   Credit  for  data  products   Funders  require  it    
  • 59. NSF  DMP  Requirements   From  Grant  Proposal  Guidelines:    DMP  supplement  may  include:   1.  the  types  of  data,  samples,  physical  collections,  software,  curriculum   materials,  and  other  materials  to  be  produced  in  the  course  of  the  project   2.   the  standards  to  be  used  for  data  and  metadata  format  and  content  (where   existing  standards  are  absent  or  deemed  inadequate,  this  should  be   documented  along  with  any  proposed  solutions  or  remedies)   3.   policies  for  access  and  sharing  including  provisions  for  appropriate   protection  of  privacy,  confidentiality,  security,  intellectual  property,  or  other   rights  or  requirements   4.   policies  and  provisions  for  re-­‐use,  re-­‐distribution,  and  the  production  of   derivatives   5.   plans  for  archiving  data,  samples,  and  other  research  products,  and  for   preservation  of  access  to  them  
  • 60. 1.  Types  of  data  &  other  information  •  Types  of  data  produced  •  Relationship  to  existing  data  •  How/when/where  will  the  data  be  captured  or   created?   C.  Strasser  •  How  will  the  data  be  processed?  •  Quality  assurance  &  quality  control  measures  •  Security:  version  control,  backing  up   biology.kenyon.edu  •  Who  will  be  responsible  for  data  management   during/after  project?   From  Flickr  by  Lazurite  
  • 61. 2.  Data  &  metadata  standards  •  What  metadata  are  needed  to  make  the  data  meaningful?  •  How  will  you  create  or  capture  these  metadata?     Wired.com  •  Why  have  you  chosen  particular  standards  and  approaches   for  metadata?  
  • 62. 3.  Policies  for  access  &  sharing   4.  Policies  for  re-­‐use  &  re-­‐distribution  •  Are  you  under  any  obligation  to  share  data?    •  How,  when,  &  where  will  you  make  the  data  available?    •  What  is  the  process  for  gaining  access  to  the  data?    •  Who  owns  the  copyright  and/or  intellectual  property?  •  Will  you  retain  rights  before  opening  data  to  wider  use?  How  long?  •  Are  permission  restrictions  necessary?  •  Embargo  periods  for  political/commercial/patent  reasons?    •  Ethical  and  privacy  issues?  •  Who  are  the  foreseeable  data  users?  •  How  should  your  data  be  cited?  
  • 63. 5.  Plans  for  archiving  &  preservation  •  What  data  will  be  preserved  for  the  long  term?  For  how  long?      •  Where  will  data  be  preserved?  •  What  data  transformations  need  to  occur  before   preservation?  •  What  metadata  will  be  submitted   alongside  the  datasets?  •  Who  will  be  responsible  for  preparing   data  for  preservation?  Who  will  be  the   main  contact  person  for  the  archived   data?   From  Flickr  by  theManWhoSurfedTooMuch  
  • 64. Don’t  forget:  Budget  •  Costs  of  data  preparation  &  documentation   Hardware,  software   Personnel   Archive  fees  •  How  costs  will  be  paid     Request  funding!   dorrvs.com  
  • 65. NSF’s  Vision*   DMPs  and  their  evaluation  will  grow  &  change  over  time   (similar  to  broader  impacts)   Peer  review  will  determine  next  steps   Community-­‐driven  guidelines     –  Different  disciplines  have  different  definitions  of  acceptable   data  sharing   –  Flexibility  at  the  directorate  and  division  levels   –  Tailor  implementation  of  DMP  requirement   Evaluation  will  vary  with  directorate,  division,  &  program   officer    *Unofficially   Help  from  Jennifer  Schopf,  NSF  
  • 66. From  Flickr  by  dipster1   Toolbox  
  • 67. E-­‐notebooks  &  online  science      •  NoteBook  •  ORNL  eNote    •  Evernote  •  Google  Docs  •  Blogs  •  wikis  •  TheLabNotebook.com  •  NoteBookMaker   TheLabNotebook.com!
  • 68. dmp.cdlib.org   dmponline.dcc.ac.uk   Step-­‐by-­‐step  wizard  for  generating  DMP  Create    |    edit    |    re-­‐use    |    share    |    save    |    generate     Open  to  community     Links  to  institutional  resources   Directorate  information  &  updates  
  • 69. List  of  repositories:  databib.org   Where  should  I  put   my  data?  
  • 70. B  A   C  
  • 71. NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure   Community   Cyberinfrastructure   Engagement  &   Outreach   Courtesy  of  DataONE  
  • 72. www.dataone.org  •  Data  Education  Tutorials  •  Database  of  best  practices    &  software  tools  •  Primer  on  data  management  •  Investigator  Toolkit  
  • 73. Intercept  researchers   where  they  already   work  
  • 74. Open  Source   Tool   Add-­‐in  &  Web   Application   Earth,   environmental,   ecological   researchers  
  • 75. Features   Best  practices  check   Generate  metadata   Generate  citation  Post  data  to  repository  
  • 76. Data  Repository  for  Anyone  |  Anywhere  
  • 77. Main  site:  dataup.cdlib.org  
  • 78. CDL’s  Data  Pub  Blog:  datapub.cdlib.org  
  • 79. carlystrasser.net   Resources   Slideshare  link:  this   presentation  
  • 80. Handy  References  Best  Practices  for  Preparing  Environmental  Data  Sets  to  Share  and  Archive.  September  2010.  Hook,  Santhana  Vannan,  Beaty,  Cook,  &  Wilson  http://daac.ornl.gov/PI/BestPractices-­‐2010.pdf  Some  Simple  Guidelines  for  Effective  Data  Management.  Borer,  Seabloom,  Jones,  &  Schildhauer.    Bull  Ecol  Soc  Amer,  April  2009:  205-­‐214.      
  • 81. dataup.cdlib.org  @DataUpCDL  facebook.com/DataUpCDL   carlystrasser.net   @carlystrasser   carlystrasser@gmail.com  

×