Cal Poly - Data Management for Researchers
Upcoming SlideShare
Loading in...5
×
 

Cal Poly - Data Management for Researchers

on

  • 336 views

...



October 17, 2013 @ 1 Robert E. Kennedy Library, Data Studio, California Polytechnic State University.
Researchers rarely learn about good data management practices. Instead we develop our own systems that are often unintelligible to others. In this talk, Strasser, PhD, will focus on the common mistakes that scientists make and how to avoid them. She will provide best practices for data management, which will facilitate data sharing and reuse, and introduce tools you can use.

Statistics

Views

Total Views
336
Views on SlideShare
336
Embed Views
0

Actions

Likes
1
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cal Poly - Data Management for Researchers Cal Poly - Data Management for Researchers Presentation Transcript

  • From  Calisphere,    Couretsy  of    UC  Riverside,  California  Museum  of  Photography   From  Calisphere,    Courtesy  of  Thousand  Oaks  Library       Data  Management   for  Researchers   Tips,  Tools,  &     Why  You  Should  Care     Carly  Strasser,  PhD   California  Digital  Library   @carlystrasser   carly.strasser@ucop.edu   Cal  Poly     Oct  2013  
  • Roadmap   4.  Toolbox     3.  Best  practices   2.  Why  you  should  care   1.  Background    
  • What  role  can   libraries  play  in   data  education?   NSF  funded  DataNet  Project   Office  of  Cyberinfrastructure   Why  don’t  people   share  data?   Do  attitudes  about   sharing  differ   among  disciplines?   What  barriers  to  sharing   can  we  eliminate?   Is  data  management   being  taught?   How  can  we  promote  storing   data  in  repositories?  
  • What  role  can   libraries  play  in   data  education?   Why  don’t  people   share  data?   Do  attitudes  about   sharing  differ   among  disciplines?   What  barriers  to  sharing   can  we  eliminate?   Is  data  management   being  taught?   How  can  we  promote  storing   data  in  repositories?  
  • From  Calisphere  via  Santa  Clara  University,     ark:/13030/kt696nc7j2   A  Brief   History  of   Data   Collection   Or…  how  scientists  came  to  be  so   bad  at  data  management  
  • Back in the day… Curie   Newton   Da  Vinci   classicalschool.blogspot.com   Darwin  
  • From  Flickr  by  US  Army  Environmental  Command   From  Flickr  by    deltaMike   From  Flickr  by    DW0825   From  Flickr  by  Flickmor   Courtesey  of  WHOI   Digital  data   C.  Strasser  
  • Digital  data   +     Complex   workflows  
  • Data  management   Documentation   Reproducibility   From  Flickr  by  ~Minnea~  
  • •  Cost   •  Confusion  about   standards   •  Lack  of  training   •  Fear  of  lost  rights  or   benefits   •  No  incentives   From  Flickr  by  iowa_spirit_walker  
  • From  sandierpastures.com   the Truth You need to know about Data  management   Metadata   Data  repositories   Data  sharing  
  • Why  you   should  care   From  Flickr  by  johntrainor  
  • Because  they  care:   From  Flickr  by  Redden-­‐McAllister  
  • Because  they  care:   All  data  must  be  in  a   public  archive.   You  can’t  hoard  it.  If  it’s  not   available  you  can’t  cite  it.   Include  a  data  section  with   how  to  find  datasets.  
  • Data   Management:   Who  Knew    Could   be  a  Hot  Topic?   r!   te La From  Flickr  by  Velo  Steve   Carly  Strasser,  PhD   California  Digital  Library   @carlystrasser   Cal  Poly   Oct  2013  
  • What  should   OT N Vbe  doing?   you   From  Flickr  by  whatthefeed  
  • 2  tables   Random  notes   C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Sample Type: Algal Date: Dec. 16 Tray ID and Sequence: Tray 004 Reference statistics: SD for delta Position A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 SampleID Weight (mg) 0.98 0.98 0.98 1.01 3.05 3.06 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9 1.01 ref 0.99 ref 3.04 3.09 3.05 2.98 3.04 0.99 ref ref ref ref ref ALG01 Lk Outlet Alg ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 Lk Outlet Alg ALG06 ALG02 ALG04 ALG05 13 From  Stephanie  Hampton   C = 0.07 %C 38.27 39.78 40.37 42.23 1.88 31.55 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58 44.94 42.28 31.43 35.57 5.52 37.90 31.74 38.46 23.78 SD for delta delta 13C -25.05 -25.00 -24.99 -25.06 -24.34 -30.17 -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44 -25.00 -24.87 -29.69 -27.26 -22.31 -27.42 -27.93 -25.09 delta 13C_ca -24.59 -24.54 -24.53 -24.60 -23.88 -29.71 -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98 -24.54 -24.41 -29.23 -26.80 -21.85 -26.96 -27.47 -24.63 %N 1.96 2.03 2.04 2.17 0.17 0.92 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74 2.59 2.37 1.07 1.96 0.45 1.36 2.40 2.40 1.17 15 Peter's lab Washed Rocks Don't use - old data Shore -1.26 1.26 Avg Con -27.22 0.32 N = 0.15 delta 15N 4.12 4.01 4.09 4.20 -1.65 0.87 -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 3.96 4.33 0.95 2.79 4.72 1.21 0.73 4.37 delta 15N_ca 3.47 3.36 3.44 3.55 -2.30 0.22 -1.62 -0.06 0.14 2.06 0.34 3.66 -2.34 -2.17 -0.03 3.31 3.68 0.30 2.14 4.07 0.56 0.08 3.72 Spec. No. 25354 25356 25358 25360 25362 25364 25366 25368 25370 25372 25374 25376 25378 25380 25382 25384 25386 25388 25390 25392 25394 25396 25398 c c c c c c From  Stephanie  Hampton  (2010)   ESA  Workshop  on  Best  Practices      
  • Wash  Cres  Lake  Dec  15  Dont_Use.xls   C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Sample Type: Algal Date: Dec. 16 Tray ID and Sequence: Tray 004 Reference statistics: SD for delta Position A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 SampleID Weight (mg) 0.98 0.98 0.98 1.01 3.05 3.06 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9 1.01 ref 0.99 ref 3.04 3.09 3.05 2.98 3.04 0.99 ref ref ref ref ref ALG01 Lk Outlet Alg ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 Lk Outlet Alg ALG06 ALG02 ALG04 ALG05 13 From  Stephanie  Hampton   C = 0.07 %C 38.27 39.78 40.37 42.23 1.88 31.55 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58 44.94 42.28 31.43 35.57 5.52 37.90 31.74 38.46 23.78 SD for delta delta 13C -25.05 -25.00 -24.99 -25.06 -24.34 -30.17 -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44 -25.00 -24.87 -29.69 -27.26 -22.31 -27.42 -27.93 -25.09 delta 13C_ca -24.59 -24.54 -24.53 -24.60 -23.88 -29.71 -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98 -24.54 -24.41 -29.23 -26.80 -21.85 -26.96 -27.47 -24.63 %N 1.96 2.03 2.04 2.17 0.17 0.92 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74 2.59 2.37 1.07 1.96 0.45 1.36 2.40 2.40 1.17 15 Peter's lab Washed Rocks Don't use - old data Shore -1.26 1.26 Avg Con -27.22 0.32 N = 0.15 delta 15N 4.12 4.01 4.09 4.20 -1.65 0.87 -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 3.96 4.33 0.95 2.79 4.72 1.21 0.73 4.37 delta 15N_ca 3.47 3.36 3.44 3.55 -2.30 0.22 -1.62 -0.06 0.14 2.06 0.34 3.66 -2.34 -2.17 -0.03 3.31 3.68 0.30 2.14 4.07 0.56 0.08 3.72 Spec. No. 25354 25356 25358 25360 25362 25364 25366 25368 25370 25372 25374 25376 25378 25380 25382 25384 25386 25388 25390 25392 25394 25396 25398 c c c c c c From  Stephanie  Hampton  (2010)   ESA  Workshop  on  Best  Practices      
  • Random  stats  output   C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Sample Type: Algal Date: Dec. 16 Tray ID and Sequence: Tray 004 13 SampleID Weight (mg) 0.98 0.98 0.98 1.01 3.05 3.06 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9 1.01 ref 0.99 ref 3.04 3.09 3.05 2.98 3.04 0.99 ref ref ref ref ref ALG01 Lk Outlet Alg ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 Lk Outlet Alg ALG06 ALG02 ALG04 ALG05 %C 38.27 39.78 40.37 42.23 1.88 31.55 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58 44.94 42.28 31.43 35.57 5.52 37.90 31.74 38.46 23.78 Don't use - old data Shore -1.26 1.26 Avg Con -27.22 0.32 15 Reference statistics: SD for delta C = 0.07 Position A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 Peter's lab Washed Rocks SD for delta N = 0.15 delta 13C -25.05 -25.00 -24.99 -25.06 -24.34 -30.17 -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44 -25.00 -24.87 -29.69 -27.26 -22.31 -27.42 -27.93 -25.09 delta 13C_ca -24.59 -24.54 -24.53 -24.60 -23.88 -29.71 -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98 -24.54 -24.41 -29.23 -26.80 -21.85 -26.96 -27.47 -24.63 %N 1.96 2.03 2.04 2.17 0.17 0.92 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74 2.59 2.37 1.07 1.96 0.45 1.36 2.40 2.40 1.17 delta 15N 4.12 4.01 4.09 4.20 -1.65 0.87 -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 3.96 4.33 0.95 2.79 4.72 1.21 0.73 4.37 delta 15N_ca Spec. No. 3.47 25354 3.36 25356 3.44 25358 3.55 25360 -2.30 25362 0.22 25364 -1.62 25366 -0.06 25368 0.14 25370 2.06 25372 0.34 25374 3.66 25376 -2.34 25378 -2.17 25380 -0.03 25382 3.31 25384 3.68 25386 0.30 25388 2.14 25390 4.07 25392 0.56 25394 0.08 25396 3.72 25398 c c c SUMMARY OUTPUT c c Regression Statistics Multiple R 0.283158 R Square 0.080178 Adjusted R Square -0.022024 Standard Error 1.906378 Observations 11 ANOVA c df Regression Residual Total SS MS F Significance F 1 2.851116 2.851116 0.784507 0.398813 9 32.7085 3.634278 10 35.55962 Coefficients Standard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0% Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 X Variable 1 -0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 From  Stephanie  Hampton  
  • C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Sample Type: Algal Date: Dec. 16 Tray ID and Sequence: Tray 004 Reference statistics: SD for delta Position A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 SampleID Weight (mg) 0.98 0.98 0.98 1.01 3.05 3.06 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9 1.01 ref 0.99 ref 3.04 3.09 3.05 2.98 3.04 0.99 ref ref ref ref ref ALG01 Lk Outlet Alg ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 Lk Outlet Alg ALG06 ALG02 ALG04 ALG05 13 C = 0.07 %C 38.27 39.78 40.37 42.23 1.88 31.55 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58 44.94 42.28 31.43 35.57 5.52 37.90 31.74 38.46 23.78 SD for delta delta 13C delta 13C_ca -25.05 -24.59 -25.00 -24.54 -24.99 -24.53 -25.06 -24.60 -24.34 -23.88 -30.17 -29.71 -21.11 -20.65 -28.05 -27.59 -29.56 -29.10 -27.32 -26.86 -27.50 -27.04 SampleID -22.68 -22.22 -24.58 -24.12 -21.06 -20.60 Weight (mg) -29.44 -28.98 -25.00 -24.54 -24.87 -24.41 -29.69 %C-29.23 -27.26 -26.80 delta 13C -22.31 -21.85 delta 13C_ca -27.42 -26.96 -27.93 -27.47 -25.09 -24.63 %N delta 15N delta 15N_ca 15 Don't use - old data Shore -1.26 1.26 Avg Con -27.22 0.32 N = 0.15 %N 1.96 2.03 2.04 2.17 0.17 0.92 0.48 2.30 1.68 1.97 1.36 ALG03 0.34 0.15 0.34 2.91 1.74 2.59 2.37 6.85 1.07 1.96 -21.11 0.45 -20.65 1.36 2.40 2.40 0.48 1.17 -0.97 -1.62 Peter's lab Washed Rocks delta 15N delta 15N_ca 4.12 3.47 4.01 3.36 4.09 3.44 4.20 3.55 -1.65 -2.30 0.87 0.22 -0.97 -1.62 0.59 -0.06 0.79 0.14 2.71 2.06 0.99 0.34 ALG05 4.31 3.66 -1.69 -2.34 -1.52 -2.17 2.91 0.62 -0.03 3.96 3.31 4.33 3.68 35.56 0.95 0.30 2.79 2.14 -28.05 4.72 4.07 -27.59 1.21 0.56 0.73 0.08 4.37 3.72 2.30 0.59 -0.06 Spec. No. 25354 25356 25358 25360 25362 c 25364 25366 c 25368 25370 25372 25374 c ALG07 25376 25378 c 25380 c 25382 3.04 25384 25386 25388 33.49 25390 -29.56 25392 25394 -29.10 c 25396 25398 1.68 0.79 0.14 SUMMARY OUTPUT ALG06 ALG04 Regression Statistics Multiple R 0.283158 2.95 Square 0.080178 3.01 R Adjusted R Square -0.022024 Standard Error 1.906378 41.17 43.74 Observations 11 -27.32 -27.50 ANOVA -26.86 -27.04 df Regression Residual 1.97 Total 2.71 2.06 Intercept ALG02 ALG01 3 4.51 -22.68 -22.22 MS 2.99 1.59 -24.58 -24.12 Significance F SS F 1 2.851116 2.851116 0.784507 0.398813 9 32.7085 3.634278 1.3610 35.55962 0.34 0.15 0.99 4.31 -1.69 ALG03 ALG07 2.92 2.9 4.37 -21.06 -20.60 33.58 -29.44 -28.98 0.34 -1.52 1.74 0.62 -0.03 Coefficients Standard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0% 0.34 -2.34 -2.17 -4.297428 4.671099 3.66 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 X Variable 1 -0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 4.00 3.00 2.00 1.00 Series1 -35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00 0.00 -1.00 -2.00 -3.00 From  Stephanie  Hampton  
  • What  should   you  be  doing?   From  Flickr  by  whatthefeed  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 2.  Data  collection  &  organization   Create  unique  identifiers   From  Flickr  by  zebbie   •  Decide  on  naming  scheme  early   •  Create  a  key   •  Different  for  each  sample   From  Flickr  by  sjbresnahan  
  • 2.  Data  collection  &  organization   Standardize   •  Consistent  within  columns   – only  numbers,  dates,  or  text   •  Consistent  names,  codes,  formats   Modified  from  K.  Vanderbilt     From  Pink  Floyd,  The  Wall      themurkyfringe.com  
  • 2.  Data  collection  &  organization   Standardize   •  Reduce  possibility   of  manual  error  by   constraining  entry   choices   Excel  lists   Google  Docs   Data   Forms   validataion   Modified  from  K.  Vanderbilt    
  • 2.  Data  collection  &  organization       Create  parameter  table   Create  a  site  table   From  doi:10.3334/ORNLDAAC/777   From  doi:10.3334/ORNLDAAC/777   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  • 2.  Data  collection  &  organization   What  about   databases?   A  relational  database  is      A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables     A  RDB  provides    Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Reduced  redundancy  &  entry  errors     From  Mark  Schildhauer  
  • 2.  Data  collection  &  organization   You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex     Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old   From  Mark  Schildhauer  
  • 2.  Data  collection  &  organization    Use  descriptive  file  names  *   •  Unique   •  Reflect  contents   Bad:        Mydata.xls    2001_data.csv    best  version.txt   Better:  Eaffinis_nanaimo_2010_counts.xls   Study   organism   Site   name   Year   What  was   measured     *Not  for  everyone   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  • 2.  Data  collection  &  organization   Organize  files    logically   Biodiversity   Lake   Experiments   Biodiv_H20_heatExp_2005to2008.csv   Biodiv_H20_predatorExp_2001to2003.csv   …   Field  work   Biodiv_H20_PlanktonCount_2001toActive.csv   Biodiv_H20_ChlAprofiles_2003.csv   …     Grassland   From  S.  Hampton  
  • 2.  Data  collection  &  organization    Preserve  information   •  Keep  raw  data  raw   •  Use  scripts  to  process  data    &  save  them  with  data   Raw  data  as  .csv   R  script  for  processing  &   analysis    
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 3.  Quality  control  and  quality  assurance   Before  data  collection   From  Flickr  by  StacieBee   •  Define  &  enforce  standards   •  Assign  responsibility  for  data  quality  
  • 3.  Quality  control  and  quality  assurance   After  data  entry   •  Check  for  missing,  impossible,   anomalous  values   •  Perform  statistical  summaries     •  Look  for  outliers     60   50   40   30   20   10   0   0   10   20   30   40  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 4.  Metadata  basics   Why  are  you   What  is   promoting   metadata?   Excel?  
  • 4.  Metadata  basics   From  Flickr  by    //ichael  Patric|{        Metadata  =  Data  reporting     WHO  created  the  data?   WHAT  is  the  content      of  the  data  set?   WHEN  was  it  created?   WHERE  was  it  collected?   HOW  was  it  developed?   WHY  was  it  developed?  
  • 4.  Metadata  basics   •  Scientific  context   •  •  Environmental  conditions  during  collection   •  Where  collected  &  spatial  resolution  When   collected  &  temporal  resolution   •  The  name(s)  of  the  data  file(s)  in  the  data   set   What  instruments  (including  model  &   serial  number)  were  used   Standards  or  calibrations  used   Name  of  the  data  set   •  What  data  were  collected   •  Digital  context   Scientific  reason  why  the  data  were   collected   •  •  •  •  Date  the  data  set  was  last  modified   •  Example  data  file  records  for  each  data   type  file   •  Pertinent  companion  files   •  •  Information  about  parameters   List  of  related  or  ancillary  data  sets   How  each  was  measured  or  produced   •  Software  (including  version  number)   used  to  prepare/read    the  data  set   Units  of  measure   •  •  Format  used  in  the  data  set   •  •  •  Data  processing  that  was  performed   •  Precision  &  accuracy  if  known   Personnel  &  stakeholders   •  •  Quality  assurance  &  control  measures   •  Funders   Definitions  of  codes  used   •  Who  to  contact  with  questions   •  Information  about  data   •  Who  collected     •  •  Known  problems  that  limit  data  use  (e.g.   uncertainty,  sampling  problems)     How  to  cite  the  data  set  
  • 4.  Metadata  basics   What  is   metadata?   Select  the  appropriate  standard   •  Provides  structure  to  describe  data   Common  terms    |    definitions    |    language    |    structure   •  Lots  of  different  standards    EML  ,  FGDC,  ISO19115,  DarwinCore,…   •  Tools  for  creating  metadata  files    Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)        
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  flow  charts   Temperature   data   Salinity                 data   “Clean”  T   &  S  data   Data  import  into  R   Data  in  R   format   Quality  control  &   data  cleaning   Analysis:  mean,  SD   Graph  production   Summary   statistics  
  • 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  commented  scripts   •  R,  SAS,  MATLAB   •  Well-­‐documented  code  is…   Easier  to  review   Easier  to  share   Easier  to  repeat  analysis   %   #   $   &  
  • 5.  Workflows   Fancy  Schmancy  workflows:  Kepler   Resulting  output   https://kepler-­‐project.org  
  • 5.  Workflows   Workflows  enable   Reproducibility   Transparency     Executability       From  Flickr  by  merlinprincesse  
  • 5.  Workflows   Minimally:  document  your  analysis      commented  code;  simple  flow-­‐chart     www.littlebytesoflife.com   Emerging  workflow  applications  will…   −  Link  software  for  executable  end-­‐to-­‐end  analysis   −  Provide  detailed  info  about  data  &  analysis   −  Facilitate  re-­‐use  &  refinement  of  complex,  mn:   o ulti-­‐step   o analyses   ng  S aring   omi  sh C −  Enable  efficient  swapping  of  alternative  models  s!     kflow ent & r wo algorithms   uirem req −  Help  automate  tedious  tasks  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • 6.  Data  stewardship  &  reuse   From  Flickr  by  greensambaman   The  20-­‐Year  Rule   The  metadata  accompanying  a   data  set  should  be  written  for  a   user  20  years  into  the  future       RULE   (National  Research  Council  1991)  
  • 6.  Data  stewardship  &  reuse   Use  stable  formats      csv,  txt,  tiff   Create  back-­‐up  copies     original,  near,  far   Periodically  test  ability  to  restore  information   Modified from R. Cook  
  • 6.  Data  stewardship  &  reuse   Store  your  data  in  a  repository   Institutional  archive   Ask  a  librarian   Discipline/specialty  archive         Repos  of  repos:   databib.org   re3data.org   From  Flickr  by  torkildr  
  • 6.  Data  stewardship  &  reuse   Practice  Data  Citation   Example:   Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of   morphological  diversification  in  the  absence  of  a  detailed   phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital   Repository.  doi:10.5061/dryad.20   Persistent  Unique   Identifier   Allows  readers  to  find  data  products   Get  credit  for  data  and  publications   Promotes  reproducibility   Better  measure  of  research  impact  
  • From  Flickr  by  Big  Swede  Guy   Best  Practices   ent data managem 1.  Planning   2.  Data  collection  &   organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  • What  is  a  data   management  plan?   A  document  that   describes  what  you  will   do  with  your  data   throughout     the  research  project   From Flickr by Barbies Land
  • From  Flickr  by  401(K)  2013   DMP  for  funders:   A  short  plan  submitted   alongside  grant  applications    An  outline  of     –  what  will  be  collected   –  methods   –  Standards   But they all have –  Metadata   different requirements –  sharing/access   and express them in –  long-­‐term  storage   different ways  Includes  how  and  why  
  • NSF  DMP  Requirements   From  Grant  Proposal  Guidelines:    DMP  supplement  may  include:   1.  the  types  of  data,  samples,  physical  collections,  software,  curriculum   materials,  and  other  materials  to  be  produced  in  the  course  of  the  project   2.   the  standards  to  be  used  for  data  and  metadata  format  and  content  (where   existing  standards  are  absent  or  deemed  inadequate,  this  should  be   documented  along  with  any  proposed  solutions  or  remedies)   3.   policies  for  access  and  sharing  including  provisions  for  appropriate   protection  of  privacy,  confidentiality,  security,  intellectual  property,  or  other   rights  or  requirements   4.   policies  and  provisions  for  re-­‐use,  re-­‐distribution,  and  the  production  of   derivatives   5.   plans  for  archiving  data,  samples,  and  other  research  products,  and  for   preservation  of  access  to  them  
  • From  Flickr  by  OZinOH   The  Data  Management   Planning  Tool   DMPTool       Carly  Strasser      |  @carlystrasser   California  Digital  Library     5  August  2013   ESA  2013  SS  2  
  • From  Flickr  by  dipster1   Toolbox  
  • Write  a  DMP   dmptool.org                     Step-­‐by-­‐step  wizard  for  generating  DMP   create  |  edit  |  re-­‐use  |  share   Free  &  open  to  community    
  • Find  a  repository   Where   should  I  put   my  data?   databib.org  
  • From Flickr by thewmatt Get  help  
  • From  Flickr  by  North  Carolina  Digital   Heritage  Center   Get  help  from  your  library   From  Flickr  by  Madison  Guy  
  • Get  help   Toolbox:    DCXL  blog:  dcxl.cdlib.org  
  • From  Flickr  by  dotpolka   Doing  science  is  a   privilege  –  not  a  right  
  • From  Flickr  by  Michael  Tinkler  
  • From  Flickr  by  mikerosebery    There  is  a  social  contract  of  science:  we   have  an  obligation  to  ensure  dissemination,   validation,  &  advancement.   To  not  do  so  is  science  malpractice.       –  Brian  Hole,  Ubiquity  Press  at  UCL  
  • My  website   Email  me   Tweet  me   My  slides   carlystrasser.net   carlystrasser@gmail.com   @carlystrasser     slideshare.net/carlystrasser