Data	  Management	                                         The	  Current	  Landscape	  Carly	  Strasser	  California	  Dig...
From	  Flickr	  by	  	  DW0825	                                                                                           ...
Digital	  data	         +	  	  Complex	  analyses	  
Data	                               Models	                      Maximum	                      Likelihood	                ...
UGLY TRUTH                                                    Most	                                                      E...
Where	  data	  end	  up	                                                         From	  Flickr	  by	  diylibrarian	       ...
Who	  cares?	         	                                                      From	  Flickr	  by	  Redden-­‐McAllister	    ...
Where	  data	  end	  up	                                                                      From	  Flickr	  by	  diylibr...
Data	     Reuse	     Data	    Sharing	     Data	  Management	  
Trends	  in	  Data	  Archiving	  Journal	  publishers	  Joint	  Data	  Archiving	  Agreement	  	  Data	  Papers	  etc.	  E...
What	  is	  a	  data	  management	  plan?	  A	  document	  that	  describes	  what	  you	  will	  do	  with	  your	  data	...
Why	  should	  a	  scientist	  prepare	  a	                             DMP?	         	                        	         S...
NSF	  DMP	  Requirements	   From	  Grant	  Proposal	  Guidelines:	  	  DMP	  supplement	  may	  include:	       1.  the	  ...
NSF’s	  Vision*	      DMPs	  and	  their	  evaluation	  will	  grow	  &	  change	  over	  time	      (similar	  to	  broad...
dmp.cdlib.org	                      dmponline.dcc.ac.uk	  
now	  called	                                                                                                   DataUp    ...
www.dataone.org	  •    Data	  Education	  Tutorials	  •    Database	  of	  best	  practices	  	  &	  software	  tools	  • ...
Data	  Management	                                                      Best	  Practices	  Carly	  Strasser	  California	 ...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
2.	  Data	  collection	  &	  organization	  Create	  unique	  identifiers	       •  Decide	  on	  naming	  scheme	  early	 ...
2.	  Data	  collection	  &	  organization	          Standardize	                        •  Consistent	  within	  columns	 ...
2.	  Data	  collection	  &	  organization	  Use	  descriptive	  file	  names	                                           PhD...
2.	  Data	  collection	  &	  organization	     	  Use	  descriptive	  file	  names	  *	         •  Unique	         •  Reflec...
2.	  Data	  collection	  &	  organization	  	  Preserve	  information	                                            R	  scri...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
3.	  Quality	  control	  and	  quality	  assurance	  Before	  data	  collection	  •  Define	  &	  enforce	  standards	  •  ...
3.	  Quality	  control	  and	  quality	  assurance	  During	  data	  collection/entry	      •  Minimize	  manual	  entry	 ...
3.	  Quality	  control	  and	  quality	  assurance	  After	  data	  entry	  •  Check	  for	  missing,	  impossible,	     a...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
4.	  Metadata	    	  	  Metadata	  =	  Data	  reporting	                                            	      WHO	  created	 ...
•    Scientific	  context	         4.	  Metadata	                                                                    •     ...
4.	  Metadata	                                                                                                     What	  ...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
5.	  Workflows	    Workflow:	  how	  you	  get	  from	  the	  raw	  data	  to	  the	  final	    products	  of	  your	  resear...
5.	  Workflows	    Workflow:	  how	  you	  get	  from	  the	  raw	  data	  to	  the	  final	    products	  of	  your	  resear...
5.	  Workflows	  Fancy	  Schmancy	  workflows:	  Kepler	                                                          Resulting	...
5.	  Workflows	   Workflows	  enable	   	                                                                                   ...
Best	  Practices	  for	  Data	  Management	     1.  Planning	     2.  Data	  collection	  &	  organization	     3.  Qualit...
6.	  Data	  stewardship	  &	  reuse	                                                                           From	  Flic...
6.	  Data	  stewardship	  &	  reuse	  Use	  stable	  formats	       	     	  csv,	  txt,	  tiff	  Create	  back-­‐up	  copi...
6.	  Data	  stewardship	  &	  reuse	              Store	  data	  in	  a	  repository	                     Institutional	  ...
6.	  Data	  stewardship	  &	  reuse	     Data	  Citation	                Allows	  readers	  to	  find	  data	  products	   ...
Check	  out	  the	  blog	     dcxl.cdlib.org	      or	  my	  website	         www.carlystrasser.net	             Email	  m...
Data Management: The Current Landscape
Upcoming SlideShare
Loading in...5
×

Data Management: The Current Landscape

707

Published on

Presentation for IASSIST 2012 Meeting in Washington, DC.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
707
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Data Management: The Current Landscape"

  1. 1. Data  Management   The  Current  Landscape  Carly  Strasser  California  Digital  Library   2012  IASSIST  Conference  University  of  California  Curation  Center   June  2012  
  2. 2. From  Flickr  by    DW0825   From  Flickr  by  Flickmor   From  Flickr  by    deltaMike   Digital  data   www.woodrow.org   C.  Strasser   Courtesey  of  WHOI   From  Flickr  by  US  Army  Environmental  Command  
  3. 3. Digital  data   +    Complex  analyses  
  4. 4. Data   Models   Maximum   Likelihood   estimation   Matrix   Models   Images   Tables   Paper  
  5. 5. UGLY TRUTH Most   Earth  |  Environmental  |  Ecological   scientists…      5shortessays.blogspot.com     are  not  taught  data  management   don’t  know  what  metadata  are   can’t  name  data  centers  or  repositories   don’t  share  data  publicly  or  store  it  in  an  archive   aren’t  convinced  they  should  share  data    
  6. 6. Where  data  end  up   From  Flickr  by  diylibrarian   www blog.order2disorder.com   From  Flickr  by  csessums   Data  Metadata   From  Flickr  by  csessums   Recreated  from  Klump  et  al.  2006  
  7. 7. Who  cares?     From  Flickr  by  Redden-­‐McAllister   From  Flickr  by  AJC1   www.rba.gov.au  
  8. 8. Where  data  end  up   From  Flickr  by  diylibrarian   www Data   wwwMetadata   From  Flickr  by  torkildr   Recreated  from  Klump  et  al.  2006  
  9. 9. Data   Reuse   Data   Sharing   Data  Management  
  10. 10. Trends  in  Data  Archiving  Journal  publishers  Joint  Data  Archiving  Agreement    Data  Papers  etc.  Ecological  Archives,  Beyond  the  PDF    Funders  Data  management  requirements    
  11. 11. What  is  a  data  management  plan?  A  document  that  describes  what  you  will  do  with  your  data  during  your  research  and  after  you  complete  your  research  
  12. 12. Why  should  a  scientist  prepare  a   DMP?       Saves  time   Increases  efficiency   Easier  to  use  data       Others  can  understand  &  use  data   Credit  for  data  products   Funders  require  it    
  13. 13. NSF  DMP  Requirements   From  Grant  Proposal  Guidelines:    DMP  supplement  may  include:   1.  the  types  of  data,  samples,  physical  collections,  software,  curriculum   materials,  and  other  materials  to  be  produced  in  the  course  of  the  project   2.   the  standards  to  be  used  for  data  and  metadata  format  and  content  (where   existing  standards  are  absent  or  deemed  inadequate,  this  should  be   documented  along  with  any  proposed  solutions  or  remedies)   3.   policies  for  access  and  sharing  including  provisions  for  appropriate   protection  of  privacy,  confidentiality,  security,  intellectual  property,  or  other   rights  or  requirements   4.   policies  and  provisions  for  re-­‐use,  re-­‐distribution,  and  the  production  of   derivatives   5.   plans  for  archiving  data,  samples,  and  other  research  products,  and  for   preservation  of  access  to  them  
  14. 14. NSF’s  Vision*   DMPs  and  their  evaluation  will  grow  &  change  over  time   (similar  to  broader  impacts)   Peer  review  will  determine  next  steps   Community-­‐driven  guidelines     –  Different  disciplines  have  different  definitions  of  acceptable   data  sharing   –  Flexibility  at  the  directorate  and  division  levels   –  Tailor  implementation  of  DMP  requirement   Evaluation  will  vary  with  directorate,  division,  &  program   officer    *Unofficially   Help  from  Jennifer  Schopf,  NSF  
  15. 15. dmp.cdlib.org   dmponline.dcc.ac.uk  
  16. 16. now  called   DataUp  •  Open  source  add-­‐in  &  web  application  •  Facilitate  data  management,  sharing,  archiving  for  scientists  •  Focus  on  atmospheric,  ecological,  hydrological,  and   oceanographic  data  •  Collecting  requirements  for  add-­‐in  from  scientists,  data   centers,  libraries   Funders:  Gordon  and  Betty  Moore  Foundation,  Microsoft  Research  
  17. 17. www.dataone.org  •  Data  Education  Tutorials  •  Database  of  best  practices    &  software  tools  •  Primer  on  data  management  •  Investigator  Toolkit   now  called   DataUp  
  18. 18. Data  Management   Best  Practices  Carly  Strasser  California  Digital  Library   2012  IASSIST  Conference  University  of  California  Curation  Center   June  2012  
  19. 19. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  20. 20. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  21. 21. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  22. 22. 2.  Data  collection  &  organization  Create  unique  identifiers   •  Decide  on  naming  scheme  early   •  Create  a  key   •  Different  for  each  sample   From  Flickr  by  zebbie   From  Flickr  by  sjbresnahan  
  23. 23. 2.  Data  collection  &  organization   Standardize   •  Consistent  within  columns   – only  numbers,  dates,  or  text   •  Consistent  names,  codes,  formats  Modified  from  K.  Vanderbilt     From  Pink  Floyd,  The  Wall      themurkyfringe.com  
  24. 24. 2.  Data  collection  &  organization  Use  descriptive  file  names   PhDcomics.com  
  25. 25. 2.  Data  collection  &  organization    Use  descriptive  file  names  *   •  Unique   •  Reflect  contents  Bad:    Mydata.xls   Better:  Eaffinis_nanaimo_2010_counts.xls      2001_data.csv      best  version.txt   Study   Year   organism   Site   name   What  was   measured     *Not  for  everyone   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  26. 26. 2.  Data  collection  &  organization    Preserve  information   R  script  for  processing  &   analysis   •  Keep  raw  data  raw   •  Use  scripts  to  process  data      &  save  them  with  data   Raw  data  as  .csv  
  27. 27. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  28. 28. 3.  Quality  control  and  quality  assurance  Before  data  collection  •  Define  &  enforce  standards  •  Assign  responsibility  for  data  quality   From  Flickr  by  StacieBee  
  29. 29. 3.  Quality  control  and  quality  assurance  During  data  collection/entry   •  Minimize  manual  entry   •  Use  double  entry   •  Use  text-­‐to-­‐speech  program   to  read  data  back   •  Use  a  database   •  Document  changes   From  Flickr  by  schock  
  30. 30. 3.  Quality  control  and  quality  assurance  After  data  entry  •  Check  for  missing,  impossible,   anomalous  values  •  Perform  statistical  summaries    •  Look  for  outliers   •  Normal  probability  plots   •  Regression   •  Scatter  plots   60   50   40   •  Maps   30   20   10   0   0   10   20   30   40    
  31. 31. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  32. 32. 4.  Metadata      Metadata  =  Data  reporting     WHO  created  the  data?   WHAT  is  the  content  of  the  data  set?   WHEN  was  it  created?   From  Flickr  by    //ichael  Patric|{   WHERE  was  it  collected?   HOW  was  it  developed?   WHY  was  it  developed?    
  33. 33. •  Scientific  context   4.  Metadata   •  Scientific  reason  why  the  data  were   collected   •  What  data  were  collected  •  Digital  context   •  What  instruments  (including  model  &   •  Name  of  the  data  set   serial  number)  were  used   •  The  name(s)  of  the  data  file(s)  in  the  data   •  Environmental  conditions  during  collection   set   •  Where  collected  &  spatial  resolution  When   •  Date  the  data  set  was  last  modified   collected  &  temporal  resolution   •  Example  data  file  records  for  each  data   •  Standards  or  calibrations  used   type  file   •  Information  about  parameters   •  Pertinent  companion  files   •  How  each  was  measured  or  produced   •  List  of  related  or  ancillary  data  sets   •  Units  of  measure   •  Software  (including  version  number)   •  Format  used  in  the  data  set   used  to  prepare/read    the  data  set   •  Precision  &  accuracy  if  known   •  Data  processing  that  was  performed   •  Information  about  data  •  Personnel  &  stakeholders   •  Definitions  of  codes  used   •  Who  collected     •  Quality  assurance  &  control  measures   •  Who  to  contact  with  questions   •  Known  problems  that  limit  data  use  (e.g.   •  Funders   uncertainty,  sampling  problems)     •  How  to  cite  the  data  set  
  34. 34. 4.  Metadata   What  is   metadata?   Select  the  appropriate  metadata   standard   •  Provides  structure  to  describe  data   Common  terms    |    definitions    |    language    |    structure   •  Lots  of  different  standards    EML  ,  FGDC,  ISO19115,  DarwinCore,…   •  Tools  for  creating  metadata  files    Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)        
  35. 35. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse  
  36. 36. 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  flow  charts   Temperature   data   Data  import  into  R   Data  in  R   Salinity                 format   data   Quality  control  &   “Clean”  T   data  cleaning   &  S  data   Analysis:  mean,  SD   Summary   statistics   Graph  production  
  37. 37. 5.  Workflows   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflows:  commented  scripts   •  R,  SAS,  MATLAB   •  Well-­‐documented  code  is…   Easier  to  review   Easier  to  share   %   #   $   Easier  to  repeat  analysis   &  
  38. 38. 5.  Workflows  Fancy  Schmancy  workflows:  Kepler   Resulting  output   https://kepler-­‐project.org  
  39. 39. 5.  Workflows   Workflows  enable     From  Flickr  by  merlinprincesse   Reproducibility    can  someone  independently  validate  findings?   Transparency      others  can  understand  how  you  arrived  at  your  results   Executability      others  can  re-­‐run  or  re-­‐use  your  analysis    
  40. 40. Best  Practices  for  Data  Management   1.  Planning   2.  Data  collection  &  organization   3.  Quality  control  &  assurance   4.  Metadata   5.  Workflows   6.  Data  stewardship  &  reuse    
  41. 41. 6.  Data  stewardship  &  reuse   From  Flickr  by  greensambaman   The  20-­‐Year  Rule   The  metadata  accompanying  a   data  set  should  be  written  for  a   user  20  years  into  the  future   RULE       Document  Document  Document   Document    Document  Document   Document  Document  Document   Document  Document  Document       (National  Research  Council  1991)    
  42. 42. 6.  Data  stewardship  &  reuse  Use  stable  formats      csv,  txt,  tiff  Create  back-­‐up  copies     original,  near,  far  Periodically  test  back-­‐ups   Modified from R. Cook  
  43. 43. 6.  Data  stewardship  &  reuse   Store  data  in  a  repository   Institutional  archive   Discipline/specialty  archive         From  Flickr  by  torkildr  
  44. 44. 6.  Data  stewardship  &  reuse   Data  Citation   Allows  readers  to  find  data  products   Get  credit  for  data  and  publications   Promotes  reproducibility   Better  measure  of  research  impact   Example:   Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological   diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from   characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20     Learn  more  at  www.datacite.org   Modified from R. Cook  
  45. 45. Check  out  the  blog   dcxl.cdlib.org   or  my  website   www.carlystrasser.net   Email  me   carlystrasser@gmail.com   Tweet  me   @carlystrasser  |  @dcxlCDL   DCXL  on  FB   DCXLatCDL  

×