From	
  Flickr	
  by	
  Jeff	
  Golden	
  

Spooky	
  
Spreadsheets	
  

Carly	
  Strasser	
  |	
  California	
  Digital	
 ...
Roadmap	
  

3.  Toolbox	
  
2. Best	
  practices	
  
	
  
1.  Background	
  
	
  
Scientists	
  are	
  bad	
  at	
  
data	
  management.	
  
From	
  Flickr	
  by	
  robertpaulyoung	
  
Many	
  tables	
  
Embedded	
  
figures	
  
my	
  spreadsheet	
  

No	
  headings	
  
my	
  spreadsheet	
  
my	
  spreadsheet	
  
?
www.petshaming.net	
  

NO	
  

Reproducibility	
  
Transparency	
  
Reuse	
  

Didn’t	
  share	
  the	
  data	
  
Didn’t	...
Why	
  should	
  I	
  care?	
  
From	
  Flickr	
  by	
  johntrainor	
  
Because	
  
they	
  care:	
  

From	
  Flickr	
  by	
  Redden-­‐McAllister	
  
From	
  Flickr	
  by	
  Big	
  Swede	
  Guy	
  

Best	
  
Practices	
  

ent
data managem
From	
  Flickr	
  by	
  Mark	
  Sardella	
  

Plan	
  before	
  data	
  
collection	
  
Design	
  sample	
  naming	
  scheme	
  

From	
  Flickr	
  by	
  zebbie	
  

•  Create	
  a	
  key	
  (data	
  dictionary...
Design	
  file	
  naming	
  scheme	
  

PhDcomics.com	
  

Planning	
  
Design	
  file	
  naming	
  scheme	
  

Planning	
  

	
  Use	
  descriptive	
  file	
  names	
  *	
  
•  Unique	
  
•  Refle...
Design	
  file	
  organization	
  

Planning	
  

From	
  S.	
  Hampton	
  
Design	
  file	
  organization	
  
Biodiversity	
  

Lake	
  
Experiments	
   Biodiv_H20_heatExp_2005to2008.csv	
  
Biodiv_...
Design	
  your	
  spreadsheet	
  
Constrain	
  entries	
  	
  
Atomize	
  
Break	
  down	
  spreadsheets	
  

From	
  Flic...
Consider	
  a	
  database	
  

Planning	
  

A	
  relational	
  database	
  is	
  	
  
	
  A	
  set	
  of	
  tables	
  
	
...
Consider	
  a	
  database	
  

Planning	
  

You	
  should	
  invest	
  time	
  in	
  learning	
  databases	
  if	
  	
  
...
Planning	
  

Pick	
  a	
  data	
  repository	
  
Store	
  your	
  data	
  in	
  a	
  repository	
  
Institutional	
  arch...
Decide	
  on	
  preservation/backup	
  

Planning	
  

What	
  software?	
  
What	
  hardware?	
  
What	
  personnel?	
  
...
Write	
  a	
  data	
  
management	
  plan!	
  

Planning	
  

…document	
  that	
  
describes	
  what	
  you	
  will	
  
d...
Planning	
  

DMP	
  components	
  
• 
• 
• 
• 
• 
• 

From	
  Flickr	
  by	
  Barbies	
  Land	
  

What	
  will	
  be	
  ...
dmptool.org	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

Step-­‐by-­‐step	
  wizard	
  for	
  generating	
  DMP	
  
create	
  ...
During	
  Data	
  Collection	
  &	
  Entry	
  

From	
  Flickr	
  by	
  Julia	
  Manzerova	
  
Keep	
  raw	
  data	
  raw	
  
Realistically:	
  	
  
•  Archive	
  .csv	
  version	
  of	
  raw	
  data	
  
•  Make	
  a	...
Keep	
  raw	
  data	
  raw	
  
Ideally:	
  
•  Use	
  scripts	
  to	
  process	
  data	
  	
  
•  Save	
  them	
  with	
  ...
Document	
  your	
  workflow	
  

During	
  
collection	
  

Workflow:	
  how	
  you	
  get	
  from	
  the	
  raw	
  data	
 ...
Document	
  your	
  workflow	
  

During	
  
collection	
  

Workflow:	
  how	
  you	
  get	
  from	
  the	
  raw	
  data	
 ...
Document	
  your	
  workflow	
  

During	
  
collection	
  

Fancy	
  schmancy	
  workflows	
  
Resulting	
  output	
  

htt...
Document	
  your	
  workflow	
  

During	
  
collection	
  

Workflows	
  enable	
  
•  Reproducibility	
  
•  Transparency	...
Constrain	
  data	
  entries	
  
•  Excel	
  lists	
  
•  Data	
  validation	
  
•  Google	
  docs	
  forms	
  	
  

Modifi...
Atomize	
  

During	
  
collection	
  

One	
  piece	
  of	
  information	
  per	
  cell	
  
Break	
  down	
  spreadsheets	
  
Fake	
  a	
  relational	
  database	
  

During	
  
collection	
  

	
  Create	
  parame...
Create	
  metadata	
  

During	
  
Why	
  are	
  you	
  
collection	
  
promoting	
  
Excel?	
  
During	
  
collection	
  

Create	
  metadata	
  
	
  	
  Metadata:	
  data	
  reporting	
  
	
  

WHO	
  created	
  the	
...
During	
  
collection	
  

Create	
  metadata	
  
Digital	
  context	
  

Scientific	
  context	
  

• 

Name	
  of	
  the	...
Create	
  metadata	
  
<

a n da rd
St

During	
  
collection	
  
What	
  is	
  

metadata?	
  

Metadata	
  standards…	
 ...
During	
  
collection	
  

Back	
  up	
  daily	
  

Near	
  
Original	
  
From	
  Flickr	
  by	
  see	
  phar	
  

From	
 ...
Remember	
  that	
  data	
  
management	
  plan?	
  

During	
  
collection	
  

Revisit	
  
Review	
  
Revise	
  

From	
...
From	
  Flickr	
  by	
  purplemattfish	
  

During	
  
collection	
  

Revisit	
  
Review	
  
Revise	
  
Schedule	
  a	
  t...
From	
  Flickr	
  by	
  dipster1	
  

Toolbox	
  
Write	
  a	
  DMP	
  
dmptool.org	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

Step-­‐by-­‐step	
  wizard	
  for	
  generating...
Find	
  a	
  repository	
  

Where	
  
should	
  I	
  put	
  
my	
  data?	
  

databib.org	
  
Manage	
  &	
  share	
  

•  Help	
  researchers	
  manage,	
  describe,	
  
and	
  share	
  tabular	
  data	
  
•  Free	
...
Manage	
  &	
  share	
  

Features	
  
1. 
2. 
3. 
4. 

Best	
  practices	
  check	
  
Generate	
  metadata	
  
Get	
  ide...
Create	
  metadata	
  
Create	
  metadata	
  
Clean	
  data	
  

Open	
  Refine	
  =	
  Google	
  Refine	
  
	
  

• 
• 
• 
• 

Open	
  source	
  desktop	
  application	
...
Open	
  Refine	
  =	
  Google	
  Refine	
  
	
  

• 
• 
• 
• 

Open	
  source	
  desktop	
  application	
  	
  
Used	
  for	...
Get	
  help	
  

Toolbox:	
  
	
  DCXL	
  blog:	
  dcxl.cdlib.org	
  
From	
  Flickr	
  by	
  twm1340	
  

Culture	
  
Shift	
  Ahead	
  
From	
  Flickr	
  by	
  cdsessums	
  

science	
  
source	
  
notebook	
  
content	
  
access	
  
data	
  
government	
  
...
Make	
  a	
  
resolution	
  
•  Triage	
  on	
  current	
  
projects	
  
•  Get	
  	
  advisor,	
  lab	
  mates,	
  
colla...
Website	
  
Email	
  
Twitter	
  
Slides	
  

carlystrasser.net	
  
carlystrasser@gmail.com	
  
@carlystrasser	
  	
  
sli...
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheets
Upcoming SlideShare
Loading in …5
×

Bren - UCSB - Spooky spreadsheets

7,031 views

Published on

Talk for Jim Frew's grad class at Bren School, UC Santa Barbara. Oct 31, 2013. All about things you can do wrong (and right) with spreadsheets.

Published in: Technology

Bren - UCSB - Spooky spreadsheets

  1. From  Flickr  by  Jeff  Golden   Spooky   Spreadsheets   Carly  Strasser  |  California  Digital  Library   UCSB/Bren  Oct  2013  
  2. Roadmap   3.  Toolbox   2. Best  practices     1.  Background    
  3. Scientists  are  bad  at   data  management.   From  Flickr  by  robertpaulyoung  
  4. Many  tables  
  5. Embedded   figures  
  6. my  spreadsheet   No  headings  
  7. my  spreadsheet  
  8. my  spreadsheet  
  9. ?
  10. www.petshaming.net   NO   Reproducibility   Transparency   Reuse   Didn’t  share  the  data   Didn’t  document  the  data  (metadata)   Didn’t  document  provenance/workflow  
  11. Why  should  I  care?   From  Flickr  by  johntrainor  
  12. Because   they  care:   From  Flickr  by  Redden-­‐McAllister  
  13. From  Flickr  by  Big  Swede  Guy   Best   Practices   ent data managem
  14. From  Flickr  by  Mark  Sardella   Plan  before  data   collection  
  15. Design  sample  naming  scheme   From  Flickr  by  zebbie   •  Create  a  key  (data  dictionary)   •  Make  sure  names  are  unique   •  Define  codes   Planning  
  16. Design  file  naming  scheme   PhDcomics.com   Planning  
  17. Design  file  naming  scheme   Planning    Use  descriptive  file  names  *   •  Unique   •  Reflect  contents   Bad:        Mydata.xls    2001_data.csv    best  version.txt   Better:  Eaffinis_nanaimo_2010_counts.xls   Study   organism   Site   name   Year   What  was   measured     *Not  for  everyone   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  18. Design  file  organization   Planning   From  S.  Hampton  
  19. Design  file  organization   Biodiversity   Lake   Experiments   Biodiv_H20_heatExp_2005to2008.csv   Biodiv_H20_predatorExp_2001to2003.csv   …   Field  work   Biodiv_H20_PlanktonCount_2001toActive.csv   Biodiv_H20_ChlAprofiles_2003.csv   …     Planning   Consider…   •  Dependencies?   •  File  formats?   •  Time  of  collection?   •  Order  of  analysis?   Wo r ws ! kflo Grassland   From  S.  Hampton  
  20. Design  your  spreadsheet   Constrain  entries     Atomize   Break  down  spreadsheets   From  Flickr  by  Ulleskelf   Planning  
  21. Consider  a  database   Planning   A  relational  database  is      A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables     A  RDB  provides    Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Reduced  redundancy  &  entry  errors     From  Mark  Schildhauer  
  22. Consider  a  database   Planning   You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex     Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old   From  Mark  Schildhauer  
  23. Planning   Pick  a  data  repository   Store  your  data  in  a  repository   Institutional  archive   Ask  a  librarian   Discipline/specialty  archive         Repos  of  repos:   databib.org   re3data.org   From  Flickr  by  torkildr  
  24. Decide  on  preservation/backup   Planning   What  software?   What  hardware?   What  personnel?   How  often?   Set  up  reminders!   Test  system     From  Flickr    by  withassociates   From  Flickr  by  sepa  synod   From  Flickr  by    taberandrew  
  25. Write  a  data   management  plan!   Planning   …document  that   describes  what  you  will   do  with  your  data   throughout     the  research  project   From  Flickr  by  Barbies  Land  
  26. Planning   DMP  components   •  •  •  •  •  •  From  Flickr  by  Barbies  Land   What  will  be  collected   Methods   Standards   Metadata   Sharing/access   have But they all different requirements Long-­‐term  storage   and express them in different ways
  27. dmptool.org                     Step-­‐by-­‐step  wizard  for  generating  DMP   create  |  edit  |  re-­‐use  |  share   Free  &  open  to  community     Planning  
  28. During  Data  Collection  &  Entry   From  Flickr  by  Julia  Manzerova  
  29. Keep  raw  data  raw   Realistically:     •  Archive  .csv  version  of  raw  data   •  Make  a  “raw”  tab  in  working  data  file   •  Do  all  work  on  other  tabs   During   collection  
  30. Keep  raw  data  raw   Ideally:   •  Use  scripts  to  process  data     •  Save  them  with  data     Raw  data  as  .csv   During   collection   R  script  for  processing  &  analysis  
  31. Document  your  workflow   During   collection   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflow:  flow  chart   Temperature   data   Salinity                 data   “Clean”  T   &  S  data   Data  import  into  Excel   Data  in   spread-­‐ sheet   Quality  control  &   data  cleaning   Analysis:  mean,  SD   Graph  production   Summary   statistics  
  32. Document  your  workflow   During   collection   Workflow:  how  you  get  from  the  raw  data  to  the  final   products  of  your  research     Simple  workflow:  commented  script   •  R,  SAS,  MATLAB…   •  Well-­‐documented  code  is   Easier  to  review   Easier  to  share   Easier  to  use  for  repeat  analysis   #   %   $   &  
  33. Document  your  workflow   During   collection   Fancy  schmancy  workflows   Resulting  output   https://kepler-­‐project.org  
  34. Document  your  workflow   During   collection   Workflows  enable   •  Reproducibility   •  Transparency     •  Reuse     From  Flickr  by  merlinprincesse  
  35. Constrain  data  entries   •  Excel  lists   •  Data  validation   •  Google  docs  forms     Modified  from  K.  Vanderbilt     During   collection  
  36. Atomize   During   collection   One  piece  of  information  per  cell  
  37. Break  down  spreadsheets   Fake  a  relational  database   During   collection    Create  parameter  table   Create  a  site  table   From  doi:10.3334/ORNLDAAC/777   From  doi:10.3334/ORNLDAAC/777   From  R  Cook,  ESA  Best  Practices  Workshop  2010  
  38. Create  metadata   During   Why  are  you   collection   promoting   Excel?  
  39. During   collection   Create  metadata      Metadata:  data  reporting     WHO  created  the  data?   WHAT  is  the  content      of  the  data  set?   WHEN  was  it  created?   HOW  was  it  developed?   WHY  was  it  developed?   From  Flickr  by    //ichael  Patric|{     WHERE  was  it  collected?  
  40. During   collection   Create  metadata   Digital  context   Scientific  context   •  Name  of  the  data  set   •  Scientific  reason  why  the  data  were  collected   •  The  name(s)  of  the  data  file(s)  in  the  data  set   •  What  data  were  collected   •  Date  the  data  set  was  last  modified   •  •  Example  data  file  records  for  each  data  type   file   What  instruments  (including  model  &  serial   number)  were  used   •  Environmental  conditions  during  collection   •  Pertinent  companion  files   •  Temporal  &  spatial  resolution     •  List  of  related  or  ancillary  data  sets   •  Standards  or  calibrations  used   •  Software  (including  version  number)  used  to   Information  about  parameters   prepare/read    the  data  set   •  How  each  was  measured  or  produced   Data  processing  that  was  performed   •  Units  of  measure   •  Personnel  &  stakeholders   •  Format  used  in  the  data  set   •  Who  collected     •  Precision  &  accuracy  if  known   •  Who  to  contact  with  questions   •  Funders   Information  about  data   •  Definitions  of  codes  used   •  Quality  assurance  &  control  measures   •  Known  problems  that  limit  data  use  (e.g.   uncertainty,  sampling  problems)    
  41. Create  metadata   < a n da rd St During   collection   What  is   metadata?   Metadata  standards…   •  Provide  structure  to  describe  data   Common  terms    |    definitions    |    language    |    structure   •  Come  in  many  flavors    EML  ,  FGDC,  ISO19115,  DarwinCore,…   •  Can  be  met  using  software  tools    Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)        
  42. During   collection   Back  up  daily   Near   Original   From  Flickr  by  see  phar   From  Flickr  by  lippo   Far  
  43. Remember  that  data   management  plan?   During   collection   Revisit   Review   Revise   From  Flickr  by  Barbies  Land  
  44. From  Flickr  by  purplemattfish   During   collection   Revisit   Review   Revise   Schedule  a  time  each   week  or  month  
  45. From  Flickr  by  dipster1   Toolbox  
  46. Write  a  DMP   dmptool.org                     Step-­‐by-­‐step  wizard  for  generating  DMP   create  |  edit  |  re-­‐use  |  share   Free  &  open  to  community    
  47. Find  a  repository   Where   should  I  put   my  data?   databib.org  
  48. Manage  &  share   •  Help  researchers  manage,  describe,   and  share  tabular  data   •  Free   •  Add-­‐in  for  Excel  &  web  application    
  49. Manage  &  share   Features   1.  2.  3.  4.  Best  practices  check   Generate  metadata   Get  identifier  &  citation   Post  data  to  repository  
  50. Create  metadata  
  51. Create  metadata  
  52. Clean  data   Open  Refine  =  Google  Refine     •  •  •  •  Open  source  desktop  application     Used  for  data  cleanup  and  transformation  to  other  formats   Works  with  spreadsheets  but  behaves  like  a  database   User  can  filter  the  rows  to  display  using  facets  that  define   filtering  criteria  
  53. Open  Refine  =  Google  Refine     •  •  •  •  Open  source  desktop  application     Used  for  data  cleanup  and  transformation  to  other  formats   Works  with  spreadsheets  but  behaves  like  a  database   User  can  filter  the  rows  to  display  using  facets  that  define   filtering  criteria  
  54. Get  help   Toolbox:    DCXL  blog:  dcxl.cdlib.org  
  55. From  Flickr  by  twm1340   Culture   Shift  Ahead  
  56. From  Flickr  by  cdsessums   science   source   notebook   content   access   data   government   knowledge  
  57. Make  a   resolution   •  Triage  on  current   projects   •  Get    advisor,  lab  mates,   collaborators  on  board   •  Do  better  next  time   From  Flickr  by  Andy  Graulund  
  58. Website   Email   Twitter   Slides   carlystrasser.net   carlystrasser@gmail.com   @carlystrasser     slideshare.net/carlystrasser  

×