Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

1,205 views

Published on

Published in: Technology
  • Be the first to comment

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

  1. 1. Biodiversity  Informa1cs  of  the   Cyperaceae:  Where  we  stand  and   where  we’re  heading   Andrew  Hipp,  Marlene  Hahn,     Ed  Baker,  Vince  Smith  and     The  Cariceae  Working  Group  
  2. 2. A  set  of  tools  for  Cariceae   informa1cs   Andrew  Hipp,  Marlene  Hahn,     Ed  Baker,  Vince  Smith  and     The  Cariceae  Working  Group  
  3. 3. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  4. 4. What  tools  do  we  need?       • An  easily-­‐updated  hierarchical  checklist  to  visualize   sampling  progress  across  labs,  extrac1ons,  sequences;   •   A  specimen-­‐level  phylogene6cs  pipeline  that  we  can  use   to  harvest  exis1ng  data  from  NCBI  as  well  as  generate   ongoing  phylogene1c  snapshots;   •   A  way  to  automate  mapping  from  specimen  data,  so  that   we  can  visualize  (and  assess  our  visualiza1ons  of)  species   distribu1ons  in  geographic  and  ecological  space;  and   •   A  pla8orm  for  collabora6on  –  a  virtual  research   environment  to  bring  together  researchers  worldwide    
  5. 5. I.  A  hierarchical  checklist  and   sampling  progress  reports  
  6. 6. In  2011   •  A  flat  checklist  exported   from  WCM   •  A  set  of  spreadsheets  from   collabora1ng  labs   inventorying  their  DNA  and   sequence  collec1ons   •  A  vague  idea  of  what  trips   are  needed   Today   •  A  hierarchical  checklist  by   subgenus,  sec1on   •  A  synthesis  of  what   materials  and  sequences   collaborators  have  on  hand,   and  what  taxa  are   unsampled   •  A  concrete  sampling  plan   with  trips  and  taxa   iden1fied*   *  Okay,  we’re  working  on  this  one!  
  7. 7. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   We  are  aiming  toward  a   database  in  which  the   taxonomy,  specimen   data,  DNA  extrac1ons,   raw  sequencing  data  and   DNA  matrices  all  live   together  and  can  be   curated  and  worked  on   jointly  by  the  community.  
  8. 8. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)  
  9. 9. Spring  2012:  Hierarchical  checklist   Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !  
  10. 10. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !  
  11. 11. Specimen  Record   Tissue   Extrac1on   DNA  seq.   Metadata  flow   DNA  seq.   DNA  seq.  
  12. 12. A  centralized  workflow   •  Spreadsheets  imported  into  a  single  Excel  file   •  Names  cleaned  (variable)   •  DNA  data  summary  formula  created  for  each   spreadsheet  (ca.  5  mins  per  user)   •  Names  matched  to  our  Scratchpads  checklist   •  All  files  exported  to  CSV   •  Sample  sheets  and  SP  checklist  imported  to  R   •  DNA  records  added  to  checklist  as  nodes  that  are   children  to  their  taxa.   •  Hierarchical  checklist  exported  in  text  format,  with   unsampled  taxa  marked  for  searching  
  13. 13. ß  Sec1on  name   ß  Sampled  taxon  with  its  DNA  vouchers  and  summaries   ß  Unsampled  taxon  
  14. 14. Because  Kew  has  coded  geography  using  TDWG   standards,  we  can  export  geographic  hit-­‐lists  
  15. 15. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !   !   !   ?  
  16. 16. II.  A  specimen-­‐level   phylogene1c  pipeline  
  17. 17. NCBI  is  a  morass  of  data.   Geneious   •  Query  nucleo1de    database  (NCBI)  for   Organism  contains:  “Carex”,  “Uncinia”,   “Schoenoxiphium”,  “Kobresia”,   “Vesicarex”,  or  “Cymophyllus”   •  Export  as   •  FASTA   •  TAB-­‐Delim   •  XML     •  Only  export  that  maintains  all  informa1on   in  NCBI.   •  Necessary  to  obtain  data  that  can  be  used   to  connect  sequence  to  a  specimen.  
  18. 18. Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  
  19. 19. Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  
  20. 20. A  workflow  for  specimen-­‐level  mul1gene   datasets  from  NCBI   •  Download  from  NCBI  [we  used  Geneious,  but  any  bulk  download  is   fine]   •  Parse  out  collector  name,  collector  number,  isolate  number,  geography   •  Manually  clean  collector  names  (3  days  for  >6500  records)   •  Iden1fy  specimens  by  unique  combina1ons  of  collector  name,  collector   number,  isolate   •  Toss  out  “accessions”  having  more  than  one  scien1fic  name   •  Clean  gene  region  names  so  that  names  are  not  duplicated  (30  minutes   for  >6500  records)   •  Export  datasets  to  MUSCLE  and  align;  export  log  file   •  Manually  check  alignments  and  code  logfile  (D,  RC;  variable)   •  Rerun  MUSCLE  and  export  RAxML  batchfile   •  Analyze   •  Screen  for  non-­‐monophyly;  concatenate  and  con1nue!  
  21. 21. 6692  sequence  records  in  Cariceae  
  22. 22. Tab-­‐delimited  metadata  from  NCBI  /  Geneious  is   handy,  but  it  lacks  almost  all  the  informa1on  that   could  be  used  as  voucher  IDs.  No  way  to  link   sequences  to  specimens!    However,  some  NCBI   records  do  contain  this  data.  How  do  we  access  it?  
  23. 23. NCBI   Specimen   Record   The FEATURES/Qualifier1 section has information that allows us to connect sequences to a specific specimen. (for example, some records contain the qualifier specimen_voucher) To get this additional information, we need to export the data as an XML file, and parse the data out into a useable tab delimited file. Other good information to export
  24. 24. We  parsed  the  NCBI  XML  and  embedded  fields  within   <qualifiers1>  to  get  voucher,  DNA  isolate,  popula1on   variants,  country,  geographic  coordinates,  collec1on   date,  collector  name,  and  other  fields…  many   informa1ve  about  the  iden1ty  of  the  plants  sequenced.     To  make  clean  voucher  IDs,  we  used  last  name,   collec1on  number,  and  DNA  isolate  (used  by  some  labs).   For  this  analysis,  sequences  that  could  not  be  assigned  to   a  single-­‐species  voucher  were  discarded.  
  25. 25. 6692  sequence  records  à     3004  individuals,  54  genes,  5846  sequences  
  26. 26. ITS,  ETS,  matK,  trnL-­‐trnF   3,370  DNA  sequences   2,196  individuals   723  spp   397  spp  >  1  individual   31.7%  of  those  spp  monophyle1c  
  27. 27. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  28. 28. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  29. 29. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  30. 30. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  31. 31. III.  Genera1ng  maps  from   specimen  data  
  32. 32. Carex  macloviana  D’Urv   GBIF  map,  2013-­‐07-­‐06  
  33. 33. Mapping    GBIF  Data     • Generate  species  list  to  extract  GBIF   data.  (i.e.  accepted  names  in  World   Checklist)   • Download  GBIF  data  using  a  wrapper  to   dismo::gbif  (R),  allowing  us  to  capture   and  log  errors  and  missing  data.    
  34. 34. Clean  up  downloaded  GBIF  data   •  Flag  duplicate  specimen  datasets   –  Flags  specimens  within  the  same  species  that  have   iden1cal  coordinates.     –  This  should  be  expanded  to  include  specimens  that  have   iden1cal  locality  descrip1ons.   •  Flag  imprecise  loca1on  data   –  Flags  specimens  in  which  the  la1tude  is  precise  only  to  the   degree  or  to  a  tenth  of  a  degree.   –  This  threshold  could  be  adjusted,  but  is  tailored  to  the   Worldclim  database  we  are  using  (2.5  arc  minutes).   •  Create  a  delimited  file  for  each  species  containing   specimen  data  with  flagged  columns  (reference  file  of   which  data  are  u1lized  excluded  in  mapping  step).  This   file  becomes  part  of  our  analysis  archive,  so  that  we   can  always  go  back  and  edit  or  evaluate  old  data.  
  35. 35. Example  of  a  file  generated  from  clean_gbif  
  36. 36. Mapping  "cleaned-­‐up"  dataset   (Map_gbif_jpeg_imprecise)   •  Maps  need  to  be   manually  checked  for   accuracy  and   completeness   •  We  export  the  maps   as  images  to  a   Scratchpads  media   gallery  that  can  be   queried  or  filtered  by   taxon   •  Map  reviewing  is   conducted  in  a   dedicated  SP2  forum  
  37. 37. There  are  bugs  to  work  out,  though   Some  taxa  are  missing  data.   Example:  Carex  humilis   •  Map  of  2331  specimen  records   from  R  code  download   •  Website    individual  species   download   –  Filtered  for  specimens  with   coordinate  data    (=  7209   records)   –  Missing  records  include  some          from  France,  Japan,  &        South  Korea      
  38. 38. Some  maps  will  need  adjustments:  in  next  itera1ons,   it  should  be  possible  to  automate  some  of  this   Carex  alata  specimen  is  missing  a  “-­‐”  in  longitude  column     Carex  lanceolata  has  specimens  where  the  la1tude  and   longitude  are  switched.  
  39. 39. In  the  end,   integra1ng  clean   coordinate  data   with  WorldClim   clima1c  data  allows   us  to  correlate   clima1c  niche   evolu1on  with   morphological  and   lineage   diversifica1on*.     *  See  Thursday  talk  for  exci1ng   findings  in  subgenus  Vignea!  
  40. 40. h{ps://mor-­‐systema1cs.googlecode.com/svn/trunk/cariceae   We’ve  been  wri1ng  these  tools  in  R,   for  the  simple  reason  that  that’s  what   we  know.  Bits  could  easily  be  ported   to  PHP  for  integra1on  into   Scratchpads,  or  Python  for  web   implementa1on.     Code  is  available  at:  
  41. 41. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  42. 42. If  there  is  1me,  I’ll  take   ques1ons!  

×