Biodiversity	
  Informa1cs	
  of	
  the	
  
Cyperaceae:	
  Where	
  we	
  stand	
  and	
  
where	
  we’re	
  heading	
  
A...
A	
  set	
  of	
  tools	
  for	
  Cariceae	
  
informa1cs	
  
Andrew	
  Hipp,	
  Marlene	
  Hahn,	
  	
  
Ed	
  Baker,	
  ...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
What	
  tools	
  do	
  we	
  need?	
  
	
  	
  
• An	
  easily-­‐updated	
  hierarchical	
  checklist	
  to	
  visualize	
...
I.	
  A	
  hierarchical	
  checklist	
  and	
  
sampling	
  progress	
  reports	
  
In	
  2011	
  
•  A	
  flat	
  checklist	
  exported	
  
from	
  WCM	
  
•  A	
  set	
  of	
  spreadsheets	
  from	
  
coll...
Taxonomy	
  
Specimen(s)	
  
DNA	
  
extrac6on(s)	
  
Sequence(s)	
  
Trace	
  file(s)	
  /	
  
con6g(s)	
  
We	
  are	
  a...
Taxonomy	
  
Specimen(s)	
  
DNA	
  
extrac6on(s)	
  
Sequence(s)	
  
Trace	
  file(s)	
  /	
  
con6g(s)	
  
Spring	
  2012:	
  Hierarchical	
  checklist	
  
Taxonomy	
  
Specimen(s)	
  
DNA	
  
extrac6on(s)	
  
Sequence(s)	
  
Tra...
Taxonomy	
  
Specimen(s)	
  
DNA	
  
extrac6on(s)	
  
Sequence(s)	
  
Trace	
  file(s)	
  /	
  
con6g(s)	
  
!	
  
Specimen	
  Record	
  
Tissue	
  
Extrac1on	
  
DNA	
  seq.	
  
Metadata	
  flow	
  
DNA	
  seq.	
  
DNA	
  seq.	
  
A	
  centralized	
  workflow	
  
•  Spreadsheets	
  imported	
  into	
  a	
  single	
  Excel	
  file	
  
•  Names	
  cleaned...
ß	
  Sec1on	
  name	
  
ß	
  Sampled	
  taxon	
  with	
  its	
  DNA	
  vouchers	
  and	
  summaries	
  
ß	
  Unsampled	...
Because	
  Kew	
  has	
  coded	
  geography	
  using	
  TDWG	
  
standards,	
  we	
  can	
  export	
  geographic	
  hit-­‐...
Taxonomy	
  
Specimen(s)	
  
DNA	
  
extrac6on(s)	
  
Sequence(s)	
  
Trace	
  file(s)	
  /	
  
con6g(s)	
  
!	
  
!	
  
!	...
II.	
  A	
  specimen-­‐level	
  
phylogene1c	
  pipeline	
  
NCBI	
  is	
  a	
  morass	
  of	
  data.	
  
Geneious	
  
•  Query	
  nucleo1de	
  	
  database	
  (NCBI)	
  for	
  
Organ...
Hinchliff	
  and	
  Roalson.	
  2013.	
  Systema(c	
  Biology	
  62:	
  205–219.	
  
Hinchliff	
  and	
  Roalson.	
  2013.	
  Systema(c	
  Biology	
  62:	
  205–219.	
  
A	
  workflow	
  for	
  specimen-­‐level	
  mul1gene	
  
datasets	
  from	
  NCBI	
  
•  Download	
  from	
  NCBI	
  [we	
 ...
6692	
  sequence	
  records	
  in	
  Cariceae	
  
Tab-­‐delimited	
  metadata	
  from	
  NCBI	
  /	
  Geneious	
  is	
  
handy,	
  but	
  it	
  lacks	
  almost	
  all	
  th...
NCBI	
  
Specimen	
  
Record	
  
The FEATURES/Qualifier1 section has
information that allows us to connect sequences to
a ...
We	
  parsed	
  the	
  NCBI	
  XML	
  and	
  embedded	
  fields	
  within	
  
<qualifiers1>	
  to	
  get	
  voucher,	
  DNA	...
6692	
  sequence	
  records	
  à	
  	
  
3004	
  individuals,	
  54	
  genes,	
  5846	
  sequences	
  
ITS,	
  ETS,	
  matK,	
  trnL-­‐trnF	
  
3,370	
  DNA	
  sequences	
  
2,196	
  individuals	
  
723	
  spp	
  
397	
  spp	...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
III.	
  Genera1ng	
  maps	
  from	
  
specimen	
  data	
  
Carex	
  macloviana	
  D’Urv	
  
GBIF	
  map,	
  2013-­‐07-­‐06	
  
Mapping	
  	
  GBIF	
  Data	
  	
  
• Generate	
  species	
  list	
  to	
  extract	
  GBIF	
  
data.	
  (i.e.	
  accepted	...
Clean	
  up	
  downloaded	
  GBIF	
  data	
  
•  Flag	
  duplicate	
  specimen	
  datasets	
  
–  Flags	
  specimens	
  wi...
Example	
  of	
  a	
  file	
  generated	
  from	
  clean_gbif	
  
Mapping	
  "cleaned-­‐up"	
  dataset	
  
(Map_gbif_jpeg_imprecise)	
  
•  Maps	
  need	
  to	
  be	
  
manually	
  checked...
There	
  are	
  bugs	
  to	
  work	
  out,	
  though	
  
Some	
  taxa	
  are	
  missing	
  data.	
  
Example:	
  Carex	
  ...
Some	
  maps	
  will	
  need	
  adjustments:	
  in	
  next	
  itera1ons,	
  
it	
  should	
  be	
  possible	
  to	
  autom...
In	
  the	
  end,	
  
integra1ng	
  clean	
  
coordinate	
  data	
  
with	
  WorldClim	
  
clima1c	
  data	
  allows	
  
u...
h{ps://mor-­‐systema1cs.googlecode.com/svn/trunk/cariceae	
  
We’ve	
  been	
  wri1ng	
  these	
  tools	
  in	
  R,	
  
fo...
Iden1fy	
  gaps	
  in	
  our	
  
knowledge	
  and	
  
sampling	
  
Formulate	
  sampling	
  
plan	
  
New	
  collec1ons	
 ...
If	
  there	
  is	
  1me,	
  I’ll	
  take	
  
ques1ons!	
  
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
Upcoming SlideShare
Loading in …5
×

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

1,029 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,029
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading

  1. 1. Biodiversity  Informa1cs  of  the   Cyperaceae:  Where  we  stand  and   where  we’re  heading   Andrew  Hipp,  Marlene  Hahn,     Ed  Baker,  Vince  Smith  and     The  Cariceae  Working  Group  
  2. 2. A  set  of  tools  for  Cariceae   informa1cs   Andrew  Hipp,  Marlene  Hahn,     Ed  Baker,  Vince  Smith  and     The  Cariceae  Working  Group  
  3. 3. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  4. 4. What  tools  do  we  need?       • An  easily-­‐updated  hierarchical  checklist  to  visualize   sampling  progress  across  labs,  extrac1ons,  sequences;   •   A  specimen-­‐level  phylogene6cs  pipeline  that  we  can  use   to  harvest  exis1ng  data  from  NCBI  as  well  as  generate   ongoing  phylogene1c  snapshots;   •   A  way  to  automate  mapping  from  specimen  data,  so  that   we  can  visualize  (and  assess  our  visualiza1ons  of)  species   distribu1ons  in  geographic  and  ecological  space;  and   •   A  pla8orm  for  collabora6on  –  a  virtual  research   environment  to  bring  together  researchers  worldwide    
  5. 5. I.  A  hierarchical  checklist  and   sampling  progress  reports  
  6. 6. In  2011   •  A  flat  checklist  exported   from  WCM   •  A  set  of  spreadsheets  from   collabora1ng  labs   inventorying  their  DNA  and   sequence  collec1ons   •  A  vague  idea  of  what  trips   are  needed   Today   •  A  hierarchical  checklist  by   subgenus,  sec1on   •  A  synthesis  of  what   materials  and  sequences   collaborators  have  on  hand,   and  what  taxa  are   unsampled   •  A  concrete  sampling  plan   with  trips  and  taxa   iden1fied*   *  Okay,  we’re  working  on  this  one!  
  7. 7. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   We  are  aiming  toward  a   database  in  which  the   taxonomy,  specimen   data,  DNA  extrac1ons,   raw  sequencing  data  and   DNA  matrices  all  live   together  and  can  be   curated  and  worked  on   jointly  by  the  community.  
  8. 8. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)  
  9. 9. Spring  2012:  Hierarchical  checklist   Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !  
  10. 10. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !  
  11. 11. Specimen  Record   Tissue   Extrac1on   DNA  seq.   Metadata  flow   DNA  seq.   DNA  seq.  
  12. 12. A  centralized  workflow   •  Spreadsheets  imported  into  a  single  Excel  file   •  Names  cleaned  (variable)   •  DNA  data  summary  formula  created  for  each   spreadsheet  (ca.  5  mins  per  user)   •  Names  matched  to  our  Scratchpads  checklist   •  All  files  exported  to  CSV   •  Sample  sheets  and  SP  checklist  imported  to  R   •  DNA  records  added  to  checklist  as  nodes  that  are   children  to  their  taxa.   •  Hierarchical  checklist  exported  in  text  format,  with   unsampled  taxa  marked  for  searching  
  13. 13. ß  Sec1on  name   ß  Sampled  taxon  with  its  DNA  vouchers  and  summaries   ß  Unsampled  taxon  
  14. 14. Because  Kew  has  coded  geography  using  TDWG   standards,  we  can  export  geographic  hit-­‐lists  
  15. 15. Taxonomy   Specimen(s)   DNA   extrac6on(s)   Sequence(s)   Trace  file(s)  /   con6g(s)   !   !   !   ?  
  16. 16. II.  A  specimen-­‐level   phylogene1c  pipeline  
  17. 17. NCBI  is  a  morass  of  data.   Geneious   •  Query  nucleo1de    database  (NCBI)  for   Organism  contains:  “Carex”,  “Uncinia”,   “Schoenoxiphium”,  “Kobresia”,   “Vesicarex”,  or  “Cymophyllus”   •  Export  as   •  FASTA   •  TAB-­‐Delim   •  XML     •  Only  export  that  maintains  all  informa1on   in  NCBI.   •  Necessary  to  obtain  data  that  can  be  used   to  connect  sequence  to  a  specimen.  
  18. 18. Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  
  19. 19. Hinchliff  and  Roalson.  2013.  Systema(c  Biology  62:  205–219.  
  20. 20. A  workflow  for  specimen-­‐level  mul1gene   datasets  from  NCBI   •  Download  from  NCBI  [we  used  Geneious,  but  any  bulk  download  is   fine]   •  Parse  out  collector  name,  collector  number,  isolate  number,  geography   •  Manually  clean  collector  names  (3  days  for  >6500  records)   •  Iden1fy  specimens  by  unique  combina1ons  of  collector  name,  collector   number,  isolate   •  Toss  out  “accessions”  having  more  than  one  scien1fic  name   •  Clean  gene  region  names  so  that  names  are  not  duplicated  (30  minutes   for  >6500  records)   •  Export  datasets  to  MUSCLE  and  align;  export  log  file   •  Manually  check  alignments  and  code  logfile  (D,  RC;  variable)   •  Rerun  MUSCLE  and  export  RAxML  batchfile   •  Analyze   •  Screen  for  non-­‐monophyly;  concatenate  and  con1nue!  
  21. 21. 6692  sequence  records  in  Cariceae  
  22. 22. Tab-­‐delimited  metadata  from  NCBI  /  Geneious  is   handy,  but  it  lacks  almost  all  the  informa1on  that   could  be  used  as  voucher  IDs.  No  way  to  link   sequences  to  specimens!    However,  some  NCBI   records  do  contain  this  data.  How  do  we  access  it?  
  23. 23. NCBI   Specimen   Record   The FEATURES/Qualifier1 section has information that allows us to connect sequences to a specific specimen. (for example, some records contain the qualifier specimen_voucher) To get this additional information, we need to export the data as an XML file, and parse the data out into a useable tab delimited file. Other good information to export
  24. 24. We  parsed  the  NCBI  XML  and  embedded  fields  within   <qualifiers1>  to  get  voucher,  DNA  isolate,  popula1on   variants,  country,  geographic  coordinates,  collec1on   date,  collector  name,  and  other  fields…  many   informa1ve  about  the  iden1ty  of  the  plants  sequenced.     To  make  clean  voucher  IDs,  we  used  last  name,   collec1on  number,  and  DNA  isolate  (used  by  some  labs).   For  this  analysis,  sequences  that  could  not  be  assigned  to   a  single-­‐species  voucher  were  discarded.  
  25. 25. 6692  sequence  records  à     3004  individuals,  54  genes,  5846  sequences  
  26. 26. ITS,  ETS,  matK,  trnL-­‐trnF   3,370  DNA  sequences   2,196  individuals   723  spp   397  spp  >  1  individual   31.7%  of  those  spp  monophyle1c  
  27. 27. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  28. 28. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  29. 29. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  30. 30. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  31. 31. III.  Genera1ng  maps  from   specimen  data  
  32. 32. Carex  macloviana  D’Urv   GBIF  map,  2013-­‐07-­‐06  
  33. 33. Mapping    GBIF  Data     • Generate  species  list  to  extract  GBIF   data.  (i.e.  accepted  names  in  World   Checklist)   • Download  GBIF  data  using  a  wrapper  to   dismo::gbif  (R),  allowing  us  to  capture   and  log  errors  and  missing  data.    
  34. 34. Clean  up  downloaded  GBIF  data   •  Flag  duplicate  specimen  datasets   –  Flags  specimens  within  the  same  species  that  have   iden1cal  coordinates.     –  This  should  be  expanded  to  include  specimens  that  have   iden1cal  locality  descrip1ons.   •  Flag  imprecise  loca1on  data   –  Flags  specimens  in  which  the  la1tude  is  precise  only  to  the   degree  or  to  a  tenth  of  a  degree.   –  This  threshold  could  be  adjusted,  but  is  tailored  to  the   Worldclim  database  we  are  using  (2.5  arc  minutes).   •  Create  a  delimited  file  for  each  species  containing   specimen  data  with  flagged  columns  (reference  file  of   which  data  are  u1lized  excluded  in  mapping  step).  This   file  becomes  part  of  our  analysis  archive,  so  that  we   can  always  go  back  and  edit  or  evaluate  old  data.  
  35. 35. Example  of  a  file  generated  from  clean_gbif  
  36. 36. Mapping  "cleaned-­‐up"  dataset   (Map_gbif_jpeg_imprecise)   •  Maps  need  to  be   manually  checked  for   accuracy  and   completeness   •  We  export  the  maps   as  images  to  a   Scratchpads  media   gallery  that  can  be   queried  or  filtered  by   taxon   •  Map  reviewing  is   conducted  in  a   dedicated  SP2  forum  
  37. 37. There  are  bugs  to  work  out,  though   Some  taxa  are  missing  data.   Example:  Carex  humilis   •  Map  of  2331  specimen  records   from  R  code  download   •  Website    individual  species   download   –  Filtered  for  specimens  with   coordinate  data    (=  7209   records)   –  Missing  records  include  some          from  France,  Japan,  &        South  Korea      
  38. 38. Some  maps  will  need  adjustments:  in  next  itera1ons,   it  should  be  possible  to  automate  some  of  this   Carex  alata  specimen  is  missing  a  “-­‐”  in  longitude  column     Carex  lanceolata  has  specimens  where  the  la1tude  and   longitude  are  switched.  
  39. 39. In  the  end,   integra1ng  clean   coordinate  data   with  WorldClim   clima1c  data  allows   us  to  correlate   clima1c  niche   evolu1on  with   morphological  and   lineage   diversifica1on*.     *  See  Thursday  talk  for  exci1ng   findings  in  subgenus  Vignea!  
  40. 40. h{ps://mor-­‐systema1cs.googlecode.com/svn/trunk/cariceae   We’ve  been  wri1ng  these  tools  in  R,   for  the  simple  reason  that  that’s  what   we  know.  Bits  could  easily  be  ported   to  PHP  for  integra1on  into   Scratchpads,  or  Python  for  web   implementa1on.     Code  is  available  at:  
  41. 41. Iden1fy  gaps  in  our   knowledge  and   sampling   Formulate  sampling   plan   New  collec1ons   DNA   sequences   DNA  matrices   Mul1ple   alignments   Species  tree     es1mates   Revised   classifica1on   A  central  database  for  specimen-­‐level  data  
  42. 42. If  there  is  1me,  I’ll  take   ques1ons!  

×