iPhy tools for collation and analysis of phylogenomic data. M Blaxter

1,331 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,331
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
32
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

iPhy tools for collation and analysis of phylogenomic data. M Blaxter

  1. 1. Phyloinformatics Workshop Edinburgh 2007 iPhy tools for collation and analysis of phylogenomic data Martin Jones and Mark Blaxter
  2. 2. cercozoa alv eo lat es cercom s chl s te t up I an o om gella pl rar eugly onads a u lyp gro ex *p chn a II foraminiferans ra ap ofl pl up ra ine si iop re o la din no hi hi e gr dio ic nd d cha mar ph hyt d am ine al a lar rap pl yt s r a oe ga ga a te hyt ma e a nts e a es ian e e het cilia o ba lga lg s ds ecid ae e bs so bico ero chlorop s hyte a lgae tes oomyce kon diatoms glauco ts phyte brow algae laby opalin n alg bozoa ids mo rint ae huli re ds lobose cryptophyte ch amoeba s dictyostelid la s slime molds c amoe alg hapto phyte ae e molds molds s dial slim plasmo e nts slim cor bio telid ac ej vahrasid s o tos pel ako *pro lime lka bid mold mp s s fiid tes eu zoaî gle amoe lla als e di arab lag nim nid ba p retor oxymonads pl nof s no s fun ia om a a tr ishm gi o rid ch le ìchoa yp on asa po tamon an nia ad os os s lids om a cr es mi discicristates ads opisthokonts excavates root
  3. 3. Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling 2: Organising principles 3: iPhy design 4: iPhy deployment 5: Nameless taxa & endless forms
  4. 4. Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences) are being added at a startling rate. Tree databases are also growing (both in number and size). so how does a lab worker bee keep up?
  5. 5. (10/05/2006) Metazoan Phyla: Sequences per phylum Porifera Placozoa Buddenbrockia Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha 100 1 1,000 10 10,000 10,000,000 100,000 100,000,000 1,000,000
  6. 6. (10/05/2006) Porifera Placozoa Buddenbrockia Metazoan Phyla: Species per phylum Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha 100 1 1000 10 1000000 10000 10000000 100000
  7. 7. (10/05/2006) Metazoan Phyla: Sequences per species Porifera Placozoa Buddenbrockia Myxozoa Mesozoa Ctenophora Cnidaria Micrognathozoa Cycliophora Acoelomorpha Gnathostomulida Seisonidea Rotifera Gastrotricha Sipuncula Nemertea Mollusca Entoprocta Bryozoa Brachiopoda Pogonophora Echiura Annelida Platyhelminthes Nematomorpha Nematoda Kinorhyncha Acanthocephala Priapulida Tardigrada Onychophora Arthropoda Xenoturbellida Enteropneusta Hemichordata Echinodermata Chordata Chaetognatha 100 1 0.1 1000 10
  8. 8. Phyloinformatics Workshop Edinburgh 2007 1: Forests of trees, and loads of kindling Phylogenetics is a growth area. The raw materials (sequences) are being added at a startling rate. Tree databases are also growing (both in number and size). so how does a lab worker bee keep up?
  9. 9. from Rod Page “Towards a Taxonomically Intelligent Phylogenetic Database” 7000 6000 Molecular phylogenies Cumulative number TreeBASE studies 5000 4000 3000 2000 1000 0 1980 1985 1990 1995 2000 Year
  10. 10. Phyloinformatics Workshop Edinburgh 2007 Two modes of data acquisition (a) wet lab - compute lab synergy explicitly source the sequences needed preformed ideas of the best taxa to sample the best genes to sample [this is the source of most phylogenetic data]
  11. 11. Phyloinformatics Workshop Edinburgh 2007 Two modes of data acquisition (a) wet lab - compute lab synergy (b) magpie surfing / tree surgery using phyloinformatic tools to discover the set of available genes AND taxa to address a particular problem
  12. 12. Phyloinformatics Workshop Edinburgh 2007 2: Organising principles On average … • more data are better more taxa more genes • multiple methods are better
  13. 13. Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence
  14. 14. while the NCBI taxonomy isn’t the best in the world, at least every sequence is attached to a taxon, and TAX_IDs are unique
  15. 15. The Edinburgh EST analysis Pipeline (trace2dbest) Process raw sequence traces Trim off vector & low quality (CLOBB) Cluster into putative gene objects Predict consensus sequence (prot4EST) Predict translation reading frame Generate protein translation (annot8r) Annotate using BLAST GOtcha PSort Pfam SigPep KEGG (PartiGene) Collate information in relational database
  16. 16. NEMBASE3 http://www.nematodes.org/ The web portal to NEMBASE3 Mark Blaxter, James Wasmuth, Ann Hedley & Ralf Schmid University of Edinburgh, Institute of Evolutionary Biology, Edinburgh UK EH9 3JT mark.blaxter@ed.ac.uk
  17. 17. NEMBASE3 http://www.nematodes.org/ Collectors’ curve of nematode protein families Trichinella spiralis 50000 Brugia malayi Number of families Meloidogyne incognita 40000 A Strongyloides 30000 stercoralis Ancylostoma caninum 20000 Caenorhabditis 10000 elegans B C 0 150000 75000 100000 125000 50000 0 25000 Total number of proteins
  18. 18. NEMBASE3 http://www.nematodes.org/ Earliest origins of nematode protein families 949 Strongyloidea (6120) 12302 V Rhabditoidea (3674) 1108 Rhabditina (Clade V) 0 Diplogasteromorpha (1356) 4162 435 IV Panagrolaimomorpha (2678) 132 Tylenchina (Clade IV) Tylenchomorpha 3893 Rhabditida (11213) Cephalobomorpha 7501 293 NEMATODA Ascaridomorpha III (3695) 152 Spirurina (Clade III) 2811 824 Spiruromorpha (5188) 0 I Dorylaimida (1610) 30 Dorylaimia (Clade I) 128 Trichinellida (2571)
  19. 19. Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats
  20. 20. many taxa, missing data gene-> abcdefghi /taxon 1 2 3 4 5 6 7 8 9
  21. 21. Generating a slice that • maximises taxonomic coverage • maximises present data/minimises missing data gene-> abefgi /taxon 1 3 7 9
  22. 22. Phyloinformatics Workshop Edinburgh 2007 2: Organising principles • assess all relevant taxa • assess all relevant sequence • store aligned sequences locally • output ‘slices’ of data in analysis-ready formats • store trees locally • store alternative taxonomic systems
  23. 23. Complete Including Platyhelminthes genome neglected L sequences taxa ESTs Annelida (Philippe et al.) Mollusca Tardigrada P Nematoda E Arthropoda C Vertebrata Urochordata Cephalochordata D Echinodermata Ctenophora Cnidaria Choanoflagellata Fungi
  24. 24. Phyloinformatics Workshop Edinburgh 2007 3: iPhy design
  25. 25. sequence alignment TreeFam TreeBASE user tree systematic AGGCT AGGCT AGGCT ACGGT ACGGT PheTyr AGGCT CCGGA CCGGA ACGGT CCGGA Processing to Processing to Processing to * identify relevant sequences * identify relevant sequences * capture tree data and store locally and store locally * reconcile tree nodes * associate sequences * capture tree data with existing systems and taxa * reconcile tree nodes with existing systems
  26. 26. sequence alignment TreeFam TreeBASE user tree systematic AGGCT AGGCT AGGCT ACGGT ACGGT PheTyr AGGCT CCGGA CCGGA ACGGT CCGGA Processing to Processing to Processing to * identify relevant sequences * identify relevant sequences * capture tree data and store locally and store locally * reconcile tree nodes * associate sequences * capture tree data with existing systems and taxa * reconcile tree nodes with existing systems AGGCT ACGGT CCGGA POA iPhy database Alignment Cycle tranAlign AGGCT ACGGT CCGGA AGGCT PheTyr AGGCT ACGGT CCGGA
  27. 27. sequence alignment TreeFam TreeBASE user tree systematic AGGCT AGGCT AGGCT ACGGT ACGGT PheTyr AGGCT CCGGA CCGGA ACGGT CCGGA Processing to Processing to Processing to * identify relevant sequences * identify relevant sequences * capture tree data and store locally and store locally * reconcile tree nodes * associate sequences * capture tree data with existing systems and taxa * reconcile tree nodes with existing systems AGGCT ACGGT CCGGA POA iPhy database Alignment Cycle tranAlign AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam AGGCT Orthologue ACGGT CCGGA Inference Ortho-MCL AGGCT Engine ACGGT CCGGA
  28. 28. AGGCT ACGGT CCGGA POA iPhy database Alignment Cycle tranAlign AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam AGGCT Orthologue ACGGT CCGGA Inference Ortho-MCL AGGCT Engine ACGGT CCGGA Dataset Exploration Tools AGGCT } maximal ACGGT AGGCT CCGGA Slice ACGGT CCGGA Selecter PhyML bicliques MrBayes Phylogenetics Cycle PAUP Tree Comparer ...
  29. 29. AGGCT ACGGT CCGGA POA iPhy database Alignment Cycle tranAlign AGGCT ACGGT CCGGA AGGCT PheTyr TreeFam AGGCT Orthologue ACGGT CCGGA Inference Ortho-MCL AGGCT Engine ACGGT CCGGA trees & alignments Dataset Exploration Tools Publication Quality AGGCT } maximal ACGGT AGGCT Analyses CCGGA Slice ACGGT CCGGA Selecter PhyML bicliques MrBayes Phylogenetics Cycle PAUP Tree Comparer AGGCT ... ACGGT CCGGA
  30. 30. Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’
  31. 31. BMC Bioinformatics Bio Med Central Software Open Access TaxMan: a taxonomic database manager Martin Jones* and Mark Blaxter Address: Institute of Evolutionary Biology, King's Buildings, Ashworth Laboratories, West Ma ins Road, Edinburgh EH9 3JT, UK Email: Martin Jones* - marti n.jones@ed.ac.uk; Mark Blax ter - mark.blaxter@ed.ac.uk * Corresponding author Published: 18 December 2006 Received: 11 October 2006 Accepted: 18 December 2006 BMC Bioinformatics 2006, 7:536 doi:10.1186/1471-2105-7-536 This article is available from: http://www.biomedcentral.com/1471-2105/7/536 © 2006 Jones and Blaxter; licensee BioMed Central Ltd.
  32. 32. Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan automates assembly of large sequence datasets for chosen taxa TaxMan automates generation of aligned sequences sets for chosen genes
  33. 33. Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies selection of taxa for analysis e.g. given a gene set, choosing one species per family (choosing the species with the least missing data) e.g. given a taxon set, choosing the genes (choosing genes with less than a given % missing data) e.g. generating custom defined alignments
  34. 34. Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies analysis by exporting formatted alignments (NEXUS) of nucleotides (with codon positions and genes as defined partitions) of amino acids (with genes as defined partitions)
  35. 35. Phyloinformatics Workshop Edinburgh 2007 4: iPhy deployment version 0.1: ‘TaxMan’ TaxMan simplifies post-phylogenetic analysis by saving trees (with links to the original data) saving analytical metadata (algorithm, parameters, settings) saving tree statistics (bootstraps, branch lengths)
  36. 36. Lophotrochozoa 70,000 annotated sequences ● 630,000 EST sequences ● 21 genes (mt + 18S 28S actin H3 WG EF1A) ● 53,000 sequences extracted ● 17,000 aligned consensus sequences ● 8,700 species represented ● One day for data collection, one for alignment ●
  37. 37. Molecular Phylogenetics and Evolution 43 (2007) 583–595 www.elsevier.com/locate/ympev The e ect of model choice on phylogenetic inference using mitochondrial sequence data: Lessons from the scorpions a,¤ , Benjamin Gantenbein b, Victor Fet c, Mark Blaxter a Martin Jones a Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK b AO Research Institute, Clavadelerstrasse 8, Davos Platz CH-7270, Switzerland c Department of Biological Sciences, Marshall University, Huntington, WV 25755-2510, USA Received 25 April 2006; revised 14 November 2006; accepted 14 November 2006 Available online 29 November 2006
  38. 38. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms
  39. 39. quot;... endless forms most beautiful and most wonderful have been, and are being, evolvedquot; (Darwin 1859)
  40. 40. http://www.nematodes.org/NeglectedGenomes/ ARTHROPODA/Chelicerata.html
  41. 41. Metazoan species per phylum 100000000 10000000 1000000 100000 10000 1000 100 10 1 Choanoflagellida Porifera Placozoa Cnidaria Ctenophora Acoela Mesozoa Myxozoa Nematoda Nematomorpha Loricifera Kinorhyncha Priapulida Onychophora Arthropoda Tardigrada Gastrotricha Nemertea Myzostomida Gnathostomulida Cycliophora Platyhelminthes Acanthocephala Rotifera Chaetognatha Sipunculida Bryozoa Brachiopoda Entoprocta Annelida Pogonophora Echiura Mollusca Hemichordata Echinodermata Chordata
  42. 42. organism-size curve Eukaryotes squillions number of individuals (log scale) POSSIBLE PREDATORS lots FOOD ITEMS few miniscule tiny just visible small big size of organism (log scale)
  43. 43. Sourhope farm NERC quot;Soil Biodiversity and Ecosystem Functionquot; Programme Study Site 120 m x 75 m of raw Scottish upland grass 13 000 000 000 nematodes
  44. 44. MAN IS BVT A WORM
  45. 45. Marine 1034ED Fyne1 1022ED Fyne1 1010ED Fyne1 1020ED Fyne1 1005ED Fyne1 1007ED Fyne 1140ED Orkney 1139ED Orkney 1031ED Fyne1 1043ED Gullane 1118ED Fyne2 1011ED Fyne1 1093ED Fyne2 1085ED Gullane 1046ED Gullane 1041ED Gullane Nematode 1060ED Gullane 1 1028ED Fyne1 1119ED Fyne2 1122ED Fyne2 1142ED Orkney 1145ED Orkney 1170ED Orkney 1174ED Orkney 1162ED Orkney 1169ED Orkney 1173ED Orkney 1179ED Orkney 1168ED Orkney 1176ED Orkney 1167ED Orkney 1175ED Orkney Barcodes 1147ED Orkney 1008ED Fyne1 1009ED Fyne1 1144ED Orkney 1146ED Orkney 1083ED Gullane 1073ED Gullane 1051ED Gullane 1019ED Fyne1 1124ED Fyne2 1097ED Fyne2 1150ED Orkney 1136ED Orkney 1152ED Orkney 1171ED Orkney 1154ED Orkney 5 changes 1151ED Orkney 1029ED Fyne1 1012ED Fyne1 1138ED Orkney 1013ED Fyne1 1032ED Fyne1 1092ED Fyne2 1036ED Fyne1 Gullane 1037ED Fyne1 1075ED Gullane 1109ED Fyne2 1128ED Fyne2 1094ED Fyne2 1044ED Gullane 1071ED Gullane 1064ED Gullane 1053ED Gullane 1070ED Gullane 1038ED Gullane 1052ED Gullane Loch Fyne 10 1123ED Fyne2 1035ED Fyne1 1107ED Fyne2 1108ED Fyne2 1024ED Fyne1 1178ED Orkney 1165ED Orkney 2 1156ED Orkney 1141ED Orkney 1164ED Orkney 1066ED Gullane 1047ED Gullane 1099ED Fyne2 1058ED Gullane 1042ED Gullane 1088ED Fyne2 1086ED Fyne2 1039ED Gullane 1069ED Gullane 10 1061ED Gullane 1074ED Gullane 1096ED Fyne2 1105ED Fyne2 1133ED Fyne2 1077ED Gullane 1014ED Fyne1 1068ED Gullane 1076ED Gullane 4 1080ED Gullane 1072ED Gullane 1054ED Gullane 1062ED Gullane 1048ED Gullane 1057ED Gullane 1040ED Gullane 1059ED Gullane Orkney 1120ED Fyne2 1017ED Fyne1 11 1004ED Fyne1 1018ED Fyne1 1177ED Orkney 1025ED Fyne1 1023ED Fyne1 1016ED Fyne1 1027ED Fyne1 1015ED Fyne1 1002ED Fyne1 1001ED Fyne1 1021ED Fyne1 1003ED Fyne1 2 1006ED Fyne1 1000ED Fyne1 1155ED Orkney 1121ED Fyne2 1103ED Fyne2 12 1110ED Fyne2 Loch Fyne 1114ED Fyne2 1125ED Fyne2 1131ED Fyne2 Gullane 1101ED Fyne2 1102ED Fyne2 1112ED Fyne2 1116ED Fyne2 1106ED Fyne2 1104ED Fyne2 1132ED Fyne2 51 Orkney
  46. 46. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms MOTU Molecular Operational Taxonomic Units
  47. 47. motu 1. to cut; to snap off motu-á te hau, the fishing line snapped off 2. to engrave, to inscribe letters or pictures in stone or in wood, like the motu mo rogorogo, inscrip- tions for recitation in lines called kohau. 3. islet some names of islets: Motu Motiro Hiva, Motu Nui, Motu Iti, Motu Kaokao, Motu Tapu, Motu Marotiri, Motu Kau, Motu Tavake, Motu Tautara, Motu Ko Hepa Ko Maihori, Motu Hava.
  48. 48. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms MOTU specimen-based surveys CBoL Barcode of Life (CO1) anonymous, specimen-free surveys environmental sampling bulk community DNA millions of sequences
  49. 49. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms ~1.2 million described species ~10-100 million species in reality Thus, most ‘species’ will never be formally named.
  50. 50. Phyloinformatics Workshop Edinburgh 2007 5: Nameless taxa & endless forms How do we incorporate these myriad ‘nameless taxa’ into our systems?
  51. 51. Phyloinformatics Workshop Edinburgh 2007 TaxMan, iPhy & chelicerate evolution Martin Jones MOTU and barcoding Robin Floyd & Jenna Mann PartiGene & EST analysis Ralf Schmid, James Wasmuth & Ann Hedley

×