Data Mining GenBank for Phylogenetic inference - T. Vision


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining GenBank for Phylogenetic inference - T. Vision

  1. 1. Prospects for enabling Suppose you have the sequence of a protein-coding phylogenetically informed gene, and are interested in its function. What is the first thing you would do? comparative biology on the web • If it were me, I would search for conserved domains that match records in Pfam and other Todd Vision & Hilmar Lapp 1,2 1 protein domain databases. 1U.S. National Evolutionary Synthesis Center • Are these databases complete? 2Dept. Of Biology, University of North Carolina • Are they infallible? at Chapel HIll • Are they still useful? Why are these data useful? • You needn’t have mastery of the specialist literature before the search • A match connects you to a vast interconnected world of information • Why not worry about completeness? ! A negative result is not expensive ! Many broadly useful records are already present • Why not worry about fallibility? ! The user can weigh the evidence once a match is found ! Assertions should be exposed to scrutiny 1
  2. 2. Some observations The case of phylogenetic data • This infrastructure is designed to disseminate data • There is a broad audience for phylogenetic data to non-specialists ! Organismal phylogeny (e.g. Encyclopedia of Life) • The relevant data may be derived from multiple ! Gene/protein trees “studies”, not all of which are published • Many of the available resources are geared toward specialist researchers & students • Data is hoarded neither by the researcher nor by the domain database • Non-specialists turn to taxonomic classifications when they need organismal phylogenetic • The search service is as widely disseminated as information the data • Few know where to find gene/protein trees at all • Semantic-level machine-to-machine communication facilitates human comprehensive TreeBase Tree of Life Web Project • screenshot 2
  3. 3. The NCBI taxonomy • Provides ! A hierarchy for all species represented by DNA sequences in Genbank ! Names and IDs for internal nodes ! An FTP dump • But does NOT ! Include unsequences species ! Report confidence in topology or monophyly ! Taxonomic nuance (it has synonyms & common names) Node-oriented web services from What if the NCBI taxonomy… the Tree of Life Web Project Name • Listed all taxa, including fossils? • Description • • Allowed one to assess where there are Authority • conflicting topologies? Date • • Reported support values for clades? Other names • • Reported divergence time estimates for Completeness of children • nodes (e.g. from TimeTree) Extinction status • Confidence of position • • Reported the provenance of the data? Monophyly • 3
  4. 4. Further barriers to dissemination Outline of phylogenetic information • Informatics @ NESCent • Technical obstacles • An example of a phylogenetically-informed Technology for storing and querying trees ! semantic web application for phenotype Difficulties with exchange standards ! data Inference of consensus trees and supertrees ! • Promoting interoperability and closing Taxonomic intelligence ! technical gaps in phyloinformatics through Globally unique identifiers ! open development • Social obstacles ! Reluctance to provide incomplete or fallible information NESCent sponsored science • Catalysis Meetings (large, one-time events) ! To foster new collaborations and synthetic research • Working Groups ! Smaller, focused, multiple meetings • Sabbatical Scholars • Postdoctoral fellows • Short-term visitor program ! 2 weeks to 3 months ! Encourage collaborative projects • Application info: 4
  5. 5. NESCent Informatics Evolutionary Informatics WG • Support for sponsored science and scientists • Organizers: Arlin Stoltzfus and Rutger Vos ! Facilitating electronic collaboration • Selected goals: ! Software/database development ! Providing HPC and other IT infrastructure ! XML serialization of NEXUS • Cyberinfrastructure for synthetic science ! Formal grammar for validation and interconversion of Data sharing ! NEXUS & other formats Software interoperability ! ! A transition model language for evolutionary models Training ! used in statistical inference In partnership with major national and international ! ! An ontology for evolutionary comparative data analysis efforts • Phylogenetic cyberinfrastructure to enable GeoPhyloBuilder comparative biology • Two traditions in the recording of phenotype data “Putting the ! Natural language descriptions and character matrices geography into ! Statements made using anatomical and trait ontologies, designed to capitalize on the semantic web phylogeography” • NESCent WG on morphological evolution in fish ! Organized by Paula Mabee and Monte Westerfield David Kidd & Xianhua Liu ! Led to a larger project • Aim is to integrate • Extension for ArcGIS Software that creates a spatiotemporal ! Mutant phenotype data for zebrafish GIS network model from a tree with georeferenced nodes. ! Comparative morphology data for the Ostariophysi • 3D visualizations are possible through ArcSCENE. • 5
  6. 6. Describing phenotypes using Ontologies ontologies • Defined terms with defined relationships • Entity-Quality system (EQ) ! e.g. Gene Ontology, Cell Ontology • Entity term from an anatomy ontology ! zebrafish anatomy cell ontology, etc. cell part_of part_of • Quality term from Phenotype and Trait Ontology (PATO) cell membrane projection • e.g. Entity=dorsal fin, Shape=round is_a is_a axolemma part_of axon Phenotype and Trait Ontology Evolutionary character matrices (PATO) ... • Common phenotypic data format in physical evolutionary biology (e.g. NEXUS) quality optical quality • Characters + character states, similar to chromatic buoyancy EQ property dorsal fin shape character 2 color amplitude round state Species one blue pointed state Species two green undulate state Species three bright blue dark blue 6
  7. 7. Character Matrix vs. EQ A scenario • A geneticist observes a reduction in the number Character of a particular bone type (e.g. branchiostegal ray) Character in a zebrafish mutant of her favorite gene. State AO • She asks: is this bone variable in number among Entity Attribute Value PATO species in nature? dorsal fin shape round • She could query the evolutionary phenotype database using: Entity Quality ! Entity = Branchiostegal ray (from TAO) ! Qualities pertaining to attribute ‘count’ (from PATO) • By examining additional changes on these same • She could examine a visualization of the branches, she sees several parallelisms: phylogenetic relationships of the taxa with ! loss of the swimbladder, pelvic fins, and scales the relevant character changes mapped. ! elongation of the mandibular or hyoid arches ! reduction or loss of the opercle in syngnathids and • She would see that most Ostariophysi have 3 saccopharyngoids. rays, but that reduction has occurred ! a variety of other bones and soft tissues are lost or multiple times: greatly modified ! solenostomids and syngnathids (ghost pipefishes • She might hypothesize that these trait and pipefishes) correlations are all due to alterations in the expression of the same suite of morphogens. ! giganturids • She can select appropriate species from these ! saccopharyngoid (gulper and swallower) eels lineages to follow-up experimentally. 7
  8. 8. Some anatomical ontologies What data are needed to enable this scenario? Amphibia • C. elegans • • Anatomy and trait ontologies Fish (zebrafish, medaka, teleosts) • • Phenotypes in EQ syntax for Insects (Drosophila, Mosquito, Hymenoptera) • ! Zebrafish mutants (already exist) Mammals (mouse, human) • ! Species/clades of Ostariophysi Plants (Arabidopsis, cereals, maize, all plants) • • Phylogenetic relationships among the Ostariophysi ! Taxonomy ontology Preserving published data for NESCent (Vision, Lapp, Software Developers) future integration efforts Working groups U. Oregon (Westerfield) Curator interface Usability testing EQSYTE database Sequence alignments (e.g. Treebase) • Liason to ZFIN EQSYTE public interface Liason to NCBO Long-term population records (e.g. pedigrees) • USD (Mabee, EQSYTE contents 2D and 3D images Data Curator) • Zebrafish Ostariophysan phenotypic Collection and locality information phenotypic • & genetic Morphology data NCBO data collaborators (Arratia, Coburn, Behaviorial observations • Applications Ontologies Hilton Lunderg, Mayden) (Phenote, OBO-Edit) (taxonomy, TAO, PATO, homology) Numerical tables • OBO (host of TAO, PATO, taxonomy ontology) Etc. • Tulane U. Phenotype Ontologies (Rios/Ontology Curator) for Evolutionary Biology Ichthyology community Liason to CToL Workshops (DeepFin, Fishbase) • Most of these data are lost upon publication • These are the stuff of comparative biology 8
  9. 9. Dryad: A digital repository for published data Journals and societies involved in evolutionary biology so far American Naturalist (ASN) • Evolution (SSE) • Journal of Evolutionary Biology (ESEB) • Integrative and Comparative Biology (SICB) • Molecular Biology and Evolution (SMBE) • Molecular Ecology • Molecular Phylogenetics and Evolution • Systematic Biology (SSB) • NCSU Digital Library Initiative 2006 Phyloinformatics Hackathon Open development ATV NCL NESCent HyPhy PAUP* CIPRES GARLI TreeBase • Open source refers only to the licensing of the software code Bio::CDAT Biojava BioSQL JEBL Bioruby BioPerl Biopython • At NESCent, we have been experimenting with practices in open development ! Community contributes to a shared code base ! Higher barrier to entry ! Can be a substantial payoff in terms of interoperability, functionality, usability, maintenance ! Surprisingly rare in academia 9
  10. 10. Hackathon mechanics • Before the meeting ! Participants and users suggested integrative workflows • At the meeting Gaps in existing toolkits were identified ! Subgroups collaborated on high priority targets ! Followed a “use case” model ! Subgroups and targets were allowed to be fluid ! Users were on hand to provide datasets, test code, ! provide their perspective ! Dedicated participants tasked with documentation • All code is open-source and deposited in established repositories Accomplishments • Reconciling trees • Sequence family evolution ! BioPerl: Support for NJTree ! BioPerl: Support for TribeMCL, QuickTree, ! Biopython: Wrapper for Softparsmap ClustalW, Phylip, PAML ! BioRuby: Model for phylogenetic trees and networks with graph algorithms ! BioPerl & Biopython: Support for dN/dS-based tests for selection in HyPhy ! BioSQL: Model for phylogenetic trees and networks with optimization methods and ! Biojava: Parser for Phylip alignment format topological queries ! BioRuby: Support for T-Coffee, MAFFT, and Phylip 10
  11. 11. • Phylogenetic inference on non-molecular • NEXUS compliance characters ! BioPerl: Interoperability between Bio::Phylo and ! Biojava: Interoperability between Biojava and JEBL BioPerl APIs ! Biojava & BioRuby: Level II-compliant NEXUS parsers ! BioRuby: NEXUS-compliant data model and parser for ! All: PAUP and TNT results Evaluated major APIs ! Proposed compliance levels ! • Phylogenetic footprinting Gathered test files exposing common errors ! ! BioPerl: Support for Footprinter, PhastCons, and using Fixed compliance issues in NCL and Bio::NEXUS reference ! ClustalW over a sliding window implementations Worked on integrating those into GARLI and BioPerl, ! respectively • Estimation of divergence times ! BioPerl: Draft design of r8s wrapper Next hackathon • Comparative Phylogenetic Methods in R • December 10-14, 2007 • Student internships in open-source software • Organizers: S. Kembel, H. Lapp, B. O'Meara, S. development Price, T. Vision, A. Zanne ! Students work with any of a large number of established OS projects • ! Students and mentors work & communicate remotely • NESCent recruited mentors and oversaw student • Have an idea for a future event? Submit a progress whitepaper! ! Eleven students worked on projects in visualization, usability, interoperability & implementation of new methods 11
  12. 12. NEXML Command-line BioSQL • Student: Jamie Estill Student: Jason Caravas • • Mentor: Hilmar Lapp Mentor: Rutger Vos • • Commands for Flexible serialization of phylogenetic objects • Database initialization ! Bio::TreeIO import ! Perl Bio::Phylo module tools for NEXML • Bio::TreeIO export ! parsing and serialization Tree query ! Tree optimization ! Tree manipulation ! Conservation of phylogenetic diversity • Student: Klaas Hartmann • Mentor: Tobias Thierer • Implementation of algorithm and GUI for optimal allocation of a finite budget to individual species to maximize phylogenetic diversity. 12
  13. 13. Bayesian calibration of Phyloinformatics Summer Course divergence times Teaching advanced • programming skills to • Student: Michael Nowak phylogenetic methods • Mentor: Derrick Zwickl developers Focus is on software • technologies rather than methodology First year • • Fossil occurrence data is used to ! 10 days in July 2007 construct informative priors on ! Organized by Bill Piel of TreeBASE divergence times for Bayesian ! 8 co-instructors analysis in, e.g. BEAST ! 23 students (11 female) in the first year Additional acknowledgements Conclusions Hackathon participants • The future of web-enabled comparative biology is • beginning to become clearer. GSoC mentors and students • ! For a preview, see genomics! Summer course instructors • • The facile exchange of phylogenetic data is what Phenotype evolution project • will enable it. ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula • Expect to be using technologies such as Mabee, Peter Midford, Monte Westerfield ontologies and web services, which are now • Data depository: largely foreign to phylogenetic researchers. ! Ryan Scherle, Jane Greenberg • Also expect a shift toward open development. ! This will necessitate new modes of training for academic phyloinformaticists. 13