Data Mining GenBank for Phylogenetic inference - T. Vision

Prospects for enabling Suppose you have the sequence of a protein-coding
phylogenetically informed gene, and are interested in its function. What is
the ﬁrst thing you would do?
comparative biology on the web
• If it were me, I would search for conserved
domains that match records in Pfam and other
Todd Vision & Hilmar Lapp
1,2 1
protein domain databases.
1U.S. National Evolutionary Synthesis Center
• Are these databases complete?
2Dept. Of Biology, University of North Carolina
• Are they infallible?
at Chapel HIll
• Are they still useful?

Why are these data useful?
• You needn’t have mastery of the specialist
literature before the search
• A match connects you to a vast interconnected
world of information
• Why not worry about completeness?
! A negative result is not expensive
! Many broadly useful records are already present
• Why not worry about fallibility?
! The user can weigh the evidence once a match is
found
! Assertions should be exposed to scrutiny

1

Some observations The case of phylogenetic data
• This infrastructure is designed to disseminate data • There is a broad audience for phylogenetic data
to non-specialists ! Organismal phylogeny (e.g. Encyclopedia of Life)
• The relevant data may be derived from multiple ! Gene/protein trees
“studies”, not all of which are published • Many of the available resources are geared
toward specialist researchers & students
• Data is hoarded neither by the researcher nor by
the domain database • Non-specialists turn to taxonomic classiﬁcations
when they need organismal phylogenetic
• The search service is as widely disseminated as
information
the data
• Few know where to ﬁnd gene/protein trees at all
• Semantic-level machine-to-machine
communication facilitates human comprehensive

TreeBase
Tree of Life Web Project
• screenshot

2

The NCBI taxonomy
• Provides
! A hierarchy for all species represented by DNA
sequences in Genbank
! Names and IDs for internal nodes
! An FTP dump
• But does NOT
! Include unsequences species
! Report confidence in topology or monophyly
! Taxonomic nuance (it has synonyms & common
names)

Node-oriented web services from
What if the NCBI taxonomy… the Tree of Life Web Project
Name
• Listed all taxa, including fossils? •
Description
•
• Allowed one to assess where there are
Authority
•
conflicting topologies?
Date
•
• Reported support values for clades? Other names
•
• Reported divergence time estimates for Completeness of children
•
nodes (e.g. from TimeTree) Extinction status
•
Confidence of position
•
• Reported the provenance of the data?
Monophyly
•

3

Further barriers to dissemination
Outline
of phylogenetic information
• Informatics @ NESCent
• Technical obstacles
• An example of a phylogenetically-informed
Technology for storing and querying trees
!
semantic web application for phenotype
Difﬁculties with exchange standards
!
data
Inference of consensus trees and supertrees
!
• Promoting interoperability and closing
Taxonomic intelligence
!
technical gaps in phyloinformatics through
Globally unique identiﬁers
!
open development
• Social obstacles
! Reluctance to provide incomplete or fallible
information

NESCent sponsored science
• Catalysis Meetings (large, one-time events)
! To foster new collaborations and synthetic research
• Working Groups
! Smaller, focused, multiple meetings
• Sabbatical Scholars
• Postdoctoral fellows
• Short-term visitor program
! 2 weeks to 3 months
! Encourage collaborative projects
• Application info: http://www.nescent.org

4

NESCent Informatics
Evolutionary Informatics WG
• Support for sponsored science and scientists
• Organizers: Arlin Stoltzfus and Rutger Vos ! Facilitating electronic collaboration
• Selected goals: ! Software/database development
! Providing HPC and other IT infrastructure
! XML serialization of NEXUS
• Cyberinfrastructure for synthetic science
! Formal grammar for validation and interconversion of
Data sharing
!
NEXUS & other formats
Software interoperability
!
! A transition model language for evolutionary models
Training
!
used in statistical inference
In partnership with major national and international
!
! An ontology for evolutionary comparative data analysis
efforts
• http://www.nescent.org/wg_evoinfo

Phylogenetic cyberinfrastructure to enable
GeoPhyloBuilder
comparative biology
• Two traditions in the recording of phenotype data
“Putting the ! Natural language descriptions and character matrices
geography into ! Statements made using anatomical and trait ontologies,
designed to capitalize on the semantic web
phylogeography”
• NESCent WG on morphological evolution in fish
! Organized by Paula Mabee and Monte Westerfield
David Kidd & Xianhua Liu
! Led to a larger project
• Aim is to integrate
• Extension for ArcGIS Software that creates a spatiotemporal
! Mutant phenotype data for zebrafish
GIS network model from a tree with georeferenced nodes.
! Comparative morphology data for the Ostariophysi
• 3D visualizations are possible through ArcSCENE.
• http://www.nescent.org/informatics/software.php

5

Describing phenotypes using
Ontologies
ontologies
• Defined terms with defined relationships • Entity-Quality system (EQ)
! e.g. Gene Ontology, Cell Ontology
• Entity term from an anatomy ontology
! zebrafish anatomy cell ontology, etc.
cell part_of
part_of • Quality term from Phenotype and Trait
Ontology (PATO)
cell
membrane
projection • e.g. Entity=dorsal fin, Shape=round
is_a is_a

axolemma part_of axon

Phenotype and Trait Ontology
Evolutionary character matrices
(PATO)
...
• Common phenotypic data format in
physical
evolutionary biology (e.g. NEXUS)
quality
optical
quality
• Characters + character states, similar to
chromatic
buoyancy
EQ
property
dorsal fin shape character 2
color
amplitude
round state
Species one
blue
pointed state
Species two
green

undulate state
Species three
bright blue dark blue

6

Character Matrix vs. EQ A scenario
• A geneticist observes a reduction in the number
Character of a particular bone type (e.g. branchiostegal ray)
Character
in a zebrafish mutant of her favorite gene.
State AO
• She asks: is this bone variable in number among
Entity Attribute Value PATO species in nature?
dorsal fin shape round
• She could query the evolutionary phenotype
database using:
Entity Quality ! Entity = Branchiostegal ray (from TAO)
! Qualities pertaining to attribute ‘count’ (from PATO)

• By examining additional changes on these same
• She could examine a visualization of the branches, she sees several parallelisms:
phylogenetic relationships of the taxa with ! loss of the swimbladder, pelvic fins, and scales
the relevant character changes mapped. ! elongation of the mandibular or hyoid arches
! reduction or loss of the opercle in syngnathids and
• She would see that most Ostariophysi have 3
saccopharyngoids.
rays, but that reduction has occurred ! a variety of other bones and soft tissues are lost or
multiple times: greatly modified
! solenostomids and syngnathids (ghost pipefishes • She might hypothesize that these trait
and pipefishes) correlations are all due to alterations in the
expression of the same suite of morphogens.
! giganturids
• She can select appropriate species from these
! saccopharyngoid (gulper and swallower) eels
lineages to follow-up experimentally.

7

Some anatomical ontologies
What data are needed to enable
this scenario? Amphibia
•
C. elegans
•
• Anatomy and trait ontologies
Fish (zebraﬁsh, medaka, teleosts)
•
• Phenotypes in EQ syntax for
Insects (Drosophila, Mosquito, Hymenoptera)
•
! Zebraﬁsh mutants (already exist)
Mammals (mouse, human)
•
! Species/clades of Ostariophysi
Plants (Arabidopsis, cereals, maize, all plants)
•
• Phylogenetic relationships among the
Ostariophysi
! Taxonomy ontology

Preserving published data for
NESCent
(Vision, Lapp,
Software Developers)

future integration efforts
Working groups U. Oregon
(Westerfield)
Curator interface
Usability testing
EQSYTE database
Sequence alignments (e.g. Treebase)
•
Liason to ZFIN
EQSYTE public interface
Liason to NCBO

Long-term population records (e.g. pedigrees)
•
USD
(Mabee, EQSYTE contents

2D and 3D images
Data Curator)
•
Zebrafish
Ostariophysan
phenotypic
Collection and locality information
phenotypic
•
& genetic
Morphology data NCBO
data
collaborators
(Arratia, Coburn,
Behaviorial observations
•
Applications
Ontologies
Hilton Lunderg, Mayden)
(Phenote, OBO-Edit)
(taxonomy, TAO,
PATO, homology)

Numerical tables
•
OBO
(host of TAO, PATO,
taxonomy ontology)

Etc.
•
Tulane U.
Phenotype Ontologies
(Rios/Ontology Curator)
for Evolutionary Biology
Ichthyology community
Liason to CToL Workshops
(DeepFin, Fishbase)

• Most of these data are lost upon publication
• These are the stuff of comparative biology

8

Dryad: A digital repository for published data
Journals and societies involved
in evolutionary biology
so far
American Naturalist (ASN)
•
Evolution (SSE)
•
Journal of Evolutionary Biology (ESEB)
•
Integrative and Comparative Biology (SICB)
•
Molecular Biology and Evolution (SMBE)
•
Molecular Ecology
•
Molecular Phylogenetics and Evolution
•
Systematic Biology (SSB)
•
NCSU Digital Library Initiative

2006 Phyloinformatics Hackathon
Open development
ATV NCL NESCent HyPhy PAUP* CIPRES GARLI TreeBase

• Open source refers only to the licensing of the
software code Bio::CDAT Biojava BioSQL JEBL Bioruby BioPerl Biopython

• At NESCent, we have been experimenting with
practices in open development
! Community contributes to a shared code base
! Higher barrier to entry
! Can be a substantial payoff in terms of interoperability,
functionality, usability, maintenance
! Surprisingly rare in academia

9

Hackathon mechanics
• Before the meeting
! Participants and users suggested integrative workflows
• At the meeting
Gaps in existing toolkits were identified
!
Subgroups collaborated on high priority targets
!
Followed a “use case” model
!
Subgroups and targets were allowed to be fluid
!
Users were on hand to provide datasets, test code,
!
provide their perspective
! Dedicated participants tasked with documentation
• All code is open-source and deposited in
established repositories

Accomplishments
• Reconciling trees
• Sequence family evolution ! BioPerl: Support for NJTree
! BioPerl: Support for TribeMCL, QuickTree, ! Biopython: Wrapper for Softparsmap
ClustalW, Phylip, PAML ! BioRuby: Model for phylogenetic trees and
networks with graph algorithms
! BioPerl & Biopython: Support for dN/dS-based
tests for selection in HyPhy ! BioSQL: Model for phylogenetic trees and
networks with optimization methods and
! Biojava: Parser for Phylip alignment format
topological queries
! BioRuby: Support for T-Coffee, MAFFT, and
Phylip

10

• Phylogenetic inference on non-molecular
• NEXUS compliance
characters
! BioPerl: Interoperability between Bio::Phylo and ! Biojava: Interoperability between Biojava and JEBL
BioPerl APIs ! Biojava & BioRuby: Level II-compliant NEXUS parsers
! BioRuby: NEXUS-compliant data model and parser for
! All:
PAUP and TNT results
Evaluated major APIs
!

Proposed compliance levels
!

• Phylogenetic footprinting Gathered test ﬁles exposing common errors
!

! BioPerl: Support for Footprinter, PhastCons, and using Fixed compliance issues in NCL and Bio::NEXUS reference
!

ClustalW over a sliding window implementations
Worked on integrating those into GARLI and BioPerl,
!
respectively
• Estimation of divergence times
! BioPerl: Draft design of r8s wrapper

Next hackathon
• Comparative Phylogenetic Methods in R
• December 10-14, 2007 • Student internships in open-source software
• Organizers: S. Kembel, H. Lapp, B. O'Meara, S. development
Price, T. Vision, A. Zanne ! Students work with any of a large number of
established OS projects
• http://hackathon.nescent.org/R_Hackathon_1
! Students and mentors work & communicate remotely
• NESCent recruited mentors and oversaw student
• Have an idea for a future event? Submit a progress
whitepaper! ! Eleven students worked on projects in visualization,
usability, interoperability & implementation of new
methods

11

NEXML Command-line BioSQL
• Student: Jamie Estill
Student: Jason Caravas
•
• Mentor: Hilmar Lapp
Mentor: Rutger Vos
•
• Commands for
Flexible serialization of phylogenetic objects
• Database initialization
!
Bio::TreeIO import
!
Perl Bio::Phylo module tools for NEXML
•
Bio::TreeIO export
!
parsing and serialization Tree query
!
Tree optimization
!
Tree manipulation
!

Conservation of phylogenetic
diversity
• Student: Klaas Hartmann
• Mentor: Tobias Thierer
• Implementation of algorithm and GUI for
optimal allocation of a ﬁnite budget to
individual species to maximize phylogenetic
diversity.

12

Bayesian calibration of Phyloinformatics Summer Course
divergence times
Teaching advanced
•
programming skills to
• Student: Michael Nowak phylogenetic methods
• Mentor: Derrick Zwickl developers
Focus is on software
•
technologies rather than
methodology
First year
•
• Fossil occurrence data is used to ! 10 days in July 2007
construct informative priors on ! Organized by Bill Piel of
TreeBASE
divergence times for Bayesian ! 8 co-instructors
analysis in, e.g. BEAST ! 23 students (11 female) in the
ﬁrst year

Additional acknowledgements
Conclusions
Hackathon participants
• The future of web-enabled comparative biology is •
beginning to become clearer. GSoC mentors and students
•
! For a preview, see genomics! Summer course instructors
•
• The facile exchange of phylogenetic data is what Phenotype evolution project
•
will enable it. ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula
• Expect to be using technologies such as Mabee, Peter Midford, Monte Westerﬁeld
ontologies and web services, which are now • Data depository:
largely foreign to phylogenetic researchers. ! Ryan Scherle, Jane Greenberg
• Also expect a shift toward open development.
! This will necessitate new modes of training for
academic phyloinformaticists.

13

Data Mining GenBank for Phylogenetic inference - T. Vision

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Data Mining GenBank for Phylogenetic inference - T. Vision

Similar to Data Mining GenBank for Phylogenetic inference - T. Vision (20)

More from Roderic Page

More from Roderic Page (20)

Recently uploaded

Recently uploaded (20)

Data Mining GenBank for Phylogenetic inference - T. Vision