Quest for Orthologs (QfO) is a volunteer effort to jointly create a firmer foundation for comparative biology research. Extrapolating knowledge from a handful of organisms for which experimental data is available to other living organisms is the heart of comparative biology, and orthology assignments are axiomatic. Whether the aim is to find the gene in a model organism corresponding to a human disease gene, inferring the function of a newly sequenced gene using available experimental assays from its orthologs, or inferring species phylogenies by tracing the evolution of orthologous groups, reliable ortholog predictions are needed. Any improvement to this essential base will offer enormous benefit to the entire biological community. Currently more than 30 different phylogenomic databases provide their ortholog analysis results to the scientific community, each differing in many ways—number of species, taxonomic range, sampling density, applied methodology, and more. In addition, phylogenomic databases differ in their concepts, making direct benchmarking problematic and presenting major obstacles to the user community looking for relationships between genes.
One of the prerequisite for the QfO consortium is a shared species phylogeny, which is used for the final determination of gene relationships during construction of sequence-based gene trees. A cladogram was therefore constructed for the 147 species of the reference proteomes (http://www.ebi.ac.uk/reference_proteomes/) based on information from various resources (http://wiki.isb-sib.ch/swisstree/Species_tree_for_Quest_for_Orthologs_reference_proteomes_2013). This talk presents the current progress by the QfO consortium and ideally will lead to a discussion of common needs and open questions.
Quest for Orthologs: anchoring comparative biology research (TDWG 2013)
1. Quest for Orthologs*
anchoring comparative biology research
Sharing and delivery of reusable phylogenetic knowledge
TDWG 2013 Annual Conference - October 2013 Florence, Italy
*http://questfororthologs.org/
Wednesday, October 23, 13
2. Evolutionary conservation allows knowledge
transfer between well-characterized model
organisms to human & other organisms and is
the basis for comparative genomic studies
Wednesday, October 23, 13
3. The barriers
More than 30 phylogenomic databases provide their analysis
results to the scientific community.
The content of these databases differ
The concepts of these databases also differ
Complex/slow pipelines
Unavailability as stand alone programs
Different output formats
Lack of benchmarking data sets
Consequently comparing and choosing is difficult
Wednesday, October 23, 13
4. SWISS INSTITUTE OF BIOINFORMATICS
EUROPEAN BIOINFORMATICS
INSTITUTE
STOCKHOLMS UNIVERSITET
EIDGENÖSSISCHE TECHNISCHE
HOCHSCHULE ZÜRICH
INSTITUT DE GÉNÉTIQUE ET
MICROBIOLOGIE
NATIONAL INSTITUTE FOR BASIC
BIOLOGY, JAPAN
SANGER INSTITUTE
EUROPEAN MOLECULAR BIOLOGY
LABORATORY
INSTITUT NATIONAL DE LA
RECHERCHE AGRONOMIQUE
UNIVERSIDAD DE MURCIA
JOINT GENOME INSTITUTE
UNIVERSITY OF LAUSANNE
Who we are
SYNGENTA
UNIVERSITÄT BONN
UNIVERSITY OF CAMBRIDGE
CENTRE INTERNATIONALE POUR LA
RECHERCHE AGRONOMIQUE POUR LE
DÉVELOPPEMENT
JACKSON LABORATORY
UNIVERSITÉ DE LYON
UNIVERSITÉ DE GENEVE
PRINCETON UNIVERSITY
Wednesday, October 23, 13
CENTRE DE REGULACIÓ GENOMICS,
BARCELONA
UNIVERSITY OF PENNSYLVANIA
5. Quest for Orthologs’
objectives
A collaboration of phylogenomic databases
Use shared reference datasets (proteomes and
species trees)
Benchmark orthology predictions
Use an agreed format
Evaluate emerging new methods
Wednesday, October 23, 13
6. QfO - proteomes
Criteria 1: include the major experimental model
organisms
Criteria 2: include a broad taxonomic range of genomes
Common dataset: QfO Reference Proteome: http://
www.ebi.ac.uk/reference_proteomes
Currently 147 species that are publicly available and are
generated using UniProtKB, Ensembl and Ensembl
Genomes.
Additional species on request, annual release in April
Wednesday, October 23, 13
7. QfO - Format
Common format: OrthoXML: http://seqxml.org
Designed for representing the orthology relationships
that are generated as output
Sign ups: Ensembl Compara*, HCOP, InParanoid*,
MBGD Microbial Genome Database, OMA*,
OrthoInspector, OrthoMCL, Panther, PHOG, PhyloFacts,
PhylomeDB, ProGMap, Roundup*
Wednesday, October 23, 13
8. QfO - Benchmark
Compared: Ensembl Compara, InParanoid (Full, core),
MetaPhOrs (Missing 3 genomes), OMA (Pairs, Groups,
HOGs), Orthoinspector 1.30, PANTHER 8.0 (LDO only, all),
PhylomeDB, RSD 0.8 1e-5 (RoundUp)
OrthoBench: http://orthology.benchmarkservice.org
Battery of approaches: Species-tree discordance test: Gold
standard gene trees: Gold standard (hierarchical)
orthologous groups
Minimum standard and sanity check (already useful):
Minimum Information for an Orthology Prediction Algorithm?
Guide to improve algorithms
Wednesday, October 23, 13
9. A species tree is key
A reliable species phylogeny enhances prediction of gene relationships
Current cladogram comprised of 147 species from the reference
datasets and is based on information from various resources
Newick format
3 identifiers: UniProtKB species code, scientific name, NCBI taxid
Relevant publications for speciation nodes possible
For QfO benchmarking only needs to cover current accepted models of
species evolution
A time-tree would be desirable to define rules for the introduction of
multi-furcating nodes for benchmarking purposes
Ortholog DB providers use it for gene/species tree reconciliation
Wednesday, October 23, 13
11. Resources
Quest for Orthologs—http://questfororthologs.org/
Alan Wilter Sousa da Silva—
http://www.ebi.ac.uk/reference_proteomes
Eric Sonnhammer & Matthieu Muffato—
http://seqxml.org
Adrian Altenhoff & Christophe Dessimov—
http://orthology.benchmarkservice.org
Brigitte Boeckman—wiki.isb-sib.ch/swisstree
Wednesday, October 23, 13
12. Our questions
Are model differences among the different ToLs documented and if
yes, is this info made public or can be made available to QfO?
Which tree format is best to use for data comparison and update?
Are confidence values for internal nodes in the ToLs made
available?
Are ToLs available for download (formats, update frequencies,
release identifiers?,...)
Are species identifiers of ToL projects in sync with the NCBI TaxIds?
Is there a point of contact and communication? How can we
productively engage with the ToL & taxonomy community on a
cooperative effort?
Wednesday, October 23, 13
13. Brigitte Boeckmann
Vincent Daubin
Kristoffer Forslund
Toni Gabaldon
SWISS INSTITUTE OF BIOINFORMATICS
UNIVERSITÉ DE LYON
EUROPEAN MOLECULAR BIOLOGY
LABORATORY
CENTRE DE REGULACIÓ GENOMICS,
BARCELONA
Matthieu Muffato
EUROPEAN BIOINFORMATICS
INSTITUTE
Fabian Schreiber
SANGER INSTITUTE
Special thanks to
Wednesday, October 23, 13