Good afternoon, and thank you for attending this presentation. I am going to talk about my humble efforts to develop new tools for comparative genomics by the common man or woman.
The driving force that led to this incursion into the world of application development was a question posed to me by our distinct colleagues Paula and José Paulo. Would it be possible to tell what genes from S. cerevisiae are absent in S. kudryavzevii? As both genomes were sequenced, and publicly available the answer would be a mere question of work. A synteny analysis followed by the assessment of the annotation for the putative missing sequences would do the trick.
Unfortunately the available sequence for the genome of S. kudryavzevii was to say the least patchy, and the accompanying annotation also found wanting. The low coverage of the S. kudryavzevii genome can be exemplified by this matching by MEGABLAST of its contigs to the length of the S. cerevisiae chromosome 3. This fact made synteny analysis futile, and led to single ORF/CDS homology analysis. Although this will not be able to overcome the limitations of the S. kudryavzevii genome, it will at least corroborate the available annotation, allowing to consolidation of potentially fragmented ORF, and to write off spurious ones.
When considering evolution among sequences the following types of homology can be perceived according to the cannonical criteria of similarity, conjunction and congruence. Elapsed time, and occurring context are paramount to separate xenolog, and paralog from ortholog sequences. Xenolog sequences were the result of lateral transfer, and their occurrence conflicts with the host organism phylogeny. Paralogs occur in the same genome, and lineage although they share a common ancestor sequence along that same phylogenetic lineage. Orthologs diverged with their own genomes.
The problem of looking for orthologs lays mainly on how to distinguish them from paralogs. Even though it should be possible simply by comparing similarity less clear cut situations require explicit phylogenetic analysis.
Those situations may be associated to what we can perceive in a more granular analysis of paralogy. Namely if the sequence duplication happened either before or after speciation, or even if ohnology, and whole genome duplications are involved.
In brief, a consensus approach to homolog sequence detection, and classification would need a three stage process. First putative homolog sequences would be listed by similarity search. These sequences would be also classified trough their rank in the reciprocal search hit list. This kind of search is typically carried by BLAST, and the bi-directional best hits named Reciprocal Best Hits. RBH would define putative orthologs, and simple hits would do the same for paralogs. What follows is the confirmation of the similarity results by functional inference analysis of their products. This is usually done by determining the domain architecture of the protein, and several competing methods exist. The more efficient are based on hidden markov models aka HMMs, although regular expressions are more diverse, and easier to use. Synteny analysis of the sequence neighborhood would be the last test to orthology if conserved
COMPARATIVE GENOMICS - Tool Development - COMPARATIVE GENOMICS - Tool Development -