BiDiBlast Tool Presentation
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

BiDiBlast Tool Presentation

  • 1,816 views
Uploaded on

Talk for the benefit of my fellow researchers at CReM (Univ. Nova de Lisboa)

Talk for the benefit of my fellow researchers at CReM (Univ. Nova de Lisboa)

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,816
On Slideshare
1,809
From Embeds
7
Number of Embeds
3

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 7

http://www.slideshare.net 3
http://www.linkedin.com 2
http://www.slashdocs.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Good afternoon, and thank you for attending this presentation. I am going to talk about my humble efforts to develop new tools for comparative genomics by the common man or woman.
  • The driving force that led to this incursion into the world of application development was a question posed to me by our distinct colleagues Paula and José Paulo. Would it be possible to tell what genes from S. cerevisiae are absent in S. kudryavzevii? As both genomes were sequenced, and publicly available the answer would be a mere question of work. A synteny analysis followed by the assessment of the annotation for the putative missing sequences would do the trick.
  • Unfortunately the available sequence for the genome of S. kudryavzevii was to say the least patchy, and the accompanying annotation also found wanting. The low coverage of the S. kudryavzevii genome can be exemplified by this matching by MEGABLAST of its contigs to the length of the S. cerevisiae chromosome 3. This fact made synteny analysis futile, and led to single ORF/CDS homology analysis. Although this will not be able to overcome the limitations of the S. kudryavzevii genome, it will at least corroborate the available annotation, allowing to consolidation of potentially fragmented ORF, and to write off spurious ones.
  • When considering evolution among sequences the following types of homology can be perceived according to the cannonical criteria of similarity, conjunction and congruence. Elapsed time, and occurring context are paramount to separate xenolog, and paralog from ortholog sequences. Xenolog sequences were the result of lateral transfer, and their occurrence conflicts with the host organism phylogeny. Paralogs occur in the same genome, and lineage although they share a common ancestor sequence along that same phylogenetic lineage. Orthologs diverged with their own genomes.
  • The problem of looking for orthologs lays mainly on how to distinguish them from paralogs. Even though it should be possible simply by comparing similarity less clear cut situations require explicit phylogenetic analysis.
  • Those situations may be associated to what we can perceive in a more granular analysis of paralogy. Namely if the sequence duplication happened either before or after speciation, or even if ohnology, and whole genome duplications are involved.
  • In brief, a consensus approach to homolog sequence detection, and classification would need a three stage process. First putative homolog sequences would be listed by similarity search. These sequences would be also classified trough their rank in the reciprocal search hit list. This kind of search is typically carried by BLAST, and the bi-directional best hits named Reciprocal Best Hits. RBH would define putative orthologs, and simple hits would do the same for paralogs. What follows is the confirmation of the similarity results by functional inference analysis of their products. This is usually done by determining the domain architecture of the protein, and several competing methods exist. The more efficient are based on hidden markov models aka HMMs, although regular expressions are more diverse, and easier to use. Synteny analysis of the sequence neighborhood would be the last test to orthology if conserved

Transcript

  • 1. COMPARATIVE GENOMICS - Tool Development - COMPARATIVE GENOMICS - Tool Development -
  • 2.
    • Driving Cause
      • Problem
        • How many genes are absent?
    • Sacharomyces cerevisiae Sacharomyces kudryavzevii
      • Synteny analysis?
  • 3.
    • Driving Cause
      • Problem
        • How many genes are absent?
    • Sacharomyces cerevisiae Sacharomyces kudryavzevii
        • Homolog ORF detection to assess differences.
    Coverage gap S. kudryavzevii contigs
  • 4.
    • Homology
      • Types of homology
        • Origin versus time
  • 5.
    • Homology
      • Types of homology
        • Orthology versus Paralogy versus Speciation
  • 6.
    • Homology
      • Types of homology
        • Orthology versus Paralogy versus Speciation
          • A complex picture
          • Many available detection strategies – none is perfect
    1. Merkeev I., P. Novichkov, and A. Mironov . 2006. PHOG: a database of supergenomes built from proteome complements. BMC Evolutionary Biology 6 :52.   out-paralogy ohnology in-paralogy
  • 7.
    • Homology
      • Detecting Homolog Sequences
        • Increasing levels of stringency approach
          • Sequence similarity
            • Similarity search
            • Best reciprocal hit (BRH)
            • Bi-Directional BLAST
          • Similar product function
            • Similar domain architecture
            • Regular Expressions (e.g.PROSITE Patterns)
            • PSSM (e.g. NCBI CDD)
            • HMM (e.g. Pfam)
            • Common (protein) family
          • Similar syntenic neighbourhood
            • No easy solution
  • 8.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • No windows tool/server readily available
            • Adapt existing PERL script
            • Refactor from UNIX to Windows environment
            • Lack of experience => effort needed?
            • Migrate to UNIX environment
            • Same problems
            • Develop simple JAVA app
            • Existing experience => smoother path
            • New useful tricks to learn
            • Interface command line applications
            • Multithreading = multitasking
            • In the end unanticipated problems emerged
            • Coding problems
            • Library insufficient documentation
            • GUI development
  • 9.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • Implementation as data pipeline
            • Several (thousands of) code lines
            • Collection of 15 JAVA classes – 3 Packages
            • General routines - bidiblastsup
            • Data structures – bidiblastsup.objects
            • User interface – bidiblastsup.ui
            • Uses 3 third-party libraries
            • BioJava 1.4 – mainly trasnlation tasks
            • DB4O 5.0 – data management and …
            • NeoBio – scoring schemes including ambiguity codes
            • Integrates 4 command line tools
            • NCBI BLAST (blastall –p blastn)
            • align0 (FASTA) – ORF alignment
            • stretcher (EMBOSS) – protein alignment
            • yn00 (PAML) – dN/dS calculation
  • 10.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • Implementation as data pipeline
            • Swing graphical user interface (GUI)
            • Control over the program run
            • Parameter entry
            • BLAST database building
            • Result dumping
  • 11.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • IS NOT an orthologous gene finding tool!
            • Performs the RBH detection between pools of DNA sequences
            • Customised BLAST / TBLASTX parameters
            • Store indicator values about the results
            • Stores every first BLAST hit
            • Bi-directional – putative orthologs
            • Uni-directional – putative paralogs
            • Aligns the resulting hit sequences by careful global alignment
            • Measures the real length of the aligned regions
            • Proper sequence similarity
            • Translates and aligns the ORF products
            • Global alignment using a given substitution matrix
            • A codon wise global alignment of the ORF as a by product
            • Several statistics stored
  • 12.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • IS NOT an orthologous gene finding tool!
            • Calculates evolution rates for every hit (pair of sequences)
            • Based on the codon wise global alignment of the ORFs
            • Dumps the results in delimited text files
            • Follow on processing and analysis
            • Results should be imported into relational database
            • Spreadsheets accepted but not favoured
  • 13.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • IS NOT an orthologous gene finding tool!
            • Result filtration runs on the final user
            • Sequence length mismatches (e.g. 80% to 120%)
            • Similarity threshold
            • Intervening STOP codons as ORF quality control
            • Usage scope
            • Comparative genomics
            • Annotation of ORF from newly sequenced genomes
            • Estimation of evolution rates for sets sequence
            • etc…
  • 14.
    • Homology
      • Detecting Homolog Sequences
        • Bi-Directional BLAST
          • Future developments
            • Domain architecture detection in products
            • Integration problems
            • What kind of formalism?
            • Downstream or upstream?
            • Other assorted improvements
            • User interface
            • Result management inside the application
            • Sinteny integration
            • Matching against whole genome / chromosome / contig