bai2
Upcoming SlideShare
Loading in...5
×
 

bai2

on

  • 1,199 views

 

Statistics

Views

Total Views
1,199
Slideshare-icon Views on SlideShare
1,198
Embed Views
1

Actions

Likes
0
Downloads
16
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    bai2 bai2 Presentation Transcript

    • Whole-Genome Prokaryote Phylogeny without Sequence Alignment Bailin HAO and Ji QI T-Life Research Center, Fudan University Shanghai 200433, China Institute of Theoretical Physics, Academia Sinica Beijing 100080, China http://www.itp.ac.cn/~hao/
    • Classification of Prokaryotes: A Long-Standing Problem
      • Traditional taxonomy: too few features
          • Morphology : spheric, helices, rod-shaped……
          • Metabolism : photosythesis, N-fixing, desulfurization……
          • Gram staining : positive and negative
      • SSU rRNA Tree (Carl Woese et al., 1977):
        • 16S rRNA: ancient conserved sequences of about 1500kb
        • Discovery of the three domains of life: Archaea, Bacteria and Eucarya
        • Endosymbiont origin of mitochondria and chloroplasts
    • The SSU rRNA Tree of Life: A big progress in molecular phylogeny of prokaryotes as evidenced by the history of the Bergey’s Manual
    • Bergey’s Manual Trust: Bergey’s Manual
      • 1st Ed. “ Determinative Bacteriology”: 1923
      • 8th Ed. “ Determinative Bacteriology”: 1974
      • 1 st Ed. “ Systematic Bacteriology”: 1984-1989, 4 volumes
      • 9 th Ed. “ Determinative Bacteriology”: 1994
      • 2 nd Ed. “ Systematic Bacteriology”: 2001-200?, 5 volumes planned; On-Line “ Taxonomic Outline of Procarytes ” by Garrity et al. (October 2003)
      • (26 phyla: A1-A2, B1-B24)
    • Our Final Result
      • 132 organisms (16A + 110B + 6E)
      • Input: genome data
      • Output: phylogenetic tree
      • No selection of genes, no alignment of sequences, no fine adjustment whatsoever
      • See the tree first. Story follows.
    •  
    • Protein Tree for 145 Organisms From 82 Genera (K=5) 16 Archaea (11 genera, 16 species) 123 Bacteria (65 genera, 98 species) 6 Eukaryotes
    •  
    • Complete Bacterial Genomes Appeared since 1995 Early Expectations:
      • More support to the SSU rRNA Tree of Life
      • Add details to the classification (branchings and groupings)
      • More hints on taxonomic revisions
      • Confusion brought by the hyperthermophiles
        • Aquifex aeolicus (Aquae) 1998: 1551335
        • Thermotoga maritima (Thema) 1999: 1860725
        • “ Genome Data Shake tree of life ”
            • Science 280 (1 May 1998) 672
        • “ Is it time to uproot the tree of life? ”
        • Science 284 (21 May 1999) 130
        • “ Uprooting the tree of life ”
          • W. Ford Doolittle, Scientific American (February 2000) 90
    • Debate on Lateral Gene Transfer
      • Extreme estimate: 17% in E. Coli
      • Limitations of the above approach
      • B. Wang, J. Mol. Evol. 53 (2001) 244
      • “ Phase transition” and “crystalization” of species (C. Woese 1998)
      • Lateral transfer within smaller gene pools as an innovative agent
      • Composition vector may incorporate LGT within small gene pools
      • Alignment-Based Molecular Phylogeny
          • TCAGACGC
          • TCGGAGT
            • T C A G A C G C
            • T C G G A - G T
            • Scoring scheme
            • Gap penalty
            • 16S rRNA tree was based on sequence alignment
        • Problem: sequence alignment cannot be readily applied to complete genomes
        • Homology -> alignment
        • Different genome size, gene content and gene order
      Gene A A ’ B Gene B ’ C ? 1st species 2nd species
    • Our Motivations:
      • Develop a molecular phylogeny method that makes use of complete genomes – no selection of particular genes
      • Avoid sequence alignment
      • Try to reach higher resolution to provide an independent comparison with other approaches such as SSU tRNA trees
      • Make comparison with bacteriologists’ systematics as reflected in Bergey’s Manual (2001, 2002)
      • Our paper accepted by J. Molecular Evolution
    • Other Whole-Genome Approaches
      • Gene content
      • Presence or absence of COGs
      • Conserved Gene Pairs
      • “ Information” distances
      • Domain order in proteins (Ken Nishikawa’s talk at InCoB2003)
    • Comparison of Complete Genomes/Proteomes
      • Compositional vectors
        • Nucleotides: a 、 t 、 c 、 g
        • aatcgcgcttaagtc
        • Di-nucleotide (K=2) distribution:
      • {aa at ac ag ta tt tc tg ca ct cc cg ga gt gc gg}
      • { 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0, 2, 0, 1 ,2 , 0}
      } }
      • K-strings make a composition vector
          • DNA sequence  vector of dimension 4 K
          • Protein sequence  vector of dimension 20 K
          • Given a genomic or protein sequence  a unique composition vector
          • The converse: a vector  one or more sequences ?
          • K big enough -> uniqueness
          • Connection with the number of Eulerian loops in a graph (a separate study available as a preprint at ArXiv:physics/0103028 and from Hao’s webpage)
    • A Key Improvement: Subtraction of Random Background
      • Mutations took place randomly at molecular level
      • Selection shaped the direction of evolution
      • Many neutral mutations remain as random background
      • At single amino acid level protein sequences are quite close to random
      • Highlighting the role of selection by subtraction a random background
    • Frequency and Probability
      • A sequence of length
      • A K-string
      • Frequency of appearance
      • Probability
    • Predicting #(K-strings) from that of lengths (K-1) and (K-2) strings
      • Joint probability vs. conditional probability
      • Making the weakest Markov assumption:
      • Another joint probability:
    • (K-2)-th Order Markov Model
      • Change to frequencies:
      • Normalization factor may be ignored when L>>K
      • Construct compositional vectors using these modified string counts:
        • For the i-th string type of species A we use
    • Composition Distance
      • Define correlation between two compositional vectors by the cosine of angle
        • From two complete proteomes:
          • A : {a 1 ,a 2 ,……,a n } n=20 5 = 3 200 000
          • B : {b 1 ,b 2 ,……,b n }
          • C(A,B) ∈[-1,1]
      • Distance
        • D(A,B)∈[0,1]
    • Materials: Genomes from NCBI ( ftp.ncbi.nih.gov/genomes/Bacteria/ ) Not the original GenBank files 6 Eucaryote genomes were included for reference Tree construction: Neighbor-Joining in Phylip
    • Protein Tree for 132 species (K=5) 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes
    •  
    • Protein Tree for 132 species K=6 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes
    •  
    • Protein Class vs. Whole Proteome
      • Trees based on collection of ribosomal proteins (SSU + LSU): ribosomal proteins are interwoven with rRNA to form functioning complex; results consistent with SSU rRNA trees
      • Trees based on collection of aminoacyl-tRNA synthetases (AARS). Trees based on single AARS were not good. Trees based on all 20 AARSs much better but not as good as that based on rProteins.
    • Genus Tree based on Ribosomal Proteins
    • A Genus Tree based on Aminoacyl tRNA synthetases
    • Chloroplast Tree
      • Sequences of about 100 000 bp
      • Tree of the endosymbiont partners
      • Paper accepted by Molecular Biology and Evolution on 12 August 2003
    • Chloroplast tree
    • Coronaviruses including Human SARS-CoV
      • Sequences of tens kilo bases
      • SARS squence: about 29730 bases
      • Paper published in Chinese Science Bulletin on 26 June 2003
    • Coronavirus tree
    • Understanding the Subtraction Procedure: Analysis of Extreme Cases in E. coli
      • There are 1 343 887 5-strings belonging to 841832 different types.
      • Maximal count before subtraction: 58 for the
      • 5-peptide GKSTL. 58 reduces to 0.646 after subtraction.
      • Maximal component after subtraction: 197 for the 5-peptide HAMSC. The number 197 came from a single count 1 before the subtraction.
    • GKSTL: how 58 reduces to 0.646?
      • #(GKST)=113
      • #(KSTL)=77
      • #(KST)=247
      • Markov prediction: 113*77/247=35.23
      • Final result: (58-35.23)/35.23=0.646
    • HAMSC: how 1 grows to 197?
      • #(HAMS)=1
      • #(AMSC)=1
      • #(AMS)=198
      • Markov prediction: 1*1/198=1/198
      • Final result: (1-1/198)/(1/198)=197
    • 6121 Exact Matches of GKSTL In PIR Rel.1.26 with >1.2 Mil Proteins
      • These 6121 matches came from a diverse taxonomic assortment from virus to bacteria to fungi to plants and animals including human being
      • In the parlance of classic cladistics GKSTL contributes to plesiomorphic characters that should be eliminated in a strict phylogeny
      • The subtraction procedure did the job.
    • 15 Exact Matches of HAMSC: In PIR Rel.1.26 with >1.2 Mil Proteins
      • 1 match from Eukaryotic protein
      • 4 matches (the same protein) from virus
      • 10 matches from prokaryotes, among which
      • 3 from Shegella and E. coli (HAMSCAPDKE)
      • 3 from Samonella (HAMSCAPERD)
      • HAMSC is characteristic for prokaryotes
      • HAMSCA is specific for enterobacteria
    • Stable Topology of the Tree
      • K=1: makes some sense!
      • K=2,3,4: topology gradually converges
      • K=5 and K=6: present calculation
      • K=7 and more: too high resolution; star-tree or bush expected
    • Statistical Test of the Tree
      • Bootstrap versus Jack knife
      • Bootstrap in sequence alignments
      • “Bootstrap” by random selections
      • from the AA-sequence pool
      • A time consuming job
      • 180 bootstraps for 72 species
    • About 70% genes for every species were selected in one bootstrap
    • “ K-string Picture” of Evolution
      • K=5 ->3 200 000 points in space of
      • 5-strings
      • K=6 ->64 000 000 points
      • In the primordial soup: short polypeptides of a limited assortment
      • Evolution by growth, fusion, mutation leads to diffusion in the string space
      • String space not saturated yet
    • The Problem of Higher Taxa
      • 1974: Bacteria as a separate kingdom
      • 1994: Archaea and Bacetria as two domains
      • The relation of higher taxa?
      • Summary
      • As composition vectors do not depend on genome size and gene content. The use of whole genome data is straightforward
      • Data independent on that of 16S rRNA
      • Method different from that based on SSU rRNA
      • Results agree with SSU rRNA trees and the Bergey’s Manual
      • Hint on groupings of higher taxa
      • A method without “free parameters”: data in, tree out
      • Possibility of an automatic and objective classification tool for prokaryotes
    • Conclusion: The Tree of Life is saved! There is phylogenetic information in the prokaryotic proteomes. Time to work on molecular definition of taxa. Thank you!
    •  
    •  
    • Protein Tree for 132 species (K=5) 16 Archaea (11 genera, 16 species) 110 Bacteria (57 genera, 88 species) 6 Eukaryotes
    •  
    •  
    • A Failed Attempt Using Avoidance Sinatures
    •  
    • Comparison with the Bergey’s Manual
      • Tree Construction
        • phylip package of J. Felsenstein (Neighbor-Joining)
        • The Fitch method is not
        • feasible here,
        • Nondistance-matrix method (MP, ML et al)
      • Material
        • ftp://ncbi.nlm.nih.gov/genomes/Bacteria/
        Phyla Classes Orders Families Genera Species Strains Archaea 2 7 9 9 9 13 13 Bacteria 9 14 23 28 37 46 57 Total 11 21 32 37 46 59 70
    • Early expectation from genome data
      • Was there intensive lateral gene transfer?
      • Gene tree cannot be equated to the real tree of life
      • Genome data: 10 6 to 10 7
      • Difficult to align whole genome data
      • Prokaryote and Eukaryote
      • Three Kingdoms( Carl Woese ,16S rRNA )
        • Archaea
        • Eubacteria
        • Eukarya
      • Five Kingdoms ( Lynn Margulis )
        • Bacteria ( Archaea, Eubacteria )
        • Protoctista
        • Animalia
        • Fungi
        • Plantae
      • Common features of Archaea and Eubacteria:
      • Small cells, no nucleus membrane, ring DNA,
      • no CAP at 5’end of mRNA, presence of S-D
      • segments
      • Many proteins associated with replication, transcription, and translation are common in Archaea and Eukaryote
      • Features of Archaea: lack of some enzymes, insensitive to some antibiotics
      • 《 Compositional Representation of Protein Sequences and the Number of Eulerian Loops 》
      • by Bailin Hao, Huimin Xie, Shuyu Zhang
        • K=5: 76.7% proteins have unique reconstruction
        • K=6:  94.0%
        • K=10: >99%
          • Checked 2820 AA-seqs from pdb.seq, a special selection of SWISS-PROT
          • See Los Alamos National Lab E-Archive: physics/0103028
    • Subtraction of Random Background
      • Using a (K-2)-order Markov Model
      • K=2: genomic signature by Karlin and Burge
      • May be justified by using Maximal Entropy Principle with appropriate constraints (Hu & Wang, 2001)
    • What to do next
      • Detailed comparison with traditional taxonomy
      • Add more eukaryotes
      • Elucidation of the foundatrion and limitation of compositional approach
      • Software and web interface
      • Problem of lateral gene transfer
      • Viruses ?
      • Confusion brought by the hyperthermophiles
        • Aquifex aeolicus (Aqua) 1998: 1551335
        • Thermotoga maritima (Tmar) 1999: 1860725
        • “ Genome Data Shake tree of life”
            • Science 280 (1 May 1998) 672
        • “ Is it time to uproot the tree of life?”
        • Science 284 (21 May 1999) 130
        • “ Uprooting the tree of life”
          • Sci. Amer. (February 2000) 9
          • Problem of Lateral Gene Transfer (LGT): tree or network
          • Problem of higher taxa