3. TIGRTIGR
Topics of Discussion
• Introduction to evolution
• Introduction to phylogenomics
• Phylogenomic examples
– Species evolution
– Uncultured organisms
– Functional predictions
– Gene duplication
– Genome rearrangements
4. TIGRTIGR
Eisen Genome Projects
Extremophiles, DNA repair
models
Deinococcus radiodurans
Haloferax volcanii
Tetrahymena thermophila
Novel phylogenetic groups Tree of Life
Endosymbionts Wolbachia, Baumania,
Chemosynthetic symbionts,
Prochloron
Evolution of C1 metabolism Carboxydothermus,
Methylococcus, Chlorobium,
Acidothiobacillus
5. TIGRTIGR
Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
6. TIGRTIGR
Comparative Genomics
• Comparison of genomes between species
• Identify differences
– SNPs, Indels
– Rearrangements
– Presence/absence of genes, pathways, features
• Correlating with phenotypic differences
• Can be used to improve on every step in
genome analysis
11. TIGRTIGR
Evolutionary Perspective and
Comparative Biology
• Comparative biology is the analysis of differences
and similarities between species.
• An evolutionary perspective is useful in such studies
because this allows one to focus not just on the
levels and degrees of similarity or difference but on
how and why similarities and differences came to
be.
12. TIGRTIGR
Phylogenomics
• Genome sequences contain a record of the
evolution of a species and all its genes
• Evolutionary analysis is the key to interpreting
genome sequences and making the most use out of
them
• There is a feedback loop between evolutionary
and genome analysis such that they should be
done together.
14. TIGRTIGR
Why Completeness is Important
• Improves characterization of genome features
• Better comparative genomics
• Presence/absence is less subjective
• Missing sequence might be important (e.g.,
centromere)
• Allows researchers to focus on biology not
sequencing
• Facilitates large scale correlation studies
• Controls for contamination
16. TIGRTIGR
• Analysis of S. pombe genome by Wood et al. 2002
• Compared the predicted proteomes of all
completed genomes of eukaryotes to those of
prokaryotes
• Asked: “Are there genes found in all eukaryotes
with no obvious homologs in any prokaryote?”
Eukaryotes vs. Prokaryotes
18. TIGRTIGR
Eukaryotic Specific Genes
• >200 genes found including:
– Cytoskeleton components: tubulin, ankyrin,
myosin
– Protein degradation: ubiquitin, proteases
– Chromatin and DNA packaging
• Of the 200 many had no known function: could
encode novel eukaryotic wide processes
19. TIGRTIGR
Multi- vs. Single-Cellular Eukaryotes
• Further analysis of S. pombe genome
• Compared multi-cellular vs. single-cellular
eukaryotes (animals and plants vs. yeast)
• “Are there genes in all multi-cellular and not in
any single-cellular?”
• Found only 3
• Concluded that the genetic basis of multi-
cellularity was likely to be gene regulation and not
invention of new genes
22. TIGRTIGR
Selecting Genome Projects
• Economic importance
• Relevance to human disease
• Biochemical or physiological novelty
• Ecological importance
• Phylogenetic position
23. TIGRTIGR
Selecting Genome Projects
• If all else is (roughly) equal, select the most
experimentally tractable organisms
– Deinococcus radiodurans
– Chlorobium tepidum
– Tetrahymena thermophila
• Genome sequences are powerful tools for
launching experimental studies for those
organisms
• However, not all important organisms work nicely
in the lab
29. TIGRTIGR
Selection Apparently Inefficient in wMel
• Likely not due to higher mutation rate
– Full suite of DNA repair genes
• Likely not due to low amounts of homologous
recombination
– RecA present
– Population studies suggest homologous recombination occurs
• Wolbachia has multiple types of bottlenecks
– Maternal transmission like obligate mutualists
– Infectious sweeps of cytoplasmic incompatibility like pathogens
Wu et al., 2004
30. TIGRTIGR
• Sap feeding insects
Glassy-winged Sharpshooter
• Carriers of Xylella
fastidiosa that causes
Pierce’s disease of
grapevines
• There are >20000
sharpshooter species,
within which
intracellular symbiotic
bacteria are
wildspread
Baumannia cicadellinicola:
1° symbionts of the Glassy-winged Sharpshooter
35. TIGRTIGR
9359 clones that are not included in the final assembly
Run_TA
7152 assembles (400 have been assembled)
<1kb 6996
1kb-2kb 125
2kb-3kb 18
3kb-4kb 6
4kb-5kb 3
5kb-6kb 2
6kb-7kb 1
7kb-8kb 1
Sequences from Another Symbiont
150 Bacteroides/Chlorobi (njtree/blast)
38. TIGRTIGR
Beyond rRNA II: Metagenomics
• Isolate, by filtration, all microbes in a sample
• Extract total DNA in very large pieces
• Clone those pieces as BACs into E.coli to get enough.
• Identify which BAC contains phylogenetic marker of interest
• Sequence the BACs like a bacterial genome.
Sample
Filter
concentrate
Extract
DNA
Clone
Into
BACs
Sequence
Gene
List
39. TIGRTIGR
Using a rRNA anchor
allowed the
identification of a new
form of phototrophy:
Proteorhodopsin
Beja et al. 2000
43. TIGRTIGR
Beyond rRNA III:
Shotgun Environmental Sequencing
shotgunshotgun
sequencesequence
Warner Brothers, Inc.Warner Brothers, Inc.
44. TIGRTIGR
Sargasso Sea
• High microbial diversity
• Most of the abundant rRNA phylotypes have
never been cultured
• Physiological processes of microbes largely
unknown
• Well studied in terms of oceanographic parameters
49. TIGRTIGR
Phylogenomics and Species Evolution II:
Biased Sample of Genomes
• Of 40 bacterial phyla
most genome
sequences come from
only 3 groups
Hugenholtz 2002
50. TIGRTIGR
# of Bacterial Phyla Sequenced
0
5
10
15
20
25
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Year
Total # of Bacterial Phyla with a Genome Sequenced
51. TIGRTIGR
# of Bacterial Phyla Sequenced
0
5
10
15
20
25
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Year
Total # of Bacterial Phyla with a Genome Sequenced
56. TIGRTIGR
Predicting Function
• Identification of motifs
– Short regions of sequence similarity that are indicative of
general activity
– e.g., ATP binding
• Homology/similarity based methods
– Gene sequence is searched against a databases of other
sequences
– If significant similar genes are found, their functional
information is used
• Problem
– Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
TIGRTIGR
58. TIGRTIGR
Blast Search of H. pylori “MutS”
Score E
Sequences producing significant alignments: (bits) Value
sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25
sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10
sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09
sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08
sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07
sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
• Blast search pulls up Syn. sp MutS#2 with much
higher p value than other MutS homologs
Eisen, 1997
60. TIGRTIGR
Phylogenetic Tree of MutS Family
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
69. TIGRTIGR
Arabidopsis thalianaGP9651815g
Drosophila melanogasterGP72929
Homo sapiensSPP49917DNL4 HUMAN
Gallus gallusGP15778121dbjBAB6
Xenopus laevisGP18029886gbAAL5
Candida albicansSPP52496DNLI C
Saccharomyces cerevisiaeGP1151
Schizosaccharomyces pombeGP700
Camelpox virusGP18483081gbAAL7
Variola major virusGP439074gbA
Cowpox virusGP20153167gbAAM136
Vaccinia virusGP2772802gbAAB96
VIRUS vaccinia 9791118refNP 06
Vaccinia virus strain Tian Tan
Monkeypox virusGP17529940gbAAL
Homo sapiensSPP49916DNL3 HUMAN
Mus musculusGP1794221gbAAC5300
Xenopus laevisGP18029884gbAAL5
lumpy skin disease virusGP1514
Swinepox virusGP18448623gbAAL6
Myxoma virusGP6523988gbAAF1502
Rabbit fibroma virusGP392838gb
Fowlpox virusGP453602embCAA828
Drosophila melanogasterGP72996
Arabidopsis thalianaSPQ42572DN
Oryza sativaGP16905197gbAAL310
Crithidia fasciculataGP312384e
Caenorhabditis elegansSPQ27474
Drosophila melanogasterGP72916
Homo sapiensSPP18858DNL1 HUMAN
Mus musculusSPP37913DNL1 MOUSE
Rattus norvecusSPQ9JHY8DNL1 RA
Xenopus laevisSPP51892DNL1 XEN
Plasmodium falciparumGP1815859
Schizosaccharomyces pombeSPP12
Saccharomyces cerevisiaeSPP048
Aeropyrum pernixSPQ9YD18DNLI A
Acidianus ambivalensSPQ02093DN
Sulfolobus solfataricusSPQ980T
Sulfolobus shibataeSPQ9P9K9DNL
Sulfolobus tokodaiiSPQ976G4DNL
Aquifex aeolicusGP2983805gbAAC
Aquifex aeolicusSPO67398DNLI A
Pyrobaculum aerophilumGP409906
uncultured crenarchaeote 74A4G
Thermoplasma acidophilumSPQ9HJ
Thermoplasma volcaniumOMNINTL0
Methanosarcina acetivorans str
Archaeoglobus fuldusSPO29632DN
A METAC 19916535gbAAM05952.1 D
Pyrococcus abyssiSPQ9V185DNLI
Pyrococcus horikoshiiSPO59288D
Pyrococcus furiosusSPP56709DNL
Thermococcus kodakaraensisGP10
Thermococcus fumicolansSPQ9HH0
Methanopyrus kandleri AV19GP19
Methanococcus jannaschiiSPQ576
Halobacterium sp.SPQ9HR35DNLI
Streptomyces coelicolorSPQ9FCB
Lymantria dispar nucleopolyhed
Ligase IV
Viral ligases
Ligase I
Archaeal Ligase
DNA Ligase Tree
70. TIGRTIGR
Problems with Similarity Based
Functional Prediction
• Prone to database error propagation.
• Cannot identify orthologous groups reliably.
• Perform poorly in cases of evolutionary rate
variation and non-hierarchical trees (similarity will
not reflect evolutionary relationships in these cases)
• May be misled by modular proteins or large
insertion/deletion events.
• Are not set up to deal with expanding data sets.
TIGRTIGR
73. TIGRTIGR
AlkA Domain (O6-Me-G glycosylase)Ogt Domain (O6-Me-G alkyltransferase)Ada Domain (transcriptions regulator)Ada E. coliAda H. inflOgt E. coliOgt H. inflOgt Gram+Ogt D. radioAlkA Gram+AlkA E. coliMGMT Euks
Alkylation Repair Genes
75. TIGRTIGR
Types of Molecular Homology
• Homologs: genes that are descended from a common
ancestor (e.g., all globins)
• Orthologs: homologs that have diverged after speciation
events (e.g., human and chimp β-globins)
• Paralogs: homologs that have diverged after gene
duplication events (e.g., α and β globin).
• Xenologs: homologs that have diverged after lateral
transfer events
• Positional homology: common ancestry of specific amino
acid or nucleotide positions in different genes
77. TIGRTIGR
DNA Repair Genes in D.
radiodurans Complete Genome
Process Genes in D. radiodurans
Nucleotide Excision Repair UvrABCD, UvrA2
Base Excision Repair AlkA, Ung, Ung2, GT, MutM, MutY-Nths,
MPG
AP Endonuclease Xth
Mismatch Excision Repair MutS, MutL
Recombination
Initiation
Recombinase
Migration and resolution
RecFJNRQ, SbcCD, RecD
RecA
RuvABC, RecG
Replication PolA, PolC, PolX, phage Pol
Ligation DnlJ
dNTP pools, cleanup MutTs, RRase
Other LexA, RadA, HepA, UVDE, MutS2
78. TIGRTIGR
Problem:
List of DNA repair gene homologs
in D. radiodurans genome is not
significantly different from other
bacterial genomes of the similar size
83. TIGRTIGR
Non-Homology Prediction:
Phylogenetic Profiles
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene
found in each other species
• Cluster genes by distribution
patterns (profiles)
85. TIGRTIGR
Chlorobium tepidum Strain TLS
C. tepidum mat in highly sulfidic
“Travelodge Stream”,
Rotorua, New Zealand
(from Castenholz and Pierson, 1995)
Phase contrast photomicrograph
of the 48-hours culture and electron
micrograph of thin cell section
(from Wahlund et al, 1991)
86. TIGRTIGR
Phylogenetic Profile -
C. tepidum Chlorophyll
Synthesis
Wu and Eisen, unpublished
5002_cobalamin biosynthesis protein CbiG/precorrin-4 C11-methyltransferase3939_precorrin-3B C17-methyltransferase/precorrin-8X methylmutase cbiJH882_cobyric acid synthase cbiP3160_dsrN protein dsrN862_cobyrinic acid a,c-diamide synthase cbiA-14010_cobN protein, putative2641_magnesium-protoporphyrin methyltransferase bchH-31498_magnesium-protoporphyrin methyltransferase bchH-14003_cobN protein, putative2636_magnesium-protoporphyrin methyltransferase bchH-24008_magnesium-chelatase, subunit I chlI-24007_magnesium-chelatase, subunit D/I family1504_magnesium-chelatase, subunit I chlI-1
88. TIGRTIGR
PG Profile of C. tepidum RbcL
suggests link with Sulfur Metabolism
from Eisen et al., 2002
and see Hanson and Tabita 2001
CT1893 sulfhydrogenase, delta subunit hydDCT1681 ABC transporter, permease proteinCT2206 polysaccharide efflux transporter, putativeCT1271 glycosyl transferaseCT1965 conserved hypothetical proteinCT2256 geranylgeranyl hydrogenase bchPCT0011 deoxyhypusine synthase, putativeCT1772 ribulose bisphosphate carboxylase, large subunit rbcLCT1894 sulhydrogenase, alpha subunit hydACT0472 conserved hypothetical proteinCT0274 carbon-nitrogen hydrolase family proteinCT1891 sulfhydrogenase, beta subunit hydB-1CT1892 sulfhydrogenase, gamma subunit hydG-1CT1250 sulfhydrogenase, gamma subunit hydG-2CT1249 sulfhydrogenase, beta subunit hydB-2
89. TIGRTIGR
Carboxydothermus hydrogenoformans
• Isolated in Yellowstone
• Thermophile (grows at 80°C)
• Anaerobic
• Grows on CO (Carbon Monoxide)
• Produces hydrogen gas
• Low GC gram postive species
• Many Archaeal-like genes
93. TIGRTIGR
Why Duplications Are Useful to Identify
• Allows division into orthologs and paralogs
• Improves functional predictions
• Helps identify mechanisms of duplication
• Can be used to study mutation processes in different
parts of a genome
• Lineage specific duplications may be indicative of
species’ specific adaptations
94. TIGRTIGR
C. pneumoniae - All Paralogs
0
250000
500000
750000
1000000
1250000
Subject Orf Position
0 250000 500000 750000 1000000 1250000
Query Orf Position
95. TIGRTIGR
C. pneumoniae Lineage-Specific Paralogs
0
250000
500000
750000
1000000
1250000
Subject Orf Position
0 250000 500000 750000 1000000 1250000
Query Orf Position
96. TIGRTIGR
E. coli Paralogs - All
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Match Coordinates
0
500000
10000001500000200000025000003000000350000040000004500000
Query Orf Coordinates
97. TIGRTIGR
E. coli Paralogs - Top
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Match Coordinates
0
500000
10000001500000200000025000003000000350000040000004500000
Query Orf Coordinates
100. TIGRTIGR
B. anthracis lineage specific duplications
ORF04205 molybdopterin biosynthesis protein MoeA (moeA)
ORF05907 molybdopterin biosynthesis protein MoeA (moeA)
ORF02636 molybdopterin biosynthesis protein MoeA (moeA)
ORF04204 molybdopterin biosynthesis protein MoeB, putative
ORF05908 molybdopterin biosynthesis protein MoeB, putative
ORF02634 molybdopterin biosynthesis protein MoeB, putative
ORF05904 molybdopterin converting factor, subunit 1 (moaD)
ORF02639 molybdopterin converting factor, subunit 1 (moaD)
ORF04206 molybdopterin converting factor, subunit 2 (moaE)
ORF05905 molybdopterin converting factor, subunit 2 (moaE)
ORF02638 molybdopterin converting factor, subunit 2 (moaE)
101. TIGRTIGR
S. aureus Lineage Specific Duplications
ORF02715 4-diphosphocytidyl-2C-methyl-D-erythritol synthase, putative
ORF02712 alcohol dehydrogenase, zinc-containing
ORF00701 alpha-hemolysin precursor (2X)
ORF00717 antibacterial protein
ORF02597 capsular polysaccharide biosynthesis proteins CapABC (2X)
ORF00804 cell wall hydrolase (3X)
ORF00657 cell wall surface anchor family protein (2X)
ORF00358 clumping factor (2X)
ORF01758 deoxyribose-phosphate aldolase (deoC)
ORF02579 purine nucleoside phosphorylase (deoD)
ORF01031 drug transporter, putative
ORF00805 endopeptidase resistance gene (eprH)
ORF00706 exotoxin 1,3,4,5, unknown (2X)
ORF02184 fibronectin(2X)
ORF00097 glycosyl transferase, group 1 family protein (3X)
ORF02086 IgG-binding protein (2X)
ORF02431 integrase/recombinase, core domain family (3X)
Analysis done with S. Gill
102. TIGRTIGR
S. aureus Lineage Specific Duplications
ORF00137 conserved hypothetical protein
ORF00138 conserved hypothetical protein
ORF00139 conserved hypothetical protein
ORF00140 conserved hypothetical protein
ORF00141 conserved hypothetical protein
ORF00142 conserved hypothetical protein
ORF00143 conserved hypothetical protein
ORF00144 conserved hypothetical protein
ORF00145 conserved hypothetical protein
ORF00146 conserved hypothetical protein
ORF00148 conserved hypothetical protein
ORF00667 conserved hypothetical protein
ORF01251 conserved hypothetical protein
ORF02160 conserved hypothetical protein
ORF02166 conserved hypothetical protein
ORF02170 conserved hypothetical protein
ORF02171 conserved hypothetical protein
ORF02507 conserved hypothetical protein
ORF02745 conserved hypothetical protein
ORF02760 conserved hypothetical protein
ORF02762 conserved hypothetical protein
ORF02763 conserved hypothetical protein
ORF02766 conserved hypothetical protein
ORF02768 conserved hypothetical protein
ORF02769 conserved hypothetical protein
ORF02770 conserved hypothetical protein
ORF02771 conserved hypothetical protein
ORF02772 conserved hypothetical protein
ORF02773 conserved hypothetical protein
ORF02774 conserved hypothetical protein
ORF02896 conserved hypothetical protein
ORF02974 conserved hypothetical protein
ORF02711 conserved hypothetical protein UPF0007
ORF02614 conserved hypothetical protein, authentic frameshift
ORF00286 hypothetical protein
ORF00338 hypothetical protein
ORF00361 hypothetical protein
ORF00412 hypothetical protein
ORF00415 hypothetical protein
ORF00614 hypothetical protein
ORF00697 hypothetical protein
ORF00703 hypothetical protein
ORF00705 hypothetical protein
ORF00875 hypothetical protein
ORF00876 hypothetical protein
ORF00877 hypothetical protein
ORF00879 hypothetical protein
ORF00888 hypothetical protein
ORF00889 hypothetical protein
ORF01024 hypothetical protein
ORF01041 hypothetical protein
ORF01089 hypothetical protein
ORF01091 hypothetical protein
ORF01092 hypothetical protein
ORF01093 hypothetical protein
ORF01095 hypothetical protein
ORF01446 hypothetical protein
ORF01462 hypothetical protein
ORF01918 hypothetical protein
ORF02099 hypothetical protein
ORF02102 hypothetical protein
ORF02158 hypothetical protein
ORF02159 hypothetical protein
ORF02172 hypothetical protein
ORF02430 hypothetical protein
ORF02434 hypothetical protein
ORF02530 hypothetical protein
ORF02531 hypothetical protein
ORF02532 hypothetical protein
ORF02533 hypothetical protein
ORF02534 hypothetical protein
Analysis done with S. Gill
103. TIGRTIGR
Lineage Specific Duplications in Wolbachia wMel
Annotation
ankyrin repeat domain protein
ankyrin repeat domain protein
ankyrin repeat domain protein
ankyrin repeat domain protein
ankyrin repeat domain protein
ankyrin repeat domain protein
ankyrin repeat domain protein
conserved domain protein
conserved domain protein
conserved domain protein
conserved domain protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
conserved hypothetical protein
FRAMESHIFT
conserved hypothetical protein
POINT MUTATION
conserved hypothetical protein,
degenerate
conserved hypothetical protein,
FRAMESHIFT
conserved hypothetical protein,
FRAMESHIFT
conserved hypothetical protein,
FRAMESHIFT
conserved hypothetical protein,
FRAMESHIFT
conserved hypothetical protein,
interruption-C
conserved hypothetical protein,
POINT MUTATION
conserved hypothetical protein,
POINT MUTATION
conserved hypothetical protein,
truncated
conserved hypothetical protein,
truncation
DNA mismatch repair protein
MutL (mutL)
DNA repair protein RadC,
putative
DNA repair protein RadC,
putative, truncation
DNA repair protein RadC,
truncation
DnaJ domain protein
DnaJ domain protein
exopolysaccharide synthesis
protein ExoD-related protein
exopolysaccharide synthesis
protein ExoD-related protein
HNH endonuclease family
protein
HNH endonuclease family
protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
hypothetical protein
major facilitator family
transporter
major facilitator family
transporter
major facilitator family
transporter
membrane protein, putative
membrane protein, putative
membrane protein, putative
MutL family protein
Na+/H+ antiporter family protein
Na+/H+ antiporter, putative
permease, putative
portal protein, FRAMESHIFT
portal protein, FRAMESHIFT
prophage LambdaW1, DNA
methylase
prophage LambdaW1, terminase
large subunit, putative
prophage LambdaW2, ankyrin
repeat domain protein
prophage LambdaW2, ankyrin
repeat domain protein
prophage LambdaW2, baseplate
assembly protein J, putative
prophage LambdaW2, baseplate
assembly protein V, putative
FRAMESHIFT
prophage LambdaW2, baseplate
assembly protein V, putative
FRAMESHIFT
prophage LambdaW2, baseplate
assembly protein W, putative
prophage LambdaW2, minor tail
protein Z, putative,
FRAMESHIFT
prophage LambdaW2, site-
specific recombinase, resolvase
family
prophage LambdaW4, ankyrin
repeat domain protein
prophage LambdaW4, DNA
methylase
prophage LambdaW4, portal
protein, FRAMESHIFT
prophage LambdaW4, portal
protein, FRAMESHIFT
prophage LambdaW4, terminase
large subunit, putative
prophage LambdaW5, ankyrin
repeat domain protein
prophage LambdaW5, ankyrin
repeat domain protein
prophage LambdaW5, ankyrin
repeat domain protein
prophage LambdaW5, baseplate
assembly protein J, putative,
FRAMESHIFT
prophage LambdaW5, baseplate
assembly protein V, putative
prophage LambdaW5, baseplate
assembly protein W, putative
prophage LambdaW5, minor tail
protein Z, putative, degenerate,
FRAMESHIFT
prophage LambdaW5, site-
specific recombinase, resolvase
family
regulatory protein RepA, putative
regulatory protein RepA, putative
reverse transcriptase, putative
reverse transcriptase, putative
reverse transcriptase, putative
sodium/alanine symporter family
protein
sodium/alanine symporter family
protein
TenA/THI-4 family protein
transcriptional regulator
transcriptional regulator
transcriptional regulator
transcriptional regulator
transcriptional regulator
transcriptional regulator
transcriptional regulator, putative
translation elongation factor Tu
(tuf)
translation elongation factor Tu
(tuf)
transposase, degenerate
transposase, IS4 family
transposase, IS4 family
transposase, IS4 family
transposase, IS5 family,
interruption-N
transposase, IS5 family,
truncation
transposase, putative, degenerate
transposase, putative, degenerate
transposase, putative, degenerate
type IV secretion system protein
VirB4, putative
UDP-N-acetylglucosamine
pyrophosphorylase-related
protein
104. TIGRTIGR
MutL Duplication in Wolbachia wMel
ORF01096 DNA mismatch repair protein MutL (mutL)
ORF00446 MutL family protein
108. TIGRTIGR
X-files
Eisen et al. 2000. Genome Biology 1(6): 11.1-11.9
Also see Tillier and Collins. 2000. Nature Genetics
26(2):195-7 and Suyama and Bork. 2001. Trends Genetics
17: 10-13.
110. TIGRTIGR
V. cholerae vs. E. coli All
0
1000000
2000000
3000000
4000000
5000000
E. coli
Coordinates
0 1000000 2000000 3000000
V. cholerae Coordinates
111. TIGRTIGR
V. cholerae vs. E. coli Best
0
1000000
2000000
3000000
4000000
5000000
E. coli
Coordinates
0 1000000 2000000 3000000
V. cholerae Coordinates
112. TIGRTIGR
V. cholerae vs. E. coli: if Top
0
1000000
2000000
3000000
4000000
5000000
E. coli
Coordinates
0 1000000 2000000 3000000
V. cholerae Coordinates
113. TIGRTIGR
V. cholerae vs. E. coli: Top, Rotated
0
1000000
2000000
3000000
4000000
5000000
E. coli
ORF Coordinates
0 500000 1000000 1500000 2000000 2500000 3000000
V. cholerae ORF Coordinates
114. TIGRTIGR
Duplication and Gene Loss Model
A
B
CD
E
F
A
B
CD
E
F
A
B
C
D
E
F
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
A
C
D
F
A’
B’
E’
E. coli
E. coli
B
C
D
F
A’
B’
D’
E’
V. cholerae
A
B
C
D
E
F
A’
B’
C’
D’
E’
F’
116. TIGRTIGR
V. cholerae vs. E. coli: Top
0
1000000
2000000
3000000
4000000
5000000
E. coli
ORF Coordinates
0 500000 1000000 1500000 2000000 2500000 3000000
V. cholerae ORF Coordinates
117. TIGRTIGR C. trachomatis MoPn
C.pneumoniaeAR39
Origin
Terminus
C. trachomatis vs C. pneumoniae
118. TIGRTIGR
M. leprae vs. M. tuberculosis
0
1000000
2000000
3000000
4000000
Mycobacterium tuberculosis
0 1000000 2000000 3000000
Mycobacterium leprae
119. TIGRTIGR
B. subtilis vs. S. auerus
0
500
1000
1500
2000
2500
3000
2632200 2632700 2633200 2633700 2634200 2634700 2635200 2635700 2636200 2636700
analysis w/ S. Gill
120. TIGRTIGR
P. putida vs. P.aeruginosa Orthologs
9945700
9946700
9947700
9948700
9949700
9950700
9951700
0 2000 4000 6000 8000
Series1
analysis w/ K. Nelson
123. TIGRTIGR
Why are Inversions Symmetrical
Around Origin
• Genetic studies in Salmonella and E. coli
suggest that there may be strong selection
against other inversions
• See:
– Mahan, Segall, Schmid and Roth
– Liu and Sanderson
– Rebollo, Francois, and, Louarn
124. TIGRTIGR
TIGRTIGR
Other peopleOther people
Mom and DadMom and Dad
H. OchmanH. Ochman
W. MartinW. Martin
F. RobbF. Robb
J. BattistaJ. Battista
E. OriasE. Orias
D. BryantD. Bryant
S. O’NeillS. O’Neill
M. EisenM. Eisen
N. MoranN. Moran
R. MyersR. Myers
C. M. CavanaughC. M. Cavanaugh
P. HanawaltP. Hanawalt
NSFNSF
J. HeidelbergJ. Heidelberg
T.ReadT.Read
N. WardN. Ward
M-I BenitoM-I Benito
J. C. VenterJ. C. VenterC. FraseC. Fraser
S. SalzbergS. Salzberg
O. WhiteO. White
I. PaulsenI. Paulsen
$$$$$$
ONRONR
DOEDOE
NIHNIH
H. TettelinH. Tettelin
Eisen GroupEisen Group
Martin WuMartin Wu
Dongying WuDongying Wu
James SakwaJames Sakwa
Jonathan BadgerJonathan Badger
127. TIGRTIGR
Why Gene Transfers Are Useful to Identify
• Laterally transferred genes frequently involved in
environmental adaptations and/or pathogenicity
• Identification of vectors of gene transfer (e.g., transposons,
integrons, phage)
• Identify species associations in the environment (e.g.,
Thermotoga and Archaea, Nelson et al.)
• Identify organelle derived genomes in eukaryotic genomes
• Important for understanding of evolution
128. TIGRTIGR
Examples of Horizontal
Transfers
• Antibiotic resistance genes on plasmids
• Toxin resistance genes on plasmids
• Insertion sequences
• Agrobacterium Ti plasmid
• Virus and phage gene acquisition and
transfer
• Organelle to nucleus transfers
129. TIGRTIGR
Steps in Lateral Gene Transfer (LGT)
A B C D
1 Gene acquires host features
2
Transfer
6 Amelioration
3-5 Integration, selection, spread
130. TIGRTIGR
Inference of Gene Transfer Involves
Identifying Unusual Genes
• Unusual distribution patterns
• High sequence similarity to supposedly
distantly related species
• Unusual nucleotide composition
• Unusual patterns of evolutionary
relatedness (gene vs. species)
132. TIGRTIGR
“Hundreds of human genes appear
likely to have resulted from horizontal
transfer from bacteria at some point in
the vertebrate lineage.”
133. TIGRTIGR
IHGSC 2001
• Claim:
– Lateral transfer from bacteria to vertebrates
• Evidence
– Genes match bacteria but not non-vertebrate
eukaryotes
– Or, genes have stronger match to bacteria than to
non-vertebrates
– A set of ~120 of these genes are found in many
bacterial species
139. TIGRTIGR
Number of pBVTs Depends
on # of Genomes Analyzed
1 2 3 4 5 Other
0
200
400
600
800
1000
1200
1400
1600
1800
Number of protein sets
Fruit fly
C. elegans
Arabidopsis
Yeast
Parasites
Salzberg et al. 2001
142. TIGRTIGR
Alternative explanations
• Gene loss from non-vertebrate eukaryotes
• Rapid divergence in non-vertebrate
eukaryotes
• Some non-vertebrate genomes are
incomplete
• Bad annotation/gene finding
• Contamination
• Blast evolution≠
147. TIGRTIGR
A. thaliana T1E2.8 is a
Chloroplast Derived HSP60ARATH -T1E2.8**********ECOLHAEINVIBCHVIBCHRICPRYEASTCHLPNCHLTRAQUAECAMJEHELPYBBURTREPATHEMABACSUDEIRAMCYTUMCYTUSYNSPSYNSPODONT CPSTMYCGEMYCPNCHLPNCHLTRCHLPNCHLTRARCFUARCFUMETJAPYRHOMETTHMETTHYEASTYEASTYEASTYEASTCELEGYEASTYEASTYEASTCELEGYEASTYEASTCELEGYEASTCELEGCELEG
EukaryaArchaeaBacteriaCyano/Cpst
The population geneticist Dobzhansky in saying this basically meant that one needs to go beyond documenting the similarities and differences between species to determining how and why these similarities and differences came into being.
The best example of this involves convergent evolution. In order to understand the differences and similarities in flight between birds and bats it is helpful to know that they evolved their flight systems separately.
Because evolution and genomic analyses have so many interactions I have argued that a composite approach which I refer to as phylogenomics is needed.
In the paper by Wood et al, they ran blast searches of all the predicted proteins in all the eukaryotic genomes against each other and came up with a list of proteins for which homologs could be identified in all eukaryotes. They then searched these against all proteins from all prokaryotes and asked which of the eukaryotic-conserved prteins did not have homologs in ANY prokaryote.
In the paper by Wood et al, they ran blast searches of all the predicted proteins in all the eukaryotic genomes against each other and came up with a list of proteins for which homologs could be identified in all eukaryotes. They then searched these against all proteins from all prokaryotes and asked which of the eukaryotic-conserved prteins did not have homologs in ANY prokaryote.
Tree tree shows a tree of life with Archaea closer to Eukaryotes and on the tree in Blue is the evolutionary branch in which genes that are found in all eukaryotes but not prokaryotes should have arisen.
Thus the S. pombe analysis was good in that it was searching for genes invented early in eukaryotic evolution and then kept in all eukaryotic lineages.
The hey here is that there were 50+ genes found in all eukaryotes and not in any prokaryote for which nobody had any functional information. These could represent major eukaryotic processes yet to be discovered.
The origins of multicellularity for the species analyzed are shown in green.
Since plants and animals separately evolved multicellularity it is not surprising that multicellular plants did not evolve the same genes as multicellular animals
Therefore their analysis was fundamentally flawed.
This is described in a News and Views I wrote about the S. pombe genome
Functional prediction using a gene tree is just like predicting the biology of a species using a species tree
Shotgun genome sequencing works by breaking a genome apart into millions of little pieces and then sequencing those pieces randomly
Metagenomics involves cloning large DNA fragments from environmental samples
Shotgun genome sequencing works by breaking a genome apart into millions of little pieces and then sequencing those pieces randomly
This is a tree of a rRNA gene that was found on a large DNA fragment isolated from the Monterey Bay. This rRNA gene groups in a tree with genes from members of the gamma Proteobacteria a group that includes E. coli as well as many environmental bacteria. This rRNA phylotype has been found to be a dominant species in many ocean ecosystems.
clone from the Sargasso Sea. This shows that this
Samples are unbiased
Phil Hugenholtz wrote an excellent review paper in Genome Biology tracing which Phyla had genome sequences available. Proteobacteria have the most with Firmicutes next (these are the low GC gram positives like B. subtilis) and Actinobacteria (the high GC gram positives like M. tuberculosis) third.
In Red are the Phyla of bacteria with cultured species but no genome sequences that we are sequencing as part of a NSF Tree of Life project.
NOTE **** In White are the Phyla with no cultured representatives.
ALSO NOTE **** This project is contingent upon having a good tree of bacteria.