Lecture 10:

EVE 161:

Microbial Phylogenomics
!

Lecture #10:
Era III: Genome Sequencing
!
UC Davis, Winter 2014
Instruct...
Where we are going and where we have been

• Previous lecture:
! 9: rRNA Case Study - Built Environment
• Current Lecture:...
1st Genome Sequence

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Fleischma

!3
insight progress
1. Library construction

2. Random sequencing phase
(i) Sequence DNA
(15,000 sequences per Mb)

(i) Isola...
Complete Genome/Chromosome Progress

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
From http://genomesonline.org
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
TIGR

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better co...
General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• C...
General Steps in Analysis of Complete Genomes
• Structural Annotation
• Identification/prediction of genes
• Characterizat...
Structural Annotation I: Genes in Genomes
• Protein coding genes.
! In long open reading frames
! ORFs interrupted by intr...
Structural Annotation II: Other Features to Find
• Gene control sequences
! Promoters
! Regulatory elements
• Transposable...
How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus
easy to detect. Finding ...
RNA Structure
•

•

RNA differs from DNA in having fairly
common G-U base pairs. Also, many
functional RNAs have unusual m...
Finding tRNAs

•

•

•

tRNAs have a highly conserved
structure, with 3 main stem-andloop structures that form a
cloverlea...
Bacteria / Archaeal Protein Coding Genes
•

Bacteria use ATG as their main start codon, but GTG and TTG are also fairly co...
Composition Methods
• The frequency of various codons is different in coding regions as
compared to non-coding regions.
– ...
Eukaryotic Genes Harder to Find
•
•

Some fundamental differences between
prokaryotes and eukaryotes:
There is lots of non...
Exons

• Exon sequences can often be identified by sequence conservation,
at least roughly.
• Dicodon statistics, as was u...
Functional Annotation

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification I: GO
•

The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt
describe...
Functional Classification II: Enzyme Nomenclature
•

Enzyme functions: which reactants are converted to which products

•
...
Functional Prediction

•
•
•
•
•
•

BLAST searches
HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam).
...
Functional Prediction II: Membrane Spanning
•

Integral membrane proteins contain amino acid
sequences that go through the...
Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and
c...
Functional Prediction
• Identification of motifs
! Short regions of sequence similarity that are indicative
of general act...
Helicobacter pylori

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
H. pylori genome - 1997

“The ability of H. pylori to
perform mismatch repair is
suggested by the presence of
methyl trans...
MutL ??

From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Slides for UC Davis EVE161 Course Taught by Jonath...
Phylogenetic Tree of MutS Family
Yeast
Human
Celeg

Aquae
Strpy
Bacsu
Synsp
Deira Helpy
Borbu
Metth
mSaco

Yeast
Human
Mou...
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
Upcoming SlideShare
Loading in...5
×

UC Davis EVE161 Lecture 10 by @phylogenomics

482

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
482
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

UC Davis EVE161 Lecture 10 by @phylogenomics

  1. 1. Lecture 10: EVE 161:
 Microbial Phylogenomics ! Lecture #10: Era III: Genome Sequencing ! UC Davis, Winter 2014 Instructor: Jonathan Eisen Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !1
  2. 2. Where we are going and where we have been • Previous lecture: ! 9: rRNA Case Study - Built Environment • Current Lecture: ! 10: Genome Sequencing • Next Lecture: ! 11: Genome Sequencing II Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !2
  3. 3. 1st Genome Sequence Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Fleischma !3
  4. 4. insight progress 1. Library construction 2. Random sequencing phase (i) Sequence DNA (15,000 sequences per Mb) (i) Isolate DNA –1 3. Closure phase (i) Assemble sequences (ii) Close gaps –1 (ii) Fragment DNA (iii) Edit GGG ACTGTTC... (iii) Clone DNA (iv) Annotation 237 800,000 1 700,000 4. Complete genome sequence 239 100,000 238 200,000 600,000 300,000 500,000 400,000 Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project. analysis of the genomes of two thermophilic bacterial species, be extensive, it is somehow constrained by phylogenetic relationAquifex aeolicus and Thermotoga maritima, revealed that 20–25% of ships. Other evidence for a ‘core’ of particular lineages comes from the genes in these species were more similar to genes from archaea the finding of a conserved core of euryarchaeal genomes21,22 and than those from bacteria13,14. This led to the suggestion of possible another finding that some types of gene might be more prone to gene Slides for these species and archaeal transfer than others23. It Winter seems extensive gene exchanges between UC Davis EVE161 Course Taught by Jonathan Eisentherefore2014 likely that horizontal gene
  5. 5. Complete Genome/Chromosome Progress Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  6. 6. From http://genomesonline.org Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  7. 7. TIGR Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  8. 8. Why Completeness is Important • Improves characterization of genome features • Gene order, replication origins • Better comparative genomics • Genome duplications, inversions • Presence and absence of particular genes can be very important • Missing sequence might be important (e.g., centromere) • Allows researchers to focus on biology not sequencing • Facilitates large scale correlation studies Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  9. 9. General Steps in Analysis of Complete Genomes • Identification/prediction of genes • Characterization of gene features • Characterization of genome features • Prediction of gene function • Prediction of pathways • Integration with known biological data • Comparative genomics Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  10. 10. General Steps in Analysis of Complete Genomes • Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features • Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data • Evolutionary Annotation • Comparative genomics Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  11. 11. Structural Annotation I: Genes in Genomes • Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a small portion of the eukaryotic genome • RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  12. 12. Structural Annotation II: Other Features to Find • Gene control sequences ! Promoters ! Regulatory elements • Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes • Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function • Unique sequences that have no obvious function Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  13. 13. How to Find ncRNAs • The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration. • One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily • Functional RNAs are characterized by secondary structure caused by base pairing within the molecule. • Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure. • The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted • Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species. • This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  14. 14. RNA Structure • • RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine. The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern – But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start with all possible short sequences, then build to larger ones, adding the contribution of each structural element. – There is an element of dynamic programming here as well. – And, “stochastic context-free grammars”, something I really don’t want to approach right now! Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  15. 15. Finding tRNAs • • • tRNAs have a highly conserved structure, with 3 main stem-andloop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart. Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass. In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  16. 16. Bacteria / Archaeal Protein Coding Genes • Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used. – • The stop codons are the same as in eukaryotes: TGA, TAA, TAG – • • stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. Genes can overlap by a small amount. Not much, but a few codons of overlap is common enough so that you can’t just eliminate overlaps as impossible. Cross-species homology works well for many genes. It is very unlikely that non-coding sequence will be conserved. – • Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF. But, a significant minority of genes (say 20%) are unique to a given species. Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon – – however, some aren’t recognizable genes in operons sometimes don’t always have a separate ribosome binding site for each gene Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  17. 17. Composition Methods • The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other measures of composition. Dicodons (groups of 6 bases) are often used – Well documented experimentally. • The composition varies between different proteins of course, and it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to have compositions that reflect their original species. – A second group with unusual compositions are highly expressed genes. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  18. 18. Eukaryotic Genes Harder to Find • • Some fundamental differences between prokaryotes and eukaryotes: There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA genes – Note that eukaryotes have 3 main RNA polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes. • • • most eukaryotic genes are split into exons and introns. Only 1 gene per transcript in eukaryotes. No ribosome binding sites: translation starts at the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been sequenced Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  19. 19. Exons • Exon sequences can often be identified by sequence conservation, at least roughly. • Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of different GC content, and composition statistics can vary between isochores. • The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them. • Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score. – In general, sites are more likely to be correct if predicted by multiple methods – Experimental data from ESTs can be very helpful here. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  20. 20. Functional Annotation Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  21. 21. Functional Classification I: GO • The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other. • Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”. • There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.) – • For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane” The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  22. 22. Functional Classification II: Enzyme Nomenclature • Enzyme functions: which reactants are converted to which products • Enzyme functions are given unique numbers by the Enzyme Commission. – Across many species, the enzymes that perform a specific function are usually evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions. – Often, two or more gene products in a genome will have the same E.C. number. – E.C. numbers are four integers separated by dots. The left-most number is the least specific – For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes: • EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide • Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  23. 23. Functional Prediction • • • • • • BLAST searches HMM models of specific genes or gene families (Pfam, TIGRfam, FIGfam). Sequence motifs and domains. If the gene is not a good match to previously known genes, these provide useful clues. Cellular location predictions, especially for transmembrane proteins. Genomic neighbors, especially in bacteria, where related functions are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region). Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to decide whether the relevant functions are present in the genome. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  24. 24. Functional Prediction II: Membrane Spanning • Integral membrane proteins contain amino acid sequences that go through the membrane one or several times. – There are also peripheral membrane proteins that stick to the hydrophilic head groups by ionic and polar interactions – There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group. • There are 2 main protein structures that cross membranes. – Most are alpha helices, and in proteins that span multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids. – Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  25. 25. Functional Prediction by Phylogeny • Key step in genome projects • More accurate predictions help guide experimental and computational analyses • Many diverse approaches • All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  26. 26. Functional Prediction • Identification of motifs ! Short regions of sequence similarity that are indicative of general activity ! e.g., ATP binding • Homology/similarity based methods ! Gene sequence is searched against a databases of other sequences ! If significant similar genes are found, their functional information is used • Problem ! Genes frequently have similarity to hundreds of motifs and multiple genes, not all with the same function Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  27. 27. Helicobacter pylori Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  28. 28. H. pylori genome - 1997 “The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.” Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  29. 29. MutL ?? From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
  30. 30. Phylogenetic Tree of MutS Family Yeast Human Celeg Aquae Strpy Bacsu Synsp Deira Helpy Borbu Metth mSaco Yeast Human Mouse Arath Arath Human Mouse Spombe Yeast Yeast Spombe Yeast Celeg Human Fly Xenla Rat Mouse Human Yeast Neucr Arath Aquae Trepa Chltr Deira Theaq BacsuBorbu Thema SynspStrpy Ecoli Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Neigo Based on Eisen, 1998 Nucl Acids
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×