FBW             27-11-2012Wim Van Criekinge
Inhoud Lessen: Bioinformatica                                GEEN LES
GEENLESOP 4DECEMBER
Gene Prediction, HMM & ncRNA               What to do with an unknown                sequence ?               Gene Ontolog...
UNKNOWN PROTEIN SEQUENCE  LOOK FOR:  • Similar sequences in databases ((PSI)    BLAST)  • Distinctive patterns/domains ass...
BASIC INFORMATION COMES FROM SEQUENCE  • One sequence- can get some information eg    amino acid properties  • More than o...
Additional analysis of protein sequences  • transmembrane                     • hydrophobicity    regions                 ...
FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES  • Pattern - short, simplest, but limited  • Motif - conserved element of ...
PATTERNS • Small, highly conserved regions • Shown as regular expressions    Example:    [AG]-x-V-x(2)-x-{YW}    – [] show...
PROFILES   • Table or matrix containing comparison     information for aligned sequences   • Used to find sequences simila...
HIDDEN MARKOV MODELS (HMM)HMM• An HMM is a large-scale profile with gaps,  insertions and deletions allowed in the  alignm...
Sequence
Gene Prediction, HMM & ncRNA               What to do with an unknown                sequence ?               Gene Ontolog...
What is an ontology?• An ontology is an explicit  specification of a conceptualization.• A conceptualization is an abstrac...
Why Create Ontologies?• to enable data exchange among  programs• to simplify unification (or translation)  of disparate re...
Summary • Ontologies are what they do:   artifacts to help people and their   programs communicate, coordinate,   collabor...
The Three Ontologies•Molecular Function — elemental activity or task  nuclease, DNA binding, transcription factor•Biologic...
DAG StructureDirected acyclic graph: each child  may have one or more parents
Example - Molecular Function
Example - Biological Process
Example - Cellular Location
AmiGO browser
GO: Applications• Eg. chip-data analysis: Overrepresented item  can provide functional clues• Overrepresentation check: co...
Gene Prediction, HMM & ncRNA               What to do with an unknown                sequence ?               Web applicat...
Computational Gene Finding  Problem:  Given a very long DNA sequence, identify coding  regions (including intron splice si...
Computational Gene Finding  Eukaryotic gene structure
Computational Gene Finding  • There is no (yet known) perfect method    for finding genes. All approaches rely on    combi...
Genefinder
GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAPThis gene structure corresponds to the position on the physical map
The Active Zone limits the extent of                             analysis, genefinder & fasta dumps                       ...
Change origin of                                               this scale by                                              ...
Boxes are Exons,                                         thin lines (or                                         springs) a...
Find the open reading framesThe triplet, non-punctuated nature of the genetic code helps us out64 potential codons       6...
There is one column                                       for each frame                                       Small horiz...
They have one                                      column for each                                      frame             ...
Computational Gene Finding: Hexanucleotide frequencies• Amino acid distributions are biased  e.g. p(A) > p(C)• Pairwise di...
Gene predictionGeneration of datasets (Ensmart@Ensembl):Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900...
The grey boxes indicate                                           regions where the codon                                 ...
blastn (EST)For raw DNA sequence analysis blastx isextremely usefulWill probe your DNA sequence against the protein databa...
The blue boxes indicate                                   regions of sequence which                                   when...
The yellow boxes represent                                     DNA matches (Blast) to C.                                  ...
New generation of programs to predict gene codingsequences based on a non-random repeat pattern(eg. Glimmer, GeneMark) – a...
Computational Gene Finding            • CpG islands are regions of sequence that              have a high proportion of CG...
This column shows                                         matches to members of a                                         ...
This column shows regions                                     of localised repeats both                                   ...
Exon/intron boundaries
Computational Gene Finding: Splice junctions            • Most Eukaryotic introns have a              consensus splice sig...
The Splice Sites are shown                                     Hooked                                     The Hook points ...
Gene Prediction, HMM & ncRNA               What to do with an unknown                sequence ?               Web applicat...
Towards profiles (PSSM) with indels – insertions and/or deletions   • Recall that profiles are matrices that     identify ...
Hidden Markov Models: Graphical models of sequences  • Need a representation that allows    specification of the probabili...
Hidden Markov Chain • A sequence is said to be Markovian if the   probability of the occurrence of an element in   a parti...
Marchov Chain for DNA
Markov chain with begin and end
Markov Models: Graphical models of sequences• Consists of states (boxes) and transitions  (arcs) labeled with probabilitie...
Markov Models  • Simplest example: Each state emits (or,    equivalently, recognizes) a particular    element with probabi...
Hidden Markov Models: Probabilistic Markov Models• Now, add probabilities to each transition (let  emission remain a singl...
Hidden Markov Models: Probablistic Emmision• If we let the states define a set of emission  probabilities for elements, we...
Hidden Markov Models• Emission uncertainty means the sequence doesnt  identify a unique path. The states are “hidden”     ...
Hidden Markov Models
Hidden Markov Models: The occasionally dishonest casino
Hidden Markov Models: The occasionally dishonest casino
Use of Hidden Markov Models• The HMM must first be “trained” using a training set    – Eg. database of known genes.    – C...
Applications of Hidden Markov Models • HMMs are effectively profiles with gaps, and   have applications throughout Bioinfo...
Hidden Markov Models Resources • UC Santa Cruz (David Haussler group)     – SAM-02 server. Returns alignments, secondary  ...
Example TMHMM                Beyond Kyte-Doolitlle …
HMM in protein analysis• http://www.cse.ucsc.edu/research/compbio/is  mb99.handouts/KK185FP.html
Hidden Markov model for gene structure     Contents (red arcs):                                    Signals (blue nodes):  ...
Classic Programs for gene findingSome of the best programs are HMM based:• GenScan – http://genes.mit.edu/GENSCAN.html• Ge...
Hidden Markov Models: Gene Finding Software• A Semi-Markov Model   GENSCAN    not to be confused with GeneScan, a commerci...
Conservation of Gene Features   100%   95%   90%   85%   80%   75%   70%   65%   60%   55%   50%                          ...
Composite Approaches• Use EST info to constrain HMMs (Genie)• Use protein homology info on top of HMMs  (fgenesh++, Genome...
Gene Prediction: more complex …    1.   Species specific    2.   Splicing enhancers found in coding regions    3.   Trans-...
Length preference5’ ss   intcomp   branch     3’ ss
RNA genes                                Besides the 6000 protein coding-genes, there is:                                1...
miRNA genes              RNA genes can be hard to detects              UGAGGUAGUAGGUUGUAUAGU              C.elegans let-27...
Lin-4• Lin-4 identified in a screen for mutations that affect timing andsequence of postembryonic development in C.elegans...
Let-7             (Pasquinelli et al. Nature 408:86-89,2000)Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-nuc...
Two computational analysis problems• Similarity search (eg BLAST), I give you a query,  you find sequences in a database t...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S -> aSS   ->   SaS   ->   aSuS ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S -> aSS   ->   Sa          S ->...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S -> aSS   ->   Sa          S ->...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammersBasic CFG“production rules”   A CFG “derivation”S   ->   aS          S   ->   aSS   ->   Sa          ...
Context-free grammers                             U C                                                 U    GBasic CFG     ...
The power of comparative analysis• Comparative genome analysis is an indispensable means of  inferring whether a locus pro...
Compensatory substitutionsthat maintain the structure               UU           C        G    G      U        A   C      ...
Evolutionary conservation of RNA molecules can be revealed       by identification of compensatory substitutions
…………
• Manual annotation of 60,770 full-length mouse complementaryDNA sequences, clustered into 33,409 „transcriptional units‟,...
Function on ncRNAs
ncRNAs & RNAi
Therapeutic Applications           •   Shooting millions of tiny RNA molecules into a               mouse’s bloodstream ca...
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
Upcoming SlideShare
Loading in …5
×

Bioinformatica t8-go-hmm

1,776 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,776
On SlideShare
0
From Embeds
0
Number of Embeds
1,250
Actions
Shares
0
Downloads
45
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bioinformatica t8-go-hmm

  1. 1. FBW 27-11-2012Wim Van Criekinge
  2. 2. Inhoud Lessen: Bioinformatica GEEN LES
  3. 3. GEENLESOP 4DECEMBER
  4. 4. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction Composite Gene Prediction Non-coding RNA HMM
  5. 5. UNKNOWN PROTEIN SEQUENCE LOOK FOR: • Similar sequences in databases ((PSI) BLAST) • Distinctive patterns/domains associated with function • Functionally important residues • Secondary and tertiary structure • Physical properties (hydrophobicity, IEP etc)
  6. 6. BASIC INFORMATION COMES FROM SEQUENCE • One sequence- can get some information eg amino acid properties • More than one sequence- get more info on conserved residues, fold and function • Multiple alignments of related sequences- can build up consensus sequences of known families, domains, motifs or sites. • Sequence alignments can give information on loops, families and function from conserved regions
  7. 7. Additional analysis of protein sequences • transmembrane • hydrophobicity regions • amino acid • signal sequences composition • localisation • molecular weight signals • solvent accessibility • targeting • antigenicity sequences • GPI anchors • glycosylation sites
  8. 8. FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES • Pattern - short, simplest, but limited • Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: • Profile • HMM
  9. 9. PATTERNS • Small, highly conserved regions • Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} – [] shows either amino acid – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these BUT- limited to near exact match in small region
  10. 10. PROFILES • Table or matrix containing comparison information for aligned sequences • Used to find sequences similar to alignment rather than one sequence • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue
  11. 11. HIDDEN MARKOV MODELS (HMM)HMM• An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities• Package used HMMER (http://hmmer.wusd.edu/)• Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM• E-value- number of false matches expected with a certain score• Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)
  12. 12. Sequence
  13. 13. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  14. 14. What is an ontology?• An ontology is an explicit specification of a conceptualization.• A conceptualization is an abstract, simplified view of the world that we want to represent.• If the specification medium is a formal representation, the ontology defines the vocabulary.
  15. 15. Why Create Ontologies?• to enable data exchange among programs• to simplify unification (or translation) of disparate representations• to employ knowledge-based services• to embody the representation of a theory• to facilitate communication among people
  16. 16. Summary • Ontologies are what they do: artifacts to help people and their programs communicate, coordinate, collaborate. • Ontologies are essential elements in the technological infrastructure of the Knowledge Age • http://www.geneontology.org/
  17. 17. The Three Ontologies•Molecular Function — elemental activity or task nuclease, DNA binding, transcription factor•Biological Process — broad objective or goal mitosis, signal transduction, metabolism•Cellular Component — location or complex nucleus, ribosome, origin recognition complex
  18. 18. DAG StructureDirected acyclic graph: each child may have one or more parents
  19. 19. Example - Molecular Function
  20. 20. Example - Biological Process
  21. 21. Example - Cellular Location
  22. 22. AmiGO browser
  23. 23. GO: Applications• Eg. chip-data analysis: Overrepresented item can provide functional clues• Overrepresentation check: contingency table – Chi-square test (or Fisher is frequency < 5)
  24. 24. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  25. 25. Computational Gene Finding Problem: Given a very long DNA sequence, identify coding regions (including intron splice sites) and their predicted protein sequences
  26. 26. Computational Gene Finding Eukaryotic gene structure
  27. 27. Computational Gene Finding • There is no (yet known) perfect method for finding genes. All approaches rely on combining various “weak signals” together • Find elements of a gene – coding sequences (exons) – promoters and start signals – poly-A tails and downstream signals • Assemble into a consistent gene model
  28. 28. Genefinder
  29. 29. GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAPThis gene structure corresponds to the position on the physical map
  30. 30. The Active Zone limits the extent of analysis, genefinder & fasta dumps A blue line within the yellow box indicates regions outside of the active zone The active zone is set by entering coordinates in the active zone (yellow box)GENE STRUCTURE INFORMATION - ACTIVE ZONEThis gene structure shows the Active Zone
  31. 31. Change origin of this scale by entering a number in the green origin boxGENE STRUCTURE INFORMATION - POSITIONThis gene structure relates to the Position:
  32. 32. Boxes are Exons, thin lines (or springs) are IntronsGENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTUREThis gene structure relates to the predicted gene structures
  33. 33. Find the open reading framesThe triplet, non-punctuated nature of the genetic code helps us out64 potential codons 61 true codons 3 stop codons (TGA, TAA, TAG) Random distribution app. 1/21 codons will be a stopAny sequence has 3 potential reading frames (+1, +2, +3)Its complement also has three potential reading frames (-1, -2, -3)6 possible reading frames GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT E K A P A Q S E M V S L S F H R K K L L P N L K W L A Y L S T K S S C P I * N G * P I F P P
  34. 34. There is one column for each frame Small horizontal lines represent stop codonsGENE STRUCTURE INFORMATION - OPEN READING FRAMESThis gene structure relates to Open reading Frames
  35. 35. They have one column for each frame The size indicates relative score for the particular start siteGENE STRUCTURE INFORMATION - START CODONSThis gene structure represents Start Codons
  36. 36. Computational Gene Finding: Hexanucleotide frequencies• Amino acid distributions are biased e.g. p(A) > p(C)• Pairwise distributions also biased e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)]• Nucleotides that code for preferred amino acids (and AA pairs) occur more frequently in coding regions than in non-coding regions.• Codon biases (per amino acid)• Hexanucleotide distributions that reflect those biases indicate coding regions.
  37. 37. Gene predictionGeneration of datasets (Ensmart@Ensembl):Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA):Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regionsDistance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp)Do you see a difference in this “distance array” between coding and noncoding sequence ?Could it be used to predict genes ?Write a program to predict genes in the following genomic sequence (http://biobix.ugent.be/txt/genomic.txt)What else could help in finding genes in raw genomic sequences ?
  38. 38. The grey boxes indicate regions where the codon frequencies match those of known C. elegans genes. the larger the grey box the more this region resembles a C. elegans coding elementGENE STRUCTURE INFORMATION - CODING POTENTIALThis gene structure corresponds to the Coding Potential
  39. 39. blastn (EST)For raw DNA sequence analysis blastx isextremely usefulWill probe your DNA sequence against the protein databaseA match (homolog) gives you some ideas regarding functionOne problem are all of the genome sequencesWill get matches to genome databases that are strictly identified bysequence homology – often you need some experimental evidence
  40. 40. The blue boxes indicate regions of sequence which when translated have similarity to previously characterised proteins. To view the alignment, select the right mouse button whilst over the blue box.GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITYThis feature shows protein sequence similarity
  41. 41. The yellow boxes represent DNA matches (Blast) to C. elegans Expressed Sequence Tags (ESTS) To view the alignment use the right mouse button whilst over the yellow box to invoke BlixemGENE STRUCTURE INFORMATION - EST MATCHESThis gene structure relates to Est Matches
  42. 42. New generation of programs to predict gene codingsequences based on a non-random repeat pattern(eg. Glimmer, GeneMark) – actually pretty good Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34
  43. 43. Computational Gene Finding • CpG islands are regions of sequence that have a high proportion of CG dinucleotide pairs (p is a phoshodiester bond linking them) – CpG islands are present in the promoter and exonic regions of approximately 40% of mammalian genes – Other regions of the mammalian genome contain few CpG dinucleotides and these are largely methylated • Definition: sequences of >500 bp with – G+C > 55% – Observed(CpG)/Expected(CpG) > 0.65
  44. 44. This column shows matches to members of a number of repeat families Currently a hidden markov model is used to detect theseGENE STRUCTURE INFORMATION - REPEAT FAMILIESThis gene structure corresponds to Repeat Families
  45. 45. This column shows regions of localised repeats both tandem and inverted Clicking on the boxes will show the complete repeat information in the blue line at the top end of the screenGENE STRUCTURE INFORMATION - REPEATSThis gene structure relates to Repeats
  46. 46. Exon/intron boundaries
  47. 47. Computational Gene Finding: Splice junctions • Most Eukaryotic introns have a consensus splice signal: GU at the beginning (“donor”), AG at the end (“acceptor”). • Variation does occur in the splice sites • Many AGs and GTs are not splice sites. • Database of experimentally validated human splice sites: http://www.ebi.ac.uk/~thanaraj/splice.h tml
  48. 48. The Splice Sites are shown Hooked The Hook points in the direction of splicing, therefore 3 splice sites point up and 5 Splice sites point down The colour of the Splice Site indicates the position at which it interrupts the Codon The height of the Splices is proportional to the Genefinder score of the Splice SiteGENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITESThis gene structure shows putative splice sites
  49. 49. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  50. 50. Towards profiles (PSSM) with indels – insertions and/or deletions • Recall that profiles are matrices that identify the probability of seeing an amino acid at a particular location in a motif. • What about motifs that allow insertions or deletions (together, called indels)? • Patterns and regular expressions can handle these easily, but profiles are more flexible. • Can indels be integrated into profiles?
  51. 51. Hidden Markov Models: Graphical models of sequences • Need a representation that allows specification of the probability of introducing (and/or extending) a gap in the profile. continue A .1 Gap A .04 Gap A .2 C .05 C .1 C .01 D .2 D .01 D .05 E .08 E .2 E .1 F .01 F .02 F .06 delete
  52. 52. Hidden Markov Chain • A sequence is said to be Markovian if the probability of the occurrence of an element in a particular position depends only on the previous elements in the sequence. • Order of a Markov chain depends on how many previous elements influence probability – 0th order: uniform probability at every position – 1st order: probability depends only on immediately previous position. • 1st order Markov chains are good for proteins.
  53. 53. Marchov Chain for DNA
  54. 54. Markov chain with begin and end
  55. 55. Markov Models: Graphical models of sequences• Consists of states (boxes) and transitions (arcs) labeled with probabilities• States have probability(s) of “emitting” an element of a sequence (or nothing).• Arcs have probability of moving from one state to another. – Sum of probabilities of all out arcs must be 1 – Self-loops (e.g. gap extend) are OK.
  56. 56. Markov Models • Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1, and each transition is equally likely. Begi Emit 1 Emit 4 n End Emit 2 Emit 3Example sequences: 1234 234 14 121214 2123334
  57. 57. Hidden Markov Models: Probabilistic Markov Models• Now, add probabilities to each transition (let emission remain a single element) 0.5 0.9 Begi Emit 1 Emit 4 1.0 n 0.25 0.1 0.8 End 0.5 0.75 Emit 2 Emit 3 0.2• We can calculate the probability of any sequence given this model by multiplying p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03 p(14) = 0.5 * 0.9 = 0.45 p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06
  58. 58. Hidden Markov Models: Probablistic Emmision• If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequence 0.5 0.9 Begi A (0.8) B(0.2) C (0.1) D (0.9) 1.0 n 0.25 0.1 0.8 End 0.5 0.75 B (0.7) C(0.3) C (0.6) A(0.4) 0.2 BCCD or BCCD ?
  59. 59. Hidden Markov Models• Emission uncertainty means the sequence doesnt identify a unique path. The states are “hidden” 0.5 0.9 Begi A (0.8) B(0.2) C (0.1) D (0.9) 1.0 n 0.25 0.1 0.8 End 0.5 0.75 B (0.7) C(0.3) C (0.6) A(0.4) 0.2• Probability of a sequence is sum of all paths that can produce it: p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9 + 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9 = 0.000972 + 0.013608 = 0.01458
  60. 60. Hidden Markov Models
  61. 61. Hidden Markov Models: The occasionally dishonest casino
  62. 62. Hidden Markov Models: The occasionally dishonest casino
  63. 63. Use of Hidden Markov Models• The HMM must first be “trained” using a training set – Eg. database of known genes. – Consensus sequences for all signal sensors are needed. – Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors.• Transition probabilities between all connected states must be estimated.• Estimate the probability of sequence s, given model m, P(s|m) – Multiply probabilities along most likely path (or add logs – less numeric error)
  64. 64. Applications of Hidden Markov Models • HMMs are effectively profiles with gaps, and have applications throughout Bioinformatics • Protein sequence applications: – MSAs and identifying distant homologs E.g. Pfam uses HMMs to define its MSAs – Domain definitions – Used for fold recognition in protein structure prediction • Nucleotide sequence applications: – Models of exons, genes, etc. for gene recognition.
  65. 65. Hidden Markov Models Resources • UC Santa Cruz (David Haussler group) – SAM-02 server. Returns alignments, secondary structure predictions, HMM parameters, etc. etc. – SAM HMM building program (requires free academic license) • Washington U. St. Louis (Sean Eddy group) – Pfam. Large database of precomputed HMM-based alignments of proteins – HMMer, program for building HMMs • Gene finders and other HMMs (more later)
  66. 66. Example TMHMM Beyond Kyte-Doolitlle …
  67. 67. HMM in protein analysis• http://www.cse.ucsc.edu/research/compbio/is mb99.handouts/KK185FP.html
  68. 68. Hidden Markov model for gene structure Contents (red arcs): Signals (blue nodes): • 5’ UTR (J5’) • begin sequence (B) • initial exon (EI) • start translation (S) • exon (E) • donor splice site (D) • intron (I) • acceptor splice site (A) • final exon (EF) • stop translation (T) • single exon (ES) • end sequence (F) • 3’ UTR (J3’) • A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene. • A candidate gene structure is created by tracing a path from B to F. • A hidden Markov model (or hidden semi-Markov model) is defined by attaching stochastic models to each of the arcs and nodes.
  69. 69. Classic Programs for gene findingSome of the best programs are HMM based:• GenScan – http://genes.mit.edu/GENSCAN.html• GeneMark – http://opal.biology.gatech.edu/GeneMark/Other programs• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3, GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
  70. 70. Hidden Markov Models: Gene Finding Software• A Semi-Markov Model GENSCAN not to be confused with GeneScan, a commercial product – Explicit model of how long to stay in a state (rather than just self-loops, which must be exponentially decaying)• Tracks “phase” of exon or intron (0 coincides with codon boundary, or 1 or 2)• Tracks strand (and direction)
  71. 71. Conservation of Gene Features 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% aligning identity Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.
  72. 72. Composite Approaches• Use EST info to constrain HMMs (Genie)• Use protein homology info on top of HMMs (fgenesh++, GenomeScan)• Use cross species genomic alignments on top of HMMs (twinscan, fgenesh2, SLAM, SGP)
  73. 73. Gene Prediction: more complex … 1. Species specific 2. Splicing enhancers found in coding regions 3. Trans-splicing 4. …
  74. 74. Length preference5’ ss intcomp branch 3’ ss
  75. 75. RNA genes Besides the 6000 protein coding-genes, there is: 140 ribosomal RNA genes 275 transfer RNA gnes 40 small nuclear RNA genes >100 small nucleolar genes ? pRNA in 29 rotary packaging motor (Simpson et el. Nature 408:745-750,2000) Cartilage-hair hypoplasmia mapped to an RNAContents-Schedule (Ridanpoa et al. Cell 104:195-203,2001) The human Prader-Willi ciritical region (Cavaille et al. PNAS 97:14035-7, 2000)
  76. 76. miRNA genes RNA genes can be hard to detects UGAGGUAGUAGGUUGUAUAGU C.elegans let-27; 21 nt (Pasquinelli et al. Nature 408:86-89,2000) Often small Sometimes multicopy and redundant Often not polyadenylated (not represented in ESTs) Immune to frameshift and nonsense mutations No open reading frame, no codon bias Often evolving rapidly in primary sequence
  77. 77. Lin-4• Lin-4 identified in a screen for mutations that affect timing andsequence of postembryonic development in C.elegans. Mutants re-iterate L1 instead of later stages of development• Gene positionally cloned by isolating a 693-bp DNA fragment thatcan rescue the phenotype of mutant animals• No protein found but 61-nucleotide precursor RNA with stem-loopstructure which is processed to 22-mer ncRNA• Genetically lin-4 acts as negative regulator of lin-14 and lin-28• The 3’ UTR of the target genes have short stretches ofcomplementarity to lin-4• Deletion of these lin-4 target seq causes unregulated gof phenotype• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteinsalthough the target mRNA
  78. 78. Let-7 (Pasquinelli et al. Nature 408:86-89,2000)Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-nucleotide productThe small let-7 RNA is also thought to be a post-transcriptionalnegative regulator for lin-41 and lin-42100% conserved in all bilaterally symmetrical animals (notjellyfish and sponges)Sometimes called stRNAs, small temporal RNAs
  79. 79. Two computational analysis problems• Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat) – For RNA, you want to take the secondary structure of the query into account• Genefinding. Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence – For RNA, with no open reading frame and no codon bias, what do you look for ?
  80. 80. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> SaS -> aSuS -> SS
  81. 81. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSuS -> SS
  82. 82. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS
  83. 83. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS
  84. 84. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS
  85. 85. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS S -> aagaSucugSc
  86. 86. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc
  87. 87. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  88. 88. Context-free grammersBasic CFG“production rules” A CFG “derivation”S -> aS S -> aSS -> Sa S -> aaSS -> aSu S -> aaSSS -> SS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  89. 89. Context-free grammers U C U GBasic CFG C*G“production rules” A CFG “derivation” A A*U G*CS -> aS S -> aS A U C A A G G GS -> Sa S -> aaS C * * *S -> aSu C C C A S -> aaSSS -> SS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  90. 90. The power of comparative analysis• Comparative genome analysis is an indispensable means of inferring whether a locus produces a ncRNA as opposed to encoding a protein.• For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species.• It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions.• With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.
  91. 91. Compensatory substitutionsthat maintain the structure UU C G G U A C A U G C 5’ A UCGAC 3’
  92. 92. Evolutionary conservation of RNA molecules can be revealed by identification of compensatory substitutions
  93. 93. …………
  94. 94. • Manual annotation of 60,770 full-length mouse complementaryDNA sequences, clustered into 33,409 „transcriptional units‟,contributing 90.1% of a newly established mouse transcriptomedatabase.• Of these transcriptional units, 4,258 are new protein-coding and11,665 are new non-coding messages, indicating that non-codingRNA is a major component of the transcriptome.
  95. 95. Function on ncRNAs
  96. 96. ncRNAs & RNAi
  97. 97. Therapeutic Applications • Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar) • In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.

×