Kyle Jensen MIT Ph.D. Thesis Defense

  • 1,744 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,744
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
32
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Motif discovery in sequential data Kyle Jensen Thesis Offense Department of Chemical Engineering Massachusetts Institute of Technology Thesis committee: Greg Stephanopoulos William Green Robert Berwick Isidore Rigoutsos ChE, MIT ChE, MIT EECS, MIT IBM
  • 2. Sequencing throughput, like processor power, is growing exponentially
  • 3. As a result, Genbank is overflowing
  • 4. Anatomics Biomics Chromosomics Cytomics Enviromics Epigenomics Fluxomics Glycomics Glycoproteomics Immunogen. Immunomics Immunoproteomics Integromics Interactomics Ionomics Lipidomics Metabolomics Metabonomics Metagenomics Metallomics Metalloproteomics Methylomics Mitogenomics Neuromics Neuropeptido. Oncogenomics Peptidomics Phenomics Phospho-prot. Phosphoproteomics Physiomics Physionomics Post–genomics Postgenomics Pre–genomics Rnomics Secretomics Subproteomics Surfaceomics Syndromics Transcriptomics And the “ome-ome” keeps growing
  • 5. Together, these data form a rich network of information
  • 6. CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC This data glut motivates the need for automated methods of discovery and analysis Here, I focus on motif discovery in sequential data using a linguistic metaphor
  • 7. A grammar is a mathematical system for describing the structure of a language S -> NP VP NP 1 -> D NP | PN NP 2 -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates
  • 8. GRAMMAR S -> NP VP NP -> D NP | PN NP -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates S => NP VP => PN VP => mary VP => mary V NP => mary hates NP => mary hates D NP 1 => mary hates the NP 1 => mary hates the N => mary hates the dog S => NP VP => NP V NP => NP V D NP 1 => NP V a NP 1 => NP V a ADJ NP 1 => NP is a ADJ NP 1 => NP is a ADJ ADJ NP 1 => NP is a large ADJ NP 1 => NP is a large ADJ N => NP is a large black N => NP is a large black cat=> PN is a large black cat => peter is a large black cat
  • 9. Grammars can describe biological phenomena in the same manner as natural languages
    • Two examples
      • Example: a declarative sentence in English
      • 10. Example: eukaryotic gene structure
    S D N NP V A P NP D N the boy is upset over the girl the advisor is pleased with the research S -> NP V A P NP NP -> { D N N gene start codon upstream primary transcript TATA box exon intron exon stop codon ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA
  • 11. Grammars are suitable for describing any complex arrangement of sequential data
    • The grammar of biological sequences
    language grammar linguistic example biological example complexity
  • 12.  
  • 13.  
  • 14. Simple, regular grammars are compactly written as regular expressions [LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]
  • 15. Motif discovery is the inverse problem: given the sentences, find the grammar CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
  • 16. Part 1: Rational design of antimicrobial peptides using linguistic methods
  • 17. Antimicrobial peptides are small proteins that attack and kill bacteria
    • Functional characteristics:
      • Part of innate immune system
        • all multicellular eukaryotes
      • Attack bacterial membrane
        • electrostatic attraction
        • 18. effective at µg/mL concentrations
    • Applications of AmPs:
      • Novel class of antibiotics
        • low bacterial resistance
        • 19. activity against “MDR” pathogens
        • 20. currently topical: acne, etc.
      • Other clinical applications
        • AIDS, certain cancers, biodefense
    AmPs bacterial membrane + + - -
  • 21.  
  • 22. AmP sequences contain many repeated motifs, suggesting a linguistic model
    • AmP amino acid sequences
      • ~1000 natural AmP sequences
        • from many different species
      • Numerous conserved motifs
        • suggest “rules” for building AmPs
        • 23. similar to grammar of languages
    cecropins cecropin motif
    • The “language” of AmP sequences
      • Can we find the underlying grammar of this language?
        • Will this grammar capture the sequence/function relationships?
      • Knowing the grammar, can we build novel AmPs?
  • 24. The AmP sequences were modeled using simple regular grammars
    • Given a language, is there a regular grammar?
      • Example: the cecropin sub-sequences
    • Automated grammar induction: Teiresias
      • Regular grammars of the form
        • R: V i -> σ V j where σ  Σ (type A, aa) or σ = {Σ} (type B, wildcard)
        • 25. Find all G for which a/b > w, and a+b>L
        • 26. Subject to maximal |R| and maximal occurrences of G
    G = ( V , Σ, R, S ) where seq1: QSEAGWLKKLGK seq2: QSEAGWLRKAAK seq3: QTEAGGLKKFGK What grammar describes these sequences? V = non-terminal symbols Σ = amino acids R = set of replacement rules S = starting amino acid cecropin motif: Q.EAG.L.K..K
  • 27. Our goal was to use this linguistic model to design novel AmPs
    • Protein design space is combinatorially large
      • 20 N possible N amino acid sequences
        • N = 18, number of stars in universe
        • 28. N = 50, number of atoms in Earth
        • 29. N = 100, number of electrons in universe
    • Why design novel AmPs?
      • Concern over RamPs
        • Cross-resistance
    • Other approaches
      • Folding & thermodynamics
      • 30. Combinatorial libraries
    sequence space grammatical space natural AmPs “ true” AmPs
  • 31. We used Teiresias to discover ~700 grammars defining the “language of AmPs” query: - grammar 1 grammar 2 -
      • These grammars were used to design novel AmPs
        • No more than 5-in-a-row with natural AmPs
        • 32. 12 million “grammatical” sequences
  • 33. 40 novel AmPs were chosen for experimental validation
    • Tested against B. subtilis & E. coli
    serial dilutions replicates 9 non-AmPs 9 natural AmPs Control 42 shuffled 42 motif-based Test N Y Expect Activity?
  • 34. Our results show significant enrichment for activity in the designed set Expected Activity? Y N Test 42 motif-based 18 / 42 42 shuffled 2 / 42 Control 9 natural AmPs 6 / 9 9 non-AmPs 0 / 9
  • 35. Optimized leads showed strong activity against anthrax and staph
  • 36.  
  • 37. Part 2: A generic motif discovery algorithm for diverse biomolecular data
  • 38. Motif discovery is the automated search for similar regions in streams of data
    • Un-sequential data
      • No “ordering”
    • Sequential data
      • A natural ordering of the data
        • Nucleotide and amino acid sequences
        • 39. Stock prices, protein structures
    MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
  • 40. There are two classes of motif discovery tools commonly used for sequence analysis
    • “Exhaustive” regular-expression based tools
    • “Descriptive” position weight matrix-based tools
    TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
  • 44. “Gemoda” was designed to be exhaustive and have descriptive power
    • Gemoda exhaustively returns maximal motifs
      • Uses convolution of Teiresias
        • Way of “stiching together” smaller patterns combinatorially
    • Gets descriptiveness from similarity metric
      • Generic, context dependent definition of similarity
    MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix
  • 45. Gemoda proceeds in three steps: comparison, clustering, and convolution
  • 46. The comparison stage is used to map the pairwise similarities between all windows in the data streams
    • Creates an distance matrix
      • Does an all-by-all comparison of windows in the data
      • 47. Comparison function is context-specific
    F(w 1 , w 2 )
  • 48. The clustering phase is used to find groups of mutually similar windows
    • Different clustering functions have different uses
      • Clique-finding is provably exhaustive
      • 49. K-means and other methods are faster
    • Output clusters become “elementary motifs” which are convolved to make longer, maximal motifs
  • 50. The convolution phase is used to “stitch” together the clusters into maximal motifs
    • The motifs should be as long as possible, without decreasing the support
    elementary motifs (clusters) window ordering
  • 51. Here we show a few representative ways in which Gemoda can be used Motif discovery in...
    • Protein sequences
      • (ppGpp)ase enzymes & finding known domains
    • DNA sequences
      • The LD-motif challenge problem
    • Protein structures
      • Conserved structures without conserved sequences
  • 52. Gemoda can be applied to amino acid sequences as well
    • Example: (ppGpp)ase family from ENZYME database
      • Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes
        • EC 3.1.7.2
        • 53. Ave. length ~700 amino acids
        • 54. 8 sequences from 8 species
      • Searched using Gemoda
        • Minimum length = 50 amino acids
        • 55. Minimum Blosum62 bit score = 50 bits
        • 56. Minimum support = 100% (8/8 sequences)
        • 57. Clustering method = clique finding
    Can Gemoda find this known motif? How sensitive is Gemoda to “noise?”
  • 58. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
  • 59. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
  • 60. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to “noise”
  • 61. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 62. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length?
  • 63. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 64. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 65. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs?
  • 66. Gemoda can also be applied to protein structures
    • Treat protein structure as alpha-carbon trace
      • Series of x,y,z coordinates
    • Use a clustering function that compares x,y,z windows
      • Root mean square deviation (RMSD)
      • 67. unit-RMSD
    x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M
  • 68. Protein structure example: human FIT vs. uridylyltransferase
  • 69. fin