Gemoda

1,310 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,310
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Gemoda

  1. 1. A generic motif discovery algorithm for diverse biomolecular data Kyle Jensen Gregory Stephanopoulos Department of Chemical Engineering Massachusetts Institute of Technology
  2. 2. Motif discovery is the automated search for similar regions in streams of data <ul><li>Un-sequential data </li><ul><li>No “ordering” </li></ul><li>Sequential data </li><ul><li>A natural ordering of the data </li><ul><li>Nucleotide and amino acid sequences
  3. 3. Stock prices, protein structures </li></ul></ul></ul>MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
  4. 4. There are two classes of motif discovery tools commonly used for sequence analysis <ul><li>“Exhaustive” regular-expression based tools </li><ul><li>Teiresias
  5. 5. Pratt </li></ul><li>“Descriptive” position weight matrix-based tools </li><ul><li>Gibbs sampler
  6. 6. MEME
  7. 7. Consensus </li></ul></ul>TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
  8. 8. “Gemoda” was designed to be exhaustive and have descriptive power <ul><li>Gemoda exhaustively returns maximal motifs </li><ul><li>Uses convolution of Teiresias </li><ul><li>Way of “stiching together” smaller patterns combinatorially </li></ul></ul><li>Gets descriptiveness from similarity metric </li><ul><li>Generic, context dependent definition of similarity </li></ul></ul>MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix
  9. 9. Gemoda proceeds in three steps: comparison, clustering, and convolution Jensen, K., Styczynski,M., Rigoutsos,I. and Stephanopoulos,G. (2005) A generic motif discovery algorithm for sequential data. Bioinformatics, in press
  10. 10. The comparison stage is used to map the pairwise similarities between all windows in the data streams <ul><li>Creates an distance matrix </li><ul><li>Does an all-by-all comparison of windows in the data
  11. 11. Comparison function is context-specific </li></ul></ul>F(w 1 , w 2 )
  12. 12. The clustering phase is used to find groups of mutually similar windows <ul><li>Different clustering functions have different uses </li><ul><li>Clique-finding is provably exhaustive
  13. 13. K-means and other methods are faster </li></ul><li>Output clusters become “elementary motifs” which are convolved to make longer, maximal motifs </li></ul>
  14. 14. The convolution phase is used to “stitch” together the clusters into maximal motifs <ul><li>The motifs should be as long as possible, without decreasing the support </li></ul>elementary motifs (clusters) window ordering
  15. 15. Here we show a few representative ways in which Gemoda can be used Motif discovery in... <ul><li>Protein sequences </li><ul><li>(ppGpp)ase enzymes & finding known domains </li></ul><li>DNA sequences </li><ul><li>The LD-motif challenge problem </li></ul><li>Protein structures </li><ul><li>Conserved structures without conserved sequences </li></ul></ul>
  16. 16. Gemoda can be applied to amino acid sequences as well <ul><li>Example: (ppGpp)ase family from ENZYME database </li><ul><li>Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes </li><ul><li>EC 3.1.7.2
  17. 17. Ave. length ~700 amino acids
  18. 18. 8 sequences from 8 species </li></ul><li>Searched using Gemoda </li><ul><li>Minimum length = 50 amino acids
  19. 19. Minimum Blosum62 bit score = 50 bits
  20. 20. Minimum support = 100% (8/8 sequences)
  21. 21. Clustering method = clique finding </li></ul></ul></ul>Can Gemoda find this known motif? How sensitive is Gemoda to “noise?”
  22. 22. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
  23. 23. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
  24. 24. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to “noise”
  25. 25. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Pevzner & Sze, Proc. ISMB, 2000
  26. 26. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length ? Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extension and novel solution to the Motif Challenge Problem. Genome Informatics, 15 (2).
  27. 27. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  28. 28. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  29. 29. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs ?
  30. 30. Gemoda can also be applied to protein structures <ul><li>Treat protein structure as alpha-carbon trace </li><ul><li>Series of x,y,z coordinates </li></ul></ul><ul><li>Use a clustering function that compares x,y,z windows </li><ul><li>Root mean square deviation (RMSD)
  31. 31. unit-RMSD </li></ul></ul>x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M
  32. 32. Protein structure example: human FIT vs. uridylyltransferase
  33. 33. Questions?
  34. 34. The Gemoda algorithm has guarantees of maximality and exhaustiveness <ul><li>Maximality </li><ul><li>Motifs are as long as possible
  35. 35. Motifs are as specific as possible
  36. 36. Motifs are not missing an occurrences </li></ul></ul><ul><li>Exhaustiveness </li><ul><li>All maximal motifs are found
  37. 37. No non-maximal motifs are found </li></ul></ul>= motif1 = motif2
  38. 38. Gemoda may be used with nucleotide sequences to find regulatory motifs <ul><li>The LD-motif problem: an example for the board </li></ul>>AACTG >AATTA >AATTG Look for motifs of at least 3 nucletides with a Hamming distance between any window of 3 of 1 or less given: 1 = AAC 2 = ACT 3 = CTG 4 = AAT 5 = ATT 6 = TTA 7 = AAT 8 = ATT 9 = TTG We get the following windows:
  39. 39. A simple natural language example <ul><li>Choosing a window length of L=4 gives 7 unique windows in the three sequences </li></ul>Seq 1: motif Seq 2: motor Seq 3: potion
  40. 40. Here we show the comparison phase using two different similarity metrics <ul><li>X's and dotted lines </li><ul><li>Identify matrix: ¾ </li></ul><li>O's and solid lines </li><ul><li>Consonant/vowel matrix: ¾ </li></ul></ul>Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Input sequences Seq 1: motif Seq 2: motor Seq 3: potion Similarity graph
  41. 41. The clustered windows (elementary motifs) are different depending on the similarity function <ul><li>Clustering phase </li><ul><li>Clique-finding
  42. 42. Support >= 2 </li></ul></ul>Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Solid lines (vowel/cons): Dotted lines (identity):
  43. 43. Likewise, the final, convolved motifs depend on the similarity function choice Motif 1 motif motor potio Seq 1: motif Seq 2: motor Seq 3: potion Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Vowel/cons: Motif 1 motif potio Motif 2 moti moto Identity: Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Vowel/cons: Identity:

×