Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
Kyle Jensen MIT Ph.D. Thesis Defense
1. Motif discovery in sequential data Kyle Jensen Thesis Offense Department of Chemical Engineering Massachusetts Institute of Technology Thesis committee: Greg Stephanopoulos William Green Robert Berwick Isidore Rigoutsos ChE, MIT ChE, MIT EECS, MIT IBM
7. A grammar is a mathematical system for describing the structure of a language S -> NP VP NP 1 -> D NP | PN NP 2 -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates
8. GRAMMAR S -> NP VP NP -> D NP | PN NP -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates S => NP VP => PN VP => mary VP => mary V NP => mary hates NP => mary hates D NP 1 => mary hates the NP 1 => mary hates the N => mary hates the dog S => NP VP => NP V NP => NP V D NP 1 => NP V a NP 1 => NP V a ADJ NP 1 => NP is a ADJ NP 1 => NP is a ADJ ADJ NP 1 => NP is a large ADJ NP 1 => NP is a large ADJ N => NP is a large black N => NP is a large black cat=> PN is a large black cat => peter is a large black cat
9.
10. Example: eukaryotic gene structure S D N NP V A P NP D N the boy is upset over the girl the advisor is pleased with the research S -> NP V A P NP NP -> { D N N gene start codon upstream primary transcript TATA box exon intron exon stop codon ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA
11.
12.
13.
14. Simple, regular grammars are compactly written as regular expressions [LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]
15. Motif discovery is the inverse problem: given the sentences, find the grammar CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
16. Part 1: Rational design of antimicrobial peptides using linguistic methods
26. Subject to maximal |R| and maximal occurrences of G G = ( V , Σ, R, S ) where seq1: QSEAGWLKKLGK seq2: QSEAGWLRKAAK seq3: QTEAGGLKKFGK What grammar describes these sequences? V = non-terminal symbols Σ = amino acids R = set of replacement rules S = starting amino acid cecropin motif: Q.EAG.L.K..K
34. Our results show significant enrichment for activity in the designed set Expected Activity? Y N Test 42 motif-based 18 / 42 42 shuffled 2 / 42 Control 9 natural AmPs 6 / 9 9 non-AmPs 0 / 9
37. Part 2: A generic motif discovery algorithm for diverse biomolecular data
38.
39. Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
57. Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to “noise?”
58. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
59. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
60. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to “noise”
61. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
62. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length?
63. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
64. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
65. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs?
66.
67. unit-RMSD x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M