SlideShare a Scribd company logo
1 of 46
Download to read offline
Motif discovery in sequential data Kyle Jensen Thesis Offense Department of Chemical Engineering Massachusetts Institute of Technology Thesis committee: Greg Stephanopoulos William Green Robert Berwick Isidore Rigoutsos ChE, MIT ChE, MIT EECS, MIT IBM
Sequencing throughput, like processor power, is growing exponentially
As a result, Genbank is overflowing
Anatomics  Biomics  Chromosomics Cytomics Enviromics  Epigenomics  Fluxomics  Glycomics Glycoproteomics Immunogen.  Immunomics  Immunoproteomics Integromics  Interactomics  Ionomics  Lipidomics Metabolomics  Metabonomics  Metagenomics  Metallomics Metalloproteomics Methylomics  Mitogenomics  Neuromics Neuropeptido.  Oncogenomics Peptidomics Phenomics Phospho-prot.  Phosphoproteomics Physiomics  Physionomics Post–genomics Postgenomics  Pre–genomics  Rnomics Secretomics  Subproteomics Surfaceomics Syndromics Transcriptomics And the “ome-ome” keeps growing
Together, these data form a rich network of information
CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC This data glut motivates the need for automated methods of discovery and analysis Here, I focus on motif discovery in sequential data using a linguistic metaphor
A grammar is a mathematical system for describing the structure of a language S   ->   NP   VP NP 1   ->   D   NP  |  PN NP 2   ->   ADJ   NP  |  N VP   ->   V NP D   ->  a | the PN   ->  peter | paul | mary ADJ   ->  large | black N   ->  dog | cat | horse V   ->  is | likes | hates
GRAMMAR S   ->   NP   VP NP   ->   D   NP  |  PN NP   ->   ADJ   NP  |  N VP   ->   V NP D   ->  a | the PN   ->  peter | paul | mary ADJ   ->  large | black N   ->  dog | cat | horse V   ->  is | likes | hates S  =>  NP   VP  =>  PN   VP  => mary  VP  => mary  V NP  => mary hates  NP  => mary hates  D   NP 1  => mary hates the  NP 1  => mary hates the  N  =>  mary hates the dog S  =>  NP   VP  =>  NP V NP  => NP   V D   NP 1  =>  NP   V  a   NP 1  => NP   V  a  ADJ   NP 1  => NP  is a  ADJ   NP 1  => NP  is a  ADJ ADJ   NP 1  => NP  is a large  ADJ   NP 1  => NP  is a large  ADJ   N  => NP  is a large black  N  => NP  is a large black cat=> PN  is a large black cat => peter is a large black cat
Grammars can describe biological phenomena in the same manner as natural languages ,[object Object]
Example: eukaryotic gene structure S D  N NP  V  A  P  NP D  N the boy is upset over the girl the advisor is pleased with the research S   ->   NP V A P NP NP  -> { D N N gene start codon upstream primary transcript TATA box exon intron exon stop codon ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA
Grammars are suitable for describing any complex arrangement of sequential data ,[object Object],language grammar linguistic example biological example complexity
 
 
Simple, regular grammars are compactly written as regular expressions [LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]
Motif discovery is the inverse problem: given the sentences, find the grammar CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
Part 1: Rational design of antimicrobial peptides using linguistic methods
Antimicrobial peptides are small proteins that attack and kill bacteria ,[object Object]
effective at  µg/mL  concentrations ,[object Object]
activity against “MDR” pathogens
currently topical: acne, etc. ,[object Object],[object Object],AmPs bacterial membrane + + - -
 
AmP sequences contain many repeated motifs, suggesting a linguistic model ,[object Object]
similar to grammar of languages cecropins cecropin motif ,[object Object]
The AmP sequences were modeled using simple regular grammars ,[object Object]
Find all G for which a/b > w, and a+b>L
Subject to maximal |R| and maximal occurrences of G G   = ( V ,  Σ, R, S ) where seq1:  QSEAGWLKKLGK seq2:  QSEAGWLRKAAK seq3:  QTEAGGLKKFGK What grammar describes these sequences? V =  non-terminal symbols Σ = amino acids R = set of replacement rules S = starting amino acid cecropin motif: Q.EAG.L.K..K
Our goal was to use this linguistic model to design novel AmPs ,[object Object]
N = 50, number of atoms in Earth
N = 100, number of electrons in universe ,[object Object],[object Object],[object Object],[object Object]
Combinatorial libraries sequence space grammatical space natural AmPs “ true” AmPs
We used Teiresias to discover ~700 grammars defining the “language of AmPs” query: - grammar 1 grammar 2 - ,[object Object]
12 million “grammatical” sequences
40 novel AmPs were chosen for experimental validation ,[object Object],serial dilutions replicates 9 non-AmPs 9 natural AmPs Control 42 shuffled 42 motif-based Test N Y Expect Activity?
Our results show significant enrichment for activity in the designed set Expected Activity? Y N Test 42 motif-based 18 / 42 42 shuffled 2 / 42 Control 9 natural AmPs 6 / 9 9 non-AmPs 0 / 9
Optimized leads showed strong activity against anthrax and staph
 
Part 2: A generic motif discovery algorithm for diverse biomolecular data
Motif discovery is the automated search for similar regions in streams of data ,[object Object]
Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
There are two classes of motif discovery tools commonly used for sequence analysis ,[object Object]
Pratt ,[object Object],[object Object]
MEME
Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
“Gemoda” was designed to be exhaustive and have descriptive power ,[object Object],MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix
Gemoda proceeds in three steps: comparison, clustering, and convolution
The comparison stage is used to map the pairwise similarities between all windows in the data streams ,[object Object]

More Related Content

What's hot

BIOS 203 Lecture 3: Classical molecular dynamics
BIOS 203 Lecture 3: Classical molecular dynamicsBIOS 203 Lecture 3: Classical molecular dynamics
BIOS 203 Lecture 3: Classical molecular dynamicsbios203
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense PresentationOnur Taylan
 
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...Density functional theory (DFT) and the concepts of the augmented-plane-wave ...
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...ABDERRAHMANE REGGAD
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learningAlexander Novikov
 
Topological Data Analysis: visual presentation of multidimensional data sets
Topological Data Analysis: visual presentation of multidimensional data setsTopological Data Analysis: visual presentation of multidimensional data sets
Topological Data Analysis: visual presentation of multidimensional data setsDataRefiner
 
PhD defence presentation
PhD defence presentationPhD defence presentation
PhD defence presentationcsteinmann
 
Senior Thesis Presentation Final
Senior Thesis Presentation FinalSenior Thesis Presentation Final
Senior Thesis Presentation Finalrobrhenderson
 
Dissertation defense power point
Dissertation defense power pointDissertation defense power point
Dissertation defense power pointKelly Dodson
 
Methods available in WIEN2k for the treatment of exchange and correlation ef...
Methods available in WIEN2k for the treatment  of exchange and correlation ef...Methods available in WIEN2k for the treatment  of exchange and correlation ef...
Methods available in WIEN2k for the treatment of exchange and correlation ef...ABDERRAHMANE REGGAD
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networksVincent Traag
 
Basics of Quantum and Computational Chemistry
Basics of Quantum and Computational ChemistryBasics of Quantum and Computational Chemistry
Basics of Quantum and Computational ChemistryGirinath Pillai
 
My PhD thesis defense presentation
My PhD thesis defense presentationMy PhD thesis defense presentation
My PhD thesis defense presentationSuman Srinivasan
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Graphene Field Effect Transistor
Graphene Field Effect TransistorGraphene Field Effect Transistor
Graphene Field Effect TransistorAhmed AlAskalany
 

What's hot (20)

BIOS 203 Lecture 3: Classical molecular dynamics
BIOS 203 Lecture 3: Classical molecular dynamicsBIOS 203 Lecture 3: Classical molecular dynamics
BIOS 203 Lecture 3: Classical molecular dynamics
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense Presentation
 
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...Density functional theory (DFT) and the concepts of the augmented-plane-wave ...
Density functional theory (DFT) and the concepts of the augmented-plane-wave ...
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
 
Topological Data Analysis: visual presentation of multidimensional data sets
Topological Data Analysis: visual presentation of multidimensional data setsTopological Data Analysis: visual presentation of multidimensional data sets
Topological Data Analysis: visual presentation of multidimensional data sets
 
PhD defence presentation
PhD defence presentationPhD defence presentation
PhD defence presentation
 
Senior Thesis Presentation Final
Senior Thesis Presentation FinalSenior Thesis Presentation Final
Senior Thesis Presentation Final
 
Mechanical, thermal, and electronic properties of transition metal dichalcoge...
Mechanical, thermal, and electronic properties of transition metal dichalcoge...Mechanical, thermal, and electronic properties of transition metal dichalcoge...
Mechanical, thermal, and electronic properties of transition metal dichalcoge...
 
Dissertation defense power point
Dissertation defense power pointDissertation defense power point
Dissertation defense power point
 
Methods available in WIEN2k for the treatment of exchange and correlation ef...
Methods available in WIEN2k for the treatment  of exchange and correlation ef...Methods available in WIEN2k for the treatment  of exchange and correlation ef...
Methods available in WIEN2k for the treatment of exchange and correlation ef...
 
Introduction to complex networks
Introduction to complex networksIntroduction to complex networks
Introduction to complex networks
 
Basics of Quantum and Computational Chemistry
Basics of Quantum and Computational ChemistryBasics of Quantum and Computational Chemistry
Basics of Quantum and Computational Chemistry
 
My PhD thesis defense presentation
My PhD thesis defense presentationMy PhD thesis defense presentation
My PhD thesis defense presentation
 
Introduction to DFT Part 2
Introduction to DFT Part 2Introduction to DFT Part 2
Introduction to DFT Part 2
 
Presentation bi2 s3+son
Presentation bi2 s3+sonPresentation bi2 s3+son
Presentation bi2 s3+son
 
Metadynamics
MetadynamicsMetadynamics
Metadynamics
 
NANO266 - Lecture 13 - Ab initio molecular dyanmics
NANO266 - Lecture 13 - Ab initio molecular dyanmicsNANO266 - Lecture 13 - Ab initio molecular dyanmics
NANO266 - Lecture 13 - Ab initio molecular dyanmics
 
Introduction to DFT Part 1
Introduction to DFT Part 1 Introduction to DFT Part 1
Introduction to DFT Part 1
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Graphene Field Effect Transistor
Graphene Field Effect TransistorGraphene Field Effect Transistor
Graphene Field Effect Transistor
 

Similar to Kyle Jensen MIT Ph.D. Thesis Defense

Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Anilkumar C
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 
BIOL335: Homology search
BIOL335: Homology searchBIOL335: Homology search
BIOL335: Homology searchPaul Gardner
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionAashish Patel
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生ysuzuki-naist
 
Shah Presentation1[1811].pptx
Shah Presentation1[1811].pptxShah Presentation1[1811].pptx
Shah Presentation1[1811].pptxShahnawaz Rayeen
 
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...Spencer Bliven
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...fruitbreedomics
 
Epitope prediction and its algorithms
Epitope prediction and its algorithmsEpitope prediction and its algorithms
Epitope prediction and its algorithmsPrasanthperceptron
 
Epitope prediction and its algorithms
Epitope prediction and its algorithmsEpitope prediction and its algorithms
Epitope prediction and its algorithmsPrasanthperceptron
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterIJMER
 
Lysosomal Porage Diseases Case Study
Lysosomal Porage Diseases Case StudyLysosomal Porage Diseases Case Study
Lysosomal Porage Diseases Case StudyRachelle Lewis
 

Similar to Kyle Jensen MIT Ph.D. Thesis Defense (20)

Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
presentation
presentationpresentation
presentation
 
BIOL335: Homology search
BIOL335: Homology searchBIOL335: Homology search
BIOL335: Homology search
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生
 
Shah Presentation1[1811].pptx
Shah Presentation1[1811].pptxShah Presentation1[1811].pptx
Shah Presentation1[1811].pptx
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
 
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
Epitope prediction and its algorithms
Epitope prediction and its algorithmsEpitope prediction and its algorithms
Epitope prediction and its algorithms
 
Epitope prediction and its algorithms
Epitope prediction and its algorithmsEpitope prediction and its algorithms
Epitope prediction and its algorithms
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
 
Lysosomal Porage Diseases Case Study
Lysosomal Porage Diseases Case StudyLysosomal Porage Diseases Case Study
Lysosomal Porage Diseases Case Study
 

More from Kyle Jensen

The intellectual property landscape of the human genome
The intellectual property landscape of the human genomeThe intellectual property landscape of the human genome
The intellectual property landscape of the human genomeKyle Jensen
 
Eschew Obfuscation
Eschew ObfuscationEschew Obfuscation
Eschew ObfuscationKyle Jensen
 
A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...Kyle Jensen
 
Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005Kyle Jensen
 
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen
 
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần Thơ
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ  CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần ThơHOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ  CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần Thơ
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần ThơKyle Jensen
 
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNg
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNgChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNg
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNgKyle Jensen
 
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂN
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂNBẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂN
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂNKyle Jensen
 
Khái quát về những nguyên tắc cơ bản trong quản lý TSTT
Khái quát về những nguyên tắc cơ bản trong quản lý TSTTKhái quát về những nguyên tắc cơ bản trong quản lý TSTT
Khái quát về những nguyên tắc cơ bản trong quản lý TSTTKyle Jensen
 
Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)Kyle Jensen
 
Chuyển giao công nghệ ở Việtnam
Chuyển giao công nghệ ở ViệtnamChuyển giao công nghệ ở Việtnam
Chuyển giao công nghệ ở ViệtnamKyle Jensen
 
Đầu tư mạo hiểm ở Việt Nam
Đầu tư mạo hiểm ở Việt NamĐầu tư mạo hiểm ở Việt Nam
Đầu tư mạo hiểm ở Việt NamKyle Jensen
 
Hình thành doanh nghiệp ở Việtnam
Hình thành doanh nghiệp ở ViệtnamHình thành doanh nghiệp ở Việtnam
Hình thành doanh nghiệp ở ViệtnamKyle Jensen
 
Chuyển giao (li-xăng) công nghệ
Chuyển giao (li-xăng) công nghệChuyển giao (li-xăng) công nghệ
Chuyển giao (li-xăng) công nghệKyle Jensen
 
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệ
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệHợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệ
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệKyle Jensen
 
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chế
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chếLời giới thiệu về trang web miễn phí cho việc tra cứu sáng chế
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chếKyle Jensen
 
Thực trang BHGCT ở Việtnam
Thực trang BHGCT ở ViệtnamThực trang BHGCT ở Việtnam
Thực trang BHGCT ở ViệtnamKyle Jensen
 
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...Kyle Jensen
 

More from Kyle Jensen (20)

Gemoda
GemodaGemoda
Gemoda
 
The intellectual property landscape of the human genome
The intellectual property landscape of the human genomeThe intellectual property landscape of the human genome
The intellectual property landscape of the human genome
 
Eschew Obfuscation
Eschew ObfuscationEschew Obfuscation
Eschew Obfuscation
 
A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...
 
Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005
 
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis Proposal
 
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần Thơ
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ  CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần ThơHOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ  CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần Thơ
HOẠT ĐỘNG NGHIÊN CỨU KHOA HỌC VÀ CHUYỂN GIAO CÔNG NGHỆ Trường Đại học Cần Thơ
 
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNg
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNgChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNg
ChuyểN Giao QuyềN đốI VớI GiốNg CâY TrồNg
 
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂN
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂNBẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂN
BẢO HỘ GIỐNG CÂY TRỒNG VÀ ĐẶC QUYỀN CỦA NÔNG DÂN
 
Khái quát về những nguyên tắc cơ bản trong quản lý TSTT
Khái quát về những nguyên tắc cơ bản trong quản lý TSTTKhái quát về những nguyên tắc cơ bản trong quản lý TSTT
Khái quát về những nguyên tắc cơ bản trong quản lý TSTT
 
Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)
 
Chuyển giao công nghệ ở Việtnam
Chuyển giao công nghệ ở ViệtnamChuyển giao công nghệ ở Việtnam
Chuyển giao công nghệ ở Việtnam
 
Đầu tư mạo hiểm ở Việt Nam
Đầu tư mạo hiểm ở Việt NamĐầu tư mạo hiểm ở Việt Nam
Đầu tư mạo hiểm ở Việt Nam
 
Hình thành doanh nghiệp ở Việtnam
Hình thành doanh nghiệp ở ViệtnamHình thành doanh nghiệp ở Việtnam
Hình thành doanh nghiệp ở Việtnam
 
Chuyển giao (li-xăng) công nghệ
Chuyển giao (li-xăng) công nghệChuyển giao (li-xăng) công nghệ
Chuyển giao (li-xăng) công nghệ
 
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệ
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệHợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệ
Hợp đồng chuyển giao vật liệu: một công cụ cho chuyển giao công nghệ
 
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chế
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chếLời giới thiệu về trang web miễn phí cho việc tra cứu sáng chế
Lời giới thiệu về trang web miễn phí cho việc tra cứu sáng chế
 
Tình huống
Tình huốngTình huống
Tình huống
 
Thực trang BHGCT ở Việtnam
Thực trang BHGCT ở ViệtnamThực trang BHGCT ở Việtnam
Thực trang BHGCT ở Việtnam
 
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
Thương mại hóa hoạt động nghiên cứu trong lĩnh vực công nghệ sinh học nông ng...
 

Kyle Jensen MIT Ph.D. Thesis Defense

  • 1. Motif discovery in sequential data Kyle Jensen Thesis Offense Department of Chemical Engineering Massachusetts Institute of Technology Thesis committee: Greg Stephanopoulos William Green Robert Berwick Isidore Rigoutsos ChE, MIT ChE, MIT EECS, MIT IBM
  • 2. Sequencing throughput, like processor power, is growing exponentially
  • 3. As a result, Genbank is overflowing
  • 4. Anatomics Biomics Chromosomics Cytomics Enviromics Epigenomics Fluxomics Glycomics Glycoproteomics Immunogen. Immunomics Immunoproteomics Integromics Interactomics Ionomics Lipidomics Metabolomics Metabonomics Metagenomics Metallomics Metalloproteomics Methylomics Mitogenomics Neuromics Neuropeptido. Oncogenomics Peptidomics Phenomics Phospho-prot. Phosphoproteomics Physiomics Physionomics Post–genomics Postgenomics Pre–genomics Rnomics Secretomics Subproteomics Surfaceomics Syndromics Transcriptomics And the “ome-ome” keeps growing
  • 5. Together, these data form a rich network of information
  • 6. CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC This data glut motivates the need for automated methods of discovery and analysis Here, I focus on motif discovery in sequential data using a linguistic metaphor
  • 7. A grammar is a mathematical system for describing the structure of a language S -> NP VP NP 1 -> D NP | PN NP 2 -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates
  • 8. GRAMMAR S -> NP VP NP -> D NP | PN NP -> ADJ NP | N VP -> V NP D -> a | the PN -> peter | paul | mary ADJ -> large | black N -> dog | cat | horse V -> is | likes | hates S => NP VP => PN VP => mary VP => mary V NP => mary hates NP => mary hates D NP 1 => mary hates the NP 1 => mary hates the N => mary hates the dog S => NP VP => NP V NP => NP V D NP 1 => NP V a NP 1 => NP V a ADJ NP 1 => NP is a ADJ NP 1 => NP is a ADJ ADJ NP 1 => NP is a large ADJ NP 1 => NP is a large ADJ N => NP is a large black N => NP is a large black cat=> PN is a large black cat => peter is a large black cat
  • 9.
  • 10. Example: eukaryotic gene structure S D N NP V A P NP D N the boy is upset over the girl the advisor is pleased with the research S -> NP V A P NP NP -> { D N N gene start codon upstream primary transcript TATA box exon intron exon stop codon ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA
  • 11.
  • 12.  
  • 13.  
  • 14. Simple, regular grammars are compactly written as regular expressions [LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]
  • 15. Motif discovery is the inverse problem: given the sentences, find the grammar CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
  • 16. Part 1: Rational design of antimicrobial peptides using linguistic methods
  • 17.
  • 18.
  • 20.
  • 21.  
  • 22.
  • 23.
  • 24.
  • 25. Find all G for which a/b > w, and a+b>L
  • 26. Subject to maximal |R| and maximal occurrences of G G = ( V , Σ, R, S ) where seq1: QSEAGWLKKLGK seq2: QSEAGWLRKAAK seq3: QTEAGGLKKFGK What grammar describes these sequences? V = non-terminal symbols Σ = amino acids R = set of replacement rules S = starting amino acid cecropin motif: Q.EAG.L.K..K
  • 27.
  • 28. N = 50, number of atoms in Earth
  • 29.
  • 30. Combinatorial libraries sequence space grammatical space natural AmPs “ true” AmPs
  • 31.
  • 33.
  • 34. Our results show significant enrichment for activity in the designed set Expected Activity? Y N Test 42 motif-based 18 / 42 42 shuffled 2 / 42 Control 9 natural AmPs 6 / 9 9 non-AmPs 0 / 9
  • 35. Optimized leads showed strong activity against anthrax and staph
  • 36.  
  • 37. Part 2: A generic motif discovery algorithm for diverse biomolecular data
  • 38.
  • 39. Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
  • 40.
  • 41.
  • 42. MEME
  • 43. Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
  • 44.
  • 45. Gemoda proceeds in three steps: comparison, clustering, and convolution
  • 46.
  • 47. Comparison function is context-specific F(w 1 , w 2 )
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53. Ave. length ~700 amino acids
  • 54.
  • 55. Minimum Blosum62 bit score = 50 bits
  • 56. Minimum support = 100% (8/8 sequences)
  • 57. Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to “noise?”
  • 58. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
  • 59. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
  • 60. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to “noise”
  • 61. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 62. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length?
  • 63. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 64. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 65. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs?
  • 66.
  • 67. unit-RMSD x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M
  • 68. Protein structure example: human FIT vs. uridylyltransferase
  • 69. fin