SlideShare a Scribd company logo
1 of 1
Dynamic Programming for Gene Structure Prediction for DNA
Marilyn B. Arceo and Craig Reinhart, Ph.D
Department of Computer Science, California Lutheran University, Thousand Oaks, CA 91360
DNA sequencings have proven difficult when it comes to the prediction
gene structure. Trying to predict a structure of gene, also known as gene
parsing, from its native DNA sequence has proven to become problematic.
Gene structures consist of specific sets of exon (coding) segments that
alternate with intron (noncoding) segments. In order to predict any gene
structures, it is required to predict and find the best location of both exon
and intron segments. This process has proven to be challenging. In this
study, dynamic programming will be used to predict gene structures from
genomic data such as DNA. Dynamic programming algorithms are used to
analyze fragments of gene structures to find optimal gene structures that
satisfy the scoring functions. Segment based dynamic programming has
proved to be useful in prediction of gene structures since segments of DNA
are analyzed and can be scored in-frame, saving time. In this study, we will
be using Hidden Markov Models (HMM) in order to predict the missing
segments of each DNA sequence given.
INTRODUCTION
1. Set S of N states, S = S1S2 …SN
2. Set V of M observation symbols, the output alphabet. V = v1v2 …vM.
3. Set A of state transition probabilities,A = aij where aij is the probability of
moving from state i to state j.
aij = P(qt+1 = Sj | qt = Si), 1 ≤ i, j ≤ N
4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is
the probability of emitting symbol k at state j.
bj(k) = P(vk|qt = Sj), 1 ≤ j ≤ N, 1 ≤ k ≤ M
5. Set ¼, the initial state distribution ¼ = ¼i where ¼i is the probability that
the start state is state i.
Πi = P(q1 = Si), 1 ≤ j ≤ N
Given the definitions above, the notation of a model is ¸ Πi = (A,B, Π).
HIDDEN MARKOV MODELS
In the future, we would like to be able to use a more complex HMM
architecture that would allow us to find more combinations of nucleotides If
we are able to have a more complex table of probabilities, predicting genes
can become more accurate since there is a bigger pool of observations that
can be used to predict the possible gaps in the sequence.
FUTURE WORK
RESULTS & DISCUSSION
HMM MODELARCHITECTURE
I want to give special thanks to Dr. Craig Reinhart, Dr. Christopher Brown,
and Dr. Dennis Revie for all their help, expertise and support in this capstone
project.
ACKNOWLEDGEMENTS
HIDDEN MARKOV MODELS
1. Lesk, Arthur M. "Alignments and Phylogenetic Trees." Introduction to
Bioinformatics. Oxford: Oxford UP, 2002. 261-80. Print.
2. Jones, Neil C., and Pavel Pevzner. An Introduction to Bioinformatics
Algorithms. Cambridge, MA: MIT, 2004. Print.
3. Rabiner, Lawerence R. “A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recogniztion”. Proceedings of the IEEE,
Vol.22, No. 2, February 1989, 257-286.
REFERENCES
Fig. 1. An example of a simple Hidden Markov Model where X
represents the possible states, Y represents the possible variables (or
parameters), A represents the possible state transition probabilities ,
and B are the output probabilities for each state.
Fig. 2. Above is the Hidden Markov Model used in this study. The
nucleotides, A, C, G, and T, are connected with one another, showing the
probability of each occurrence. Each connection has a probability set
depending on the training data given to the algorithm.
Table 1. Sample probabilities of possible nucleotides (observations) that
could occur in a sequence. Depending on the sequence you choose to train
your HMM model with, the probabilities of the observations will change
accordingly.
Presented at the 8th Annual Festival of Scholars, California Lutheran University, October 2010, Thousand Oaks, CA
Hidden Markov Models (HMM) is a machine learning algorithm that uses
training data in order to derive important insights about parameters that are
often hidden within the problem set. HMMs are a computational structure
that is able to describe patterns that define families of homologous
sequences. They are able to predict the probabilities of the possible patterns
that could occur in the data set, in this study’s case possible amino acids
and/or nucleotides. Using a scoring system and this computational model,
we are able to predict the correct gene locations for the corresponding
missing genes, the problem that we are faced.
A C G T
A 2.98% 2.98% 13.43% 1.49%
C 2.98% 14.92% 5.97% 5.97%
G 11.94% 5.97% 10.44% 4.47%
T 1.49% 7.46% 4.47% 2.98%

More Related Content

What's hot

Predicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parametersPredicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parametersIJLT EMAS
 
Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Sheng Wang
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentSaramita De Chakravarti
 
De novo str_prediction
De novo str_predictionDe novo str_prediction
De novo str_predictionShwetA Kumari
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsRyohei Suzuki
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)IJCI JOURNAL
 
Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Alexander Decker
 
Efficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNEfficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNIOSR Journals
 

What's hot (14)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Predicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parametersPredicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parameters
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013Protein threading using context specific alignment potential ismb-2013
Protein threading using context specific alignment potential ismb-2013
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
De novo str_prediction
De novo str_predictionDe novo str_prediction
De novo str_prediction
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problems
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...
 
Efficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNEfficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANN
 
Clustal X
Clustal XClustal X
Clustal X
 

Similar to Dynamic Programming Algorithm for the Prediction for Gene Structure

Stock markets and_human_genomics
Stock markets and_human_genomicsStock markets and_human_genomics
Stock markets and_human_genomicsShyam Sarkar
 
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...csandit
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesCSCJournals
 
On the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood modelOn the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood modelArrigo Coen
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsinfopapers
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsMark Gerstein
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmgarima931
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classificationperfj
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walkingJonathan Blakes
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)CSCJournals
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataijcsit
 

Similar to Dynamic Programming Algorithm for the Prediction for Gene Structure (20)

Stock markets and_human_genomics
Stock markets and_human_genomicsStock markets and_human_genomics
Stock markets and_human_genomics
 
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
2224d_final
2224d_final2224d_final
2224d_final
 
On the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood modelOn the identifiability of phylogenetic networks under a pseudolikelihood model
On the identifiability of phylogenetic networks under a pseudolikelihood model
 
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
Predicting Functional Regions in Genomic DNA Sequences Using Artificial Neur...
Predicting Functional Regions in Genomic DNA Sequences Using  Artificial Neur...Predicting Functional Regions in Genomic DNA Sequences Using  Artificial Neur...
Predicting Functional Regions in Genomic DNA Sequences Using Artificial Neur...
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
Colombo14a
Colombo14aColombo14a
Colombo14a
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classification
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
Nbt1004 1315
Nbt1004 1315Nbt1004 1315
Nbt1004 1315
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
 

Recently uploaded

PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 

Recently uploaded (20)

PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 

Dynamic Programming Algorithm for the Prediction for Gene Structure

  • 1. Dynamic Programming for Gene Structure Prediction for DNA Marilyn B. Arceo and Craig Reinhart, Ph.D Department of Computer Science, California Lutheran University, Thousand Oaks, CA 91360 DNA sequencings have proven difficult when it comes to the prediction gene structure. Trying to predict a structure of gene, also known as gene parsing, from its native DNA sequence has proven to become problematic. Gene structures consist of specific sets of exon (coding) segments that alternate with intron (noncoding) segments. In order to predict any gene structures, it is required to predict and find the best location of both exon and intron segments. This process has proven to be challenging. In this study, dynamic programming will be used to predict gene structures from genomic data such as DNA. Dynamic programming algorithms are used to analyze fragments of gene structures to find optimal gene structures that satisfy the scoring functions. Segment based dynamic programming has proved to be useful in prediction of gene structures since segments of DNA are analyzed and can be scored in-frame, saving time. In this study, we will be using Hidden Markov Models (HMM) in order to predict the missing segments of each DNA sequence given. INTRODUCTION 1. Set S of N states, S = S1S2 …SN 2. Set V of M observation symbols, the output alphabet. V = v1v2 …vM. 3. Set A of state transition probabilities,A = aij where aij is the probability of moving from state i to state j. aij = P(qt+1 = Sj | qt = Si), 1 ≤ i, j ≤ N 4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is the probability of emitting symbol k at state j. bj(k) = P(vk|qt = Sj), 1 ≤ j ≤ N, 1 ≤ k ≤ M 5. Set ¼, the initial state distribution ¼ = ¼i where ¼i is the probability that the start state is state i. Πi = P(q1 = Si), 1 ≤ j ≤ N Given the definitions above, the notation of a model is ¸ Πi = (A,B, Π). HIDDEN MARKOV MODELS In the future, we would like to be able to use a more complex HMM architecture that would allow us to find more combinations of nucleotides If we are able to have a more complex table of probabilities, predicting genes can become more accurate since there is a bigger pool of observations that can be used to predict the possible gaps in the sequence. FUTURE WORK RESULTS & DISCUSSION HMM MODELARCHITECTURE I want to give special thanks to Dr. Craig Reinhart, Dr. Christopher Brown, and Dr. Dennis Revie for all their help, expertise and support in this capstone project. ACKNOWLEDGEMENTS HIDDEN MARKOV MODELS 1. Lesk, Arthur M. "Alignments and Phylogenetic Trees." Introduction to Bioinformatics. Oxford: Oxford UP, 2002. 261-80. Print. 2. Jones, Neil C., and Pavel Pevzner. An Introduction to Bioinformatics Algorithms. Cambridge, MA: MIT, 2004. Print. 3. Rabiner, Lawerence R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recogniztion”. Proceedings of the IEEE, Vol.22, No. 2, February 1989, 257-286. REFERENCES Fig. 1. An example of a simple Hidden Markov Model where X represents the possible states, Y represents the possible variables (or parameters), A represents the possible state transition probabilities , and B are the output probabilities for each state. Fig. 2. Above is the Hidden Markov Model used in this study. The nucleotides, A, C, G, and T, are connected with one another, showing the probability of each occurrence. Each connection has a probability set depending on the training data given to the algorithm. Table 1. Sample probabilities of possible nucleotides (observations) that could occur in a sequence. Depending on the sequence you choose to train your HMM model with, the probabilities of the observations will change accordingly. Presented at the 8th Annual Festival of Scholars, California Lutheran University, October 2010, Thousand Oaks, CA Hidden Markov Models (HMM) is a machine learning algorithm that uses training data in order to derive important insights about parameters that are often hidden within the problem set. HMMs are a computational structure that is able to describe patterns that define families of homologous sequences. They are able to predict the probabilities of the possible patterns that could occur in the data set, in this study’s case possible amino acids and/or nucleotides. Using a scoring system and this computational model, we are able to predict the correct gene locations for the corresponding missing genes, the problem that we are faced. A C G T A 2.98% 2.98% 13.43% 1.49% C 2.98% 14.92% 5.97% 5.97% G 11.94% 5.97% 10.44% 4.47% T 1.49% 7.46% 4.47% 2.98%