Role of AI in seed science Predictive modelling and Beyond.pptx
Dynamic Programming Algorithm for the Prediction for Gene Structure
1. Dynamic Programming for Gene Structure Prediction for DNA
Marilyn B. Arceo and Craig Reinhart, Ph.D
Department of Computer Science, California Lutheran University, Thousand Oaks, CA 91360
DNA sequencings have proven difficult when it comes to the prediction
gene structure. Trying to predict a structure of gene, also known as gene
parsing, from its native DNA sequence has proven to become problematic.
Gene structures consist of specific sets of exon (coding) segments that
alternate with intron (noncoding) segments. In order to predict any gene
structures, it is required to predict and find the best location of both exon
and intron segments. This process has proven to be challenging. In this
study, dynamic programming will be used to predict gene structures from
genomic data such as DNA. Dynamic programming algorithms are used to
analyze fragments of gene structures to find optimal gene structures that
satisfy the scoring functions. Segment based dynamic programming has
proved to be useful in prediction of gene structures since segments of DNA
are analyzed and can be scored in-frame, saving time. In this study, we will
be using Hidden Markov Models (HMM) in order to predict the missing
segments of each DNA sequence given.
INTRODUCTION
1. Set S of N states, S = S1S2 …SN
2. Set V of M observation symbols, the output alphabet. V = v1v2 …vM.
3. Set A of state transition probabilities,A = aij where aij is the probability of
moving from state i to state j.
aij = P(qt+1 = Sj | qt = Si), 1 ≤ i, j ≤ N
4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is
the probability of emitting symbol k at state j.
bj(k) = P(vk|qt = Sj), 1 ≤ j ≤ N, 1 ≤ k ≤ M
5. Set ¼, the initial state distribution ¼ = ¼i where ¼i is the probability that
the start state is state i.
Πi = P(q1 = Si), 1 ≤ j ≤ N
Given the definitions above, the notation of a model is ¸ Πi = (A,B, Π).
HIDDEN MARKOV MODELS
In the future, we would like to be able to use a more complex HMM
architecture that would allow us to find more combinations of nucleotides If
we are able to have a more complex table of probabilities, predicting genes
can become more accurate since there is a bigger pool of observations that
can be used to predict the possible gaps in the sequence.
FUTURE WORK
RESULTS & DISCUSSION
HMM MODELARCHITECTURE
I want to give special thanks to Dr. Craig Reinhart, Dr. Christopher Brown,
and Dr. Dennis Revie for all their help, expertise and support in this capstone
project.
ACKNOWLEDGEMENTS
HIDDEN MARKOV MODELS
1. Lesk, Arthur M. "Alignments and Phylogenetic Trees." Introduction to
Bioinformatics. Oxford: Oxford UP, 2002. 261-80. Print.
2. Jones, Neil C., and Pavel Pevzner. An Introduction to Bioinformatics
Algorithms. Cambridge, MA: MIT, 2004. Print.
3. Rabiner, Lawerence R. “A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recogniztion”. Proceedings of the IEEE,
Vol.22, No. 2, February 1989, 257-286.
REFERENCES
Fig. 1. An example of a simple Hidden Markov Model where X
represents the possible states, Y represents the possible variables (or
parameters), A represents the possible state transition probabilities ,
and B are the output probabilities for each state.
Fig. 2. Above is the Hidden Markov Model used in this study. The
nucleotides, A, C, G, and T, are connected with one another, showing the
probability of each occurrence. Each connection has a probability set
depending on the training data given to the algorithm.
Table 1. Sample probabilities of possible nucleotides (observations) that
could occur in a sequence. Depending on the sequence you choose to train
your HMM model with, the probabilities of the observations will change
accordingly.
Presented at the 8th Annual Festival of Scholars, California Lutheran University, October 2010, Thousand Oaks, CA
Hidden Markov Models (HMM) is a machine learning algorithm that uses
training data in order to derive important insights about parameters that are
often hidden within the problem set. HMMs are a computational structure
that is able to describe patterns that define families of homologous
sequences. They are able to predict the probabilities of the possible patterns
that could occur in the data set, in this study’s case possible amino acids
and/or nucleotides. Using a scoring system and this computational model,
we are able to predict the correct gene locations for the corresponding
missing genes, the problem that we are faced.
A C G T
A 2.98% 2.98% 13.43% 1.49%
C 2.98% 14.92% 5.97% 5.97%
G 11.94% 5.97% 10.44% 4.47%
T 1.49% 7.46% 4.47% 2.98%