SlideShare a Scribd company logo
1 of 5
Title of Research Proposal
Computational methods for gene prediction in Eukaryotes
Name of Researcher Umar Rasheed
Session 2019-2023
Registration No 2019-bce-019
Proposed Supervisor Prof:Abdul Rehman
Institute of Biochemistry and Biotechnology
University of veterinary and Animal Sciences,
Lahore
Table of content:
Introduction 3
Aims 3
Literature review 3
Methodology 4
References 5
Introduction:
Functionally,aeukaryoticgene canbe definedasbeingcomposedof atranscribedregionandof regions
that cis-regulate the geneexpression,suchasthe promoterregionwhichcontrolsboththe site andthe
extentof transcription.The regionbetweentwo transcribedregionsiscalledintergenic.The promoteris
inthe intergenicregion,immediatelyupstreamof the gene andnotoverlappingwithit.Genes
sometimesoverlapothergenescreatingasingle transcriptional unitbutthese eventsare rare and
complex enoughthattomy knowledge,nogene predictorhasbeenable toexplicitlymodel them forthe
automaticanalysisandannotationof genomicsequences,computational genepredictionisbecoming
increasinglyimportant. Gene identificationisforpredictingthe completegene structure,particularly
the accurate exon-intronstructure of agene ina eukaryoticgenomicDNA sequence.Aftersequencing,
findingthe genesisone of the firstandmostsignificantstepsinknowingthe genome of aspecies.Gene
findingusuallyreferstothe fieldof computational biologywhichisinvolvedwithalgorithmically
recognizingthe stretchesof sequence,generallygenomicDNA thatare biologicallyfunctional.This
speciallynotonlyinvolvesprotein-codinggenesbutmayalso include additionalfunctional elementsfor
instance RNA genesandregulatoryregionsasdescribedabove.
In general,mostcurrentlyexistingprogramsuse twotypesof contentsensors:one forcodingsequences
and one fornon- codingsequences,i.e.introns,UTR(UniversallyTranslatedRegions) andintergenic
regions.
Aims:
The main aim behind the problem of gene prediction is to correctly label each element of DNA sequence
as belonging to protein-coding region, RNA coding region, and noncoding or intergenic regions.
Intergenic regions are the regions of DNA in between genes. Genes in eukaryotes consist of coding
segments interrupted by long noncoding segments. These coding segments are termed as exons and
noncoding segments as introns. In case of human eukaryotes only 3% of DNA sequence is coding
Literature Review:
DifferentComputationalalgorithmswidelyusedinliterature forgene predictioninclude,Support
VectorMachine e.g.Kimet al. whichdependsuponthe SVMsforpredictingthe targetsof a
transcriptionfactorbyrecognizingsubtle relationshipsbetweentheirexpressionprofiles.;Hidden
Markov Model eg.Van Barenet al. whichcan model the statistical dependenciesbetweenthe adjacent
bases;Software programs,Issac et al. whichcan be trainedforspecificdomainof organismorsimilar
genesforgreateraccuracy; Machine Learning,Hoff etal. where context-freegrammarsandother
machine learningtechniquesare explored;Digital Signal Processing,Mabrouketal. where Discrete
Fouriertransforms(DFT) andfilterbasedtechniquesare usedand;Neural Networksalgorithmsbased
approach e.g.Mahony,et.Al.The listis definitelynotexhaustive.
Methodology:
There are three majorphasesinvolvedinthe tool:
1. Database searchvia sequence alignment
The firstphase is to go throughthe genome andannotate genesthatare highsimilaritymatchesto
alreadyknowneukaryotesgenes.The entire listof currentlyknownhumangeneshasbeencompiled
and isreferredtoas RefSeq byOttogene prediction tool.Iwoulduse the RefSeqif mytool isusedonly
for humangene prediction.Otherwise Iwouldgowiththe available databases,orif possible forma
database of my ownbasedon the redistrictedorganismsformytool. For Otto,the cutoff for annotation
of a gene whencomparingtoRefSeqisthatthe genomicsequence hastomatchat least50% of its
lengthtothe RefSeq.The sequence identitymustbe greaterthan92%. For mytool,the thresholdwill be
heuristicallydeterminedonce the tool isrunonbenchmarksof knowngenes. For this phase the
SeqAlignFPGA tool whichIamcurrentlyspeedingupundermySURA wouldbe used.Itusesa prefix tries
model andbackwardsearch algorithmtomatch the querysequence withthe reference gnome.The
computational time islinearwiththe lengthof the querysequence.Itismore accurate and several to
tensof timesfasterthanBLAST.
2. Alternativesplicesiteprediction
The secondphase involvesalternative splice site prediction.Here Iplantouse a customizedvariationof
MaxEntScan.The tool isbasedon maximumentropyprinciple.Although,itisthe mostaccurate and
sensitivetool till date,itlacksspecificity ascomparedtoothergood tools.Unlike the usual methodfor
trainingitwiththe generalizeddata, Iplanto make a model basedonthe resultsof the firstapproach.
Thismeansthat onlyfamiliesof DNA whichhave asignificantmatchwiththe querysequence (abovethe
threshold) shall be entertainedforthe socalled"priordata".Thiswouldincrease the low specificityof
the tool and furtherincrease the sensitivity.
3. Ab-initio gene prediction based on standard HMM model
Nowthe secondphase woulddevice the gene intoexonsandintrons.The thirdphase usesstandard
HMM forab initiogene prediction.Itiswell knownfromliteraturethatusingdifferentcontentsensors
and thusdifferentmodelsforcodingandnon-codingregionsisalwaysagoodwayof proceedingwith
gene prediction.Incase of neural networksbasedapproachandspecificallystandardHMMs, thisiseven
more critical since highsequence similarityisneeded.The trainingsetgeneratedfromthe resultsof
phase 1 will thusbe furtherenhancedforconstructionof twoexclusivestandardHMMs.Moreover,the
reasonfor mychoice of standardHMM insteadof generalizedHMMisthat it isan integratedmodel.
Most of the contentsensorswouldbe coveredhere.
Researchesinthe fieldof machine learninghave developedaFastTwo “Level HMM DecodingAlgorithm
for VocabularyHandwritingRecognition(FTLDA).Thisalgorithmbreaksupthe computationof words
intotwo levels:state (orword) levelandcharacter(or letter) level.Ananalogousdecodingalgorithmfor
standardHMM forgene predictionisalsopossible byusingpatterns(hexamers,dinucleotides) forthe
firstlevel andthe individual nucleotidebasesforthe secondlevel.Thiswouldspeedupthe decoding
processby as muchas 15 timesfor240,000 nts, the size of a large humangene.
Finally,similartoOtto,the resultwill be validatedbycomparisonwithEST,protein,andgenomic
sequence databasesusingSeqAlignFPGA (orBLAST,since itiswidelyused) andalsousingphylogenetic
programslike Paupor Phylip(afterdoingthe SeqAlignFPGA) toidentifyhomologyandparalogy.For
example,inthe case of proteincomparison,the sequence willbe translatedandmatchedagainst
proteindatabase.Similarlythe gnomicsequence will be matchedwithEST.The reductionin
computational time inthe above phaseswill alsoenable thisstage tocomplete faster.
References:
1. “Gene Prediction”.Online. [Available]http://www.wikipedia.org
2. Wang, Chenand Li,"A brief review of computational genepredictionmethods",GenomicProteomics,
Vol.2,No.4,pp.216-221, 2004
3. RabindraKu.Jena,MusbahM.Aqel,Pankaj Srivastava,andPrabhatK.Mahanti,"SoftComputing
MethodologiesinBioinformatics",EuropeanJournal of ScientificResearch,Vol.26,No.2,pp.189-203,
2009
4. D. Sundar,“BEL418, Bioinformatics”Spring,2012.IndianInstitute of TechnologyDelhi,India
5 Sung-KyuKim,Jin-WuNam,Je-KeunRhee,Wha-JinLee andByoung- TakZhang,"miTarget:microRNA
target gene predictionusingasupportvectormachine",BMCBioinformatics,Vol.7,No.411,pp.1-14,
2006
6. Marijke J. van Barenand Michael R. Brent,"Iterative gene predictionandpseudogeneremoval
improvesgenome annotation",Genome Research,Vol.16,pp.678-685, 2006

More Related Content

Similar to Umar assignment

BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
ChijiokeNsofor
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classification
perfj
 

Similar to Umar assignment (20)

BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 
genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Bio Scope
Bio ScopeBio Scope
Bio Scope
 
A Methodology For Motif Discovery Employing Iterated Cluster Re-Assignment
A Methodology For Motif Discovery Employing Iterated Cluster Re-AssignmentA Methodology For Motif Discovery Employing Iterated Cluster Re-Assignment
A Methodology For Motif Discovery Employing Iterated Cluster Re-Assignment
 
An26247254
An26247254An26247254
An26247254
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classification
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 

Recently uploaded

GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
Areesha Ahmad
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
Start Project
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
Brahmesh Reddy B R
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
GlendelCaroz
 

Recently uploaded (20)

Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
PARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semPARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th sem
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdf
 
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENSANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxPOST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 

Umar assignment

  • 1. Title of Research Proposal Computational methods for gene prediction in Eukaryotes Name of Researcher Umar Rasheed Session 2019-2023 Registration No 2019-bce-019 Proposed Supervisor Prof:Abdul Rehman Institute of Biochemistry and Biotechnology University of veterinary and Animal Sciences, Lahore
  • 2. Table of content: Introduction 3 Aims 3 Literature review 3 Methodology 4 References 5
  • 3. Introduction: Functionally,aeukaryoticgene canbe definedasbeingcomposedof atranscribedregionandof regions that cis-regulate the geneexpression,suchasthe promoterregionwhichcontrolsboththe site andthe extentof transcription.The regionbetweentwo transcribedregionsiscalledintergenic.The promoteris inthe intergenicregion,immediatelyupstreamof the gene andnotoverlappingwithit.Genes sometimesoverlapothergenescreatingasingle transcriptional unitbutthese eventsare rare and complex enoughthattomy knowledge,nogene predictorhasbeenable toexplicitlymodel them forthe automaticanalysisandannotationof genomicsequences,computational genepredictionisbecoming increasinglyimportant. Gene identificationisforpredictingthe completegene structure,particularly the accurate exon-intronstructure of agene ina eukaryoticgenomicDNA sequence.Aftersequencing, findingthe genesisone of the firstandmostsignificantstepsinknowingthe genome of aspecies.Gene findingusuallyreferstothe fieldof computational biologywhichisinvolvedwithalgorithmically recognizingthe stretchesof sequence,generallygenomicDNA thatare biologicallyfunctional.This speciallynotonlyinvolvesprotein-codinggenesbutmayalso include additionalfunctional elementsfor instance RNA genesandregulatoryregionsasdescribedabove. In general,mostcurrentlyexistingprogramsuse twotypesof contentsensors:one forcodingsequences and one fornon- codingsequences,i.e.introns,UTR(UniversallyTranslatedRegions) andintergenic regions. Aims: The main aim behind the problem of gene prediction is to correctly label each element of DNA sequence as belonging to protein-coding region, RNA coding region, and noncoding or intergenic regions. Intergenic regions are the regions of DNA in between genes. Genes in eukaryotes consist of coding segments interrupted by long noncoding segments. These coding segments are termed as exons and noncoding segments as introns. In case of human eukaryotes only 3% of DNA sequence is coding Literature Review: DifferentComputationalalgorithmswidelyusedinliterature forgene predictioninclude,Support VectorMachine e.g.Kimet al. whichdependsuponthe SVMsforpredictingthe targetsof a
  • 4. transcriptionfactorbyrecognizingsubtle relationshipsbetweentheirexpressionprofiles.;Hidden Markov Model eg.Van Barenet al. whichcan model the statistical dependenciesbetweenthe adjacent bases;Software programs,Issac et al. whichcan be trainedforspecificdomainof organismorsimilar genesforgreateraccuracy; Machine Learning,Hoff etal. where context-freegrammarsandother machine learningtechniquesare explored;Digital Signal Processing,Mabrouketal. where Discrete Fouriertransforms(DFT) andfilterbasedtechniquesare usedand;Neural Networksalgorithmsbased approach e.g.Mahony,et.Al.The listis definitelynotexhaustive. Methodology: There are three majorphasesinvolvedinthe tool: 1. Database searchvia sequence alignment The firstphase is to go throughthe genome andannotate genesthatare highsimilaritymatchesto alreadyknowneukaryotesgenes.The entire listof currentlyknownhumangeneshasbeencompiled and isreferredtoas RefSeq byOttogene prediction tool.Iwoulduse the RefSeqif mytool isusedonly for humangene prediction.Otherwise Iwouldgowiththe available databases,orif possible forma database of my ownbasedon the redistrictedorganismsformytool. For Otto,the cutoff for annotation of a gene whencomparingtoRefSeqisthatthe genomicsequence hastomatchat least50% of its lengthtothe RefSeq.The sequence identitymustbe greaterthan92%. For mytool,the thresholdwill be heuristicallydeterminedonce the tool isrunonbenchmarksof knowngenes. For this phase the SeqAlignFPGA tool whichIamcurrentlyspeedingupundermySURA wouldbe used.Itusesa prefix tries model andbackwardsearch algorithmtomatch the querysequence withthe reference gnome.The computational time islinearwiththe lengthof the querysequence.Itismore accurate and several to tensof timesfasterthanBLAST. 2. Alternativesplicesiteprediction The secondphase involvesalternative splice site prediction.Here Iplantouse a customizedvariationof MaxEntScan.The tool isbasedon maximumentropyprinciple.Although,itisthe mostaccurate and sensitivetool till date,itlacksspecificity ascomparedtoothergood tools.Unlike the usual methodfor trainingitwiththe generalizeddata, Iplanto make a model basedonthe resultsof the firstapproach. Thismeansthat onlyfamiliesof DNA whichhave asignificantmatchwiththe querysequence (abovethe threshold) shall be entertainedforthe socalled"priordata".Thiswouldincrease the low specificityof the tool and furtherincrease the sensitivity. 3. Ab-initio gene prediction based on standard HMM model
  • 5. Nowthe secondphase woulddevice the gene intoexonsandintrons.The thirdphase usesstandard HMM forab initiogene prediction.Itiswell knownfromliteraturethatusingdifferentcontentsensors and thusdifferentmodelsforcodingandnon-codingregionsisalwaysagoodwayof proceedingwith gene prediction.Incase of neural networksbasedapproachandspecificallystandardHMMs, thisiseven more critical since highsequence similarityisneeded.The trainingsetgeneratedfromthe resultsof phase 1 will thusbe furtherenhancedforconstructionof twoexclusivestandardHMMs.Moreover,the reasonfor mychoice of standardHMM insteadof generalizedHMMisthat it isan integratedmodel. Most of the contentsensorswouldbe coveredhere. Researchesinthe fieldof machine learninghave developedaFastTwo “Level HMM DecodingAlgorithm for VocabularyHandwritingRecognition(FTLDA).Thisalgorithmbreaksupthe computationof words intotwo levels:state (orword) levelandcharacter(or letter) level.Ananalogousdecodingalgorithmfor standardHMM forgene predictionisalsopossible byusingpatterns(hexamers,dinucleotides) forthe firstlevel andthe individual nucleotidebasesforthe secondlevel.Thiswouldspeedupthe decoding processby as muchas 15 timesfor240,000 nts, the size of a large humangene. Finally,similartoOtto,the resultwill be validatedbycomparisonwithEST,protein,andgenomic sequence databasesusingSeqAlignFPGA (orBLAST,since itiswidelyused) andalsousingphylogenetic programslike Paupor Phylip(afterdoingthe SeqAlignFPGA) toidentifyhomologyandparalogy.For example,inthe case of proteincomparison,the sequence willbe translatedandmatchedagainst proteindatabase.Similarlythe gnomicsequence will be matchedwithEST.The reductionin computational time inthe above phaseswill alsoenable thisstage tocomplete faster. References: 1. “Gene Prediction”.Online. [Available]http://www.wikipedia.org 2. Wang, Chenand Li,"A brief review of computational genepredictionmethods",GenomicProteomics, Vol.2,No.4,pp.216-221, 2004 3. RabindraKu.Jena,MusbahM.Aqel,Pankaj Srivastava,andPrabhatK.Mahanti,"SoftComputing MethodologiesinBioinformatics",EuropeanJournal of ScientificResearch,Vol.26,No.2,pp.189-203, 2009 4. D. Sundar,“BEL418, Bioinformatics”Spring,2012.IndianInstitute of TechnologyDelhi,India 5 Sung-KyuKim,Jin-WuNam,Je-KeunRhee,Wha-JinLee andByoung- TakZhang,"miTarget:microRNA target gene predictionusingasupportvectormachine",BMCBioinformatics,Vol.7,No.411,pp.1-14, 2006 6. Marijke J. van Barenand Michael R. Brent,"Iterative gene predictionandpseudogeneremoval improvesgenome annotation",Genome Research,Vol.16,pp.678-685, 2006