1. Title of Research Proposal
Computational methods for gene prediction in Eukaryotes
Name of Researcher Umar Rasheed
Session 2019-2023
Registration No 2019-bce-019
Proposed Supervisor Prof:Abdul Rehman
Institute of Biochemistry and Biotechnology
University of veterinary and Animal Sciences,
Lahore
3. Introduction:
Functionally,aeukaryoticgene canbe definedasbeingcomposedof atranscribedregionandof regions
that cis-regulate the geneexpression,suchasthe promoterregionwhichcontrolsboththe site andthe
extentof transcription.The regionbetweentwo transcribedregionsiscalledintergenic.The promoteris
inthe intergenicregion,immediatelyupstreamof the gene andnotoverlappingwithit.Genes
sometimesoverlapothergenescreatingasingle transcriptional unitbutthese eventsare rare and
complex enoughthattomy knowledge,nogene predictorhasbeenable toexplicitlymodel them forthe
automaticanalysisandannotationof genomicsequences,computational genepredictionisbecoming
increasinglyimportant. Gene identificationisforpredictingthe completegene structure,particularly
the accurate exon-intronstructure of agene ina eukaryoticgenomicDNA sequence.Aftersequencing,
findingthe genesisone of the firstandmostsignificantstepsinknowingthe genome of aspecies.Gene
findingusuallyreferstothe fieldof computational biologywhichisinvolvedwithalgorithmically
recognizingthe stretchesof sequence,generallygenomicDNA thatare biologicallyfunctional.This
speciallynotonlyinvolvesprotein-codinggenesbutmayalso include additionalfunctional elementsfor
instance RNA genesandregulatoryregionsasdescribedabove.
In general,mostcurrentlyexistingprogramsuse twotypesof contentsensors:one forcodingsequences
and one fornon- codingsequences,i.e.introns,UTR(UniversallyTranslatedRegions) andintergenic
regions.
Aims:
The main aim behind the problem of gene prediction is to correctly label each element of DNA sequence
as belonging to protein-coding region, RNA coding region, and noncoding or intergenic regions.
Intergenic regions are the regions of DNA in between genes. Genes in eukaryotes consist of coding
segments interrupted by long noncoding segments. These coding segments are termed as exons and
noncoding segments as introns. In case of human eukaryotes only 3% of DNA sequence is coding
Literature Review:
DifferentComputationalalgorithmswidelyusedinliterature forgene predictioninclude,Support
VectorMachine e.g.Kimet al. whichdependsuponthe SVMsforpredictingthe targetsof a
4. transcriptionfactorbyrecognizingsubtle relationshipsbetweentheirexpressionprofiles.;Hidden
Markov Model eg.Van Barenet al. whichcan model the statistical dependenciesbetweenthe adjacent
bases;Software programs,Issac et al. whichcan be trainedforspecificdomainof organismorsimilar
genesforgreateraccuracy; Machine Learning,Hoff etal. where context-freegrammarsandother
machine learningtechniquesare explored;Digital Signal Processing,Mabrouketal. where Discrete
Fouriertransforms(DFT) andfilterbasedtechniquesare usedand;Neural Networksalgorithmsbased
approach e.g.Mahony,et.Al.The listis definitelynotexhaustive.
Methodology:
There are three majorphasesinvolvedinthe tool:
1. Database searchvia sequence alignment
The firstphase is to go throughthe genome andannotate genesthatare highsimilaritymatchesto
alreadyknowneukaryotesgenes.The entire listof currentlyknownhumangeneshasbeencompiled
and isreferredtoas RefSeq byOttogene prediction tool.Iwoulduse the RefSeqif mytool isusedonly
for humangene prediction.Otherwise Iwouldgowiththe available databases,orif possible forma
database of my ownbasedon the redistrictedorganismsformytool. For Otto,the cutoff for annotation
of a gene whencomparingtoRefSeqisthatthe genomicsequence hastomatchat least50% of its
lengthtothe RefSeq.The sequence identitymustbe greaterthan92%. For mytool,the thresholdwill be
heuristicallydeterminedonce the tool isrunonbenchmarksof knowngenes. For this phase the
SeqAlignFPGA tool whichIamcurrentlyspeedingupundermySURA wouldbe used.Itusesa prefix tries
model andbackwardsearch algorithmtomatch the querysequence withthe reference gnome.The
computational time islinearwiththe lengthof the querysequence.Itismore accurate and several to
tensof timesfasterthanBLAST.
2. Alternativesplicesiteprediction
The secondphase involvesalternative splice site prediction.Here Iplantouse a customizedvariationof
MaxEntScan.The tool isbasedon maximumentropyprinciple.Although,itisthe mostaccurate and
sensitivetool till date,itlacksspecificity ascomparedtoothergood tools.Unlike the usual methodfor
trainingitwiththe generalizeddata, Iplanto make a model basedonthe resultsof the firstapproach.
Thismeansthat onlyfamiliesof DNA whichhave asignificantmatchwiththe querysequence (abovethe
threshold) shall be entertainedforthe socalled"priordata".Thiswouldincrease the low specificityof
the tool and furtherincrease the sensitivity.
3. Ab-initio gene prediction based on standard HMM model
5. Nowthe secondphase woulddevice the gene intoexonsandintrons.The thirdphase usesstandard
HMM forab initiogene prediction.Itiswell knownfromliteraturethatusingdifferentcontentsensors
and thusdifferentmodelsforcodingandnon-codingregionsisalwaysagoodwayof proceedingwith
gene prediction.Incase of neural networksbasedapproachandspecificallystandardHMMs, thisiseven
more critical since highsequence similarityisneeded.The trainingsetgeneratedfromthe resultsof
phase 1 will thusbe furtherenhancedforconstructionof twoexclusivestandardHMMs.Moreover,the
reasonfor mychoice of standardHMM insteadof generalizedHMMisthat it isan integratedmodel.
Most of the contentsensorswouldbe coveredhere.
Researchesinthe fieldof machine learninghave developedaFastTwo âLevel HMM DecodingAlgorithm
for VocabularyHandwritingRecognition(FTLDA).Thisalgorithmbreaksupthe computationof words
intotwo levels:state (orword) levelandcharacter(or letter) level.Ananalogousdecodingalgorithmfor
standardHMM forgene predictionisalsopossible byusingpatterns(hexamers,dinucleotides) forthe
firstlevel andthe individual nucleotidebasesforthe secondlevel.Thiswouldspeedupthe decoding
processby as muchas 15 timesfor240,000 nts, the size of a large humangene.
Finally,similartoOtto,the resultwill be validatedbycomparisonwithEST,protein,andgenomic
sequence databasesusingSeqAlignFPGA (orBLAST,since itiswidelyused) andalsousingphylogenetic
programslike Paupor Phylip(afterdoingthe SeqAlignFPGA) toidentifyhomologyandparalogy.For
example,inthe case of proteincomparison,the sequence willbe translatedandmatchedagainst
proteindatabase.Similarlythe gnomicsequence will be matchedwithEST.The reductionin
computational time inthe above phaseswill alsoenable thisstage tocomplete faster.
References:
1. âGene Predictionâ.Online. [Available]http://www.wikipedia.org
2. Wang, Chenand Li,"A brief review of computational genepredictionmethods",GenomicProteomics,
Vol.2,No.4,pp.216-221, 2004
3. RabindraKu.Jena,MusbahM.Aqel,Pankaj Srivastava,andPrabhatK.Mahanti,"SoftComputing
MethodologiesinBioinformatics",EuropeanJournal of ScientificResearch,Vol.26,No.2,pp.189-203,
2009
4. D. Sundar,âBEL418, BioinformaticsâSpring,2012.IndianInstitute of TechnologyDelhi,India
5 Sung-KyuKim,Jin-WuNam,Je-KeunRhee,Wha-JinLee andByoung- TakZhang,"miTarget:microRNA
target gene predictionusingasupportvectormachine",BMCBioinformatics,Vol.7,No.411,pp.1-14,
2006
6. Marijke J. van Barenand Michael R. Brent,"Iterative gene predictionandpseudogeneremoval
improvesgenome annotation",Genome Research,Vol.16,pp.678-685, 2006