SlideShare a Scribd company logo
1 of 32
Gene Prediction
Gene Prediction
• Genome – sequenced and assembled
• First ensuring task is to locate all the protein coding
genes hidden within the genome.
• This help in understanding the functional content of
the genome.
• Computational gene prediction is a prerequisite for
detailed functional annotation of genes and
genomes.
• The ultimate goal is to describe all the genes
computationally with near 100% accuracy.
• The ability to accurately predict genes can significantly
reduce the amount of experimental verification work
required.
Gene Organization
• Prokaryote
– Have relatively small genome sizes
– Have high gene density (>90% of coding)
– Few repeats
– Single ORF, found adjacent to one another
– Less complex
Gene Organization
• Eukaryote
– Nuclear genomes are much larger than prokaryotic ones
– Have a very low gene density
– The space between genes is often very large
– Rich in repetitive sequences and transposable elements
– Eukaryotic genomes are characterized by a mosaic
organization in which a gene is split into pieces (called
exons) by intervening noncoding sequences (called
introns)
– The nascent transcript from a eukaryotic gene is modified
in three different ways (Capping, Splicing and
Polyadenylation) before becoming a mature mRNA for
protein translation.
– More complex
Prediction Methods
• Location of genes is achieved through one
or combination of following methods.
• Intrinsic Method or Template Method
– Searching for Signal
– Searching by Content
• Extrinsic Method or Lookup Method
– Homology based prediction
– Comparative gene prediction
Searching for Signals
• Basic Signals like
– The start/stop codon
– The 5’/3’ splice site
– Ribosomal binding site
– Transcription factor binding site
– polyadenylation (poly-A) sites
• Sequence signals are detected through use of
Position Weight Matrices [(PWM) which are
calculated from a set of known functional
signals] across a sequence of interest.
• Most vertebrate genes use ATG as the translation start codon and
have a uniquely conserved flanking sequence call a Kozak sequence
(CCGCCATGG).
• In bacteria, the majority of genes have a start codon ATG which codes
for methionine.
• Occasionally, GTG and TTG are used as alternative start codons, but
methionine is still the actual amino acid inserted at the first position.
• Because there may be multiple ATG, GTG, or TGT codons in a frame,
the presence of these codons at the beginning of the frame does not
necessarily give a clear indication of the translation initiation site.
• Instead, to help identify this initiation codon, other features associated
with translation are used.
• One such feature is the ribosomal binding site, also called the
Shine-Dalgarno sequence, has a consensus motif of AGGAGGT.
• There are three possible stop codons, identification of which is
straightforward.
• The terminator sequence has a distinct stem-loop secondary structure
followed by a string of Ts.
Transcription Start
RBS
Translation Start Stop
Transcription Termination
TTTTT
Searching for Content
• Gene content refers to coding statistics, which includes
nonrandom nucleotide distribution, amino acid
distribution, synonymous codon usage, and hexamer
frequencies.
• Among these features, the hexamer frequencies appear to
be most discriminative for coding potentials.
• Protein coding regions are known to exhibit characteristic
compositional bias when compared to non-coding regions
• Statistical analyses have shown that pairs of codons (or
amino acids at the protein level) tend to correlate.
• The frequency of six unique nucleotides appearing together
in a coding region is much higher than by random chance.
• Accurate exon are predicted by content based features
• A number of other content based measures can also be
used to discriminate protein coding regions from non-coding
regions
• Content measures are also referred as coding statistics
• The main issue in prediction of eukaryotic genes is the
identification of exons, introns, and splicing sites.
• The good news is that there are still some conserved
sequence features in eukaryotic genes that allow
computational prediction.
• For example, the splice junctions of introns and exons
follow the GT–AG rule in which an intron at the 5’ splice
junction has a consensus motif of GTAAGT; and at the 3’
splice junction is a consensus motif of (Py)12 NCAG
• In addition, most of these genes have a high density of
CG dinucleotides near the transcription start site.
• This region is referred to as a CpG island (p refers to
the phosphodiester bond connecting the two
nucleotides), which helps to identify the transcription
initiation site of a eukaryotic gene.
• The poly-A signal can also help locate the final coding
sequence.
• Eg tools: GeneScan, Grail2
Similarity based prediction
• The homology-based method makes predictions
based on significant matches of the query
sequence with sequences of known genes.
• For instance, if a translated DNA sequence is
found to be similar to a known protein or protein
family from a database search, this can be strong
evidence that the region codes for a protein.
• This works well for prokaryotes.
• Alternatively, when possible exons of a genomic
DNA region match a sequenced cDNA, this also
provides experimental evidence for the existence
of a coding region.
• A combination of abinitio and this method can be
done
• Eg tools: GeneID, GenomeScan, ORF Finder
• Prokaryotic DNA is first subject to conceptual
translation in all six possible frames, three frames
forward and three frames reverse.
• Because a stop codon occurs in about every twenty
codons by chance in a noncoding region, a frame
longer than thirty codons without interruption by stop
codons is suggestive of a gene coding region, although
the threshold for an ORF is normally set even higher at
fifty or sixty codons.
• The putative frame is further manually confirmed by the
presence of other signals such as a start codon and
Shine–Dalgarno sequence.
• Furthermore, the putative ORF can be translated into a
protein sequence, which is then used to search against a
protein database.
• Detection of homologs from this search is probably the
strongest indicator of a protein-coding frame.
Comparative gene prediction
• Large number of complete genomes are
available
• Comparisons made with genomes
• Rational: Functional regions tend to be
more conserved than non-protein coding
regions
• Eg tools: Twinscan, SLAM
Performance Evaluation
• The accuracy of a prediction program can be evaluated
using parameters such as Sensitivity and Specificity.
• To describe the concept of sensitivity and specificity
accurately, four features are used: True Positive (TP),
which is a correctly predicted feature; False Positive
(FP), which is an incorrectly predicted feature; False
Negative (FN), which is a missed feature; and True
Negative (TN), which is the correctly predicted absence
of a feature.
• Using these four terms, sensitivity (Sn) and specificity
(Sp) can be described by the following formulas:
• Sn = TP/(TP + FN)
• Sp = TP/(TP + FP)
• According to these formulas, sensitivity is the proportion
of true signals predicted among all possible true
signals.
• It can be considered as the ability to include correct
predictions.
• In contrast, specificity is the proportion of true signals
among all signals that are predicted.
• It represents the ability to exclude incorrect predictions.
• A program is considered accurate if both sensitivity and
specificity are simultaneously high and approach a value of
1.
• In a case in which sensitivity is high but specificity is low,
the program is said to have a tendency to over predict.
• On the other hand, if the sensitivity is low but specificity
high, the program is too conservative and lacks
predictive power.
• Because neither sensitivity nor specificity alone can fully describe
accuracy, it is desirable to use a single value to summarize both of
them.
• In the field of gene finding, a single parameter known as the
correlation coefficient (CC) is often used, which is defined by the
following formula:
• CC =
• The value of the CC provides an overall measure of accuracy, which
ranges from −1 to +1, with +1 meaning always correct prediction
and −1 meaning always incorrect prediction
TP • TN − F P • FN
√(TP + F P)(TN + FN)(F P + TN)
Promoter and Regulatory
Element Prediction
Promoters and Regulatory
elements
• Promoters are DNA elements located in the vicinity of
gene start sites (which should not be confused with the
translation start sites) and serve as binding sites for
the gene transcription machinery, consisting of RNA
polymerases and transcription factors.
• These DNA elements directly regulate gene expression.
• Promoters and regulatory elements are traditionally
determined by experimental analysis.
• The process is extremely time consuming and laborious.
• Computational prediction of promoters and regulatory
elements is especially promising because it has the
potential to replace a great deal of extensive
experimental analysis.
Identification of promoters and
regulatory elements - a difficult task
• First, promoters and regulatory elements are not
clearly defined and are highly diverse. Each gene
seems to have a unique combination of sets of
regulatory motifs that determine its unique temporal and
spatial expression. There is currently a lack of sufficient
understanding of all the necessary regulatory elements
for transcription.
• Second, the promoters and regulatory elements
cannot be translated into protein sequences to
increase the sensitivity for their detection.
• Third, promoter and regulatory sites to be predicted are
normally short (six to eight nucleotides) and can be
found in essentially any sequence by random chance,
thus resulting in high rates of false positives associated
with theoretical predictions.
Promotor and Regulatory elements
in Prokaryotes
• In bacteria, transcription is initiated by RNA
polymerase, which is a multi-subunit enzyme.
• The RNA polymerase recognizes specific sequences
located 35 and 10 base pairs (bp) upstream from the
transcription start site upstream of a gene and binds.
• The –35 box has a consensus sequence of TTGACA
and the –10 box has a consensus of TATAAT.
• In addition to the RNA polymerase, there are also a
number of DNA-binding proteins called transcription
factors which bind to specific DNA sequences referred
to as regulatory elements.
• They bind to a site several hundred bases away from the
promoter.
• In eukaryotes, gene expression is also regulated by
a protein complex formed between transcription
factors and RNA polymerase.
• Three different types of RNA polymerase
complexes, namely RNA polymerases I, II, and III
where RNA polymerases I and III are responsible
for the transcription of ribosomal RNAs and
tRNAs, respectively and RNA polymerase II is
exclusively responsible for transcribing mRNAs.
• The core of many eukaryotic promoters is a so-
called TATA box, located 30 bps upstream from
the transcription start site, having a consensus
motif TATA(A/T)A(A/T).
Promotor and Regulatory elements
in Eukaryotes
• However, not all eukaryotic promoters contain the
TATA box.
• In addition, many genes have a unique initiator
sequence (Inr), which is a pyrimidine rich
sequence with a consensus (C/T)(C/T)CA(C/T)
(C/T).
• Some regulatory sites can be found tens of
thousands base pairs away from the gene start site.
• Occasionally, regulatory elements are located
downstream instead of upstream of the transcription
start site.
Promotor and Regulatory elements
in Eukaryotes
NF1- nuclear factor 1
SP1-specificity protein 1
PREDICTION ALGORITHMS
• Current algorithms for predicting
promoters and regulatory elements can be
categorized as either
– ab initio based - which make de novo
predictions by scanning individual sequences;
– phylogenetic footprinting - Similarity based
- which make predictions based on alignment
of homologous sequences;
ab initio algorithm
• This type of algorithm predicts prokaryotic and
eukaryotic promoters and regulatory elements based
on characteristic sequences patterns for promoters
and regulatory elements.
• Some ab initio programs are signal based, relying
on characteristic promoter sequences such as the
TATA box, whereas others rely on content
information such as hexamer frequencies.
• The advantage of the ab initio method is that the
sequence can be applied as such without having to
obtain experimental information.
• The limitation is the need for training, which makes
the prediction programs species specific.
• In addition, this type of method has a difficulty in
discovering new, unknown motifs.
• The conventional approach to detecting a promoter or
regulatory site is through
– matching a consensus sequence pattern represented by regular
expressions or
– matching a position-specific scoring matrix (PSSM) constructed
from well-characterized binding sites.
• In either case, the consensus sequences or the matrices
are relatively short, covering 6 to 10 bases.
• This simple approach, however, often has difficulty
differentiating true promoters from random sequence
matches and generates high rates of false positives as
a result.
ab initio algorithm
Prokaryotic Promoter Prediction
• One of the unique aspects in prokaryotic promoter
prediction is the determination of operon structures,
because genes within an operon share a common
promoter located upstream of the first gene of the
operon.
• Thus, operon prediction is the key in prokaryotic
promoter prediction.
• Once an operon structure is known, only the first gene is
predicted for the presence of a promoter and regulatory
elements, whereas other genes in the operon do not
possess such DNA elements.
• This method relies on two kinds of information: gene
orientation and intergenic distances of a pair of genes
of interest and conserved linkage of the genes based on
comparative genomic analysis.
Predicting Eukaryotic Promoters
• The ab initio method for predicting eukaryotic promoters and
regulatory elements also relies on searching the input
sequences for matching of consensus patterns of known
promoters and regulatory elements.
• However, this approach tends to generate very high rate of
false positives owing to nonspecific matches with the short
sequence patterns.
• To increase the specificity of prediction, a unique feature
called CpG islands are detected where many vertebrate
genes are characterized by a high density of CG
dinucleotides near the promoter region overlapping the
transcription start site.
• By identifying the CpG islands, promoters can be traced on
the immediate upstream region from the islands.
• The promoter regions tend to contain a high density of
protein-binding sites. Thus, finding a cluster of transcription
factor binding sites often enhances the probability of
individual binding site prediction.
Tool for BPP
• BPROM is a web-based program for prediction
of bacterial promoters.
• It uses a linear discriminant function combined
with signal and content information such as
consensus promoter sequence and
oligonucleotide composition of the promoter
sites.
• This program first predicts a given sequence for
bacterial operon structures by using an
intergenic distance of 100 bp as basis for
distinguishing genes to be in an operon.
Tool for EPP
• McPromoter (http://genes.mit.edu/McPromoter.html) is
a web-based program that uses a neural network to
make promoter predictions.
• It has a unique promoter model containing six scoring
segments.
• The program scans a window of 300 bases for the
likelihoods of being in each of the coding, noncoding,
and promoter regions.
• The input for the neural network includes parameters for
sequence physical properties, such as DNA bendability,
plus signals such as the TATA box, initiator box, and
CpG islands.
phylogenetic footprinting
• It has been observed that promoter and regulatory
elements from closely related organisms such as human
and mouse are highly conserved.
• The conservation is both at the sequence level and at
the level of organization of the elements.
• Therefore, it is possible to obtain such promoter
sequences for a particular gene through comparative
analysis.
• The identification of conserved noncoding DNA elements
that serve crucial functional roles is referred to as
phylogenetic footprinting; the elements are called
phylogenetic footprints.
• This type of method can apply to both prokaryotic
and eukaryotic sequences.
• The selection of organisms for comparison is an
important consideration in this type of analysis.
• If the pair of organisms selected are too closely
related, such as human and chimpanzee, the
sequence difference between them may not be
sufficient to filter out functional elements. On the
other hand, if the organisms’ evolutionary distances
are too long, such as between human and fish, long
evolutionary divergence may render promoter and
other elements undetectable.
• Another caveat of phylogenetic footprinting is to
extract noncoding sequences upstream of
corresponding genes and focus the comparison to
this region only, which helps to prevent false
positives.
Tool
• ConSite (http://mordor.cgb.ki.se/cgi-
bin/CONSITE/consite) is a web server that finds putative
promoter elements by comparing two orthologous
sequences.
• The user provides two individual sequences which are
aligned by ConSite using a global alignment algorithm.
• Alternatively, the program accepts precomputed
alignment.
• Conserved regions are identified by calculating identity
scores, which are then used to compare against a motif
database of regulatory sites (TRANSFAC).
• High-scoring sequence segments upstream of genes are
returned as putative regulatory elements.

More Related Content

What's hot

SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsprateek kumar
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programsMugdhaSharma11
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
An Introduction to Genomics
An Introduction to GenomicsAn Introduction to Genomics
An Introduction to GenomicsDr NEETHU ASOKAN
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics Senthil Natesan
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methodsratanvishwas
 

What's hot (20)

SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
Genomics types
Genomics typesGenomics types
Genomics types
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
An Introduction to Genomics
An Introduction to GenomicsAn Introduction to Genomics
An Introduction to Genomics
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 
Cath
CathCath
Cath
 
Scop database
Scop databaseScop database
Scop database
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 

Similar to Bioinformatics

Modern concept of gene.pdf
Modern concept of gene.pdfModern concept of gene.pdf
Modern concept of gene.pdfMeeraTaraSuresh
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 
Bioinfomatics in mutation studies
Bioinfomatics in mutation studies Bioinfomatics in mutation studies
Bioinfomatics in mutation studies Dariyus Kabraji
 
organization of genome both full ppt.pptx
organization of genome both full ppt.pptxorganization of genome both full ppt.pptx
organization of genome both full ppt.pptxrameshparihar764
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Arnab kumar de
Arnab kumar deArnab kumar de
Arnab kumar deArnab De
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 
Real time pcr, reverse transcripta pcr, gene expression methods and microarray
Real time pcr, reverse transcripta pcr, gene expression methods and microarrayReal time pcr, reverse transcripta pcr, gene expression methods and microarray
Real time pcr, reverse transcripta pcr, gene expression methods and microarrayharrisonjoshua
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 
Transcriptomics,techniqes, applications.pdf
Transcriptomics,techniqes, applications.pdfTranscriptomics,techniqes, applications.pdf
Transcriptomics,techniqes, applications.pdfshinycthomas
 
Mapping and quantifying transcripts.pdf
Mapping and quantifying transcripts.pdfMapping and quantifying transcripts.pdf
Mapping and quantifying transcripts.pdfKristu Jayanti College
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 

Similar to Bioinformatics (20)

genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
Lecture 4.ppt
Lecture 4.pptLecture 4.ppt
Lecture 4.ppt
 
Modern concept of gene.pdf
Modern concept of gene.pdfModern concept of gene.pdf
Modern concept of gene.pdf
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
Bioinfomatics in mutation studies
Bioinfomatics in mutation studies Bioinfomatics in mutation studies
Bioinfomatics in mutation studies
 
organization of genome both full ppt.pptx
organization of genome both full ppt.pptxorganization of genome both full ppt.pptx
organization of genome both full ppt.pptx
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Arnab kumar de
Arnab kumar deArnab kumar de
Arnab kumar de
 
gene expression traditional methods
gene expression traditional methods gene expression traditional methods
gene expression traditional methods
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
Real time pcr, reverse transcripta pcr, gene expression methods and microarray
Real time pcr, reverse transcripta pcr, gene expression methods and microarrayReal time pcr, reverse transcripta pcr, gene expression methods and microarray
Real time pcr, reverse transcripta pcr, gene expression methods and microarray
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Transcriptomics,techniqes, applications.pdf
Transcriptomics,techniqes, applications.pdfTranscriptomics,techniqes, applications.pdf
Transcriptomics,techniqes, applications.pdf
 
Finding genes
Finding genesFinding genes
Finding genes
 
Mapping and quantifying transcripts.pdf
Mapping and quantifying transcripts.pdfMapping and quantifying transcripts.pdf
Mapping and quantifying transcripts.pdf
 
08_Annotation_2022.pdf
08_Annotation_2022.pdf08_Annotation_2022.pdf
08_Annotation_2022.pdf
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 

More from Afra Fathima

Radio active labeling
Radio active labelingRadio active labeling
Radio active labelingAfra Fathima
 
Micriobiology & Microbes
Micriobiology & MicrobesMicriobiology & Microbes
Micriobiology & MicrobesAfra Fathima
 
Identification of proteins by 2D gel
Identification of proteins by 2D gelIdentification of proteins by 2D gel
Identification of proteins by 2D gelAfra Fathima
 
HIGH PERFORMANCE LIQUID CHROMATOGRAPHY
HIGH PERFORMANCE LIQUID CHROMATOGRAPHYHIGH PERFORMANCE LIQUID CHROMATOGRAPHY
HIGH PERFORMANCE LIQUID CHROMATOGRAPHYAfra Fathima
 
Ammonium sulphate precipitation
Ammonium sulphate precipitationAmmonium sulphate precipitation
Ammonium sulphate precipitationAfra Fathima
 
Collection of blood, serum & plasma
Collection of blood, serum & plasmaCollection of blood, serum & plasma
Collection of blood, serum & plasmaAfra Fathima
 
Sub cellular fractionation
Sub cellular fractionationSub cellular fractionation
Sub cellular fractionationAfra Fathima
 
Cells of the immune system
Cells of the immune systemCells of the immune system
Cells of the immune systemAfra Fathima
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system Afra Fathima
 
Dinitro salicylic acid (DNSA) method
Dinitro salicylic acid (DNSA) methodDinitro salicylic acid (DNSA) method
Dinitro salicylic acid (DNSA) methodAfra Fathima
 
Cells of the immune system
Cells of the immune system Cells of the immune system
Cells of the immune system Afra Fathima
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system Afra Fathima
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system Afra Fathima
 
Cells of the immune system
Cells  of the immune systemCells  of the immune system
Cells of the immune systemAfra Fathima
 

More from Afra Fathima (20)

Time management
Time managementTime management
Time management
 
Radio active labeling
Radio active labelingRadio active labeling
Radio active labeling
 
Micriobiology & Microbes
Micriobiology & MicrobesMicriobiology & Microbes
Micriobiology & Microbes
 
Identification of proteins by 2D gel
Identification of proteins by 2D gelIdentification of proteins by 2D gel
Identification of proteins by 2D gel
 
SDS PAGE
SDS PAGESDS PAGE
SDS PAGE
 
HIGH PERFORMANCE LIQUID CHROMATOGRAPHY
HIGH PERFORMANCE LIQUID CHROMATOGRAPHYHIGH PERFORMANCE LIQUID CHROMATOGRAPHY
HIGH PERFORMANCE LIQUID CHROMATOGRAPHY
 
Dialysis
DialysisDialysis
Dialysis
 
Lowry's method
Lowry's  methodLowry's  method
Lowry's method
 
Bradford's method
Bradford's methodBradford's method
Bradford's method
 
Ammonium sulphate precipitation
Ammonium sulphate precipitationAmmonium sulphate precipitation
Ammonium sulphate precipitation
 
Collection of blood, serum & plasma
Collection of blood, serum & plasmaCollection of blood, serum & plasma
Collection of blood, serum & plasma
 
Sub cellular fractionation
Sub cellular fractionationSub cellular fractionation
Sub cellular fractionation
 
Cells of the immune system
Cells of the immune systemCells of the immune system
Cells of the immune system
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system
 
Dinitro salicylic acid (DNSA) method
Dinitro salicylic acid (DNSA) methodDinitro salicylic acid (DNSA) method
Dinitro salicylic acid (DNSA) method
 
Cells of the immune system
Cells of the immune system Cells of the immune system
Cells of the immune system
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system
 
Cells of the immune system
Cells  of the immune system Cells  of the immune system
Cells of the immune system
 
Cells of the immune system
Cells  of the immune systemCells  of the immune system
Cells of the immune system
 
Sulphadiazine
SulphadiazineSulphadiazine
Sulphadiazine
 

Recently uploaded

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 

Recently uploaded (20)

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 

Bioinformatics

  • 2. Gene Prediction • Genome – sequenced and assembled • First ensuring task is to locate all the protein coding genes hidden within the genome. • This help in understanding the functional content of the genome. • Computational gene prediction is a prerequisite for detailed functional annotation of genes and genomes. • The ultimate goal is to describe all the genes computationally with near 100% accuracy. • The ability to accurately predict genes can significantly reduce the amount of experimental verification work required.
  • 3. Gene Organization • Prokaryote – Have relatively small genome sizes – Have high gene density (>90% of coding) – Few repeats – Single ORF, found adjacent to one another – Less complex
  • 4. Gene Organization • Eukaryote – Nuclear genomes are much larger than prokaryotic ones – Have a very low gene density – The space between genes is often very large – Rich in repetitive sequences and transposable elements – Eukaryotic genomes are characterized by a mosaic organization in which a gene is split into pieces (called exons) by intervening noncoding sequences (called introns) – The nascent transcript from a eukaryotic gene is modified in three different ways (Capping, Splicing and Polyadenylation) before becoming a mature mRNA for protein translation. – More complex
  • 5. Prediction Methods • Location of genes is achieved through one or combination of following methods. • Intrinsic Method or Template Method – Searching for Signal – Searching by Content • Extrinsic Method or Lookup Method – Homology based prediction – Comparative gene prediction
  • 6. Searching for Signals • Basic Signals like – The start/stop codon – The 5’/3’ splice site – Ribosomal binding site – Transcription factor binding site – polyadenylation (poly-A) sites • Sequence signals are detected through use of Position Weight Matrices [(PWM) which are calculated from a set of known functional signals] across a sequence of interest.
  • 7. • Most vertebrate genes use ATG as the translation start codon and have a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG). • In bacteria, the majority of genes have a start codon ATG which codes for methionine. • Occasionally, GTG and TTG are used as alternative start codons, but methionine is still the actual amino acid inserted at the first position. • Because there may be multiple ATG, GTG, or TGT codons in a frame, the presence of these codons at the beginning of the frame does not necessarily give a clear indication of the translation initiation site. • Instead, to help identify this initiation codon, other features associated with translation are used. • One such feature is the ribosomal binding site, also called the Shine-Dalgarno sequence, has a consensus motif of AGGAGGT. • There are three possible stop codons, identification of which is straightforward. • The terminator sequence has a distinct stem-loop secondary structure followed by a string of Ts. Transcription Start RBS Translation Start Stop Transcription Termination TTTTT
  • 8. Searching for Content • Gene content refers to coding statistics, which includes nonrandom nucleotide distribution, amino acid distribution, synonymous codon usage, and hexamer frequencies. • Among these features, the hexamer frequencies appear to be most discriminative for coding potentials. • Protein coding regions are known to exhibit characteristic compositional bias when compared to non-coding regions • Statistical analyses have shown that pairs of codons (or amino acids at the protein level) tend to correlate. • The frequency of six unique nucleotides appearing together in a coding region is much higher than by random chance. • Accurate exon are predicted by content based features • A number of other content based measures can also be used to discriminate protein coding regions from non-coding regions • Content measures are also referred as coding statistics
  • 9. • The main issue in prediction of eukaryotic genes is the identification of exons, introns, and splicing sites. • The good news is that there are still some conserved sequence features in eukaryotic genes that allow computational prediction. • For example, the splice junctions of introns and exons follow the GT–AG rule in which an intron at the 5’ splice junction has a consensus motif of GTAAGT; and at the 3’ splice junction is a consensus motif of (Py)12 NCAG • In addition, most of these genes have a high density of CG dinucleotides near the transcription start site. • This region is referred to as a CpG island (p refers to the phosphodiester bond connecting the two nucleotides), which helps to identify the transcription initiation site of a eukaryotic gene. • The poly-A signal can also help locate the final coding sequence. • Eg tools: GeneScan, Grail2
  • 10. Similarity based prediction • The homology-based method makes predictions based on significant matches of the query sequence with sequences of known genes. • For instance, if a translated DNA sequence is found to be similar to a known protein or protein family from a database search, this can be strong evidence that the region codes for a protein. • This works well for prokaryotes. • Alternatively, when possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region. • A combination of abinitio and this method can be done • Eg tools: GeneID, GenomeScan, ORF Finder
  • 11. • Prokaryotic DNA is first subject to conceptual translation in all six possible frames, three frames forward and three frames reverse. • Because a stop codon occurs in about every twenty codons by chance in a noncoding region, a frame longer than thirty codons without interruption by stop codons is suggestive of a gene coding region, although the threshold for an ORF is normally set even higher at fifty or sixty codons. • The putative frame is further manually confirmed by the presence of other signals such as a start codon and Shine–Dalgarno sequence. • Furthermore, the putative ORF can be translated into a protein sequence, which is then used to search against a protein database. • Detection of homologs from this search is probably the strongest indicator of a protein-coding frame.
  • 12. Comparative gene prediction • Large number of complete genomes are available • Comparisons made with genomes • Rational: Functional regions tend to be more conserved than non-protein coding regions • Eg tools: Twinscan, SLAM
  • 13. Performance Evaluation • The accuracy of a prediction program can be evaluated using parameters such as Sensitivity and Specificity. • To describe the concept of sensitivity and specificity accurately, four features are used: True Positive (TP), which is a correctly predicted feature; False Positive (FP), which is an incorrectly predicted feature; False Negative (FN), which is a missed feature; and True Negative (TN), which is the correctly predicted absence of a feature. • Using these four terms, sensitivity (Sn) and specificity (Sp) can be described by the following formulas: • Sn = TP/(TP + FN) • Sp = TP/(TP + FP)
  • 14. • According to these formulas, sensitivity is the proportion of true signals predicted among all possible true signals. • It can be considered as the ability to include correct predictions. • In contrast, specificity is the proportion of true signals among all signals that are predicted. • It represents the ability to exclude incorrect predictions. • A program is considered accurate if both sensitivity and specificity are simultaneously high and approach a value of 1. • In a case in which sensitivity is high but specificity is low, the program is said to have a tendency to over predict. • On the other hand, if the sensitivity is low but specificity high, the program is too conservative and lacks predictive power.
  • 15. • Because neither sensitivity nor specificity alone can fully describe accuracy, it is desirable to use a single value to summarize both of them. • In the field of gene finding, a single parameter known as the correlation coefficient (CC) is often used, which is defined by the following formula: • CC = • The value of the CC provides an overall measure of accuracy, which ranges from −1 to +1, with +1 meaning always correct prediction and −1 meaning always incorrect prediction TP • TN − F P • FN √(TP + F P)(TN + FN)(F P + TN)
  • 17. Promoters and Regulatory elements • Promoters are DNA elements located in the vicinity of gene start sites (which should not be confused with the translation start sites) and serve as binding sites for the gene transcription machinery, consisting of RNA polymerases and transcription factors. • These DNA elements directly regulate gene expression. • Promoters and regulatory elements are traditionally determined by experimental analysis. • The process is extremely time consuming and laborious. • Computational prediction of promoters and regulatory elements is especially promising because it has the potential to replace a great deal of extensive experimental analysis.
  • 18. Identification of promoters and regulatory elements - a difficult task • First, promoters and regulatory elements are not clearly defined and are highly diverse. Each gene seems to have a unique combination of sets of regulatory motifs that determine its unique temporal and spatial expression. There is currently a lack of sufficient understanding of all the necessary regulatory elements for transcription. • Second, the promoters and regulatory elements cannot be translated into protein sequences to increase the sensitivity for their detection. • Third, promoter and regulatory sites to be predicted are normally short (six to eight nucleotides) and can be found in essentially any sequence by random chance, thus resulting in high rates of false positives associated with theoretical predictions.
  • 19. Promotor and Regulatory elements in Prokaryotes • In bacteria, transcription is initiated by RNA polymerase, which is a multi-subunit enzyme. • The RNA polymerase recognizes specific sequences located 35 and 10 base pairs (bp) upstream from the transcription start site upstream of a gene and binds. • The –35 box has a consensus sequence of TTGACA and the –10 box has a consensus of TATAAT. • In addition to the RNA polymerase, there are also a number of DNA-binding proteins called transcription factors which bind to specific DNA sequences referred to as regulatory elements. • They bind to a site several hundred bases away from the promoter.
  • 20. • In eukaryotes, gene expression is also regulated by a protein complex formed between transcription factors and RNA polymerase. • Three different types of RNA polymerase complexes, namely RNA polymerases I, II, and III where RNA polymerases I and III are responsible for the transcription of ribosomal RNAs and tRNAs, respectively and RNA polymerase II is exclusively responsible for transcribing mRNAs. • The core of many eukaryotic promoters is a so- called TATA box, located 30 bps upstream from the transcription start site, having a consensus motif TATA(A/T)A(A/T). Promotor and Regulatory elements in Eukaryotes
  • 21. • However, not all eukaryotic promoters contain the TATA box. • In addition, many genes have a unique initiator sequence (Inr), which is a pyrimidine rich sequence with a consensus (C/T)(C/T)CA(C/T) (C/T). • Some regulatory sites can be found tens of thousands base pairs away from the gene start site. • Occasionally, regulatory elements are located downstream instead of upstream of the transcription start site. Promotor and Regulatory elements in Eukaryotes
  • 22. NF1- nuclear factor 1 SP1-specificity protein 1
  • 23. PREDICTION ALGORITHMS • Current algorithms for predicting promoters and regulatory elements can be categorized as either – ab initio based - which make de novo predictions by scanning individual sequences; – phylogenetic footprinting - Similarity based - which make predictions based on alignment of homologous sequences;
  • 24. ab initio algorithm • This type of algorithm predicts prokaryotic and eukaryotic promoters and regulatory elements based on characteristic sequences patterns for promoters and regulatory elements. • Some ab initio programs are signal based, relying on characteristic promoter sequences such as the TATA box, whereas others rely on content information such as hexamer frequencies. • The advantage of the ab initio method is that the sequence can be applied as such without having to obtain experimental information. • The limitation is the need for training, which makes the prediction programs species specific. • In addition, this type of method has a difficulty in discovering new, unknown motifs.
  • 25. • The conventional approach to detecting a promoter or regulatory site is through – matching a consensus sequence pattern represented by regular expressions or – matching a position-specific scoring matrix (PSSM) constructed from well-characterized binding sites. • In either case, the consensus sequences or the matrices are relatively short, covering 6 to 10 bases. • This simple approach, however, often has difficulty differentiating true promoters from random sequence matches and generates high rates of false positives as a result. ab initio algorithm
  • 26. Prokaryotic Promoter Prediction • One of the unique aspects in prokaryotic promoter prediction is the determination of operon structures, because genes within an operon share a common promoter located upstream of the first gene of the operon. • Thus, operon prediction is the key in prokaryotic promoter prediction. • Once an operon structure is known, only the first gene is predicted for the presence of a promoter and regulatory elements, whereas other genes in the operon do not possess such DNA elements. • This method relies on two kinds of information: gene orientation and intergenic distances of a pair of genes of interest and conserved linkage of the genes based on comparative genomic analysis.
  • 27. Predicting Eukaryotic Promoters • The ab initio method for predicting eukaryotic promoters and regulatory elements also relies on searching the input sequences for matching of consensus patterns of known promoters and regulatory elements. • However, this approach tends to generate very high rate of false positives owing to nonspecific matches with the short sequence patterns. • To increase the specificity of prediction, a unique feature called CpG islands are detected where many vertebrate genes are characterized by a high density of CG dinucleotides near the promoter region overlapping the transcription start site. • By identifying the CpG islands, promoters can be traced on the immediate upstream region from the islands. • The promoter regions tend to contain a high density of protein-binding sites. Thus, finding a cluster of transcription factor binding sites often enhances the probability of individual binding site prediction.
  • 28. Tool for BPP • BPROM is a web-based program for prediction of bacterial promoters. • It uses a linear discriminant function combined with signal and content information such as consensus promoter sequence and oligonucleotide composition of the promoter sites. • This program first predicts a given sequence for bacterial operon structures by using an intergenic distance of 100 bp as basis for distinguishing genes to be in an operon.
  • 29. Tool for EPP • McPromoter (http://genes.mit.edu/McPromoter.html) is a web-based program that uses a neural network to make promoter predictions. • It has a unique promoter model containing six scoring segments. • The program scans a window of 300 bases for the likelihoods of being in each of the coding, noncoding, and promoter regions. • The input for the neural network includes parameters for sequence physical properties, such as DNA bendability, plus signals such as the TATA box, initiator box, and CpG islands.
  • 30. phylogenetic footprinting • It has been observed that promoter and regulatory elements from closely related organisms such as human and mouse are highly conserved. • The conservation is both at the sequence level and at the level of organization of the elements. • Therefore, it is possible to obtain such promoter sequences for a particular gene through comparative analysis. • The identification of conserved noncoding DNA elements that serve crucial functional roles is referred to as phylogenetic footprinting; the elements are called phylogenetic footprints.
  • 31. • This type of method can apply to both prokaryotic and eukaryotic sequences. • The selection of organisms for comparison is an important consideration in this type of analysis. • If the pair of organisms selected are too closely related, such as human and chimpanzee, the sequence difference between them may not be sufficient to filter out functional elements. On the other hand, if the organisms’ evolutionary distances are too long, such as between human and fish, long evolutionary divergence may render promoter and other elements undetectable. • Another caveat of phylogenetic footprinting is to extract noncoding sequences upstream of corresponding genes and focus the comparison to this region only, which helps to prevent false positives.
  • 32. Tool • ConSite (http://mordor.cgb.ki.se/cgi- bin/CONSITE/consite) is a web server that finds putative promoter elements by comparing two orthologous sequences. • The user provides two individual sequences which are aligned by ConSite using a global alignment algorithm. • Alternatively, the program accepts precomputed alignment. • Conserved regions are identified by calculating identity scores, which are then used to compare against a motif database of regulatory sites (TRANSFAC). • High-scoring sequence segments upstream of genes are returned as putative regulatory elements.