SlideShare a Scribd company logo
Genome Annotation
Definition:
It is the process of taking the raw DNA sequence
produced by the genome-sequencing projects and
adding the layers of analysis and interpretation
necessary to extract its biological significance and
place it into the context of our understanding of
biological processes.
• Today, the public international sequence databases contain
more than nine billion nucleotides and the flow of new
sequences is increasing dramatically. For scientists, the
challenge is to exploit this huge amount of sequences.
• To extract biological knowledge from anonymous genomic
sequences is the main objective of genome annotation.
• The extensive use of computer tools is needed to minimize
the slow and costly human interventions. This is the reason
why annotation is often synonymous with prediction.
• The annotation work is divided into two steps: structural
annotation, which consists mainly of localizing gene elements;
and functional annotation, which aims at assigning a
biochemical function to the deduced gene products.
Structural annotation
The prediction of the gene elements is a complex problem
and its issue is primordial because of its consequences on all
the following analyses.
• Eukaryotic genes with their mosaic structure are more difficult
to find than prokaryotic ones which are simple open reading
frames. The presence of introns complicates the problem,
although the binding sites of the spliceosome may be used to
predict the exact position of the exon borders.
• According to the prediction tools, the result of the prediction
concerns the splice sites, the exons or the whole gene (gene
modelling software).
Gene prediction in Prokaryotes
• Prokaryotes have relatively small genomes with sizes ranging
from 0.5 to 10 Mbp.
• But gene density in the genomes is high with more than 90%
of a genome sequence containing coding sequence.
• In bacteria majority of genes start with ATG which codes for
methionine. Occasionally, GTG and TTG are used as
alternative start codons. These codons not necessarily give a
clear indication of the translation initiation site. This is
overcome by the presence of Shine- Delgarno sequence,
which is a stretch of purine rich sequence complementary to
16S rRNA in the ribosome.
• Many genes are transcribed together as one operon.
The end of the operon is characterized by a transcription
termination signal called rho- independent terminator.
Conventional determination of ORFs
• One method is based on the nucleotide composition of the
third position of codon. It has been observed that this
position has a preference to use G or C over A or T.
• By plotting the GC composition at this position, regions with
values significantly above the random level can be identified,
which are indicative of the presence of ORFs.
• There is a similar method called TESTCODE that exploits the
fact that the third codon nucleotides in a coding region tend
to repeat themselves.
Performance evaluation
• Accuracy can be described by evaluating two parameters such
as sensitivity and specificity. To describe this concept four
features are used: true positive (TP), false positive (FP), false
negative (FN), true negative (TN).
• TP: correctly predicted feature
• FP: incorrectly predicted feature
• FN: missed feature
• TN: correctly predicted absence of a feature
• Sensitivity is the proportion of true signals predicted among
all possible true strengths.
• Specificity is the proportion of true signals among all signals
that are predicted.
Sn = TP/(TP+FN)
SP = TP/(TP+FP)
• Correlation coefficient:
• Value of CC provides an overall measure of accuracy which
ranges from -1 to +1
Gene prediction in eukaryotes
• Eukaryotic nuclear genomes are much larger than prokaryotic
ones, with size ranging from 10 Mbp to 670 Gbp.
• They tend to have a very low gene density. For example in
humans only 3% of the genome codes for genes, with about 1
gene per 100 kbp on average.
• The nascent mRNA undergoes post -transcriptional
modification before becoming a mature mRNA for protein
translation.
• The main issue in prediction of eukaryotic genes is the
identification of exons, introns and splicing sites.
Prediction can be made on the basis of :
• Presence of conserved sequences - Splice junctions of introns and
exons follow the GT-AG rule.
• Statistical patterns- Nucleotide compositions and codon bias in
coding regions of eukaryotes are different from those of the non
coding regions
Most vertebrate genes use ATG as the translation start codon and
have uniquely conserved sequences called as Kozak sequence
(CCGCCATGG)
• Presence of CpG island- Most of these genes have a high density of
CG dinucleotides near the transcription start site. Here ‘p’ refers to
the phosphodiester bond between the two nucleotides.
Gene prediction programs
Ab initio-based programs:
• This discriminate exons from non coding sequences and
subsequently joins them together in the correct order.
• It rely on two features gene signals and gene content.
• In addition with HMMs, discriminant analysis ,neural network
based algorithms are also used in gene prediction.
• Neural networks:
It is a statistical
model with a special architecture
for pattern recognition
and classification. Here multiple
layers are constructed- input,
output and hidden layers. The
output is the probability of the
exon structure. GRAIL is a
program based on neural
network algorithm. Fig: Architecture of a neural
network for eukaryotic gene
prediction
• Prediction using HMMs:
- GENSCAN is a web based program on fifth- order HMMs,
- HMMgene is also a web program. It uses a criterion called
the conditional maximum likelihood to discriminate coding
from non coding features.
• Prediction using Discriminant Analysis:
- Some gene prediction algorithms rely on discriminant
analysis, either LDA or quadratic discriminant analysis (QDA).
- LDA works by plotting a 2-D graph of coding signals versus all
potential 3’ splice site positions and drawing a diagonal line
that best separates coding signals from non-coding signals
based on knowledge learned from training data sets of known
gene structures.
- QDA draws a curved line based on a quadratic function to
separate coding and non-coding features.
Programs used are: FGENES, FGENESH, FGENESH_C,
FGENESH+,MZEF
Homolgy based programs:
- These are based on the fact that exon structures and exon
sequences of related species are highly conserved.
- When coding frames in a query sequence are translated and
used to align with closest protein homologs found in
database, nearly perfectly matched regions can be used to
reveal the exon boundaries in the query.
- Programs used are:
GenomeScan, EST2Genome, SGP-1, TwinScan
• Consensus based programs:
These programs are developed using consensus- based
algorithms which combine results of multiple programs based
on consensus. However this may lead to lowered sensitivity
and missed predictions.
Eg of consensus- based programs are: GeneComber, DIGIT
Functional annotation
• At the present time, the functional genome annotation is
based on the idea that some sequence similarities detected
between two proteins mean that they are homologs i.e. they
come from the same ancestor and share the same
biochemical function.
• Therefore, for each predicted gene, the protein is deduced
from the coding region and is compared through BlastP with
the protein databases.
• If the similarities detected are considered relevant, the name
(function) of the putative homologue protein is associated
with the prediction.
• The tendency is nevertheless the following:
when a predicted gene product is 100 % identical to an
already characterized protein, it receives the same name,
whereas sequences with stringent similarity to known
proteins are called ‘putative’ proteins of the same name.
• The sequences for which only similarities to ESTs are detected
and named ‘unknown’ proteins.
• Finally, genes without similar sequences and, hence, only
deduced from intrinsic prediction programs are labelled
‘hypothetical’.
• Some annotators confirm and complete the Blast results by
full-length alignments between the query protein and the
closest homologue detected, and by looking for motifs and
family signatures.
Automatic genome annotation
pipelines
• The primary goal of the pipeline process is to deliver highly
accurate and reliable genome annotations, using the widest
possible range of evidence from available databases.
• As pipelines have evolved, the trend has been to move away
from single algorithm methods and towards consensus-based
approaches.
• Pipelines are the integration of suites of bioinformatics
software tools with multiple databases, to manage
automatically the analysis and storage of genomic sequence.
• Genomic sequences pass through several successive levels of
algorithms. Each layer of processing provides further
refinement of annotation detail.
Fig: The generic structure of an automatic genome annotation pipeline
and delivery system.
Genomic pipelines:
Several genomic pipelines exist worldwide. Publicly funded
projects include
• Ensembl at the European Bioinformatics Institute (EBI)/Sanger
Institute,
• NCBI Analysis Pipeline,
• Oak Ridge National Laboratories (ORNL) Genome Channel.
THANK YOU

More Related Content

Similar to genomeannotation-160822182432.pdf

gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
MugdhaSharma11
 
Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptx
Sridharshinisathishk
 
Genome analysis2
Genome analysis2Genome analysis2
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
alagappa university karaikudi
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
Monica Munoz-Torres
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
Sreenivasa Reddy Thalla
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
ajay301
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
saswat tripathy
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
Monica Munoz-Torres
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
Bas van Breukelen
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Nawfal Aldujaily
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
Monica Munoz-Torres
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
karamveer prajapat
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
Rai University
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
Rai University
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
IndrajaDoradla
 
Finding genes
Finding genesFinding genes
Finding genes
Sabahat Ali
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
Alireza Doustmohammadi
 

Similar to genomeannotation-160822182432.pdf (20)

gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptx
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Finding genes
Finding genesFinding genes
Finding genes
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 

Recently uploaded

The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
Texas Alliance of Groundwater Districts
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 

Recently uploaded (20)

The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 

genomeannotation-160822182432.pdf

  • 2. Definition: It is the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.
  • 3. • Today, the public international sequence databases contain more than nine billion nucleotides and the flow of new sequences is increasing dramatically. For scientists, the challenge is to exploit this huge amount of sequences. • To extract biological knowledge from anonymous genomic sequences is the main objective of genome annotation. • The extensive use of computer tools is needed to minimize the slow and costly human interventions. This is the reason why annotation is often synonymous with prediction. • The annotation work is divided into two steps: structural annotation, which consists mainly of localizing gene elements; and functional annotation, which aims at assigning a biochemical function to the deduced gene products.
  • 4. Structural annotation The prediction of the gene elements is a complex problem and its issue is primordial because of its consequences on all the following analyses. • Eukaryotic genes with their mosaic structure are more difficult to find than prokaryotic ones which are simple open reading frames. The presence of introns complicates the problem, although the binding sites of the spliceosome may be used to predict the exact position of the exon borders. • According to the prediction tools, the result of the prediction concerns the splice sites, the exons or the whole gene (gene modelling software).
  • 5. Gene prediction in Prokaryotes • Prokaryotes have relatively small genomes with sizes ranging from 0.5 to 10 Mbp. • But gene density in the genomes is high with more than 90% of a genome sequence containing coding sequence. • In bacteria majority of genes start with ATG which codes for methionine. Occasionally, GTG and TTG are used as alternative start codons. These codons not necessarily give a clear indication of the translation initiation site. This is overcome by the presence of Shine- Delgarno sequence, which is a stretch of purine rich sequence complementary to 16S rRNA in the ribosome. • Many genes are transcribed together as one operon. The end of the operon is characterized by a transcription termination signal called rho- independent terminator.
  • 6. Conventional determination of ORFs • One method is based on the nucleotide composition of the third position of codon. It has been observed that this position has a preference to use G or C over A or T. • By plotting the GC composition at this position, regions with values significantly above the random level can be identified, which are indicative of the presence of ORFs. • There is a similar method called TESTCODE that exploits the fact that the third codon nucleotides in a coding region tend to repeat themselves.
  • 7. Performance evaluation • Accuracy can be described by evaluating two parameters such as sensitivity and specificity. To describe this concept four features are used: true positive (TP), false positive (FP), false negative (FN), true negative (TN). • TP: correctly predicted feature • FP: incorrectly predicted feature • FN: missed feature • TN: correctly predicted absence of a feature • Sensitivity is the proportion of true signals predicted among all possible true strengths. • Specificity is the proportion of true signals among all signals that are predicted. Sn = TP/(TP+FN) SP = TP/(TP+FP)
  • 8. • Correlation coefficient: • Value of CC provides an overall measure of accuracy which ranges from -1 to +1
  • 9. Gene prediction in eukaryotes • Eukaryotic nuclear genomes are much larger than prokaryotic ones, with size ranging from 10 Mbp to 670 Gbp. • They tend to have a very low gene density. For example in humans only 3% of the genome codes for genes, with about 1 gene per 100 kbp on average. • The nascent mRNA undergoes post -transcriptional modification before becoming a mature mRNA for protein translation. • The main issue in prediction of eukaryotic genes is the identification of exons, introns and splicing sites.
  • 10. Prediction can be made on the basis of : • Presence of conserved sequences - Splice junctions of introns and exons follow the GT-AG rule. • Statistical patterns- Nucleotide compositions and codon bias in coding regions of eukaryotes are different from those of the non coding regions Most vertebrate genes use ATG as the translation start codon and have uniquely conserved sequences called as Kozak sequence (CCGCCATGG) • Presence of CpG island- Most of these genes have a high density of CG dinucleotides near the transcription start site. Here ‘p’ refers to the phosphodiester bond between the two nucleotides.
  • 11. Gene prediction programs Ab initio-based programs: • This discriminate exons from non coding sequences and subsequently joins them together in the correct order. • It rely on two features gene signals and gene content. • In addition with HMMs, discriminant analysis ,neural network based algorithms are also used in gene prediction.
  • 12. • Neural networks: It is a statistical model with a special architecture for pattern recognition and classification. Here multiple layers are constructed- input, output and hidden layers. The output is the probability of the exon structure. GRAIL is a program based on neural network algorithm. Fig: Architecture of a neural network for eukaryotic gene prediction
  • 13. • Prediction using HMMs: - GENSCAN is a web based program on fifth- order HMMs, - HMMgene is also a web program. It uses a criterion called the conditional maximum likelihood to discriminate coding from non coding features. • Prediction using Discriminant Analysis: - Some gene prediction algorithms rely on discriminant analysis, either LDA or quadratic discriminant analysis (QDA). - LDA works by plotting a 2-D graph of coding signals versus all potential 3’ splice site positions and drawing a diagonal line that best separates coding signals from non-coding signals based on knowledge learned from training data sets of known gene structures.
  • 14. - QDA draws a curved line based on a quadratic function to separate coding and non-coding features. Programs used are: FGENES, FGENESH, FGENESH_C, FGENESH+,MZEF Homolgy based programs: - These are based on the fact that exon structures and exon sequences of related species are highly conserved.
  • 15. - When coding frames in a query sequence are translated and used to align with closest protein homologs found in database, nearly perfectly matched regions can be used to reveal the exon boundaries in the query. - Programs used are: GenomeScan, EST2Genome, SGP-1, TwinScan • Consensus based programs: These programs are developed using consensus- based algorithms which combine results of multiple programs based on consensus. However this may lead to lowered sensitivity and missed predictions. Eg of consensus- based programs are: GeneComber, DIGIT
  • 16. Functional annotation • At the present time, the functional genome annotation is based on the idea that some sequence similarities detected between two proteins mean that they are homologs i.e. they come from the same ancestor and share the same biochemical function. • Therefore, for each predicted gene, the protein is deduced from the coding region and is compared through BlastP with the protein databases. • If the similarities detected are considered relevant, the name (function) of the putative homologue protein is associated with the prediction.
  • 17. • The tendency is nevertheless the following: when a predicted gene product is 100 % identical to an already characterized protein, it receives the same name, whereas sequences with stringent similarity to known proteins are called ‘putative’ proteins of the same name. • The sequences for which only similarities to ESTs are detected and named ‘unknown’ proteins. • Finally, genes without similar sequences and, hence, only deduced from intrinsic prediction programs are labelled ‘hypothetical’. • Some annotators confirm and complete the Blast results by full-length alignments between the query protein and the closest homologue detected, and by looking for motifs and family signatures.
  • 18. Automatic genome annotation pipelines • The primary goal of the pipeline process is to deliver highly accurate and reliable genome annotations, using the widest possible range of evidence from available databases. • As pipelines have evolved, the trend has been to move away from single algorithm methods and towards consensus-based approaches. • Pipelines are the integration of suites of bioinformatics software tools with multiple databases, to manage automatically the analysis and storage of genomic sequence. • Genomic sequences pass through several successive levels of algorithms. Each layer of processing provides further refinement of annotation detail.
  • 19. Fig: The generic structure of an automatic genome annotation pipeline and delivery system.
  • 20. Genomic pipelines: Several genomic pipelines exist worldwide. Publicly funded projects include • Ensembl at the European Bioinformatics Institute (EBI)/Sanger Institute, • NCBI Analysis Pipeline, • Oak Ridge National Laboratories (ORNL) Genome Channel.