SlideShare a Scribd company logo
1 of 21
Download to read offline
Genome Annotation
Definition:
It is the process of taking the raw DNA sequence
produced by the genome-sequencing projects and
adding the layers of analysis and interpretation
necessary to extract its biological significance and
place it into the context of our understanding of
biological processes.
• Today, the public international sequence databases contain
more than nine billion nucleotides and the flow of new
sequences is increasing dramatically. For scientists, the
challenge is to exploit this huge amount of sequences.
• To extract biological knowledge from anonymous genomic
sequences is the main objective of genome annotation.
• The extensive use of computer tools is needed to minimize
the slow and costly human interventions. This is the reason
why annotation is often synonymous with prediction.
• The annotation work is divided into two steps: structural
annotation, which consists mainly of localizing gene elements;
and functional annotation, which aims at assigning a
biochemical function to the deduced gene products.
Structural annotation
The prediction of the gene elements is a complex problem
and its issue is primordial because of its consequences on all
the following analyses.
• Eukaryotic genes with their mosaic structure are more difficult
to find than prokaryotic ones which are simple open reading
frames. The presence of introns complicates the problem,
although the binding sites of the spliceosome may be used to
predict the exact position of the exon borders.
• According to the prediction tools, the result of the prediction
concerns the splice sites, the exons or the whole gene (gene
modelling software).
Gene prediction in Prokaryotes
• Prokaryotes have relatively small genomes with sizes ranging
from 0.5 to 10 Mbp.
• But gene density in the genomes is high with more than 90%
of a genome sequence containing coding sequence.
• In bacteria majority of genes start with ATG which codes for
methionine. Occasionally, GTG and TTG are used as
alternative start codons. These codons not necessarily give a
clear indication of the translation initiation site. This is
overcome by the presence of Shine- Delgarno sequence,
which is a stretch of purine rich sequence complementary to
16S rRNA in the ribosome.
• Many genes are transcribed together as one operon.
The end of the operon is characterized by a transcription
termination signal called rho- independent terminator.
Conventional determination of ORFs
• One method is based on the nucleotide composition of the
third position of codon. It has been observed that this
position has a preference to use G or C over A or T.
• By plotting the GC composition at this position, regions with
values significantly above the random level can be identified,
which are indicative of the presence of ORFs.
• There is a similar method called TESTCODE that exploits the
fact that the third codon nucleotides in a coding region tend
to repeat themselves.
Performance evaluation
• Accuracy can be described by evaluating two parameters such
as sensitivity and specificity. To describe this concept four
features are used: true positive (TP), false positive (FP), false
negative (FN), true negative (TN).
• TP: correctly predicted feature
• FP: incorrectly predicted feature
• FN: missed feature
• TN: correctly predicted absence of a feature
• Sensitivity is the proportion of true signals predicted among
all possible true strengths.
• Specificity is the proportion of true signals among all signals
that are predicted.
Sn = TP/(TP+FN)
SP = TP/(TP+FP)
• Correlation coefficient:
• Value of CC provides an overall measure of accuracy which
ranges from -1 to +1
Gene prediction in eukaryotes
• Eukaryotic nuclear genomes are much larger than prokaryotic
ones, with size ranging from 10 Mbp to 670 Gbp.
• They tend to have a very low gene density. For example in
humans only 3% of the genome codes for genes, with about 1
gene per 100 kbp on average.
• The nascent mRNA undergoes post -transcriptional
modification before becoming a mature mRNA for protein
translation.
• The main issue in prediction of eukaryotic genes is the
identification of exons, introns and splicing sites.
Prediction can be made on the basis of :
• Presence of conserved sequences - Splice junctions of introns and
exons follow the GT-AG rule.
• Statistical patterns- Nucleotide compositions and codon bias in
coding regions of eukaryotes are different from those of the non
coding regions
Most vertebrate genes use ATG as the translation start codon and
have uniquely conserved sequences called as Kozak sequence
(CCGCCATGG)
• Presence of CpG island- Most of these genes have a high density of
CG dinucleotides near the transcription start site. Here ‘p’ refers to
the phosphodiester bond between the two nucleotides.
Gene prediction programs
Ab initio-based programs:
• This discriminate exons from non coding sequences and
subsequently joins them together in the correct order.
• It rely on two features gene signals and gene content.
• In addition with HMMs, discriminant analysis ,neural network
based algorithms are also used in gene prediction.
• Neural networks:
It is a statistical
model with a special architecture
for pattern recognition
and classification. Here multiple
layers are constructed- input,
output and hidden layers. The
output is the probability of the
exon structure. GRAIL is a
program based on neural
network algorithm. Fig: Architecture of a neural
network for eukaryotic gene
prediction
• Prediction using HMMs:
- GENSCAN is a web based program on fifth- order HMMs,
- HMMgene is also a web program. It uses a criterion called
the conditional maximum likelihood to discriminate coding
from non coding features.
• Prediction using Discriminant Analysis:
- Some gene prediction algorithms rely on discriminant
analysis, either LDA or quadratic discriminant analysis (QDA).
- LDA works by plotting a 2-D graph of coding signals versus all
potential 3’ splice site positions and drawing a diagonal line
that best separates coding signals from non-coding signals
based on knowledge learned from training data sets of known
gene structures.
- QDA draws a curved line based on a quadratic function to
separate coding and non-coding features.
Programs used are: FGENES, FGENESH, FGENESH_C,
FGENESH+,MZEF
Homolgy based programs:
- These are based on the fact that exon structures and exon
sequences of related species are highly conserved.
- When coding frames in a query sequence are translated and
used to align with closest protein homologs found in
database, nearly perfectly matched regions can be used to
reveal the exon boundaries in the query.
- Programs used are:
GenomeScan, EST2Genome, SGP-1, TwinScan
• Consensus based programs:
These programs are developed using consensus- based
algorithms which combine results of multiple programs based
on consensus. However this may lead to lowered sensitivity
and missed predictions.
Eg of consensus- based programs are: GeneComber, DIGIT
Functional annotation
• At the present time, the functional genome annotation is
based on the idea that some sequence similarities detected
between two proteins mean that they are homologs i.e. they
come from the same ancestor and share the same
biochemical function.
• Therefore, for each predicted gene, the protein is deduced
from the coding region and is compared through BlastP with
the protein databases.
• If the similarities detected are considered relevant, the name
(function) of the putative homologue protein is associated
with the prediction.
• The tendency is nevertheless the following:
when a predicted gene product is 100 % identical to an
already characterized protein, it receives the same name,
whereas sequences with stringent similarity to known
proteins are called ‘putative’ proteins of the same name.
• The sequences for which only similarities to ESTs are detected
and named ‘unknown’ proteins.
• Finally, genes without similar sequences and, hence, only
deduced from intrinsic prediction programs are labelled
‘hypothetical’.
• Some annotators confirm and complete the Blast results by
full-length alignments between the query protein and the
closest homologue detected, and by looking for motifs and
family signatures.
Automatic genome annotation
pipelines
• The primary goal of the pipeline process is to deliver highly
accurate and reliable genome annotations, using the widest
possible range of evidence from available databases.
• As pipelines have evolved, the trend has been to move away
from single algorithm methods and towards consensus-based
approaches.
• Pipelines are the integration of suites of bioinformatics
software tools with multiple databases, to manage
automatically the analysis and storage of genomic sequence.
• Genomic sequences pass through several successive levels of
algorithms. Each layer of processing provides further
refinement of annotation detail.
Fig: The generic structure of an automatic genome annotation pipeline
and delivery system.
Genomic pipelines:
Several genomic pipelines exist worldwide. Publicly funded
projects include
• Ensembl at the European Bioinformatics Institute (EBI)/Sanger
Institute,
• NCBI Analysis Pipeline,
• Oak Ridge National Laboratories (ORNL) Genome Channel.
THANK YOU

More Related Content

Similar to genomeannotation-160822182432.pdf

Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxSridharshinisathishk
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionAashish Patel
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)IndrajaDoradla
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataAlireza Doustmohammadi
 
genomics ; structural and functional annotation, micro array technology
genomics ; structural and functional annotation, micro array technologygenomics ; structural and functional annotation, micro array technology
genomics ; structural and functional annotation, micro array technologySilpa Selvaraj
 

Similar to genomeannotation-160822182432.pdf (20)

Functional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptxFunctional annotation- prediction of genes.pptx
Functional annotation- prediction of genes.pptx
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Finding genes
Finding genesFinding genes
Finding genes
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
genomics ; structural and functional annotation, micro array technology
genomics ; structural and functional annotation, micro array technologygenomics ; structural and functional annotation, micro array technology
genomics ; structural and functional annotation, micro array technology
 

Recently uploaded

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

genomeannotation-160822182432.pdf

  • 2. Definition: It is the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.
  • 3. • Today, the public international sequence databases contain more than nine billion nucleotides and the flow of new sequences is increasing dramatically. For scientists, the challenge is to exploit this huge amount of sequences. • To extract biological knowledge from anonymous genomic sequences is the main objective of genome annotation. • The extensive use of computer tools is needed to minimize the slow and costly human interventions. This is the reason why annotation is often synonymous with prediction. • The annotation work is divided into two steps: structural annotation, which consists mainly of localizing gene elements; and functional annotation, which aims at assigning a biochemical function to the deduced gene products.
  • 4. Structural annotation The prediction of the gene elements is a complex problem and its issue is primordial because of its consequences on all the following analyses. • Eukaryotic genes with their mosaic structure are more difficult to find than prokaryotic ones which are simple open reading frames. The presence of introns complicates the problem, although the binding sites of the spliceosome may be used to predict the exact position of the exon borders. • According to the prediction tools, the result of the prediction concerns the splice sites, the exons or the whole gene (gene modelling software).
  • 5. Gene prediction in Prokaryotes • Prokaryotes have relatively small genomes with sizes ranging from 0.5 to 10 Mbp. • But gene density in the genomes is high with more than 90% of a genome sequence containing coding sequence. • In bacteria majority of genes start with ATG which codes for methionine. Occasionally, GTG and TTG are used as alternative start codons. These codons not necessarily give a clear indication of the translation initiation site. This is overcome by the presence of Shine- Delgarno sequence, which is a stretch of purine rich sequence complementary to 16S rRNA in the ribosome. • Many genes are transcribed together as one operon. The end of the operon is characterized by a transcription termination signal called rho- independent terminator.
  • 6. Conventional determination of ORFs • One method is based on the nucleotide composition of the third position of codon. It has been observed that this position has a preference to use G or C over A or T. • By plotting the GC composition at this position, regions with values significantly above the random level can be identified, which are indicative of the presence of ORFs. • There is a similar method called TESTCODE that exploits the fact that the third codon nucleotides in a coding region tend to repeat themselves.
  • 7. Performance evaluation • Accuracy can be described by evaluating two parameters such as sensitivity and specificity. To describe this concept four features are used: true positive (TP), false positive (FP), false negative (FN), true negative (TN). • TP: correctly predicted feature • FP: incorrectly predicted feature • FN: missed feature • TN: correctly predicted absence of a feature • Sensitivity is the proportion of true signals predicted among all possible true strengths. • Specificity is the proportion of true signals among all signals that are predicted. Sn = TP/(TP+FN) SP = TP/(TP+FP)
  • 8. • Correlation coefficient: • Value of CC provides an overall measure of accuracy which ranges from -1 to +1
  • 9. Gene prediction in eukaryotes • Eukaryotic nuclear genomes are much larger than prokaryotic ones, with size ranging from 10 Mbp to 670 Gbp. • They tend to have a very low gene density. For example in humans only 3% of the genome codes for genes, with about 1 gene per 100 kbp on average. • The nascent mRNA undergoes post -transcriptional modification before becoming a mature mRNA for protein translation. • The main issue in prediction of eukaryotic genes is the identification of exons, introns and splicing sites.
  • 10. Prediction can be made on the basis of : • Presence of conserved sequences - Splice junctions of introns and exons follow the GT-AG rule. • Statistical patterns- Nucleotide compositions and codon bias in coding regions of eukaryotes are different from those of the non coding regions Most vertebrate genes use ATG as the translation start codon and have uniquely conserved sequences called as Kozak sequence (CCGCCATGG) • Presence of CpG island- Most of these genes have a high density of CG dinucleotides near the transcription start site. Here ‘p’ refers to the phosphodiester bond between the two nucleotides.
  • 11. Gene prediction programs Ab initio-based programs: • This discriminate exons from non coding sequences and subsequently joins them together in the correct order. • It rely on two features gene signals and gene content. • In addition with HMMs, discriminant analysis ,neural network based algorithms are also used in gene prediction.
  • 12. • Neural networks: It is a statistical model with a special architecture for pattern recognition and classification. Here multiple layers are constructed- input, output and hidden layers. The output is the probability of the exon structure. GRAIL is a program based on neural network algorithm. Fig: Architecture of a neural network for eukaryotic gene prediction
  • 13. • Prediction using HMMs: - GENSCAN is a web based program on fifth- order HMMs, - HMMgene is also a web program. It uses a criterion called the conditional maximum likelihood to discriminate coding from non coding features. • Prediction using Discriminant Analysis: - Some gene prediction algorithms rely on discriminant analysis, either LDA or quadratic discriminant analysis (QDA). - LDA works by plotting a 2-D graph of coding signals versus all potential 3’ splice site positions and drawing a diagonal line that best separates coding signals from non-coding signals based on knowledge learned from training data sets of known gene structures.
  • 14. - QDA draws a curved line based on a quadratic function to separate coding and non-coding features. Programs used are: FGENES, FGENESH, FGENESH_C, FGENESH+,MZEF Homolgy based programs: - These are based on the fact that exon structures and exon sequences of related species are highly conserved.
  • 15. - When coding frames in a query sequence are translated and used to align with closest protein homologs found in database, nearly perfectly matched regions can be used to reveal the exon boundaries in the query. - Programs used are: GenomeScan, EST2Genome, SGP-1, TwinScan • Consensus based programs: These programs are developed using consensus- based algorithms which combine results of multiple programs based on consensus. However this may lead to lowered sensitivity and missed predictions. Eg of consensus- based programs are: GeneComber, DIGIT
  • 16. Functional annotation • At the present time, the functional genome annotation is based on the idea that some sequence similarities detected between two proteins mean that they are homologs i.e. they come from the same ancestor and share the same biochemical function. • Therefore, for each predicted gene, the protein is deduced from the coding region and is compared through BlastP with the protein databases. • If the similarities detected are considered relevant, the name (function) of the putative homologue protein is associated with the prediction.
  • 17. • The tendency is nevertheless the following: when a predicted gene product is 100 % identical to an already characterized protein, it receives the same name, whereas sequences with stringent similarity to known proteins are called ‘putative’ proteins of the same name. • The sequences for which only similarities to ESTs are detected and named ‘unknown’ proteins. • Finally, genes without similar sequences and, hence, only deduced from intrinsic prediction programs are labelled ‘hypothetical’. • Some annotators confirm and complete the Blast results by full-length alignments between the query protein and the closest homologue detected, and by looking for motifs and family signatures.
  • 18. Automatic genome annotation pipelines • The primary goal of the pipeline process is to deliver highly accurate and reliable genome annotations, using the widest possible range of evidence from available databases. • As pipelines have evolved, the trend has been to move away from single algorithm methods and towards consensus-based approaches. • Pipelines are the integration of suites of bioinformatics software tools with multiple databases, to manage automatically the analysis and storage of genomic sequence. • Genomic sequences pass through several successive levels of algorithms. Each layer of processing provides further refinement of annotation detail.
  • 19. Fig: The generic structure of an automatic genome annotation pipeline and delivery system.
  • 20. Genomic pipelines: Several genomic pipelines exist worldwide. Publicly funded projects include • Ensembl at the European Bioinformatics Institute (EBI)/Sanger Institute, • NCBI Analysis Pipeline, • Oak Ridge National Laboratories (ORNL) Genome Channel.