Structural annotation................pptx

STRUCTURAL ANNOTATION
• Structural annotation of genome by computer analysis of sequence data and experimental
techniques

INTRODUCTION
• The scope of genome annotation has expanded since the first complete annotation
of the Haemophilus influenzae genome in 1995( Fleishmann et al., 1995).
• Once a DNA sequence has been obtained, whether it is the sequence of a single
cloned fragment or of an entire chromosome, then various methods can be
employed to locate the genes that are present.
• These methods can be divided into those that involve simply inspecting the
sequence, by eye or more frequently by computer, to look for the special sequence
features associated with genes, and those methods that locate genes by
experimental analysis of the DNA sequence. The computer methods form part of the
methodology called bioinformatics.
• The first software used to analyze sequencing reads is the ‘Staden Package’
created by Rodger Staden in 1977( Staden, 1977).

STRUCTURAL ANNOTATION
• Finding features of DNA—exons, introns, promoters, transposons, etc.—is known as
structural annotation. Structural annotation attempts to ﬁnd genes in a genomic sequence.
• A gene can be deﬁned as "a sequence region necessary for generating functional
products" . Functional products of genes are proteins and RNAs. Genes that lead to the
production of proteins are called protein-coding genes.
• Other genes that do not code proteins, but instead functional RNA molecules, are called
noncoding genes. Noncoding RNA genes include genes for ribosomal RNA (rRNA),
transfer RNA (tRNA), microRNA (miRNA), small nuclear RNA and nucleolar RNA (snRNA
and snoRNA, respectively) and long noncoding RNA (lncRNA).
• Structural annotations also identify pseudogenes. They were initially considered to be
functionless and evolutionary dead-ends. We now know that they sometimes participate in
gene regulation. Hence, their prediction improves our understanding of genomes.

• Sequence inspection can be used to locate genes because genes are not random series of nucleotides but
instead have distinctive features.
• At present we do not fully understand the nature of all of these specific features, and sequence inspection is
therefore not a foolproof way of locating genes, but it is still a powerful tool and is usually the first method that
is applied to analysis of a new genome sequence.
GENE LOCATION BY SEQUENCE INSPECTION

The coding regions of genes are open reading frames
• Genes that code for proteins comprise open reading frames
(ORFs) consisting of a series of codons that specify the amino
acid sequence of the protein that the gene codes for.
• The ORF begins with an initiation codon usually (but not
always) ATG-and ends with a termination codon: TAA, TAG, or
TGA . Searching a DNA sequence for ORFs that begin with an
ATG and end with a termination triplet is therefore one way of
looking for genes.
• The analysis is complicated by the fact that each DNA
sequence has six reading frames, three in one direction and
three in the reverse direction on the complementary strand , but
computers are quite capable of scanning all six reading frames
for ORFs.
A PROTEIN-CODING GENE IS AN OPEN READING FRAME
OF TRIPLET CODONS

• With bacterial genomes, simple ORF scanning is an effective way of locating most of the genes in a
DNA sequence.
• With bacteria the analysis is further simplified by the fact that the genes are very closely spaced and
hence there is relatively little intergenic DNA in the genome (only 11% for E. coli).
• If we assume that the real genes do not overlap, which is true for most bacterial genes, then it is only
in the intergenic regions that there is a possibility of mistaking a short, spurious ORF for a real gene.
• So if the intergenic component of a genome is small, then there is a reduced chance of making
mistakes in interpreting the results of a simple ORF scan.

Repeats
• The first step in structural annotation involves repeat masking. DNA repeats occur in both
prokaryotic and eukaryotic organisms.
• The repeats account for 0% to over 42% of the prokaryotic genome . Similarly, eukaryotic
genomes can harbor millions of repeats.
• For instance, repeats account for two-thirds of the human genome . Repeat sequences can be
localized in tandem, i.e., adjacent to one other, and are typically found in the centromere .
Alternatively, they can be interspersed in different forms of transposable elements, e.g., in long
and short interspersed nuclear elements (LINEs and SINEs), DNA transposons, etc. .
• Repeat masking tools rely on databases with lists of already identified repeats. RepeatMasker is
a good example of such tool.
• Aligning transcript and protein evidence after masking is the second step of structural annotation
before gene identification, although it is not mandatory. BLAST or BLAT can be used to align the
transcript and protein evidence.
• Further, RNA-seq evidence can be aligned using TopHat or HISAT .

Predictions of Gene and Different Features
• Identifying protein-coding genes and other regulatory elements takes center stage in gene annotation. Gene
prediction is a complex process, especially for eukaryotic DNA .
• The varying sizes of introns(noncoding sequences) in-between exons and alternative splice variants make
gene structure prediction difficult.
• Many gene prediction programs exist. They can be categorized into three groups: ab initio methods,
homology-based methods, and combined methods.
• Approaches for gene prediction based on nucleotide sequence are called ab initio methods. Ab initio
approaches rely on statistical models, such as the hidden Markov model (HMM), to identify promoters, coding
or noncoding regions, and intron–exon junctions in the genome sequence.
• The second approach aligns the sequence with expressed sequence tags (EST), complementary DNA (cDNA),
or protein evidence, and uses detected similarities for gene prediction.

• The other group comprises programs that combine ab initio and evidence- or homology-based
approaches for gene prediction.
• In addition, gene prediction programs should be able to predict alternative splicing sites because
alternative splicing is a major actor in the regulation of gene expression, and transcriptome and proteome
diversity .
• Accordingly, gene prediction programs use various models to predict splice sites. Since approximately
99% of the introns in sequenced genomes begin with GT and end with AG, these features are denoted as
mandatory by most gene prediction systems for splice site detection.
• In addition, incorporation of a strong splice donor consensus, such as the GC–AG splice site, improves
the accuracy of gene prediction programs.

https://services.healthtech.dtu.dk/service.php?EasyGene-1.2
http://www.softberry.com/berry.phtml?topic=fgen
es &group=programs&subgroup=gfind
http://opal.biology.gatech.edu/GeneMark/
http://www.genezilla.org/
http://hollywood.mit.edu/GENSCAN.html
Commonly used gene prediction programs and their classification, based on the above discussion

http://ccb.jhu.edu/software/glimmerhmm/
https://services.healthtech.dtu.dk/service.php?HMMgene-1.1
https://galaxy.inf.ethz.ch/tool_runner?tool_id=mgenepredict
https://services.healthtech.dtu.dk/service.php?NetGene2-2.42
http://www.cbs.dtu.dk/services/RNAmmer/
https://github.com/KorfLab/SNAP
http://lowelab.ucsc.edu/tRNAscan-SE/

http://galaxy.informatik.uni-halle.de/
http://genomethreader.org/
https://mblab.wustl.edu/software.html
http://www.pseudogene.org/pseudopipe/
https://mblab.wustl.edu/software.html
http://bioinf.uni-greifswald.de/augustus/
http://www.cbcb.umd.edu/software/jigsaw/

Databases for Structural Annotation
• Annotations require supporting data that can be used or presented as evidence of predicted
assignments. Currently, homology-based methods play a central role in genome annotation because
of the huge amount of EST and cDNA sequences available .
• Homology-based methods depend on DNA, RNA, or protein sequence alignment data, which can
easily be retrieved from biological databases. Ab initio annotations, on the other hand, identify genes
and their structures using mathematical models.
• Nonetheless, the ab initio gene predictors have to be trained using high-quality gene models or
organism-specific genome traits, such as codon frequency and intron–exon length distribution .
Further, ab initio models require ESTs, RNA-seq data, and proteins to improve prediction accuracy.
• Databases readily provide such data. Nucleotide and protein sequence or structure can easily be
found in comprehensive public-domain databases, e.g., the GenBank , European Nucleotide Archive
(ENA) , and DNA Databank of Japan (DDBJ). UniProt , which is a protein sequence database that
combines UniProtKB/Swiss-Prot (over 560,000 manually curated sequences) and
UniProtKB/TrEMBL (180 million automatically annotated sequences), provides the scientific
community with high-quality and freely accessible protein sequences with the associated functional
information.

Comparative Annotation Methods
• Genome annotation achieved by comparison of genes and genomes across species can be a reliable
information source for understanding genome evolution. Comparative annotation allows annotations of a
well-studied genome to be projected onto an evolutionarily close species. It often focuses on the coding
genes.
• Valuable information for comparative annotation can be found from genome alignment. A well-aligned
genome will yield sound data for comparative annotation .
• Approaches to comparative annotation of genomes can be categorized into ab initio methods and
homology-based methods, considering the input information used for annotation, i.e., either a statistical
model of genes, or protein sequence, EST, and cDNA, accordingly.
• Ab initio approaches are preferred for genes that are weakly or not at all represented in RNA-seq library and
have insufficient similarity to any known protein and lack other evidence.

• Related species have genomes that share similarities inherited from their common ancestor,
over- laid with species-specific differences that have arisen since the species began to evolve
independently. Because of natural selection, the sequence similarities between related genomes
are greatest within the genes and lowest in the intergenic regions.
• Therefore, when related genomes are compared, homologous genes are easily identified
because they have high sequence similarity, and any ORF that does not have a clear homolog in
the second genome can be discounted as almost certainly being a chance sequence and not a
genuine gene. This type of analysis-called comparative genomics

Homology-Based Annotation
• For predict and annotate genes by identifying significant matches from a well annotated genome
sequence by employing alignment tools such as BLAST.
• Homology-based annotations use the coding sequences (CDS), usually protein sequences and
sometimes transcripts in the form of mRNA, cDNA, or EST to predict genes, assuming similar sequence
regions encode homologous proteins.
• Tools like Exonerate and DIALIGN can be used for sequence alignment; GenomeThreader and
AGenDA are used for gene predictions. Increased evolutionary distance between the input protein and
the target protein reduces the accuracy of homology-based gene finding. This happens because of
heavy reliance on the alignment and information derived from the already known genes, which creates a
challenge in identifying genes whose properties are different from those of referenced genes.
• However, newer comparative approaches solve this issue by relying to a greater degree on sequence
conservation, which enables them to identify genes with new features and different statistical
composition.
• TWINSCAN and SGP2 are examples of tools in which gene prediction uses the analysis of sequence
conservation patterns between genomic sequences of evolutionarily related organisms.

Ab Initio Annotation
• Ab initio annotation relies on ab initio gene predictors, which in turn rely on training data to construct an
algorithm or model.
• Prediction is done based on the genomic sequence in question, using statistical analysis and other gene
signals such as k-mer statistics and frame length.
• Some popular ab initio gene predictors are discussed below. AUGUSTUS defines the probability distributions
for eukaryotic genome sequences based on GHMM. AUGUSTUS is re-trainable and it can predict alterative
splicing, and the 50UTR and 30UTR, including introns. AUGUSTUS is one of the most accurate ab initio
gene prediction programs for the species it has been trained for .
• FGENESH is an HMM-based, very fast, and accurate ab initio gene structure prediction program for
humans, Drosophila, plants, yeasts, and nematodes.
• This renders it the fastest tool among HMM-based gene finding programs. GENSCAN is another HMM-
based ab initio tool for predicting locations and exon–intron structures of genes in genomic sequences of a
variety of organisms.

• Vertebrate and invertebrate versions of GENSCAN are available. The accuracy of the latter is lower
because the original tool was primarily designed for the detection of genes in human and vertebrate
genomic sequences.
• It is becoming a common practice to use ab initio annotation methods in combine a sequences
transcriptome information such as that provided by RNA-seq.
• This can be viewed as an evidence-based or extrinsic approach. For example, a newer version of
AUGUSTUS can incorporate information from EST and protein alignments.
• In addition, a variant of FGENESH called FGENESH-C uses HMM and cDNA for predictions, while
GenomeScan (an extension of GENSCAN) uses extrinsic information of protein BLAST alignments for
gene structure prediction.

Ab initio and Homology based annotation tools summary.

Annotation Pipelines
• Analysis of large amounts of data generated by the sequencing requires multiple computationally-intensive
steps . Sets of algorithms that process sequence data and are executed in a predefined order are called a
bioinformatic pipelines.
• Pipelines process massive amounts of sequence data and the associated metadata using multiple
software components, databases, and environments.
• They are comprehensive, holistic packages that try to exploit relevant information provided by both ab
initio and similarity-based gene predictors.

Structural Pipelines
• MAKER2 is a multi-threaded, parallelized genome annotation and data management application, which
builds up on MAKER.
• Ab initio gene prediction tools SNAP, AUGUSTUS, and GenMark-ES are integrated in MAKER2. Novel
genomes with limited training data available can be annotated with MAKER2. The tool can also be used
to improve annotation quality by integrating mRNA-seq data.
• NCBI Eukaryotic Annotation Pipeline is an automated pipeline for eukaryotes, in which coding and
noncoding genes, transcripts, and proteins in both finished and draft genomes can be annotated. This
pipeline uses Splign and ProSplign for alignment. It also has its own gene prediction tool called
GNOMON which combines HMM-based ab initio models and homology search information extracted
from experimental evidence.
• Comparative Annotation Toolkit (CAT) is a fully open-source software toolkit for end-to-end annotation.
CAT uses Progressive Cactus for multiple alignments. It’s output, together with previously annotated
genomes, is used to project annotations using TransMAP .

• CAT uses AUGUSTUS for gene prediction both from transMap projections and for ab initio gene
prediction.
• CAT wan developed by the GENCODE, and was utilized for the annotation of genomes of laboratory
mouse strains and great apes .
• BRAKER1 is a fully automated and highly accurate unsupervised RNA-seq–based genome
annotation pipeline for eukaryotic genomes.

Annotation Visualization
• File Formats
Most bioinformatic tools use the FASTA format as a standard for sequence data sharing. The FASTA
format is used for searching sequence databases, evaluating similarity scores, and identification of
periodic similarity scores.
Other standard file formats exisformat can accommodate additional information and can be used by
different programs, and interpreformat human users. It format genomic features in a standard text file
format.
• Genome Browsers
• Researchers and users utilize genome browsers to integrate various types of information, as well as
analyze and visualize data related to annotation.
• Genome browsers are usually used to efficiently and conveniently browse, search, retrieve, and
examine genomic sequence and annotation data, via a graphical interface. The UCSC Genome
Browser is the most commonly used genome browser; many visualization tools are modeled based on
this tool.
• The Ensembl genome browser is another widely used genome browser for vertebrate genomes, which
supports comparative genomics, sequence variation analysis, and transcriptional regulation analysis.
• Generic Model Organism Database (GMOD) is a collection of interconnected open-source software
tools and databases for managing, visualizing, storing, and sharing genetic and genomic information.

Re-Annotation
• We have seen that as a result of the increasing volume of data from genome sequencing projects,
computational analysis methods have become a considerable element of genome annotations. However, this
has led to high levels of misannotation in public databases .
• Re-annotation benefits the end-user by providing the latest resources. Updating a previously annotated
genome can be seen as re-annotation . Automated annotations save time and resources, but manual
annotations, although time-consuming, are better than automated annotations.
• Re-annotation can be used to create large complete genomes, and indeed, there are tools that can be used
for this purpose. Restauro-G is rapid bacterial genome re-annotation software that utilizes a BLAST-like
alignment tool for re-annotation.
• MAKER2 incorporates an external annotation pass-through mechanism that accepts pre-existing genome
annotations.
• Wiki-based sites have been proven successful in providing accurate, useful, and updated information,
despite the fear of being filled with unreliable and inaccurate data. Currently, new information emerges from
different corners of bioinformatic fields, which impacts gene annotation, rendering re-annotation a never-
ending process, to some degree.

EXPERIMENTAL TECHNIQUES FOR GENE LOCATION
• Most experimental methods for gene location are not based on direct examination of
DNA molecules but instead rely on detection of the RNA molecules that are
transcribed from genes.
• All genes are transcribed into RNA, and if the gene is discontinuous then the primary
transcript is subsequently processed to remove the introns and link up the exons .
• Techniques that map the positions of transcribed sequences in a DNA fragment can
therefore be used to locate exons and entire genes. The only problem to be kept in
mind is that the transcript is usually longer than the coding part of the gene because it
begins several tens of nucleotides upstream of the initiation codon and continues
several tens or hundreds of nucleotides downstream of the termination codon.

Hybridization tests can determine if a fragment contains transcribed sequences
• The simplest procedures for studying transcribed sequences are based on hybridization analysis.
• RNA molecules can be separated by specialized forms of agarose gel electrophoresis, transferred to a
nitrocellulose or nylon membrane, and examined by the process called northern hybridization.
• This differs from Southern hybridization only in the precise conditions under which the transfer is carried
out, and the fact that it was not invented by a Dr Northern and so does not have a capital "N."
• If a northern blot of cellular RNA is probed with a labeled fragment of the genome, then RNAs
transcribed from genes within that fragment will be detected. Northern hybridization is therefore,
theoretically, a means of determining the number of genes present in a DNA fragment and the size of
each coding region.

• An RNA is electrophoresed under denaturing
conditions in an agarose gel.
• After ethidium bromide staining, two bands are seen.
These are the two largest rRNA molecules , which are
abundant in most cells.
• The smaller rRNAs, which are also abundant, are not
seen because they are so short that they run out the
bottom of the gel and, in most cells, none of the
mRNAs are enough to form a band visible after
ethidium bromide staining.
• The gel is blotted onto a nylon membrane and, in this
example, probed with a radioactively labeled DNA
fragment.
• A single band is visible on the autoradiograph, showing
that the DNA fragment used as the probe contains part
or all of one transcribed sequence. northern hybridization

zoo-blotting
• A second type of hybridization analysis avoids the problems with
poorly expressed and tissue-specific genes by searching not for
RNAs but for related sequences in the DNAs of other organisms.
• This approach, like homology searching, is based on the fact that
homologous genes in related organisms have similar sequences,
whereas the intergenic DNA is usually quite different. a DNA
from one species is used to probe a Southern transfer of DNAs
from related species, and one or more hybridization signals are
obtained, then it is likely that the probe contains one or more
genes. This is called zoo-blotting.
zoo-blotting.

cDNA sequencing enables genes to be mapped within DNA fragments
• Northern hybridization and zoo-blotting enable the presence or absence of genes in a DNA fragment to be
determined, but give no positional information relating to the location of those genes in the DNA sequence.
The easiest way to obtain this information is to sequence the relevant cDNAs.
• A cDNA is a copy of an mRNA and so corresponds to the coding region of a gene, plus any leader or trailer
sequences that are also transcribed. Comparing a cDNA sequence with a genomic DNA sequence
therefore delineates the position of the relevant gene and reveals the exon-intron boundaries.
• In order to obtain an individual cDNA, a cDNA library must first be prepared from all of the mRNA in the
tissue being studied. Once the library has been prepared, the success of cDNA sequencing as a means of
gene location depends on two factors.

• The first concerns the frequency of the desired cDNAs in the library. As with northern hybridization,
the problem relates to the different expression levels of different genes. If the DNA fragment being
studied contains one or more poorly expressed genes, then the relevant cDNAs will be rare in the
library and it might be necessary to screen many clones before the desired one is identified.
• To get around this problem, various methods of cDNA capture or cDNA selection have been
devised, in which the DNA fragment being studied is repeatedly hybridized to the pool of cDNAs in
order to enrich the pool for the desired clones.
• Because the cDNA pool contains so many different sequences, it is generally not possible to
discard all the irrelevant clones by these repeated hybridizations, but it is possible to increase
significantly the frequency of those clones that specifically hybridize to the DNA fragment. This
reduces the size of the library that must subsequently be screened under stringent conditions to
identify the desired clones.

• A second factor that determines success or failure is the completeness of the individual cDNA
molecules. Usually, cDNAs are made by copying RNA molecules into single-stranded DNA with
reverse transcriptase and then converting the single-stranded DNA into double-stranded DNA with a
DNA polymerase..
• There is always a chance that one or other of the strand synthesis reactions will not proceed to
completion, resulting in a truncated cDNA. The presence of intramolecular base pairs in the RNA can
also lead to incomplete copying. Truncated cDNAs may lack some of the information needed to locate
the start and end points of a gene and all its exon-intron boundaries.

Methods are available for precise mapping of the ends of transcripts – (RACE)
• The problems with incomplete cDNAs mean that more robust methods are needed for locating the
precise start and end points of gene transcripts.
• One possibility is a special type of PCR that uses RNA rather than DNA as the starting material. The
first step in this type of PCR is to convert the RNA into cDNA with reverse transcriptase, after which
the cDNA is amplified with Taq polymerase in the same way as in a normal PCR. These methods go
under the collective name of reverse transcriptase PCR (RT-PCR) but the particular version that
interests us at present is rapid amplification of cDNA ends (RACE).

• In the simplest form of this method, one of the primers is specific for an internal region close to the
beginning of the gene being studied. This primer attaches to the mRNA for the gene and directs the
first reverse transcriptase-catalyzed stage of the process, during which a cDNA corresponding to the
start of the mRNA is made.
• Because only a small segment of the mRNA is being copied, the expectation is that the cDNA
synthesis will not terminate prematurely, so one end of the cDNA will correspond exactly with the start
of the mRNA.
• Once the cDNA has been made, a short poly(A) tail is attached to its 3' end. The second primer
anneals to this poly(A) sequence and, during the first round of the normal PCR, converts the single-
stranded cDNA into a double-stranded molecule, which is subsequently amplified as the PCR
proceeds. The sequence of this amplified molecule will reveal the precise position of the start of the
transcript.

RACE – Rapid Amplification of cDNA Ends.
• The RNA being studied is converted into a partial cDNA by
extension of a DNA primer that anneals at an internal
position not too distant from the 5' end of the molecule.
• The 3' end of the cDNA is further extended by treatment with
terminal deoxynucleotidyl transferase in the presence of
dNTP, which results in a series of As being added to the
cDNA.
• This series of As acts as the annealing site for the anchor
primer. Extension of the anchor primer leads to a double-
stranded DNA molecule, which can now be amplified by a
standard PCR.
• This is 5'-RACE, so-called because it results in amplification
of the 5' end of the starting RNA. A similar method- 3'-RACE-
can be used if the 3' end- sequence is desired.

• Other methods for precise transcript mapping involve heteroduplex analysis. If the DNA region being
studied is cloned as a restriction fragment in an M13 vector then it can be obtained as single-stranded
DNA. When mixed with an appropriate RNA preparation, the transcribed sequence in the cloned DNA
hybridizes with the equivalent mRNA, forming a double-stranded heteroduplex.
• The start of this mRNA lies within the cloned restriction fragment, so some of the cloned fragment
participates in the heteroduplex, but the rest does not. The single-stranded regions can be digested
by treatment with a single-strand-specific nuclease such as S1.
• The size of the heteroduplex is determined by degrading the RNA component with alkali and
electrophoresing the resulting single-stranded DNA in an agarose gel.
• This size measurement is then used to position the start of the transcript relative to the restriction site
at the end of the cloned fragment. Heteroduplex analysis can also be used to locate exon-intron
boundaries.
Heteroduplex analysis.

Exon-intron boundaries can also be located with precision
• A second method for finding exons in a genome sequence is called exon trapping.
• This requires a special type of vector that contains a minigene consisting of two exons flanking an
intron sequence, the first exon being preceded by the sequence signals needed to initiate transcription
in a eukaryotic cell.
• To use the vector, the piece of DNA to be studied is inserted into a restriction site located within the
vector's intron region.
• The vector is then introduced into a suitable eukaryotic cell line, where it is transcribed and the RNA
produced from it is spliced.
• The result is that any exon contained in the genomic fragment becomes attached between the
upstream and downstream exons from the minigene.
• RT-PCR with primers annealing within the two minigene exons is now used to amplify a DNA fragment,
which is sequenced. As the minigene sequence is already known, the nucleotide positions at which
the insertedexon starts and ends can be determined, precisely delineating this exon.

Structural annotation................pptx

Recommended

Recommended

More Related Content

Similar to Structural annotation................pptx

Similar to Structural annotation................pptx (20)

More from Cherry

More from Cherry (20)

Recently uploaded

Recently uploaded (20)

Structural annotation................pptx