CURRENT TRENDS IN
PSEDUOGENE DETECTION
AND CHARACTERIZATION
PSEUDOGENES:
 Pseudogenes are homologous relatives of known genes that have lost their
ability due to the presence of disruptive mutations such as frame shift and
premature stop codons.
 Three types of pseudogenes
Processed or retro transposed pseudogene
Duplicated pseudogenes
Unitary pseudogenes
 The ability to detect and differentiate pseudogenes from functional genes
are difficult task.
INTRODUCTION:
 Bioinformatics and computational biology approaches have aided in
understanding the mechanism of gene regulation and expression.
 Scientist are interested in understanding functional gene evolution. One source of
information is pseudogenes.
 Pseudogenes are homologous sequences arising from currently or evolutionarily
active genes that have lost their ability to function as a result of disrupted
transcription or translation.
 These genes may contain stop codons, repetitive elements, have frame shift or
lack of transcription. But they may retain gene like features, such as promoter ,
CpG islands and splice sites.
 Biologist have particular interest in pseudogenes since they can interfere with
gene centric studies.
TYPES OF PSEUDOGENES:
Processed Pseudogenes:
 They are also known as retro transposed pseudogenes, arise out of retro
transposable events from mRNA.
 A portion of the mRNA transcript of a gene is spontaneously reverse transcribed
back into DNA and inserted into chromosomal DNA.
 Because they are derived from mature mRNA product- lack the upstream
promoters of normal gene, intronic regions.
 They found abundantly in mammalian genomes but rare in other genomes.
 Processed pseudogenes represent the vast majority of inactivated gene
sequences found in the human genome.
DUPLICATED PSUDOGENES:
 In the process of duplication, genetic sequence undergoes various
modifications, including mutation, frame shifting, insertions or deletions.
 If there is any modification during transcription or translational stage that
could result in loss of genetic function.
 If the duplication of a gene is incomplete, the new sequence could be a
pseudogene. So these genes are called as unprocessed or duplicated
pseudogenes.
 They arise when a cell is replicating its won DNA and insert an extra copy of a
gene into a genome in a new location.
 Unprocessed pseudogenes may have introns and some time they contain an
extra copy of the gene. This unnecessary copy could accumulate mutations
without harming the organism.
UNITARY PSEUDOGENES:
 Various mutations can stop a gene from being successfully transcribed or
translated, and a gene may become non-functional or deactivated.
 If the mutations become fixed in a population, it is possible that the gene become
non-functional in a mechanism similar to unprocessed pseudogenes.
 The difficulty in identifying the unitary pseudogenes is that they could be a
ancient duplicate pseudogenes that have sufficiently diverged from the functional
homolog in such a way that they can be no longer detected as similar.
 In such cases they require more of an evolutionary approach incorporating
multiple organisms.
FUNCTIONAL OR NON FUNCTIONAL?
 The traditional view of pseudogenes as genetic elements similar to functional
genes.
 Pseudogenes do not produce protein products.
 Major challenge for this theory arises from recent evidence that a number of
pseudogenes are actively transcribed.
 Harrison et al. identified a set of 233 transcribed processed pseudogenes within
the mouse and human genome using a source expressed sequences in the form of
mRNAs.
 They are hypothesized to have roles in gene regulation by use of their sequence
complementary to the homologous functional gene.
 Transcribed processed pseudogenes have an additional effect that they
themselves can become duplicated, resulting in “duplicated-processed”
pseudogenes
 Many characterized pseudogenes have been implicated in regulation of gene
expression, gene regulation, and provide genetic diversity through
recombination with functional genes or exon shuffling.
AUTOMATED DETECTION OF PSEUDOGENES:
HOMOLOGY-BASED APPROACHES:
 This approach takes genome/chromosome of interest, and compare it with
known set of protein sequences such as ENSEMBL, UniProt.
 This comparison involves nucleotide or amino acid sequences.
 This is accomplished by using fast heuristic algorithm such as tblastn.
 tfastx method is used to for genomic regions which are not corresponding
to a known gene’s origin are then aligned to the corresponding gene
sequence.
 This tfastx alignments are particularly used for studying processed
pseudogenes that do not contain intronic regions.
 For duplicated pseudogenes, an algorithm such as sim4, spidey,
est2genome or exalign are used. That makes into account intron/exon splice
site information.
 Once the genome-to-protein alignments are filtered, they are searched for
disablements.
HARRISON’S APPROACH:
 They used Caenorhabditis elegans genome to test their process.
 Downloaded annotated pseudogenes sequence form Sanger centre and analysed
those sequence.
 Alignment of protein sequence to the genome sequence.
 Results were filtered for overlapping matches in genomic sequences.
 To avoid over-counting for worm protein hits, initial matches were further
filtered for adjacent matches.
 The potential pseudogenes were filtered for overlap with any other annotations
in the Sanger centre file.
 Refined the data for additional repeat elements any match to proteins.
 Using this methodology, in 2002 they found 77 processed pseudogenes on
chromosome 21 and 112 on chromosome 22.
SAKAI’S APPROACH:
 Method to detect processed pseudogenes based on cDNA mapping to the human
genome.
 All cDNA sequences aligned against the human genome using blastn.
 They extracted corresponding genomic regions for each query seqen tblastn ce.
 Set up cut values of >95% identity and >80% coverage to filter those sequence
hits based on est2genome.
 Second stage is pseudogene detection. All the candidate sequence collected
form the first stage were classified into 6 classes according to the way in which
introns are processed (completed intron loss, one/more intron loss, no intron
loss)
 They found 6,549 processed pseudogenes using 113,196 human cDNA. They
believed their approach will help aid the accuracy of gene prediction.
PSEUDOPIPE:
 A homology-based automated pseudogene identification pipeline known as
PseudoPipe was introduced.
 Aligning protein sequences to genomic sequences using tblastn.
 Sequences overlapping with the genes annotated in the ENSEMBL database are
then removed.
 The remaining sequences are further aligned using the Smith- Waterman
approach of tfasty.
 false positive repeats, low complexity sequences and potential functional gene
candidates were removed.
 Finally, sequences are separated into different classes - retrotransposed
pseudogenes, duplicated pseudogenes, or pseudogeneic fragments based upon
exon structure.
PSEUDOGENE FINDER:
 PSF, was proposed in order to detect pseudogenes within 44 ENCODE
sequences.
 PSF begins by searching for protein alignments not corresponding to the gene
location as determined by the program Prot-map.
 Potential pseudogenes are filtered from this resulting dataset by looking at
alignments.
 There are further filtered to ensure that at least one disruption event occurs.
 Based on these criteria, PSF was able to detect 81% of annotated pseudogenes
for the corresponding ENCODE regions.
PSEUDOGENE DATABASE:
HOPPSIGEN:
HOPPSIGEN [82] is a database of homologous processed pseudogenes shared
between the mouse and human genomes that contains information pertaining to the
genomic location, potential function, and gene structure of processed pseudogenes.
PSEUDOGENE QUEST:
 PseudoGeneQuest is a web-based system for identifying novel human
pseudogenes based on a user provided protein sequence.
 PseudoGeneQuest begins by comparing the query sequence against the human
genome using tblastn.
1) already known genes; 2) known pseudogenes; 3) real gene or exon; 4)
pseudogene fragment; 5) pseudogene; 6) miscellaneous; 7) putative new gene; 8)
duplicated pseudogene; 9) interrupted processed pseudogene.
PSEUDOGENE.ORG:
 Pseudogene.org is a MySQL-based repository for pseudogenes identified
using various methods, including over 100,000 pseudogenes from 75
genomes.
 Pseudogenes within Pseudogene.org are classified into the four categories
of 1) processed pseudogenes; 2) duplicated pseudogenes; 3) other non-
processed pseudogenes (such as unitary pseudogenes); and 4) unclassified
(due to signal degradation or structural ambiguity).
 Pseudogene.org classifies pseudogene detection methods according to an
entity attribute value (EAV) that includes the Ka/Ks ratio; CpG content;
distance from the query protein; proximity to CpG islands; and relevant
PCR tiling microarray results
DISCUSSION:
INCONSISTENTLY DETECTED PSEUDOGENES:
 The main difficulties with computational detection of pseudogenes is that many
of the current methods in place only share a small percentage of detected
pseudogenes.
 Performance analysis of pseudogene prediction approaches is an area in which
great improvement is needed.
 Use of these as a training set causes validation issues, since in actuality it will
only measure whether or not two separate approaches can detect the same set
types of pseudogenes.
 What is desperately needed for analyzing the performance of pseudogenes is a
“gold standard” test set similar to those found for use in analyzing gene
prediction programs
DIFFICULT CASES:
 Pavlicek et al. [21] show that only 28% of derived processed pseudogenes are
full length while in silico methods report around 95% of all processed
pseudogenes as full length.
 Pseudogenes that truly become non-functional present a particular problem in
pseudogene detection.
 One method for overcoming this issue is to detect partially conserved protein
motifs in intergenic regions called pseudomotifs.
 Duplication of gene sequences causes a particular problem. Just because two
sequences have sufficiently diverged, it does not necessarily mean that one is a
pseudogene.
 Nearly all computational detection methods currently rely on comparing a
genomic region to a known protein sequence in order to find novel
pseudogenes. However, this approach does not allow for the detection of
unitary pseudogenes where a reference gene is absent.
 limitations of sequence alignment approaches make it hard to detect ancient
and decayed pseudogenes.
FUTURE DIRECTIONS:
 As more genomes of higher level organisms become sequenced, the ability
to do comparative genomics studies of pseudogenes will become more
plausible.
 Incorporationof comparative data will likely increase both the sensitivity
and specificity of pseudogene detection approaches, at an increased price in
computation.
 As more knowledge is obtained regarding biological mechanisms at the
genome level (such as the discovery of miRNA), it is possible that truly
nonfunctional regions of the genome will be discovered to be rare events.
 As a result, the fine line between nonfunctional pseudogenes and functional
components in the genome will continue to be of interest for years to come.

Current trends in pseduogene detection and characterization

  • 1.
    CURRENT TRENDS IN PSEDUOGENEDETECTION AND CHARACTERIZATION
  • 2.
    PSEUDOGENES:  Pseudogenes arehomologous relatives of known genes that have lost their ability due to the presence of disruptive mutations such as frame shift and premature stop codons.  Three types of pseudogenes Processed or retro transposed pseudogene Duplicated pseudogenes Unitary pseudogenes  The ability to detect and differentiate pseudogenes from functional genes are difficult task.
  • 3.
    INTRODUCTION:  Bioinformatics andcomputational biology approaches have aided in understanding the mechanism of gene regulation and expression.  Scientist are interested in understanding functional gene evolution. One source of information is pseudogenes.  Pseudogenes are homologous sequences arising from currently or evolutionarily active genes that have lost their ability to function as a result of disrupted transcription or translation.  These genes may contain stop codons, repetitive elements, have frame shift or lack of transcription. But they may retain gene like features, such as promoter , CpG islands and splice sites.  Biologist have particular interest in pseudogenes since they can interfere with gene centric studies.
  • 4.
    TYPES OF PSEUDOGENES: ProcessedPseudogenes:  They are also known as retro transposed pseudogenes, arise out of retro transposable events from mRNA.  A portion of the mRNA transcript of a gene is spontaneously reverse transcribed back into DNA and inserted into chromosomal DNA.  Because they are derived from mature mRNA product- lack the upstream promoters of normal gene, intronic regions.  They found abundantly in mammalian genomes but rare in other genomes.  Processed pseudogenes represent the vast majority of inactivated gene sequences found in the human genome.
  • 5.
    DUPLICATED PSUDOGENES:  Inthe process of duplication, genetic sequence undergoes various modifications, including mutation, frame shifting, insertions or deletions.  If there is any modification during transcription or translational stage that could result in loss of genetic function.  If the duplication of a gene is incomplete, the new sequence could be a pseudogene. So these genes are called as unprocessed or duplicated pseudogenes.  They arise when a cell is replicating its won DNA and insert an extra copy of a gene into a genome in a new location.  Unprocessed pseudogenes may have introns and some time they contain an extra copy of the gene. This unnecessary copy could accumulate mutations without harming the organism.
  • 6.
    UNITARY PSEUDOGENES:  Variousmutations can stop a gene from being successfully transcribed or translated, and a gene may become non-functional or deactivated.  If the mutations become fixed in a population, it is possible that the gene become non-functional in a mechanism similar to unprocessed pseudogenes.  The difficulty in identifying the unitary pseudogenes is that they could be a ancient duplicate pseudogenes that have sufficiently diverged from the functional homolog in such a way that they can be no longer detected as similar.  In such cases they require more of an evolutionary approach incorporating multiple organisms.
  • 8.
    FUNCTIONAL OR NONFUNCTIONAL?  The traditional view of pseudogenes as genetic elements similar to functional genes.  Pseudogenes do not produce protein products.  Major challenge for this theory arises from recent evidence that a number of pseudogenes are actively transcribed.  Harrison et al. identified a set of 233 transcribed processed pseudogenes within the mouse and human genome using a source expressed sequences in the form of mRNAs.  They are hypothesized to have roles in gene regulation by use of their sequence complementary to the homologous functional gene.  Transcribed processed pseudogenes have an additional effect that they themselves can become duplicated, resulting in “duplicated-processed” pseudogenes
  • 9.
     Many characterizedpseudogenes have been implicated in regulation of gene expression, gene regulation, and provide genetic diversity through recombination with functional genes or exon shuffling.
  • 10.
    AUTOMATED DETECTION OFPSEUDOGENES: HOMOLOGY-BASED APPROACHES:  This approach takes genome/chromosome of interest, and compare it with known set of protein sequences such as ENSEMBL, UniProt.  This comparison involves nucleotide or amino acid sequences.  This is accomplished by using fast heuristic algorithm such as tblastn.  tfastx method is used to for genomic regions which are not corresponding to a known gene’s origin are then aligned to the corresponding gene sequence.  This tfastx alignments are particularly used for studying processed pseudogenes that do not contain intronic regions.
  • 11.
     For duplicatedpseudogenes, an algorithm such as sim4, spidey, est2genome or exalign are used. That makes into account intron/exon splice site information.  Once the genome-to-protein alignments are filtered, they are searched for disablements.
  • 12.
    HARRISON’S APPROACH:  Theyused Caenorhabditis elegans genome to test their process.  Downloaded annotated pseudogenes sequence form Sanger centre and analysed those sequence.  Alignment of protein sequence to the genome sequence.  Results were filtered for overlapping matches in genomic sequences.  To avoid over-counting for worm protein hits, initial matches were further filtered for adjacent matches.  The potential pseudogenes were filtered for overlap with any other annotations in the Sanger centre file.  Refined the data for additional repeat elements any match to proteins.  Using this methodology, in 2002 they found 77 processed pseudogenes on chromosome 21 and 112 on chromosome 22.
  • 13.
    SAKAI’S APPROACH:  Methodto detect processed pseudogenes based on cDNA mapping to the human genome.  All cDNA sequences aligned against the human genome using blastn.  They extracted corresponding genomic regions for each query seqen tblastn ce.  Set up cut values of >95% identity and >80% coverage to filter those sequence hits based on est2genome.  Second stage is pseudogene detection. All the candidate sequence collected form the first stage were classified into 6 classes according to the way in which introns are processed (completed intron loss, one/more intron loss, no intron loss)  They found 6,549 processed pseudogenes using 113,196 human cDNA. They believed their approach will help aid the accuracy of gene prediction.
  • 14.
    PSEUDOPIPE:  A homology-basedautomated pseudogene identification pipeline known as PseudoPipe was introduced.  Aligning protein sequences to genomic sequences using tblastn.  Sequences overlapping with the genes annotated in the ENSEMBL database are then removed.  The remaining sequences are further aligned using the Smith- Waterman approach of tfasty.  false positive repeats, low complexity sequences and potential functional gene candidates were removed.  Finally, sequences are separated into different classes - retrotransposed pseudogenes, duplicated pseudogenes, or pseudogeneic fragments based upon exon structure.
  • 15.
    PSEUDOGENE FINDER:  PSF,was proposed in order to detect pseudogenes within 44 ENCODE sequences.  PSF begins by searching for protein alignments not corresponding to the gene location as determined by the program Prot-map.  Potential pseudogenes are filtered from this resulting dataset by looking at alignments.  There are further filtered to ensure that at least one disruption event occurs.  Based on these criteria, PSF was able to detect 81% of annotated pseudogenes for the corresponding ENCODE regions.
  • 16.
    PSEUDOGENE DATABASE: HOPPSIGEN: HOPPSIGEN [82]is a database of homologous processed pseudogenes shared between the mouse and human genomes that contains information pertaining to the genomic location, potential function, and gene structure of processed pseudogenes. PSEUDOGENE QUEST:  PseudoGeneQuest is a web-based system for identifying novel human pseudogenes based on a user provided protein sequence.  PseudoGeneQuest begins by comparing the query sequence against the human genome using tblastn. 1) already known genes; 2) known pseudogenes; 3) real gene or exon; 4) pseudogene fragment; 5) pseudogene; 6) miscellaneous; 7) putative new gene; 8) duplicated pseudogene; 9) interrupted processed pseudogene.
  • 17.
    PSEUDOGENE.ORG:  Pseudogene.org isa MySQL-based repository for pseudogenes identified using various methods, including over 100,000 pseudogenes from 75 genomes.  Pseudogenes within Pseudogene.org are classified into the four categories of 1) processed pseudogenes; 2) duplicated pseudogenes; 3) other non- processed pseudogenes (such as unitary pseudogenes); and 4) unclassified (due to signal degradation or structural ambiguity).  Pseudogene.org classifies pseudogene detection methods according to an entity attribute value (EAV) that includes the Ka/Ks ratio; CpG content; distance from the query protein; proximity to CpG islands; and relevant PCR tiling microarray results
  • 18.
    DISCUSSION: INCONSISTENTLY DETECTED PSEUDOGENES: The main difficulties with computational detection of pseudogenes is that many of the current methods in place only share a small percentage of detected pseudogenes.  Performance analysis of pseudogene prediction approaches is an area in which great improvement is needed.  Use of these as a training set causes validation issues, since in actuality it will only measure whether or not two separate approaches can detect the same set types of pseudogenes.  What is desperately needed for analyzing the performance of pseudogenes is a “gold standard” test set similar to those found for use in analyzing gene prediction programs
  • 19.
    DIFFICULT CASES:  Pavliceket al. [21] show that only 28% of derived processed pseudogenes are full length while in silico methods report around 95% of all processed pseudogenes as full length.  Pseudogenes that truly become non-functional present a particular problem in pseudogene detection.  One method for overcoming this issue is to detect partially conserved protein motifs in intergenic regions called pseudomotifs.  Duplication of gene sequences causes a particular problem. Just because two sequences have sufficiently diverged, it does not necessarily mean that one is a pseudogene.
  • 20.
     Nearly allcomputational detection methods currently rely on comparing a genomic region to a known protein sequence in order to find novel pseudogenes. However, this approach does not allow for the detection of unitary pseudogenes where a reference gene is absent.  limitations of sequence alignment approaches make it hard to detect ancient and decayed pseudogenes. FUTURE DIRECTIONS:  As more genomes of higher level organisms become sequenced, the ability to do comparative genomics studies of pseudogenes will become more plausible.  Incorporationof comparative data will likely increase both the sensitivity and specificity of pseudogene detection approaches, at an increased price in computation.
  • 21.
     As moreknowledge is obtained regarding biological mechanisms at the genome level (such as the discovery of miRNA), it is possible that truly nonfunctional regions of the genome will be discovered to be rare events.  As a result, the fine line between nonfunctional pseudogenes and functional components in the genome will continue to be of interest for years to come.