SlideShare a Scribd company logo
EST Clustering
Medhavi Vashisth*#, Indu Bisla*#, Taruna Anand*, Neeru Redhu#,
Shikha Yashveer#, Nitin Virmani*, B.C. Bera*, Rajesh K. Vaid*
*National Centre for Veterinary Type Cultures, ICAR- National Research Centre on Equines, Hisar, India
#CCS Haryana Agricultural University, Hisar, india
Expressed Sequence Tags (EST)
• What are ESTs?
• Quality problem (single pass)
• Cleaning (vector clipping, contamination
filtering, repeat masking)
• Clustering
• Assembly into contigs
• Gene indices
• Databases
Expressed Sequence Tags (EST)
• ESTs represent partial sequences of cDNA
clones (average ~ 360 bp).
• Single-pass reads from the 5’ and/or 3’ ends of
cDNA clones.
What is an EST cluster?
• A cluster is fragmented, EST data and (if known) gene sequence data,
consolidated, placed in correct context, and indexed by gene such that all
expressed data concerning a single gene is in a single index class, and each
index class contains the information for only one gene (Burke et al., 1999).
Owing to the fragmented nature of EST reads, it is worthwhile to attempt
to organize the reads into assemblies that provide a consensus view of the
sampled transcripts.
• Recent studies indicate that several forms of transcript can be generated
from a single locus and that antisense RNAs and possibly other forms of
RNAs revealed in recent tiling path studies can overlap their expression
over the same sets of genomic nucleotides
Interest of ESTs
• The aim of EST clustering is simply to incorporate all ESTs that share a transcript
or gene parent to the same cluster. There is usually a requirement to assemble the
clustered ESTs into one or more consensus sequences (contigs) that reflect the
transcript diversity, and to provide these contigs in such a manner that the
information they contain most truly reflects the sampled biology.
• ESTs represent the most extensive available survey of the transcribed portion of
genomes.
• ESTs are indispensable for gene structure prediction, gene discovery and genomic
mapping.
• Characterization of splice variants and alternative polyadenylation.
• In silico differential display and gene expression studies (specific tissue expression,
normal/disease states).
• SNP data mining.
• High-volume and high-throughput data production at low cost.
Low quality data of ESTs
• High error rates (~ 1/100) because of the
sequence reading single-pass.
• Sequence compression and frame-shift errors
due to the sequence reading single-pass.
• A single EST represents only a partial gene
sequence.
• Not a defined gene/protein product.
• Not curated in a highly annotated form.
• High redundancy in the data -> huge number
of sequences to analyze.
Improving ESTs: Clustering, Assembling and Gene indices
• The value of ESTs is greatly enhanced by clustering and assembling.
– solving redundancy can help to correct errors;
– longer and better annotated sequences;
– easier association to mRNAs and proteins;
– detection of splice variants;
– fewer sequences to analyze.
• Gene indices: All expressed sequences (as ESTs) concerning a single gene are
grouped in a single index class, and each index class contains the information
for only one gene.
• Different clustering/assembly procedures have been proposed with
associated resulting databases (gene indices):
– UniGene (http://www.ncbi.nlm.nih.gov/UniGene)
– TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml)
– STACK (http://www.sambi.ac.za/Dbases.html)
EST clustering pipeline
Pre-processing: data source
• The data sources for clustering can be in-house, proprietary, public database or a
hybrid of this (chromatograms and/or sequence files).
• Each EST must have the following information:
– A sequence AC/ID (ex. sequence-run ID);
– Location in respect of the poly A (3’ or 5’);
– The CLONE ID from which the EST has been generated;
– Organism;
– Tissue and/or conditions;
– The sequence.
• The EST can be stored in FASTA format:
>T27784 EST16067 Human Endothelial cells Homo sapiens cDNA 5’
CCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTTCTAATATC
TTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAA
ATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCT
TTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCAT
GTACAGGATCCACAGGGGTGGTGTCAAATGCTATTGAAATTNTGTTGAATTGTATACTTT
TTCACTTTTTGATAATTAACCATGTAAAAAATGAACGCTACTACTATAGTAGAATTGAT
Pre-processing: Essential steps
• EST pre-processing consists in a number of essential steps to minimize the chance
to cluster unrelated sequences.
– Screening out low quality regions:
• Low quality sequence readings are error prone.
• Programs as Phred (Ewig et al., 98) read chromatograms (base-calling) and assesses a quality
value to each nucleotide.
– Screening out contaminations (tRNA, rRNA, mitoDNA).
– Screening out vector sequences (vector clipping).
– Screening out repeat sequences (repeats masking).
– Screening out low complexity sequences.
• Dedicated software are available for these tasks:
– RepeatMasker (Smit and Green,
http://ftp.genome.washington.edu/RM/RepeatMasker.html);
– VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen);
– Lucy (Chou and Holmes, 01);
– ...
Pre-processing: vector clipping
• Vector-clipping
– Vector sequences can skew clustering even if a small vector fragment remains in each
read.
– Delete 5’ and 3’ regions corresponding to the vector used for cloning.
– Detection of vector sequences is not a trivial task, because they normally lies in the low
quality region of the sequence.
– UniVec is a non-redundant vector database available from NCBI:
• http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html
• Contaminations
– Find and delete:
• bacterial DNA, yeast DNA, and other contaminations;
• Standard pairwise alignment programs are used for the detection of vector and
other contaminants (for example cross-match, BLASTN, FASTA). They are
reasonably fast and accurate.
Pre-processing: repeat masking
• Some repetitive elements found in the human genome:
Pre-processing: repeat masking
• Repeated elements:
– They represent a big part of the mammalian genome.
– They are found in a number of genomes (plants, ...)
– They induce errors in clustering and assembling.
– They should be masked, not deleted, to avoid false sequence assembling.
– … but also interesting elements for evolutionary studies.
– SSRs important for mapping of diseases.
• Tools to find repeats:
– RepeatMasker has been developed to find repetitive elements and low-complexity sequences.
RepeatMasker uses the cross-match program for the pairwise alignments
• http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker
– MaskerAid improves the speed of RepeatMasker by ~ 30 folds using WU-BLAST instead of cross-
match
• http://sapiens.wustl.edu/maskeraid
– RepBase is a database of prototypic sequences representing repetitive DNA from different eukaryotic
species.:
• http://www.girinst.org/Repbase Update.html
Pre-processing: low complexity
regions
• Low complexity sequences contains an important bias in their
nucleotide compositions (poly A tracts, AT repeats, etc.).
• Low complexity regions can provide an artifactual basis for cluster
membership.
• Clustering strategies employing alignable similarity in their first pass
are very sensitive to low complexity sequences.
• Some clustering strategies are insensitive to low complexity
sequences, because they weight sequences in respect to their
information content (ex. d2-cluster).
• Programs as DUST (NCBI) can be used to mask low complexity regions.
Pre-processing: summary
EST Clustering
• The goal of the clustering process is to incorporate overlapping ESTs which tag the
same transcript of the same gene in a single cluster.
• For clustering, we measure the similarity (distance) between any 2 sequences. The
distance is then reduced to a simple binary value: accept or reject two sequences
in the same cluster.
• Similarity can be measured using different algorithms:
– Pairwise alignment algorithms:
• Smith-Waterman is the most sensitive, but time consuming (ex. cross-match);
• Heuristic algorithms, as BLAST and FASTA, trade some sensitivity for speed
– Non-alignment based scoring methods:
• d2 cluster algorithm: based on word comparison and composition (word identity and
multiplicity) (Burke et al., 99). No alignments are performed -> fast.
– Pre-indexing methods.
– Purpose-built alignments based clustering methods.
Loose and stringent clustering
• Stringent clustering:
– Greater initial fidelity;
– One pass;
– Lower coverage of expressed gene data;
– Lower cluster inclusion of expressed gene forms;
– Shorter consensi.
• Loose clustering:
– Lower initial fidelity;
– Multi-pass;
– Greater coverage of expressed gene data;
– Greater cluster inclusion of alternate expressed forms.
– Longer consensi;
– Risk to include paralogs in the same gene index.
Supervised and unsupervised EST clustering
• Supervised clustering
– ESTs are classified with respect to known reference sequences or ”seeds” (full length
mRNAs, exon constructs from genomic sequences, previously assembled EST cluster
consensus). No. of cluster classes are defined prior to Start the clustering.
• Unsupervised clustering
– ESTs are classified without any prior knowledge. No classes are defined.
• The three major gene indices use different EST clustering methods:
– TIGR Gene Index uses a stringent and supervised clustering method, which generate
shorter consensus sequences and separate splice variants.
– STACK uses a loose and unsupervised clustering method, producing longer consensus
sequences and including splice variants in the same index.
– A combination of supervised and unsupervised methods with variable levels of
stringency are used in UniGene. No consensus sequences are produced.
Assembling and processing
• A multiple alignment for each cluster can be generated (assembly) and
consensus sequences generated (processing).
• A number of program are available for assembly and processing:
– PHRAP
(http://www.genome.washington.edu/UWGC/analysistools/Phrap.cfm);
– TIGR ASSEMBLER (Sutton et al., 95);
– CRAW (Burke et al., 98);
– ...
• Assembly and processing result in the production of consensus
sequences and singletons (helpful to visualize splice variants).
Cluster joining
• All ESTs generated from the same cDNA clone correspond to a single gene.
• Generally the original cDNA clone information is available (~ 90%).
• Using the cDNA clone information and the 5’ and 3’ reads information, clusters can
be joined.
Unigene
• UniGene Gene Indices available for a number of organisms.
• UniGene clusters are produced with a supervised procedure: ESTs are
clustered using GenBank CDSs and mRNAs data as ”seed” sequences.
• No attempt to produce contigs or consensus sequences.
• UniGene uses pairwise sequence comparison at various levels of
stringency to group related sequences, placing closely related and
alternatively spliced transcripts into one cluster.
• UniGene web site: http://www.ncbi.nlm.nih.gov/UniGene.
Unigene procedure
• Screen for contaminants, repeats, and low-complexity regions in GenBank.
– Low-complexity are detected using Dust.
– Contaminants (vector, linker, bacterial, mitochondrial, ribosomal sequences) are
detected using pairwise alignment programs.
– Repeat masking of repeated regions (RepeatMasker).
– Only sequences with at least 100 informative bases are accepted.
• Clustering procedure.
– Build clusters of genes and mRNAs (GenBank).
– Add ESTs to previous clusters (megablast).
– ESTs that join two clusters of genes/mRNAs are discarded.
– Any resulting cluster without a polyadenylation signal or at least two 3’ ESTs is discarded.
– The resulting clusters are called anchored clusters since their 3’ end is supposedly
known.
Unigene procedure (2)
• Ensures 5’ and 3’ ESTs from the same cDNA clone belongs to the same
cluster.
• ESTs that have not been clustered, are reprocessed with lower level of
stringency. ESTs added during this step are called guest members.
• Clusters of size 1 (containing a single sequence) are compared against
the rest of the clusters with a lower level of stringency and merged
with the cluster containing the most similar sequence.
• For each build of the database, clusters IDs change if clusters are split
or merged.
TIGR Genes Indices
• TIGR produces Gene Indices for a number of organisms
(http://www.tigr.org/tdb/tgi).
• TIGR Gene Indices are produced using strict supervised clustering methods.
• Clusters are assembled in consensus sequences, called tentative consensus (TC)
sequences, that represent the underlying mRNA transcripts.
• The TIGR Gene Indices building method tightly groups highly related sequences and
discard under-represented, divergent, or noisy sequences.
• TIGR Gene Indices characteristics:
– separate closely related genes into distinct consensus sequences;
– separate splice variants into separate clusters;
– low level of contamination.
• TC sequences can be used for genome annotation, genome mapping, and
identification of orthologs/paralogs genes.
TIGR Genes Indices procedure
• EST sequences recovered form dbEST
(http://www.ncbi.nlm.nih.gov/dbEST);
• Sequences are trimmed to remove:
– Vectors and adaptor sequences
– polyA/T tails
– bacterial sequences
• Get expressed transcripts (ETs) from EGAD
(http://www.tigr.org/tdb/egad/egad.shtml):
– EGAD (Expressed Gene Anatomy Database) is based on mRNA and CDS
(coding sequences) from GenBank.
• Get Tentative consensus and singletons from previous database build.
TIGR Genes Indices procedure
• Builded TCs are loaded in the TIGR Gene Indices
database and annotated using information from
GenBank and/or protein homology.
• Track of the old TC IDs is maintained through a
relational database.
• References:
– Quackenbush et al. (2000) Nucleic Acid Research,28,
141-145.
– Quackenbush et al. (2001) Nucleic Acid Research,29,
159-164.
STACK The Sequence Tag Alignment and Consensus Knowledgebase
• STACK concentrates on human data.
• Based on ”loose” unsupervised clustering, followed by strict assembly
procedure and analysis to identify and characterize sequence
divergence (alternative splicing, etc).
• The ”loose” clustering approach, d2 cluster, is not based on
alignments, but performs comparisons via non-contextual assessment
of the composition and multiplicity of words within each sequence.
• Because of the ”loose” clustering, STACK produces longer consensus
sequences than TIGR Gene Indices.
• STACK also integrates ~ 30% more sequences than UniGene, due to the
”loose” clustering approach
STACK procedure
• Sub-partitioning.
– Select human ESTs from GenBank;
– Sequences are grouped in tissue-based categories (”bins”). This will allow
further specific tissue transcription exploration.
– A ”bin” is also created for sequences derived from disease-related tissues.
• Masking.
– Sequences are masked for repeats and contaminants using cross-match:
• Human repeat sequences (RepBase);
• Vector sequences;
• Ribosomal and mitochondrial DNA, other contaminants.
STACK procedure (2)
• ”Loose” clustering using d2 cluster.
– The algorithm looks for the co-occurrence of n-length words (n = 6) in a window of size
150 bases having at least 96% identity.
– Sequences shorter than 50 bases are excluded from the clustering process.
– Clusters highly related sequences.
– Clusters also sequences related by rearrangements or alternative splicing.
– Because d2 cluster weighs sequences according to their information content, masking of
low complexity regions is not required.
• Assembly.
– The assembly step is performed using Phrap.
– STACK don’t use quality information available from chromatograms (but use them in new
version 2.2 of stackPACK)
– The lack of trace information is largely compensated by the redundancy of the ESTs data.
– Sequences that cannot be aligned with Phrap are extracted from the clusters (singletons)
and processed later.
STACK procedure (3)
• Alignment analysis.
– The CRAW program is used in the first part of the alignment analysis.
– CRAW generates consensus sequence with maximized length.
– CRAW partitions a cluster in sub-ensembles if >= 50% of a 100 bases window
differ from the rest of the sequences of the cluster.
– Rank the sub-ensembles according to the number of assigned sequences and
number of called bases for each sub-ensemble (CONTIGPROC).
– Annotate polymorphic regions and alternative splicing.
• Linking.
– Joins clusters containing ESTs with shared clone ID.
– Add singletons produced by Phrap in respect to their clone ID.
STACK procedure (4)
• STACK update.
– New ESTs are searched against existing consensus and singletons using cross-match.
– Matching sequences are added to extend existing clusters and consensus.
– Non-matching sequences are processed using d2 cluster against the entire database and
the new produces clusters are renamed)Gene Index ID change.
• STACK outputs.
– Primary consensus for each cluster in FASTA format.
– Alignments from Phrap in GDE (Genetic Data Environment) format.
– Sequence variations and sub-consensus (from CRAW processing).
• References.
– Miller et al. (1999) Genome Research,9, 1143-1155.
– Christoffels et al. (2001) Nucleic Acid Research,29, 234-238.
– http://www.sanbi.ac.za/Dbases.html
trEST (see also trGEN / tromer)
• trEST is an attempt to produce contigs from clusters of ESTs and to
translate them into proteins.
• trEST uses UniGene clusters and clusters produced from in-house
software.
• To assemble clusters trEST uses Phrap and CAP3 algorithms.
• Contigs produced by the assembling step are translated into protein
sequences using the ESTscan program, which corrects most of the
frame-shift errors and predicts transcripts with a position error of few
amino acids.
• You can access trEST via the HITS database (http://hits.isb-sib.ch).
EST clustering procedures
Mapping EST to genome
• sim4 is an algorithm that maps ESTs, cDNAs, mRNAs to genomic sequences.
(http://pbil.univ-lyon1.fr/sim4.html)
• sim4 algorithm finds matching blocks representing the "exon cores".
• The algorithm used by sim4 is similar to the blast algorithm:
– Determine high-scoring segment pairs (HSPs).
• High scoring gap-free regions.
• Selects exact matches of length 12.
• Extend matches in both directions with a score of 1 for a match and -5 for a
mismatch until no increase of the score.
– Select HSPs that could represent a gene.
• Use dynamic programming algorithm to find a chain of HSPs with the following
constrains:
– 1. Their starting position are in increasing order.
– 2. The diagonals of consecutive HSPs are nearly the same ("exon cores") or differ enough
to be a plausible intron.
Mapping EST to genome
– Find exon boundaries.
• If "exon cores" overlap, the ends are trimmed to
nd boundary sequences (GT..AG or CT..AC).
• If "exon cores" don't overlap, they are extended using a "greedy" method. Then the
ends are trimmed to find boundary sequences.
• If this last step fails, the region between two adjacent exon cores is searched for
HSPs at a reduced stringency.
– Determine alignments.
• Found exons with anchored boundaries are realigned by a method to align very
similar DNA sequences (Chao et al., 1997).
• Other similar tools:
– Spidey (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/index.html)
– est2genome (EMBOSS package)
THANK YOU

More Related Content

What's hot

Protein Database
Protein DatabaseProtein Database
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
Sardar Harpreet Kalsi
 
PAM matrices evolution
PAM matrices evolutionPAM matrices evolution
PAM matrices evolution
MithunjhaAnandakumar
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Ramya S
 
EMBL
EMBLEMBL
UniProt
UniProtUniProt
UniProt
AmnaA7
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
Clustal
ClustalClustal
Clustal
Benittabenny
 
Needleman-wunch algorithm harshita
Needleman-wunch algorithm  harshitaNeedleman-wunch algorithm  harshita
Needleman-wunch algorithm harshita
Harshita Bhawsar
 
Structural Genomics
Structural GenomicsStructural Genomics
Structural Genomics
Aqsa Javed
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
RishikaMaji
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
Dhananjay Desai
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
Alphonsa Joseph
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
Amity university, Noida
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
Arindam Ghosh
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
Safa Khalid
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
ShwetA Kumari
 
Comparitive genome mapping and model systems
Comparitive genome mapping and model systemsComparitive genome mapping and model systems
Comparitive genome mapping and model systems
Himanshi Chauhan
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
ALLIENU
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
Bioinformatics and Computational Biosciences Branch
 

What's hot (20)

Protein Database
Protein DatabaseProtein Database
Protein Database
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
PAM matrices evolution
PAM matrices evolutionPAM matrices evolution
PAM matrices evolution
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
EMBL
EMBLEMBL
EMBL
 
UniProt
UniProtUniProt
UniProt
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Clustal
ClustalClustal
Clustal
 
Needleman-wunch algorithm harshita
Needleman-wunch algorithm  harshitaNeedleman-wunch algorithm  harshita
Needleman-wunch algorithm harshita
 
Structural Genomics
Structural GenomicsStructural Genomics
Structural Genomics
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Comparitive genome mapping and model systems
Comparitive genome mapping and model systemsComparitive genome mapping and model systems
Comparitive genome mapping and model systems
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 

Similar to EST Clustering.ppt

Est database
Est databaseEst database
Est database
Amit Ruchi Yadav
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Prof. Wim Van Criekinge
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
Scott Dawson
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
Prabin Shakya
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
Manjappa Ganiger
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
Cherry
 
31931 31941
31931 3194131931 31941
31931 31941
Amit Gupta
 
NCBI
NCBINCBI
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
Ranjan Jyoti Sarma
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
Scott Dawson
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 format
sidjena70
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
Ravi Gandham
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
adibshanto115
 
Biological databases
Biological databasesBiological databases
Biological databases
Malla Reddy College of Pharmacy
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
NitantChoksi1
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
IJRTEMJOURNAL
 

Similar to EST Clustering.ppt (20)

Est database
Est databaseEst database
Est database
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
 
31931 31941
31931 3194131931 31941
31931 31941
 
NCBI
NCBINCBI
NCBI
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
Comparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 formatComparative genomics @ sid 2003 format
Comparative genomics @ sid 2003 format
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 

Recently uploaded

SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 

Recently uploaded (20)

SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 

EST Clustering.ppt

  • 1. EST Clustering Medhavi Vashisth*#, Indu Bisla*#, Taruna Anand*, Neeru Redhu#, Shikha Yashveer#, Nitin Virmani*, B.C. Bera*, Rajesh K. Vaid* *National Centre for Veterinary Type Cultures, ICAR- National Research Centre on Equines, Hisar, India #CCS Haryana Agricultural University, Hisar, india
  • 2. Expressed Sequence Tags (EST) • What are ESTs? • Quality problem (single pass) • Cleaning (vector clipping, contamination filtering, repeat masking) • Clustering • Assembly into contigs • Gene indices • Databases
  • 3. Expressed Sequence Tags (EST) • ESTs represent partial sequences of cDNA clones (average ~ 360 bp). • Single-pass reads from the 5’ and/or 3’ ends of cDNA clones.
  • 4. What is an EST cluster? • A cluster is fragmented, EST data and (if known) gene sequence data, consolidated, placed in correct context, and indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene (Burke et al., 1999). Owing to the fragmented nature of EST reads, it is worthwhile to attempt to organize the reads into assemblies that provide a consensus view of the sampled transcripts. • Recent studies indicate that several forms of transcript can be generated from a single locus and that antisense RNAs and possibly other forms of RNAs revealed in recent tiling path studies can overlap their expression over the same sets of genomic nucleotides
  • 5. Interest of ESTs • The aim of EST clustering is simply to incorporate all ESTs that share a transcript or gene parent to the same cluster. There is usually a requirement to assemble the clustered ESTs into one or more consensus sequences (contigs) that reflect the transcript diversity, and to provide these contigs in such a manner that the information they contain most truly reflects the sampled biology. • ESTs represent the most extensive available survey of the transcribed portion of genomes. • ESTs are indispensable for gene structure prediction, gene discovery and genomic mapping. • Characterization of splice variants and alternative polyadenylation. • In silico differential display and gene expression studies (specific tissue expression, normal/disease states). • SNP data mining. • High-volume and high-throughput data production at low cost.
  • 6. Low quality data of ESTs • High error rates (~ 1/100) because of the sequence reading single-pass. • Sequence compression and frame-shift errors due to the sequence reading single-pass. • A single EST represents only a partial gene sequence. • Not a defined gene/protein product. • Not curated in a highly annotated form. • High redundancy in the data -> huge number of sequences to analyze.
  • 7. Improving ESTs: Clustering, Assembling and Gene indices • The value of ESTs is greatly enhanced by clustering and assembling. – solving redundancy can help to correct errors; – longer and better annotated sequences; – easier association to mRNAs and proteins; – detection of splice variants; – fewer sequences to analyze. • Gene indices: All expressed sequences (as ESTs) concerning a single gene are grouped in a single index class, and each index class contains the information for only one gene. • Different clustering/assembly procedures have been proposed with associated resulting databases (gene indices): – UniGene (http://www.ncbi.nlm.nih.gov/UniGene) – TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) – STACK (http://www.sambi.ac.za/Dbases.html)
  • 9. Pre-processing: data source • The data sources for clustering can be in-house, proprietary, public database or a hybrid of this (chromatograms and/or sequence files). • Each EST must have the following information: – A sequence AC/ID (ex. sequence-run ID); – Location in respect of the poly A (3’ or 5’); – The CLONE ID from which the EST has been generated; – Organism; – Tissue and/or conditions; – The sequence. • The EST can be stored in FASTA format: >T27784 EST16067 Human Endothelial cells Homo sapiens cDNA 5’ CCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTTCTAATATC TTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAA ATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCT TTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCAT GTACAGGATCCACAGGGGTGGTGTCAAATGCTATTGAAATTNTGTTGAATTGTATACTTT TTCACTTTTTGATAATTAACCATGTAAAAAATGAACGCTACTACTATAGTAGAATTGAT
  • 10. Pre-processing: Essential steps • EST pre-processing consists in a number of essential steps to minimize the chance to cluster unrelated sequences. – Screening out low quality regions: • Low quality sequence readings are error prone. • Programs as Phred (Ewig et al., 98) read chromatograms (base-calling) and assesses a quality value to each nucleotide. – Screening out contaminations (tRNA, rRNA, mitoDNA). – Screening out vector sequences (vector clipping). – Screening out repeat sequences (repeats masking). – Screening out low complexity sequences. • Dedicated software are available for these tasks: – RepeatMasker (Smit and Green, http://ftp.genome.washington.edu/RM/RepeatMasker.html); – VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen); – Lucy (Chou and Holmes, 01); – ...
  • 11. Pre-processing: vector clipping • Vector-clipping – Vector sequences can skew clustering even if a small vector fragment remains in each read. – Delete 5’ and 3’ regions corresponding to the vector used for cloning. – Detection of vector sequences is not a trivial task, because they normally lies in the low quality region of the sequence. – UniVec is a non-redundant vector database available from NCBI: • http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html • Contaminations – Find and delete: • bacterial DNA, yeast DNA, and other contaminations; • Standard pairwise alignment programs are used for the detection of vector and other contaminants (for example cross-match, BLASTN, FASTA). They are reasonably fast and accurate.
  • 12. Pre-processing: repeat masking • Some repetitive elements found in the human genome:
  • 13. Pre-processing: repeat masking • Repeated elements: – They represent a big part of the mammalian genome. – They are found in a number of genomes (plants, ...) – They induce errors in clustering and assembling. – They should be masked, not deleted, to avoid false sequence assembling. – … but also interesting elements for evolutionary studies. – SSRs important for mapping of diseases. • Tools to find repeats: – RepeatMasker has been developed to find repetitive elements and low-complexity sequences. RepeatMasker uses the cross-match program for the pairwise alignments • http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker – MaskerAid improves the speed of RepeatMasker by ~ 30 folds using WU-BLAST instead of cross- match • http://sapiens.wustl.edu/maskeraid – RepBase is a database of prototypic sequences representing repetitive DNA from different eukaryotic species.: • http://www.girinst.org/Repbase Update.html
  • 14. Pre-processing: low complexity regions • Low complexity sequences contains an important bias in their nucleotide compositions (poly A tracts, AT repeats, etc.). • Low complexity regions can provide an artifactual basis for cluster membership. • Clustering strategies employing alignable similarity in their first pass are very sensitive to low complexity sequences. • Some clustering strategies are insensitive to low complexity sequences, because they weight sequences in respect to their information content (ex. d2-cluster). • Programs as DUST (NCBI) can be used to mask low complexity regions.
  • 16. EST Clustering • The goal of the clustering process is to incorporate overlapping ESTs which tag the same transcript of the same gene in a single cluster. • For clustering, we measure the similarity (distance) between any 2 sequences. The distance is then reduced to a simple binary value: accept or reject two sequences in the same cluster. • Similarity can be measured using different algorithms: – Pairwise alignment algorithms: • Smith-Waterman is the most sensitive, but time consuming (ex. cross-match); • Heuristic algorithms, as BLAST and FASTA, trade some sensitivity for speed – Non-alignment based scoring methods: • d2 cluster algorithm: based on word comparison and composition (word identity and multiplicity) (Burke et al., 99). No alignments are performed -> fast. – Pre-indexing methods. – Purpose-built alignments based clustering methods.
  • 17. Loose and stringent clustering • Stringent clustering: – Greater initial fidelity; – One pass; – Lower coverage of expressed gene data; – Lower cluster inclusion of expressed gene forms; – Shorter consensi. • Loose clustering: – Lower initial fidelity; – Multi-pass; – Greater coverage of expressed gene data; – Greater cluster inclusion of alternate expressed forms. – Longer consensi; – Risk to include paralogs in the same gene index.
  • 18. Supervised and unsupervised EST clustering • Supervised clustering – ESTs are classified with respect to known reference sequences or ”seeds” (full length mRNAs, exon constructs from genomic sequences, previously assembled EST cluster consensus). No. of cluster classes are defined prior to Start the clustering. • Unsupervised clustering – ESTs are classified without any prior knowledge. No classes are defined. • The three major gene indices use different EST clustering methods: – TIGR Gene Index uses a stringent and supervised clustering method, which generate shorter consensus sequences and separate splice variants. – STACK uses a loose and unsupervised clustering method, producing longer consensus sequences and including splice variants in the same index. – A combination of supervised and unsupervised methods with variable levels of stringency are used in UniGene. No consensus sequences are produced.
  • 19. Assembling and processing • A multiple alignment for each cluster can be generated (assembly) and consensus sequences generated (processing). • A number of program are available for assembly and processing: – PHRAP (http://www.genome.washington.edu/UWGC/analysistools/Phrap.cfm); – TIGR ASSEMBLER (Sutton et al., 95); – CRAW (Burke et al., 98); – ... • Assembly and processing result in the production of consensus sequences and singletons (helpful to visualize splice variants).
  • 20. Cluster joining • All ESTs generated from the same cDNA clone correspond to a single gene. • Generally the original cDNA clone information is available (~ 90%). • Using the cDNA clone information and the 5’ and 3’ reads information, clusters can be joined.
  • 21. Unigene • UniGene Gene Indices available for a number of organisms. • UniGene clusters are produced with a supervised procedure: ESTs are clustered using GenBank CDSs and mRNAs data as ”seed” sequences. • No attempt to produce contigs or consensus sequences. • UniGene uses pairwise sequence comparison at various levels of stringency to group related sequences, placing closely related and alternatively spliced transcripts into one cluster. • UniGene web site: http://www.ncbi.nlm.nih.gov/UniGene.
  • 22. Unigene procedure • Screen for contaminants, repeats, and low-complexity regions in GenBank. – Low-complexity are detected using Dust. – Contaminants (vector, linker, bacterial, mitochondrial, ribosomal sequences) are detected using pairwise alignment programs. – Repeat masking of repeated regions (RepeatMasker). – Only sequences with at least 100 informative bases are accepted. • Clustering procedure. – Build clusters of genes and mRNAs (GenBank). – Add ESTs to previous clusters (megablast). – ESTs that join two clusters of genes/mRNAs are discarded. – Any resulting cluster without a polyadenylation signal or at least two 3’ ESTs is discarded. – The resulting clusters are called anchored clusters since their 3’ end is supposedly known.
  • 23. Unigene procedure (2) • Ensures 5’ and 3’ ESTs from the same cDNA clone belongs to the same cluster. • ESTs that have not been clustered, are reprocessed with lower level of stringency. ESTs added during this step are called guest members. • Clusters of size 1 (containing a single sequence) are compared against the rest of the clusters with a lower level of stringency and merged with the cluster containing the most similar sequence. • For each build of the database, clusters IDs change if clusters are split or merged.
  • 24. TIGR Genes Indices • TIGR produces Gene Indices for a number of organisms (http://www.tigr.org/tdb/tgi). • TIGR Gene Indices are produced using strict supervised clustering methods. • Clusters are assembled in consensus sequences, called tentative consensus (TC) sequences, that represent the underlying mRNA transcripts. • The TIGR Gene Indices building method tightly groups highly related sequences and discard under-represented, divergent, or noisy sequences. • TIGR Gene Indices characteristics: – separate closely related genes into distinct consensus sequences; – separate splice variants into separate clusters; – low level of contamination. • TC sequences can be used for genome annotation, genome mapping, and identification of orthologs/paralogs genes.
  • 25. TIGR Genes Indices procedure • EST sequences recovered form dbEST (http://www.ncbi.nlm.nih.gov/dbEST); • Sequences are trimmed to remove: – Vectors and adaptor sequences – polyA/T tails – bacterial sequences • Get expressed transcripts (ETs) from EGAD (http://www.tigr.org/tdb/egad/egad.shtml): – EGAD (Expressed Gene Anatomy Database) is based on mRNA and CDS (coding sequences) from GenBank. • Get Tentative consensus and singletons from previous database build.
  • 26. TIGR Genes Indices procedure • Builded TCs are loaded in the TIGR Gene Indices database and annotated using information from GenBank and/or protein homology. • Track of the old TC IDs is maintained through a relational database. • References: – Quackenbush et al. (2000) Nucleic Acid Research,28, 141-145. – Quackenbush et al. (2001) Nucleic Acid Research,29, 159-164.
  • 27. STACK The Sequence Tag Alignment and Consensus Knowledgebase • STACK concentrates on human data. • Based on ”loose” unsupervised clustering, followed by strict assembly procedure and analysis to identify and characterize sequence divergence (alternative splicing, etc). • The ”loose” clustering approach, d2 cluster, is not based on alignments, but performs comparisons via non-contextual assessment of the composition and multiplicity of words within each sequence. • Because of the ”loose” clustering, STACK produces longer consensus sequences than TIGR Gene Indices. • STACK also integrates ~ 30% more sequences than UniGene, due to the ”loose” clustering approach
  • 28. STACK procedure • Sub-partitioning. – Select human ESTs from GenBank; – Sequences are grouped in tissue-based categories (”bins”). This will allow further specific tissue transcription exploration. – A ”bin” is also created for sequences derived from disease-related tissues. • Masking. – Sequences are masked for repeats and contaminants using cross-match: • Human repeat sequences (RepBase); • Vector sequences; • Ribosomal and mitochondrial DNA, other contaminants.
  • 29. STACK procedure (2) • ”Loose” clustering using d2 cluster. – The algorithm looks for the co-occurrence of n-length words (n = 6) in a window of size 150 bases having at least 96% identity. – Sequences shorter than 50 bases are excluded from the clustering process. – Clusters highly related sequences. – Clusters also sequences related by rearrangements or alternative splicing. – Because d2 cluster weighs sequences according to their information content, masking of low complexity regions is not required. • Assembly. – The assembly step is performed using Phrap. – STACK don’t use quality information available from chromatograms (but use them in new version 2.2 of stackPACK) – The lack of trace information is largely compensated by the redundancy of the ESTs data. – Sequences that cannot be aligned with Phrap are extracted from the clusters (singletons) and processed later.
  • 30. STACK procedure (3) • Alignment analysis. – The CRAW program is used in the first part of the alignment analysis. – CRAW generates consensus sequence with maximized length. – CRAW partitions a cluster in sub-ensembles if >= 50% of a 100 bases window differ from the rest of the sequences of the cluster. – Rank the sub-ensembles according to the number of assigned sequences and number of called bases for each sub-ensemble (CONTIGPROC). – Annotate polymorphic regions and alternative splicing. • Linking. – Joins clusters containing ESTs with shared clone ID. – Add singletons produced by Phrap in respect to their clone ID.
  • 31. STACK procedure (4) • STACK update. – New ESTs are searched against existing consensus and singletons using cross-match. – Matching sequences are added to extend existing clusters and consensus. – Non-matching sequences are processed using d2 cluster against the entire database and the new produces clusters are renamed)Gene Index ID change. • STACK outputs. – Primary consensus for each cluster in FASTA format. – Alignments from Phrap in GDE (Genetic Data Environment) format. – Sequence variations and sub-consensus (from CRAW processing). • References. – Miller et al. (1999) Genome Research,9, 1143-1155. – Christoffels et al. (2001) Nucleic Acid Research,29, 234-238. – http://www.sanbi.ac.za/Dbases.html
  • 32. trEST (see also trGEN / tromer) • trEST is an attempt to produce contigs from clusters of ESTs and to translate them into proteins. • trEST uses UniGene clusters and clusters produced from in-house software. • To assemble clusters trEST uses Phrap and CAP3 algorithms. • Contigs produced by the assembling step are translated into protein sequences using the ESTscan program, which corrects most of the frame-shift errors and predicts transcripts with a position error of few amino acids. • You can access trEST via the HITS database (http://hits.isb-sib.ch).
  • 34. Mapping EST to genome • sim4 is an algorithm that maps ESTs, cDNAs, mRNAs to genomic sequences. (http://pbil.univ-lyon1.fr/sim4.html) • sim4 algorithm finds matching blocks representing the "exon cores". • The algorithm used by sim4 is similar to the blast algorithm: – Determine high-scoring segment pairs (HSPs). • High scoring gap-free regions. • Selects exact matches of length 12. • Extend matches in both directions with a score of 1 for a match and -5 for a mismatch until no increase of the score. – Select HSPs that could represent a gene. • Use dynamic programming algorithm to find a chain of HSPs with the following constrains: – 1. Their starting position are in increasing order. – 2. The diagonals of consecutive HSPs are nearly the same ("exon cores") or differ enough to be a plausible intron.
  • 35. Mapping EST to genome – Find exon boundaries. • If "exon cores" overlap, the ends are trimmed to nd boundary sequences (GT..AG or CT..AC). • If "exon cores" don't overlap, they are extended using a "greedy" method. Then the ends are trimmed to find boundary sequences. • If this last step fails, the region between two adjacent exon cores is searched for HSPs at a reduced stringency. – Determine alignments. • Found exons with anchored boundaries are realigned by a method to align very similar DNA sequences (Chao et al., 1997). • Other similar tools: – Spidey (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/index.html) – est2genome (EMBOSS package)