Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Comparison of Genomic DNA to cDNA Alignment Methods
1. Comparison of Genomic DNA to
cDNA Alignment Methods
Miguel Galves and Zanoni Dias
Institute of Computing – Unicamp – Campinas – SP – Brazil
{miguel.galves,zanoni}@ic.unicamp.br
Scylla Bioinformatics – Campinas – SP – Brazil
{miguel,zanoni}@scylla.com.br
2. Agenda
Introduction
Problem
Aligners
Data set
Subsets
Evaluation Methods
Results: Exact Alignments
Results: EST Alignments
Running Time Comparison
Conclusions
3. Introduction
Identifying genes in non-characterized DNA
sequences is one of the greatest challenges in
genomics
EST-to-DNA alignment is one of the most common
methods
EST are key to understanding the inner working of
an organism
– Human being has between 30000 and 35000 genes
– Alternative Splicing plays an important role in diversity
5. Problem: How to solve ?
Classic algorithms
– Dynamic programming
Heuristic based algorithms
– Multi-steps
– Based on other tools such as Blast and
local alignments.
6. Aligners
Java version of global and semi-global
– Affine gap penalty function
– Linear space
– Global algorithm by Miller and Myers (1988)
– Semi-global based on global algorithm
Heuristic based algorithms
– sim4, Spidey and est_genome
7. Data Set
Human genome database
– Based on FASTA a GENBANK’s flat format file from
NCBI repository.
Filtering criteria
– Genes, mRNAs and CDS with /pseudo tag
– mRNAs without any CDS
– Genes without any mRNA
– CDS matching wrong patterns
23124 genes and 27448 mRNAs stored in database
8. Subsets
Subset 1Subset 1:: 66 genes from chromossome Y whith
less than 100000 bases
Subset 2: 50 complete genes from chromossome
Y whith less than 100000 bases
Subset 3: 8056 complete genes from all
chromossomes whith less than 100000 bases
Subset 4: 493 artificial EST based on complete
genes from chromossome 6 with less than
100000 bases
9. Evaluation methods
Number of gaps introduced in the aligned
gene sequence
Delta exons
Bases similarity percentage
Mismatch percentage
10. Experimental method
Two score systems, from 15 previously
defined and an alignment strategy were
choosed, using subsets 1 and 2:
– Semi-global aligner
– (1,-2,-1,0) and (1,-2,-10,0) score systems
The classic semi-global aligner was
compared to sim4, Spidey and est_genome,
both with subsets 3 and 4
18. Conclusions
Classic semi-globl algorithm produces good
results
– Running time is a problem, although it can be
improved
Sim4 produces the best results amont
external softwares tested