SlideShare a Scribd company logo
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
DOI:10.5121/ijcsa.2015.5407 73
SBVRLDNACOMP:AN EFFECTIVE DNA
SEQUENCE COMPRESSION ALGORITHM
Subhankar Roy1
,Akash Bhagot2
,Kumari Annapurna Sharma2
and Sunirmal
Khatua3
1
Department of Computer Science and Engineering Academy of Technology, G. T. Road,
Aedconagar, Hooghly-712121, W.B., India
2
Master of Computer Application, Academy of Technology, G. T. Road, Aedconagar,
Hooghly-712121, W.B., India
3
Department of Computer Science and Engineer, University of Calcutta, 92 A.P.C. Road,
Kolkata-700009, India
ABSTRACT
There are plenty specific types of data which are needed to compress for easy storage and to reduce overall
retrieval times. Moreover, compressed sequence can be used to understand similarities between biological
sequences. DNA data compression challenge has become a major task for many researchers for the last
few years as a result of exponential increase of produced sequences in gene databases. In this research
paper we have attempt to develop an algorithm by self-reference bases; namely Single Base Variable
Repeat Length DNA Compression (SBVRLDNAComp). There are a number of reference based compression
methods but they are not satisfactory for forthcoming new species. SBVRLDNAComp is an optimal solution
of the result obtained from small to long, uniform identical and non-identical string of nucleotides checked
in four different ways. Both exact repetitive and non-repetitive bases are compressed by SBVRLDNAComp.
The sound part of it is without any reference database SBVRLDNAComp achieves 1.70 to 1.73 compression
ratio α after testing on ten benchmark DNA sequences. The compressed file can be further compressed with
standard tools (such as WinZip or WinRar) but even without this SBVRLDNAComp outperforms many
standard DNA compression algorithms.
KEYWORDS
DNA; Redundancy; Reference Base; Optimized Exact Repeat; Non-Repetition; LZ77; and Compression
Ratio.
1.INTRODUCTION
The size of the genome database is increasing annually with a great speed. Each day several
thousand nucleotides are sequenced in the labs. From 1982 to the present, the number of bases in
GenBank has doubled approximately every 18 months. It is found that in Dec 1982, the number
of bases and the number of sequence records were 680338 and 606 respectively for GenBank and
none for WGS (Whole Genome Shotgun) and with Release 129 in Apr 2002, the number of bases
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
74
and the number of sequence records were 19072679701 and 16769983 respectively for GenBank
and the WGS had 692266338 number of bases and 172768 number of sequences. Again in the
Release 206 in Feb 2015, the number of bases and the number of sequence records were
187893826750 and 181336445 respectively for GenBank and the WGS had 873281414087
numbers of bases and 205465046 numbers of sequences.
High-throughput sequencing technologies [1] make it possible to rapidly acquire large numbers of
individual genomes, which, for a given organism, vary slightly from one to another. Such
repetitive and large sequence collections are a unique challenge for compression. Compressed
data reduce the communication cost and storage cost. Furthermore, compressed sequence can be
used to get the similarities within sequences.
The highly repetitive DNA sequences own some motivating properties [2], [3] which can be
utilize to compress it. As DNA sequences consists of four nucleotides bases A, C, G and T called
exon (i.e. coding regions or protein synthesis) or introns (i.e. non-coding regions or no protein
synthesis) a, c, g and t in frequent cases, two bits is enough to store each base, in spite of this fact,
the standard compression algorithm like “COMPRESS”, “GZIP”, “BZIP2”, “WinZip” or
“WinRar” uses more than 2 bits per base [4]. Even both static and adaptive Huffman’s code fails
badly on DNA sequences because the probabilities of occurrence of these symbols are not very
different. In this paper we focus on compression of this particular data. DNA is a double stranded
molecule with neighboring strands connected through hydrogen bonding between the bases. This
hydrogen bonding is quite specific with Thymine (T/t) on one strand pairing with Adenine (A/a)
on the other strand and Guanine (G/g) on one strand pairing with Cytosine (C/c) on the other and
vice versa. All compression algorithms compress only one strand.
These behaviors are primary to substantial expansion in the size of DNA data sets, and are
providing opportunities for novel compression techniques that take advantage of the
characteristics of these new data. Our aspiration is to discover mechanisms for detecting this
redundancy and use it in compression by searching optimal level of similarity within individual’s
sequence.
The majority approach of compressing genomic data is to interpret the difference between the
newly data that should be compressed with a reference sequence and then find out the differences
[5], [6], [7]. This will be competent and possible when dealing with species that have a valid
reference but is less satisfactory when approaching new species data due to the lack of a standard
reference genome; for example, RNA sequencing data should be aligned against the entire
transcriptome and is not feasible to account each possible substitute transcript splicing.
DNA sequences in higher eukaryotes contain many repetitive nucleotides and have several copies
and also the Genes duplicate themselves for evolutionary purposes. To analyze, these sequences
have to be stored. So, these facts conclude that DNA sequences should be compressed. Human
DNA almost has 3 billion bases and among then more than 99% are the same in all human [8],
[9]. Data compression reveals certain theoretical ideas such as entropy, mutual information and
complexity between sequences of different genomes.
The most common and is very simple form of DNA compression is just binary encode the DNA
sequence bases using 2 bit for each nucleotides i.e. by replacing A/a, C/c, G/g and T/t with “00”,
“01”, “10” and “11” chosen abruptly. But in practically just by binary encoding the sequence, we
can cut the file size near about 75% with slide more than 2 bit per base [10]. The advantage of
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
75
this method is that the file can be easily parsed without needing complicated compression
algorithms, but it is not satisfactory because it does not use any property within a sequence.
Traditionally, DNA data compression methods usually compress based on the different properties
such as complementary string, complementary palindrome string or reverse complements, cross-
chromosomal similarity, approximate repeat, direct repeat etc. [11] of DNA sequence. However a
slight change in properties may give worst result. The preliminary summit of SBVRLDNAComp
based on repeat checked in four different ways, which have been applied to each DNA sequence.
SBVRLDNAComp encode both exact repetitive and non-repetitive parts without using exact
Lempel-Ziv based compression algorithms or order-2 arithmetic encoding for former and later
one respectively which are very common. So a sequence which is not so redundant also gives
good compression ratio. SBVRLDNAComp permits independent compression and
decompression of individual sequences.
The study is organized as follows: The description of a number of other specialized DNA
compression algorithms (Section 2), the SBVRLDNAComp algorithm (Section 3), and
experimental results (Section 4) are followed by concluding remarks of methods and their affect
on compression (Section 5).
2.RELATED WORK
All Genome compression algorithms utilize redundancy within the sequence, but vary greatly in
the way they do so. In general, compression algorithms can be classified into Naive Bit
Manipulation, Dictionary-based, Statistical and Referential Algorithms.
Two best known lossless compression algorithms are LZ77 [12] and PPM [13]. PPM predicts the
probability distribution of the next symbol based on all previously observed symbols. Both
approaches follow sequential processing to encode a string. They test to see if the current
substring has been seen formerly and, if so, encode it by reference to the earlier happening. A
sliding window is maintained for newly observed text. If the text exceeds this window size, it
cannot be used in compression. The application of LZ77 algorithm are gzip, 7-zip etc.
Compression algorithm using exact repeats are begin with BioCompress [14], BioCompress-2
[15], Cfact [16], Off-Line [17], DNASC [18] and B2DNR [19] use the common characteristic of
a sequence.
Initial DNA compression algorithm based on exact repeat proposed by S. Grumbach and F. Tahi
BioCompress search the regularities, such as the presence of palindromes. Although obtained
result is not satisfactory but better than the existing general purpose compression technique.
Extended version of BioCompress is BioCompress-2; later one is based on LZ77. It searches for
longest exact repeats or longest palindrome or reverse complement in already encoded sequences,
then encodes that repeats by repeat length l and the first position p of preceding repeat appeared
i.e. a pair of integers (l, p); when no repetition is found it uses order-2 arithmetic coding. The
difference between Biocompress and Biocompress-2 is the addition of order-2 arithmetic coding
in the later one.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
76
Cfact is a two phase algorithm executes sequentially. The parsing phase obtains the longest
repeated factors using a suffix tree data structure. The encoding phase compresses first
occurrences of repetitive segments and all non repetitive segments using 2-bit method. The
repeated segment is replaced by a pointer in the form of (pos, len) tuples. But the suffix tree
formation for large data set is not possible in memory.
Off-Line compression algorithm approach is quite similar to Cfact. It uses a suffix tree to find
out the exact repeated substring. But unlike Cfact it use augmented suffix tree which reduces the
time and space complexities to O(n log2
n) and O(n log n) from both O(n2
), where n is the number
of bases in a sequence. It encodes most frequent non-overlapping substrings of a sequence. The
bpc of Off-line is 1.97, which is not better than any DNA specific compression algorithm but it is
a general purpose compression algorithm.
Most of the DNA compression techniques considered frequent occurred of bases i.e. A, C, G and
T. But DNASC have taken one of the infrequent occasion nucleotides N; which can be either A,
C, G or T with equal probability. It is used to compress both DNA and RNA the former one can
be converted to later just by replacing T with U. Bases are compressed by first horizontally and
then vertically. In vertical process it follows LZ style representation of nucleotides with window
size 128 bases i.e. 1024 bits and block size of 6 digits as a combination of 2 i.e. 21 bits using
extended LZ style. Compression of the next block with respect to the current block is done by one
of the 22 ways of redundancy.
Two algorithms B2 and B2DNR considering frequent and all infrequent bases. The rare
characters are {K, M, R, S, W, Y, B, D, H, V, N}. They have formed nucleic acid sequences
fragments of {A, C, G, T} and {K, M, R, S, W, Y, B, D, H, V, N} into 152
=255 combinations and
then converting them into 255 ASCII characters out of 256. For repeat count in the later method
they have used 9 characters from digit ‘1’ to ‘9’. If repeat is greater than 9; then it recounts the
repeats.
Algorithms using approximate repeat detection, start with GenCompress [20], followed by
DNACompress [21], DNAPack [22], GeNML [23] and DNAEcompress [24] show that even
improved results can be achieved by exploiting the approximate nature of the repeated regions in
DNA sequences.
GenCompress based on LZ77. GenCompress uses both approximate repeats and reverse
complements and also uses reverse complements that contain errors. It considers three standard
edit operations Replace, Insert and Delete for approximate matching. There are two versions of
GenCompress: GenCompress-1 uses the Hamming distance i.e. searches for approximate repeats
with replacement or substitution operations only, and GenCompress-2 uses edit distance that
searches for approximate repeats based on the operation insert and delete. This algorithm is able
to detect more properties in DNAsequences, such as mutation and crossover. In addition, they use
arithmetic coding of order 2 if no significant repetition is found i.e. for non-profit encoding by
these properties.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
77
DNACompress algorithm compresses a sequence in two phases. In the first phase it finds all
approximate repeats with highest score including complemented palindromes using a software
tool called Pattern Hunter [25] and in the second it encode approximate repeat regions and non-
repeat regions. It encodes those approximate repeats that give profits on overall compression.
DNAPack compression algorithm compresses both the repeat segments and the non-repeat
segments. It detects the long approximate repeats and the approximate complementary
palindrome repeats using dynamic programming. Both GenCompress and DNACompress use the
greedy approach for selection of the repeat segments. DNAPack used Hamming distance, i.e. the
approximation is only done by substitution. The non-copied regions are encoded by the best
choice from an Order-2 Arithmetic Coding, Context Tree Weighting Coding (CTW) and naive 2
bits per symbol methods.
The GeNML algorithm split the sequence into fixed size blocks. It encodes the block by
maximum likelihood model. GeNML combines both substitution and statistical styles. An inexact
repeat is encoded using a pointer to an earlier instance of the subsequence followed by
substitution, insertion or deletion operation. In compare with the above three algorithms; it
produced better compression results than using approximate repeat.
DNAEcompress compression algorithm for DNA sequence uses three standard edit operations;
replacement, insertion and delete which are extended to five operations. These are
Complementary replace define by crep (pos), Insert string represented by inss (pos; str), Delete
string i.e. dels (poslen), Exchange expressed as exch (pos1; pos2) and Inversion define by inv
(poslen) respectively. The matched patterns both exact and inexact are encoded by LZ algorithm
and unmatched pattern are order-2 arithmetic encoding. So it is like GenCompress algorithm.
Sequentially lossless compression algorithm such as PPM and the other key family of this
category are the basis of the DNA compression algorithms CDNA [26], CTW+LZ [27], and XM
[28].
The first compression algorithms based on statistical method by detecting the approximate repeat
within DNA sequences is CDNA. It predicts the probability distribution of each nucleotide by
using partial matching of the current context to earlier seen substrings. To measure the inexact
similarity CDNA use Hamming distance.
CTW+LZ is a non-greedy algorithm which searches for exact and approximate repeats; exact and
approximate reverse complements or complementary palindrome using hash table and dynamic
programming. It follows time consuming greedy search to get the longer repeat. LZ77 algorithm
is used to compress long exact or approximate repeats. Short repeats are encoded by order-32
Context Tree Weighting (CTW) and edit operations are encoded by arithmetic coding. It uses
PPM a statistical compression algorithm to predict the probability of next symbol using preceding
symbols.
The best compression algorithm compared to other two recent above algorithms is expert model
(XM). It estimates the probability of recent bases with multiple “experts” but based on PPM. An
example of an expert is an order-k Markov model where k>0. Based on the k preceding symbols
it estimates the probability of recent nucleotides. Once the symbol’s probability distribution is
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
78
determined, it is compressed by using a primary compression algorithm such as arithmetic
coding. Another is a copy expert that gives a probability based on whether the next symbol is part
of a copied region from a particular offset.
There is one important feature that has not observed by all of the above algorithms based on
exact repeat. They have not checked all promising types of exact repeats from very small to
maximum possible and uniform of particular size. Our algorithm overcomes that.
In the following section we clarify SBVRLDNAComp algorithm in details and all the associated
components methods that SBVRLDNAComp invoke; and then experimental results. Comparison
with other algorithm is also enlightened in the result.
3.PROPOSED METHODS
The algorithm SBVRLDNAComp is designed for the compression of DNA. It can also used for
RNA sequence but not for proteins. It encodes a text of characters ∑ = {A, C, G, T or U}. This
algorithm is an optimal solution of four proposed methods compressed any sequence by searching
the repeats in four different ways. It is a sequential compression algorithm. After getting the bits
sequence form Method 1 and Method 2 it compares the bit length and chooses the optimal one
dynamically before going to the subsequent method. All methods allows for access to individual
sequences in the compressed data.
Therefore, Nopt.= Min(N',N'',N''',N''''), Where N', N'', N''' and N'''' are the number of bits by four
proposed method and Nopt., the optimal number of bits before mapping to character. The character
mapped intermediate compressed file is finally compressed by LZ77 [12]. The Fig. 1 shows the
general structure of the SBVRLDNAComp algorithm. Following sub-sections discussed the
components method of SBVRLDNAComp and general algorithm.
Sequence
N', N'', N''' & N''''
Opt. Method Nopt.
Compressed Compression
Sequence (S') Ratio (α)
Fig. 1. SBVRLDNAComp structure
Optimal Method
Searching
DNA Sequence
Apply M1 -> M2 ->
M3 -> M4 Sequentially
Compressed by Optimal Components
Method and LZ77
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
79
3.1.Components methods of SBVRLDNAComp
The compression process by all methods is given below. Decompression methods are just reverse
of compression process. All methods will use a self dynamic reference variable R which stores
the current character i.e. A, C, G or T, a temporary variable T and scanning the sequence
horizontally from left to right and top to bottom. Each module search repeated regions from
different aspect then encode them to their individual logic, however, the non-repeated region and
R are coded just by straightforward 2bit coding rule assuming A/a = 00, C/c = 01, G/g = 10 and
T/t or U/u =11 respectively. Result of each method is a binary stream and the optimal one
mapped to ASCII character from fixed window of size 8. Method 1 (M1) gives profitable
encoding when the segments length is form 4 to 9. Whereas Method 2 (M2) for longest repetition.
Method 3 (M3) act well for 2 successive identical bases throughout the sequence. For uniform
repeats of segment size 3, Method 4 (M4) performed well. In the following sub-sections the
details of each method have been discussed.
3.1.1.Method 1
This method finds the repeated optimal segments length of 1 to 8 characters with respect to the
current reference base R. Control bit 0 i.e. denoted by B0 for repetitive R and bit 1 denoted by B1
for non-repetitive R to distinguish between two, followed by the 2-bits representation of R.
Repeated variable length segment of characters (2 =< length <= 9) are encodes by the relation:
{Sr = r, r = 1, 2, 3, 4, 5, 6, 7, 8}. Three bits coding rules form S1 to S8; for segments of length 2 to
9 in step of 1, are represented by {000, 001, 010, 011, 100, 101, 110 and 111}. The codeword is
B0RS1...r for repetitive segments and B1R for non-repetitive portion.
The total number of bits required to compressed any sequence by M1 is obtained by the following
equation,
N' = n * c0 + ݊′ * c1
Where:
c0 = Number of bits for repetitive bases = (1 + 1 * 2 + 3) = 6,
c1 = Number of bits for non-repetitive bases = (1 + 1 * 2) = 3,
n = Total number of repetition,
݊′= Total number of non-repetition.
Then, N' = 6 * n + 3 *	݊′ = 3(2n+	݊′)
3.1.2 Method 2
For each repetition with respect to R the assigned bit is 1 and for end of repetitive base or bases
bit is 0. It will search for maximum possible repetition. It uses dynamic programming approach.
So if 4 consecutive bases appeared then the codeword is RB1B1B1B1B0 and for non-repetitive R it
is RB0, where R is encoded by 2-bit coding. Then the following equation determines the number
of bits desired by M2,
N'' = n * c0 + ݊′ * c1
Where:
c0 = (1 * 2 + ri + 1) = 3 + ri,
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
80
ri = The number of base repetitions with respect to ith
reference base,
c1 = (1 * 2 + 1) = 3.
Therefore, ܰ′′
= 	݊ ∗ 3 + ෍ ‫ݎ‬௜
௡
௜ୀଵ
+ ݊′ ∗ 3
																													= 	3݊ + ෍ ‫ݎ‬௜
௡
௜ୀଵ
+ 3݊′
																													= 3݊ + ෍ ‫ݎ‬௜
௡
௜ୀଵ
+ 3݊′
																													= ෍ ‫ݎ‬௜
௡
௜ୀଵ
+ 3(݊ + ݊′)
3.1.3.Method 3
Here the assigned bit for each individual repetitive base is 0 and bit 1 for each non-repetitive
base. It follows greedy algorithm. The codeword for identical part is RB0 and RB1 for non-
identical characters. The bits length by M3 follows the equation below,
N''' = n * c0 + ݊′	* c1
Where:
c0 = (1 * 2 + 1) = 3,
c1 = (1 * 2 + 1) = 3,
Hence, N''' = 3n + 3n' = 3(n + n')
3.1.4.Method 4
Divide each sequence into disjoint segments size 3 using divide-and-conquer algorithm. Bit flag 0
for exact three repetitive bases and bit flag 1 for any unmatched base. The codeword of
indistinguishable segments is B0R and for distinguishable segments B1RRR. For the last segments
of length < 3; if any the coding is Rk, where 0 < k < 3. The total number of bits is obtained by the
equation given below,
N'''' = n * c0 + n' * c1+ Rk
Where:
Rk = n'' * 2
n'' = Number of bases (0 < Length < 3),
c0 = (1 * 2 + 1) = 3,
c1 = (3 * 2 + 1) = 7.
So, N'''' = n * c0 + n' * c1+ n'' * 2
= 3n + 7n' + 2n''
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
81
3.2.Combined Method
The basic versions of the each method have been discussed in the previous section. The following
section will discuss the combined edition i.e. SBVRLDNAComp algorithm. The space
complexity is O(n), where n is the number of bases in a sequence. SBVRLDNAComp first
calculates the number of bits needed by each component methods sequentially; then compares
between the results obtained N', N'', N''' and N''''; chooses the optimal one and the corresponding
optimal method suited for a particular sequence. It uses substitutional method; form the optimal
generated bit pattern intermediate file. The number of repeated bases or segments r varies from
method to method. For first component r = 8, but second module r is unpredictable, third one
does not have r dependency and for last part r = 3 respectively. An outline is shown in Fig. 2.
Input:
1: A DNA sequence S
2: Three bits coding rules: {S1, S2, S3, S4, S5, S6, S7 andS8}
3: Flag variable v for repeat B0 and non-repeat B1 segments
4: Two bits coding rule (A – 00, C – 01, G -10 and T - 11)
5: Count bits c
Output:
1: Compressed sequence S'
2: ‘α’ bits/bases (bpb)
Algorithm:
1: Store the current character of remaining S into R
2: Check the next character relative to R, store it to T
3: c ← 0
4: while R != null do
5: if R = T then
6: v = B0
7: else
8: v = B1
9: end if
10: The compressed codes are (B0RS1…r) or (B1R); (RB1…rB0) or (RB0); (RB1) or (RB0); (B0R) or
(B1R1…r) or (R1… n'')
11: Update c according to coding Splice
12: end while
13: The total number of bits c and optimal method for a specific sequence is obtained
14: Compressed by optimal method
15: Use LZ77 for final stage compression
Fig. 2. Outline of the SBVRLDNAComp algorithm.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
82
4. Results
This section evaluates the performance of the SBVRLDNAComp by applying on ten standards
DNA sequences [14], details of these sequences are summarized in Table 1. Although
SBVRLDNAComp performance has been test on 10 standard DNA sequence of small size; this
algorithm can be applied on any DNA or RNA sequence of any size.
In the following Table 2 other concerns of data compression such as compression ratio has been
discussed of different sequences. In best case it takes 1.7009 bpb. The definition of compression
ratio α is the number of characters after compression l divided by the same before compression n.
α = l / n
= l * 8 / n bpb
= (Nopt. / n) bpb
Where,
Nopt. = l * 8
The average bits per base of the proposed algorithm and other existing methods on DNA
sequences are illustrated in Fig. 3, it shows that the SBVRLDNAComp achieves the best
compression ratio among all other algorithms.
The algorithm is implemented by Java6, on a Core 2 Due processor with a 2 GB of RAM and OS
is Fedora 19. The compression ratio considered to be excellent to the best of author’s knowledge.
But it can vary slightly depends on the machine hardware and software.
Tab1e 1. Information of ten standard benchmark DNA sequences.
Sequences name Source File size(KB)
chmpxx Chloroplasts 118.1875
chntxx 152.1914
humdystrop Human 37.8613
humghcsa 64.9365
humhdabcd 57.4844
humhprtb 55.4072
mpomtcg Mitochondria 182.2344
mtpacg 97.9629
hehcmvcg Viruses 223.9785
vaccg 187.2431
Tab1e 2. The compression ratio (bpb) for the ten benchmark DNA sequences after compression.
Sequences
name
Number of
characters before
compression (N)
Number of
characters after
compression (L)
Nopt. Compression
ratio (α)
chmpxx 121024 25746 205971 1.7019
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
83
chntxx 155,844 33549 268395 1.7222
humdystrop 38,770 8277 66215 1.7079
humghcsa 66,495 14367 114937 1.7285
humhdabcd 58,864 12604 100828 1.7129
humhprtb 56,737 12063 96504 1.7009
mpomtcg 186,608 40221 321768 1.7243
mtpacg 100,314 21694 173553 1.7301
hehcmvcg 229,354 49036 392287 1.7104
vaccg 191,737 40993 327947 1.7104
Fig. 3. Average bits per base for DNA compression algorithms
5. Discussions
It is not likely that exactly one compression strategy will be optimal for diverse DNA sequence.
Different experimental results are going to show various bases distributions whereby one
compression strategy can be more efficient than another. We have proposed four new
compressions methods specialized on searching redundant substrings on highly repetitive
sequences. The first one is particularly significant for medium repeat value of r = 9; whereas the
second one is relevant for large r values r > 9, third one is applicable for small r value r = 2 and
the final one stands out on extremely uniform repetitive collections with the small segments of
size r = 3. Depending on the repetition of bases one of the four modules gives extremely good
quality result. So for any type of exact base repeat SBVRLDNABase surpass the other standard
techniques.
2.092
1.784
1.743 1.739 1.725 1.715 1.714 1.714
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
Average
Compression
Ratio Per…
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
84
Acknowledgements
This work is supported in part by the Bioinformatics Centre of Bose Institute, Computer Science
and Engineering Department of University of Calcutta and Academy of Technology under
WBUT. We thank Dr. Zumur Ghosh and Arijita Sarkar.
References
[1] Kircher M and Kelso J, 2010, High-throughput DNA sequencing – concepts and limitations,
Bioessays, Wiley Online Library, 32, 6, 524–536.
[2] Paula WCP, 2009, An Approach to Multiple DNA Sequences Compression-A thesis submitted in
partial fulfillment of the requirements for the Degree of Master of Philosophy, The Hong Kong
Polytechnic University, Hong Kong.
[3] Shiu HJ, Ng KL, Fang JF, Lee RCT and Huang CH, 2010, Data hiding methods based upon DNA
sequences, Information Sciences, Elsevier, 180, 2196–2208.
[4] Mridula TV and Samuel P, 2011 , Lossless segment based DNA compression, Proceedings of the 3rd
International Conference on Electronics Computer Technology, IEEE Xplore Press, 298- 302.
[5] Kozanitis C, Saunders C, Kruglyak S, Bafna V and Varghese G, 2010, Compressing Genomic
Sequence Fragments Using Slimgene, Research in Comp. Mol. Bio, 6044, 310-324.
[6] Daily K, Rigor P, Christley S, Xie X, and Baldi P, 2010, Data Structures and Compression
Algorithms for High-Throughput Sequencing Technologies, BMC Bioinformatics, 11, 1, article 514.
[7] Fritz MHY, Leinonen R, Cochrane G, and Birney E, 2011, Efficient Storage of High Throughput
DNA Sequencing Data Using Reference-Based Compression, Genome Research, 21, 5, 734-740.
[8] Meyer S, 2010, Signature in the Cell: DNA and the Evidence for Intelligent Design, 1st Edn.,
HarperOne, ISBN-10: 0061472794, 55.
[9] Aly W, Yousuf Band and Zohdy B, 2013, A Deoxyribonucleic acid compression algorithm using
auto-regression and swarm intelligence, Journal of Computer Science, 9, 6, 690-698.
[10] Roy S, Khatua S, Roy S and Bandyopadhyay SK, 2012 , An Efficient Biological Sequence
Compression Technique Using LUT and Repeat in the Sequence, IOSRJCE, 6, 1, 42-50.
[11] Roy S and Khatua S, 2014, DNA DATA COMPRESSION ALGORITHMS BASED ON
REDUNDANCY, IJFCST, 4, 6, 49-58.
[12] Ziv J and Lempel A, 1977, An Universal Algorithm for Sequential Data Compression, IEEE Trans.
Info. Theory, IT-23, 3, 337-343.
[13] Cleary J and Witten I, 1984, Data Compression Using Adaptive Coding and Partial String Matching,
IEEE Trans. Comm., COM-32, 4, 396-402.
[14] Grumbach S and Tahi F, 1993, Compression of DNA sequences, IEEE Symp. on the Data
Compression Conf., DCC-93, Snowbird, UT, 340–350.
[15] Grumbach S and Tahi F, 1994, A new challenge for compression algorithms: genetic sequences, Info.
Process. & Manage, Elsevier, 875-866.
[16] Rivals E, Delahaye J, Dauchet M and Delgrange O, 1996, A Guaranteed Compression Scheme for
Repetitive DNA Sequences, DCC ’96: Proc. Data Compression Conf., 453.
[17] Apostolico A and Lonardi S, 2000, Compression of Biological Sequences by Greedy Off-Line
Textual Substitution, DCC ’00: Proc. Data Compression Conf., pp. 143-152.
[18] Mishra KN, Aaggarwal A, Abdelhadi E and Srivastava PC, 2010, An Efficient Horizontal and
Vertical Method for Online DNA Sequence Compression, IJCA, 3, 1, 39-46.
[19] Roy S and Khatua S, 2013, Compression Algorithm for all Specified Bases in Nucleic Acid
Sequences, IJCA, 75, 4, 29-34.
[20] Chen X, Kwong S and Li M, 2001, A Compression Algorithm for DNA Sequences, Using
Approximate Matching for Better Compression Ratio to Reveal the True Characteristics of DNA,
IEEE Engg. in Med. and Bio., 61-66.
International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015
85
[21] Chen X, Li M, Ma B, and Tromp J, 2002, DNACompress: Fast and Effective DNA Sequence
Compression, Bioinformatics, 18, 1696-1698.
[22] Behzadi B and Fessant FL, 2005, DNA Compression Challenge Revisited: A Dynamic Programming
Approach, CPM ’05: Proc. 16th Ann. Symp. Comb. Pattern Matching, 190- 200.
[23] Korodi G and Tabus I, 2005, An Efficient Normalized Maximum Likelihood Algorithm for DNA
Sequence
Compression”, ACM Trans. Information Systems, 23, 1, 3-34.
[24] TAN L, SUN J and XIONG W, 2012, A Compression Algorithm for DNA Sequence Using Extended
Operations, Journal of Computational Information Systems, 8,18, 7685–7691.
[25] Ma B, Tromp J and Li M, 2002, PatternHunter—faster and more sensitive homology search,
Bioinformatics, 18, 440–445.
[26] Loewenstern D and Yianilos P, 1997, Significantly Lower Entropy Estimates for Natural DNA
Sequences, DCC ’97: Proc. Data Comp. Conf., 151.
[27] Matsumoto T, Sadakane K and Imai H, 2000, Biological Sequence Compression Algorithms, Genome
Informatics, 43–52.
[28] Cao MD, Dix T, Allison L and Mears C, 2007, A Simple Statistical Algorithm for Biological
Sequence Compression, DCC ’07: Proc. Data Comp. Conf., 43-52.
Authors
Subhankar Roy is currently an Assistant Professor in the Department of Computer
Science & Engineering, Academy of Technology, India. He has received his B. Tech and
M. Tech degree in Computer Science and Engineering both from University of Calcutta,
India in 2010 and 2012 respectively. His research interests are in the areas of
Bioinformatics and compression techniques. He is a member of IEEE.
Akash Bhagat is presently working as Faculty at Mahendra Educational Pvt. Ltd.,
Asansol, India. He has received his MCA degree from Academy of Technology, India in
2015. His research interests include Bioinformatics and DNA data compression
techniques.
Kumari Annapurna Sharma is presently working as Project Engineer at WIPRO-Project
SHELL, India. He has received his MCA degree from Academy of Technology, India in
2015. His research interests include Bioinformatics and DNA data compression
techniques.
Sunirmal Khatua is currently an Assistant Professor in the Department of Computer
Science and Engineering, University of Calcutta, India. He has received the M.E. degree in
Computer Science and Engineering from Jadavpur University, India in 2006. He is also
pursuing his PhD in Cloud Computing from Jadavpur University. His research interests
are in the areas of Distributed Computing, Cloud Computing, Bioinformatics, and Sensor
Networks. He is a member of IEEE.

More Related Content

What's hot

2015 12-09 nmdd
2015 12-09 nmdd2015 12-09 nmdd
2015 12-09 nmdd
Karin Lagesen
 
CRISPR-Cas9 Review: A potential tool for genome editing
CRISPR-Cas9 Review: A potential tool for genome editingCRISPR-Cas9 Review: A potential tool for genome editing
CRISPR-Cas9 Review: A potential tool for genome editing
Davient Bala
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
Joachim Jacob
 
Gene editing application for cancer therapeutics
Gene editing application for cancer therapeuticsGene editing application for cancer therapeutics
Gene editing application for cancer therapeutics
Nur Farrah Dini
 
presentation
presentationpresentation
presentation
Debit Ahmed
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
IAEME Publication
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Candy Smellie
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
Manjappa Ganiger
 
Genome editing
Genome editingGenome editing
Genome editing
Satrupa Das
 
The Efficiency and Ethics of the CRISPR System in Human Embryos
The Efficiency and Ethics of the CRISPR System in Human EmbryosThe Efficiency and Ethics of the CRISPR System in Human Embryos
The Efficiency and Ethics of the CRISPR System in Human Embryos
Stephen Cranwell
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
Smoc2 overexpression on fibrosis
Smoc2 overexpression on fibrosisSmoc2 overexpression on fibrosis
Smoc2 overexpression on fibrosis
Md Tabassum Hossain Emon
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
Jennifer Shelton
 
Translating Genomes | Personalizing Medicine
Translating Genomes | Personalizing MedicineTranslating Genomes | Personalizing Medicine
Translating Genomes | Personalizing Medicine
Candy Smellie
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
Gunnar Rätsch
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
Tristan Kempston
 
CRISPR: Gene editing for everyone
CRISPR: Gene editing for everyoneCRISPR: Gene editing for everyone
CRISPR: Gene editing for everyone
Candy Smellie
 

What's hot (19)

2015 12-09 nmdd
2015 12-09 nmdd2015 12-09 nmdd
2015 12-09 nmdd
 
CRISPR-Cas9 Review: A potential tool for genome editing
CRISPR-Cas9 Review: A potential tool for genome editingCRISPR-Cas9 Review: A potential tool for genome editing
CRISPR-Cas9 Review: A potential tool for genome editing
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
Gene editing application for cancer therapeutics
Gene editing application for cancer therapeuticsGene editing application for cancer therapeutics
Gene editing application for cancer therapeutics
 
presentation
presentationpresentation
presentation
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Genome editing
Genome editingGenome editing
Genome editing
 
The Efficiency and Ethics of the CRISPR System in Human Embryos
The Efficiency and Ethics of the CRISPR System in Human EmbryosThe Efficiency and Ethics of the CRISPR System in Human Embryos
The Efficiency and Ethics of the CRISPR System in Human Embryos
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Smoc2 overexpression on fibrosis
Smoc2 overexpression on fibrosisSmoc2 overexpression on fibrosis
Smoc2 overexpression on fibrosis
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
 
Translating Genomes | Personalizing Medicine
Translating Genomes | Personalizing MedicineTranslating Genomes | Personalizing Medicine
Translating Genomes | Personalizing Medicine
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
 
CRISPR: Gene editing for everyone
CRISPR: Gene editing for everyoneCRISPR: Gene editing for everyone
CRISPR: Gene editing for everyone
 

Viewers also liked

An approach to decrease dimentions of logical
An approach to decrease dimentions of logicalAn approach to decrease dimentions of logical
An approach to decrease dimentions of logical
ijcsa
 
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
ijcsa
 
Effects of missing observations on
Effects of missing observations onEffects of missing observations on
Effects of missing observations on
ijcsa
 
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
ijcsa
 
Dart 2
Dart 2Dart 2
Dart 2
srohr
 
The Boathouse
The  BoathouseThe  Boathouse
The Boathouse
Tom Aikins
 
Politique communale liégeoise en matière de radiation des registres de la pop...
Politique communale liégeoise en matière de radiation des registres de la pop...Politique communale liégeoise en matière de radiation des registres de la pop...
Politique communale liégeoise en matière de radiation des registres de la pop...
Michel Péters
 
Process
ProcessProcess
Process
zafirahmed
 
Edwarsiella
EdwarsiellaEdwarsiella
Dart 1
Dart 1Dart 1
Dart 1
srohr
 
Galleria Village
Galleria VillageGalleria Village
Galleria Village
srohr
 
Omni Orlando
Omni OrlandoOmni Orlando
Omni Orlando
srohr
 
Urban Reserve
Urban ReserveUrban Reserve
Urban Reserve
srohr
 
Omaha Hilton
Omaha HiltonOmaha Hilton
Omaha Hilton
srohr
 
Salman Profile
Salman ProfileSalman Profile
Salman Profile
Carachi
 
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
ijcsa
 

Viewers also liked (17)

An approach to decrease dimentions of logical
An approach to decrease dimentions of logicalAn approach to decrease dimentions of logical
An approach to decrease dimentions of logical
 
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
SUCCESSIVE LINEARIZATION SOLUTION OF A BOUNDARY LAYER CONVECTIVE HEAT TRANSFE...
 
Effects of missing observations on
Effects of missing observations onEffects of missing observations on
Effects of missing observations on
 
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
COVERAGE OPTIMIZED AND TIME EFFICIENT LOCAL SEARCH BETWEENNESS ROUTING FOR HE...
 
Dart 2
Dart 2Dart 2
Dart 2
 
The Boathouse
The  BoathouseThe  Boathouse
The Boathouse
 
Politique communale liégeoise en matière de radiation des registres de la pop...
Politique communale liégeoise en matière de radiation des registres de la pop...Politique communale liégeoise en matière de radiation des registres de la pop...
Politique communale liégeoise en matière de radiation des registres de la pop...
 
Process
ProcessProcess
Process
 
alat tradisi
alat tradisialat tradisi
alat tradisi
 
Edwarsiella
EdwarsiellaEdwarsiella
Edwarsiella
 
Dart 1
Dart 1Dart 1
Dart 1
 
Galleria Village
Galleria VillageGalleria Village
Galleria Village
 
Omni Orlando
Omni OrlandoOmni Orlando
Omni Orlando
 
Urban Reserve
Urban ReserveUrban Reserve
Urban Reserve
 
Omaha Hilton
Omaha HiltonOmaha Hilton
Omaha Hilton
 
Salman Profile
Salman ProfileSalman Profile
Salman Profile
 
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
 

Similar to SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM

Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
ijfcstjournal
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
bioejjournal
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
bioejjournal
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
IJMER
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
syed Farhan Rizvi
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
Tanmay Ghai
 
DNA sequencing
DNA sequencing  DNA sequencing
DNA sequencing
Nawfal Aldujaily
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
 
50320130403003 2
50320130403003 250320130403003 2
50320130403003 2
IAEME Publication
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
Crypt Sequence DNA
Crypt Sequence DNACrypt Sequence DNA
Crypt Sequence DNA
IOSR Journals
 
CRISPR and cardiovascular diseases.pdf
CRISPR and cardiovascular diseases.pdfCRISPR and cardiovascular diseases.pdf
CRISPR and cardiovascular diseases.pdf
Ramachandra Barik
 
3302 3305
3302 33053302 3305
New generation Sequencing
New generation Sequencing New generation Sequencing
New generation Sequencing
Vijay Raj Yanamala
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
RapportHicham
RapportHichamRapportHicham
RapportHicham
Hicham Janati
 
Nanobiology mid term exam (mesele)
Nanobiology mid term exam (mesele)Nanobiology mid term exam (mesele)
Nanobiology mid term exam (mesele)
Mesele Tilahun
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
mahimachoudhary0807
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
University of Allahabad
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
 

Similar to SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM (20)

Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
 
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELODE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
DNA sequencing
DNA sequencing  DNA sequencing
DNA sequencing
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
50320130403003 2
50320130403003 250320130403003 2
50320130403003 2
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Crypt Sequence DNA
Crypt Sequence DNACrypt Sequence DNA
Crypt Sequence DNA
 
CRISPR and cardiovascular diseases.pdf
CRISPR and cardiovascular diseases.pdfCRISPR and cardiovascular diseases.pdf
CRISPR and cardiovascular diseases.pdf
 
3302 3305
3302 33053302 3305
3302 3305
 
New generation Sequencing
New generation Sequencing New generation Sequencing
New generation Sequencing
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
RapportHicham
RapportHichamRapportHicham
RapportHicham
 
Nanobiology mid term exam (mesele)
Nanobiology mid term exam (mesele)Nanobiology mid term exam (mesele)
Nanobiology mid term exam (mesele)
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM

  • 1. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 DOI:10.5121/ijcsa.2015.5407 73 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM Subhankar Roy1 ,Akash Bhagot2 ,Kumari Annapurna Sharma2 and Sunirmal Khatua3 1 Department of Computer Science and Engineering Academy of Technology, G. T. Road, Aedconagar, Hooghly-712121, W.B., India 2 Master of Computer Application, Academy of Technology, G. T. Road, Aedconagar, Hooghly-712121, W.B., India 3 Department of Computer Science and Engineer, University of Calcutta, 92 A.P.C. Road, Kolkata-700009, India ABSTRACT There are plenty specific types of data which are needed to compress for easy storage and to reduce overall retrieval times. Moreover, compressed sequence can be used to understand similarities between biological sequences. DNA data compression challenge has become a major task for many researchers for the last few years as a result of exponential increase of produced sequences in gene databases. In this research paper we have attempt to develop an algorithm by self-reference bases; namely Single Base Variable Repeat Length DNA Compression (SBVRLDNAComp). There are a number of reference based compression methods but they are not satisfactory for forthcoming new species. SBVRLDNAComp is an optimal solution of the result obtained from small to long, uniform identical and non-identical string of nucleotides checked in four different ways. Both exact repetitive and non-repetitive bases are compressed by SBVRLDNAComp. The sound part of it is without any reference database SBVRLDNAComp achieves 1.70 to 1.73 compression ratio α after testing on ten benchmark DNA sequences. The compressed file can be further compressed with standard tools (such as WinZip or WinRar) but even without this SBVRLDNAComp outperforms many standard DNA compression algorithms. KEYWORDS DNA; Redundancy; Reference Base; Optimized Exact Repeat; Non-Repetition; LZ77; and Compression Ratio. 1.INTRODUCTION The size of the genome database is increasing annually with a great speed. Each day several thousand nucleotides are sequenced in the labs. From 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months. It is found that in Dec 1982, the number of bases and the number of sequence records were 680338 and 606 respectively for GenBank and none for WGS (Whole Genome Shotgun) and with Release 129 in Apr 2002, the number of bases
  • 2. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 74 and the number of sequence records were 19072679701 and 16769983 respectively for GenBank and the WGS had 692266338 number of bases and 172768 number of sequences. Again in the Release 206 in Feb 2015, the number of bases and the number of sequence records were 187893826750 and 181336445 respectively for GenBank and the WGS had 873281414087 numbers of bases and 205465046 numbers of sequences. High-throughput sequencing technologies [1] make it possible to rapidly acquire large numbers of individual genomes, which, for a given organism, vary slightly from one to another. Such repetitive and large sequence collections are a unique challenge for compression. Compressed data reduce the communication cost and storage cost. Furthermore, compressed sequence can be used to get the similarities within sequences. The highly repetitive DNA sequences own some motivating properties [2], [3] which can be utilize to compress it. As DNA sequences consists of four nucleotides bases A, C, G and T called exon (i.e. coding regions or protein synthesis) or introns (i.e. non-coding regions or no protein synthesis) a, c, g and t in frequent cases, two bits is enough to store each base, in spite of this fact, the standard compression algorithm like “COMPRESS”, “GZIP”, “BZIP2”, “WinZip” or “WinRar” uses more than 2 bits per base [4]. Even both static and adaptive Huffman’s code fails badly on DNA sequences because the probabilities of occurrence of these symbols are not very different. In this paper we focus on compression of this particular data. DNA is a double stranded molecule with neighboring strands connected through hydrogen bonding between the bases. This hydrogen bonding is quite specific with Thymine (T/t) on one strand pairing with Adenine (A/a) on the other strand and Guanine (G/g) on one strand pairing with Cytosine (C/c) on the other and vice versa. All compression algorithms compress only one strand. These behaviors are primary to substantial expansion in the size of DNA data sets, and are providing opportunities for novel compression techniques that take advantage of the characteristics of these new data. Our aspiration is to discover mechanisms for detecting this redundancy and use it in compression by searching optimal level of similarity within individual’s sequence. The majority approach of compressing genomic data is to interpret the difference between the newly data that should be compressed with a reference sequence and then find out the differences [5], [6], [7]. This will be competent and possible when dealing with species that have a valid reference but is less satisfactory when approaching new species data due to the lack of a standard reference genome; for example, RNA sequencing data should be aligned against the entire transcriptome and is not feasible to account each possible substitute transcript splicing. DNA sequences in higher eukaryotes contain many repetitive nucleotides and have several copies and also the Genes duplicate themselves for evolutionary purposes. To analyze, these sequences have to be stored. So, these facts conclude that DNA sequences should be compressed. Human DNA almost has 3 billion bases and among then more than 99% are the same in all human [8], [9]. Data compression reveals certain theoretical ideas such as entropy, mutual information and complexity between sequences of different genomes. The most common and is very simple form of DNA compression is just binary encode the DNA sequence bases using 2 bit for each nucleotides i.e. by replacing A/a, C/c, G/g and T/t with “00”, “01”, “10” and “11” chosen abruptly. But in practically just by binary encoding the sequence, we can cut the file size near about 75% with slide more than 2 bit per base [10]. The advantage of
  • 3. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 75 this method is that the file can be easily parsed without needing complicated compression algorithms, but it is not satisfactory because it does not use any property within a sequence. Traditionally, DNA data compression methods usually compress based on the different properties such as complementary string, complementary palindrome string or reverse complements, cross- chromosomal similarity, approximate repeat, direct repeat etc. [11] of DNA sequence. However a slight change in properties may give worst result. The preliminary summit of SBVRLDNAComp based on repeat checked in four different ways, which have been applied to each DNA sequence. SBVRLDNAComp encode both exact repetitive and non-repetitive parts without using exact Lempel-Ziv based compression algorithms or order-2 arithmetic encoding for former and later one respectively which are very common. So a sequence which is not so redundant also gives good compression ratio. SBVRLDNAComp permits independent compression and decompression of individual sequences. The study is organized as follows: The description of a number of other specialized DNA compression algorithms (Section 2), the SBVRLDNAComp algorithm (Section 3), and experimental results (Section 4) are followed by concluding remarks of methods and their affect on compression (Section 5). 2.RELATED WORK All Genome compression algorithms utilize redundancy within the sequence, but vary greatly in the way they do so. In general, compression algorithms can be classified into Naive Bit Manipulation, Dictionary-based, Statistical and Referential Algorithms. Two best known lossless compression algorithms are LZ77 [12] and PPM [13]. PPM predicts the probability distribution of the next symbol based on all previously observed symbols. Both approaches follow sequential processing to encode a string. They test to see if the current substring has been seen formerly and, if so, encode it by reference to the earlier happening. A sliding window is maintained for newly observed text. If the text exceeds this window size, it cannot be used in compression. The application of LZ77 algorithm are gzip, 7-zip etc. Compression algorithm using exact repeats are begin with BioCompress [14], BioCompress-2 [15], Cfact [16], Off-Line [17], DNASC [18] and B2DNR [19] use the common characteristic of a sequence. Initial DNA compression algorithm based on exact repeat proposed by S. Grumbach and F. Tahi BioCompress search the regularities, such as the presence of palindromes. Although obtained result is not satisfactory but better than the existing general purpose compression technique. Extended version of BioCompress is BioCompress-2; later one is based on LZ77. It searches for longest exact repeats or longest palindrome or reverse complement in already encoded sequences, then encodes that repeats by repeat length l and the first position p of preceding repeat appeared i.e. a pair of integers (l, p); when no repetition is found it uses order-2 arithmetic coding. The difference between Biocompress and Biocompress-2 is the addition of order-2 arithmetic coding in the later one.
  • 4. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 76 Cfact is a two phase algorithm executes sequentially. The parsing phase obtains the longest repeated factors using a suffix tree data structure. The encoding phase compresses first occurrences of repetitive segments and all non repetitive segments using 2-bit method. The repeated segment is replaced by a pointer in the form of (pos, len) tuples. But the suffix tree formation for large data set is not possible in memory. Off-Line compression algorithm approach is quite similar to Cfact. It uses a suffix tree to find out the exact repeated substring. But unlike Cfact it use augmented suffix tree which reduces the time and space complexities to O(n log2 n) and O(n log n) from both O(n2 ), where n is the number of bases in a sequence. It encodes most frequent non-overlapping substrings of a sequence. The bpc of Off-line is 1.97, which is not better than any DNA specific compression algorithm but it is a general purpose compression algorithm. Most of the DNA compression techniques considered frequent occurred of bases i.e. A, C, G and T. But DNASC have taken one of the infrequent occasion nucleotides N; which can be either A, C, G or T with equal probability. It is used to compress both DNA and RNA the former one can be converted to later just by replacing T with U. Bases are compressed by first horizontally and then vertically. In vertical process it follows LZ style representation of nucleotides with window size 128 bases i.e. 1024 bits and block size of 6 digits as a combination of 2 i.e. 21 bits using extended LZ style. Compression of the next block with respect to the current block is done by one of the 22 ways of redundancy. Two algorithms B2 and B2DNR considering frequent and all infrequent bases. The rare characters are {K, M, R, S, W, Y, B, D, H, V, N}. They have formed nucleic acid sequences fragments of {A, C, G, T} and {K, M, R, S, W, Y, B, D, H, V, N} into 152 =255 combinations and then converting them into 255 ASCII characters out of 256. For repeat count in the later method they have used 9 characters from digit ‘1’ to ‘9’. If repeat is greater than 9; then it recounts the repeats. Algorithms using approximate repeat detection, start with GenCompress [20], followed by DNACompress [21], DNAPack [22], GeNML [23] and DNAEcompress [24] show that even improved results can be achieved by exploiting the approximate nature of the repeated regions in DNA sequences. GenCompress based on LZ77. GenCompress uses both approximate repeats and reverse complements and also uses reverse complements that contain errors. It considers three standard edit operations Replace, Insert and Delete for approximate matching. There are two versions of GenCompress: GenCompress-1 uses the Hamming distance i.e. searches for approximate repeats with replacement or substitution operations only, and GenCompress-2 uses edit distance that searches for approximate repeats based on the operation insert and delete. This algorithm is able to detect more properties in DNAsequences, such as mutation and crossover. In addition, they use arithmetic coding of order 2 if no significant repetition is found i.e. for non-profit encoding by these properties.
  • 5. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 77 DNACompress algorithm compresses a sequence in two phases. In the first phase it finds all approximate repeats with highest score including complemented palindromes using a software tool called Pattern Hunter [25] and in the second it encode approximate repeat regions and non- repeat regions. It encodes those approximate repeats that give profits on overall compression. DNAPack compression algorithm compresses both the repeat segments and the non-repeat segments. It detects the long approximate repeats and the approximate complementary palindrome repeats using dynamic programming. Both GenCompress and DNACompress use the greedy approach for selection of the repeat segments. DNAPack used Hamming distance, i.e. the approximation is only done by substitution. The non-copied regions are encoded by the best choice from an Order-2 Arithmetic Coding, Context Tree Weighting Coding (CTW) and naive 2 bits per symbol methods. The GeNML algorithm split the sequence into fixed size blocks. It encodes the block by maximum likelihood model. GeNML combines both substitution and statistical styles. An inexact repeat is encoded using a pointer to an earlier instance of the subsequence followed by substitution, insertion or deletion operation. In compare with the above three algorithms; it produced better compression results than using approximate repeat. DNAEcompress compression algorithm for DNA sequence uses three standard edit operations; replacement, insertion and delete which are extended to five operations. These are Complementary replace define by crep (pos), Insert string represented by inss (pos; str), Delete string i.e. dels (poslen), Exchange expressed as exch (pos1; pos2) and Inversion define by inv (poslen) respectively. The matched patterns both exact and inexact are encoded by LZ algorithm and unmatched pattern are order-2 arithmetic encoding. So it is like GenCompress algorithm. Sequentially lossless compression algorithm such as PPM and the other key family of this category are the basis of the DNA compression algorithms CDNA [26], CTW+LZ [27], and XM [28]. The first compression algorithms based on statistical method by detecting the approximate repeat within DNA sequences is CDNA. It predicts the probability distribution of each nucleotide by using partial matching of the current context to earlier seen substrings. To measure the inexact similarity CDNA use Hamming distance. CTW+LZ is a non-greedy algorithm which searches for exact and approximate repeats; exact and approximate reverse complements or complementary palindrome using hash table and dynamic programming. It follows time consuming greedy search to get the longer repeat. LZ77 algorithm is used to compress long exact or approximate repeats. Short repeats are encoded by order-32 Context Tree Weighting (CTW) and edit operations are encoded by arithmetic coding. It uses PPM a statistical compression algorithm to predict the probability of next symbol using preceding symbols. The best compression algorithm compared to other two recent above algorithms is expert model (XM). It estimates the probability of recent bases with multiple “experts” but based on PPM. An example of an expert is an order-k Markov model where k>0. Based on the k preceding symbols it estimates the probability of recent nucleotides. Once the symbol’s probability distribution is
  • 6. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 78 determined, it is compressed by using a primary compression algorithm such as arithmetic coding. Another is a copy expert that gives a probability based on whether the next symbol is part of a copied region from a particular offset. There is one important feature that has not observed by all of the above algorithms based on exact repeat. They have not checked all promising types of exact repeats from very small to maximum possible and uniform of particular size. Our algorithm overcomes that. In the following section we clarify SBVRLDNAComp algorithm in details and all the associated components methods that SBVRLDNAComp invoke; and then experimental results. Comparison with other algorithm is also enlightened in the result. 3.PROPOSED METHODS The algorithm SBVRLDNAComp is designed for the compression of DNA. It can also used for RNA sequence but not for proteins. It encodes a text of characters ∑ = {A, C, G, T or U}. This algorithm is an optimal solution of four proposed methods compressed any sequence by searching the repeats in four different ways. It is a sequential compression algorithm. After getting the bits sequence form Method 1 and Method 2 it compares the bit length and chooses the optimal one dynamically before going to the subsequent method. All methods allows for access to individual sequences in the compressed data. Therefore, Nopt.= Min(N',N'',N''',N''''), Where N', N'', N''' and N'''' are the number of bits by four proposed method and Nopt., the optimal number of bits before mapping to character. The character mapped intermediate compressed file is finally compressed by LZ77 [12]. The Fig. 1 shows the general structure of the SBVRLDNAComp algorithm. Following sub-sections discussed the components method of SBVRLDNAComp and general algorithm. Sequence N', N'', N''' & N'''' Opt. Method Nopt. Compressed Compression Sequence (S') Ratio (α) Fig. 1. SBVRLDNAComp structure Optimal Method Searching DNA Sequence Apply M1 -> M2 -> M3 -> M4 Sequentially Compressed by Optimal Components Method and LZ77
  • 7. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 79 3.1.Components methods of SBVRLDNAComp The compression process by all methods is given below. Decompression methods are just reverse of compression process. All methods will use a self dynamic reference variable R which stores the current character i.e. A, C, G or T, a temporary variable T and scanning the sequence horizontally from left to right and top to bottom. Each module search repeated regions from different aspect then encode them to their individual logic, however, the non-repeated region and R are coded just by straightforward 2bit coding rule assuming A/a = 00, C/c = 01, G/g = 10 and T/t or U/u =11 respectively. Result of each method is a binary stream and the optimal one mapped to ASCII character from fixed window of size 8. Method 1 (M1) gives profitable encoding when the segments length is form 4 to 9. Whereas Method 2 (M2) for longest repetition. Method 3 (M3) act well for 2 successive identical bases throughout the sequence. For uniform repeats of segment size 3, Method 4 (M4) performed well. In the following sub-sections the details of each method have been discussed. 3.1.1.Method 1 This method finds the repeated optimal segments length of 1 to 8 characters with respect to the current reference base R. Control bit 0 i.e. denoted by B0 for repetitive R and bit 1 denoted by B1 for non-repetitive R to distinguish between two, followed by the 2-bits representation of R. Repeated variable length segment of characters (2 =< length <= 9) are encodes by the relation: {Sr = r, r = 1, 2, 3, 4, 5, 6, 7, 8}. Three bits coding rules form S1 to S8; for segments of length 2 to 9 in step of 1, are represented by {000, 001, 010, 011, 100, 101, 110 and 111}. The codeword is B0RS1...r for repetitive segments and B1R for non-repetitive portion. The total number of bits required to compressed any sequence by M1 is obtained by the following equation, N' = n * c0 + ݊′ * c1 Where: c0 = Number of bits for repetitive bases = (1 + 1 * 2 + 3) = 6, c1 = Number of bits for non-repetitive bases = (1 + 1 * 2) = 3, n = Total number of repetition, ݊′= Total number of non-repetition. Then, N' = 6 * n + 3 * ݊′ = 3(2n+ ݊′) 3.1.2 Method 2 For each repetition with respect to R the assigned bit is 1 and for end of repetitive base or bases bit is 0. It will search for maximum possible repetition. It uses dynamic programming approach. So if 4 consecutive bases appeared then the codeword is RB1B1B1B1B0 and for non-repetitive R it is RB0, where R is encoded by 2-bit coding. Then the following equation determines the number of bits desired by M2, N'' = n * c0 + ݊′ * c1 Where: c0 = (1 * 2 + ri + 1) = 3 + ri,
  • 8. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 80 ri = The number of base repetitions with respect to ith reference base, c1 = (1 * 2 + 1) = 3. Therefore, ܰ′′ = ݊ ∗ 3 + ෍ ‫ݎ‬௜ ௡ ௜ୀଵ + ݊′ ∗ 3 = 3݊ + ෍ ‫ݎ‬௜ ௡ ௜ୀଵ + 3݊′ = 3݊ + ෍ ‫ݎ‬௜ ௡ ௜ୀଵ + 3݊′ = ෍ ‫ݎ‬௜ ௡ ௜ୀଵ + 3(݊ + ݊′) 3.1.3.Method 3 Here the assigned bit for each individual repetitive base is 0 and bit 1 for each non-repetitive base. It follows greedy algorithm. The codeword for identical part is RB0 and RB1 for non- identical characters. The bits length by M3 follows the equation below, N''' = n * c0 + ݊′ * c1 Where: c0 = (1 * 2 + 1) = 3, c1 = (1 * 2 + 1) = 3, Hence, N''' = 3n + 3n' = 3(n + n') 3.1.4.Method 4 Divide each sequence into disjoint segments size 3 using divide-and-conquer algorithm. Bit flag 0 for exact three repetitive bases and bit flag 1 for any unmatched base. The codeword of indistinguishable segments is B0R and for distinguishable segments B1RRR. For the last segments of length < 3; if any the coding is Rk, where 0 < k < 3. The total number of bits is obtained by the equation given below, N'''' = n * c0 + n' * c1+ Rk Where: Rk = n'' * 2 n'' = Number of bases (0 < Length < 3), c0 = (1 * 2 + 1) = 3, c1 = (3 * 2 + 1) = 7. So, N'''' = n * c0 + n' * c1+ n'' * 2 = 3n + 7n' + 2n''
  • 9. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 81 3.2.Combined Method The basic versions of the each method have been discussed in the previous section. The following section will discuss the combined edition i.e. SBVRLDNAComp algorithm. The space complexity is O(n), where n is the number of bases in a sequence. SBVRLDNAComp first calculates the number of bits needed by each component methods sequentially; then compares between the results obtained N', N'', N''' and N''''; chooses the optimal one and the corresponding optimal method suited for a particular sequence. It uses substitutional method; form the optimal generated bit pattern intermediate file. The number of repeated bases or segments r varies from method to method. For first component r = 8, but second module r is unpredictable, third one does not have r dependency and for last part r = 3 respectively. An outline is shown in Fig. 2. Input: 1: A DNA sequence S 2: Three bits coding rules: {S1, S2, S3, S4, S5, S6, S7 andS8} 3: Flag variable v for repeat B0 and non-repeat B1 segments 4: Two bits coding rule (A – 00, C – 01, G -10 and T - 11) 5: Count bits c Output: 1: Compressed sequence S' 2: ‘α’ bits/bases (bpb) Algorithm: 1: Store the current character of remaining S into R 2: Check the next character relative to R, store it to T 3: c ← 0 4: while R != null do 5: if R = T then 6: v = B0 7: else 8: v = B1 9: end if 10: The compressed codes are (B0RS1…r) or (B1R); (RB1…rB0) or (RB0); (RB1) or (RB0); (B0R) or (B1R1…r) or (R1… n'') 11: Update c according to coding Splice 12: end while 13: The total number of bits c and optimal method for a specific sequence is obtained 14: Compressed by optimal method 15: Use LZ77 for final stage compression Fig. 2. Outline of the SBVRLDNAComp algorithm.
  • 10. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 82 4. Results This section evaluates the performance of the SBVRLDNAComp by applying on ten standards DNA sequences [14], details of these sequences are summarized in Table 1. Although SBVRLDNAComp performance has been test on 10 standard DNA sequence of small size; this algorithm can be applied on any DNA or RNA sequence of any size. In the following Table 2 other concerns of data compression such as compression ratio has been discussed of different sequences. In best case it takes 1.7009 bpb. The definition of compression ratio α is the number of characters after compression l divided by the same before compression n. α = l / n = l * 8 / n bpb = (Nopt. / n) bpb Where, Nopt. = l * 8 The average bits per base of the proposed algorithm and other existing methods on DNA sequences are illustrated in Fig. 3, it shows that the SBVRLDNAComp achieves the best compression ratio among all other algorithms. The algorithm is implemented by Java6, on a Core 2 Due processor with a 2 GB of RAM and OS is Fedora 19. The compression ratio considered to be excellent to the best of author’s knowledge. But it can vary slightly depends on the machine hardware and software. Tab1e 1. Information of ten standard benchmark DNA sequences. Sequences name Source File size(KB) chmpxx Chloroplasts 118.1875 chntxx 152.1914 humdystrop Human 37.8613 humghcsa 64.9365 humhdabcd 57.4844 humhprtb 55.4072 mpomtcg Mitochondria 182.2344 mtpacg 97.9629 hehcmvcg Viruses 223.9785 vaccg 187.2431 Tab1e 2. The compression ratio (bpb) for the ten benchmark DNA sequences after compression. Sequences name Number of characters before compression (N) Number of characters after compression (L) Nopt. Compression ratio (α) chmpxx 121024 25746 205971 1.7019
  • 11. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 83 chntxx 155,844 33549 268395 1.7222 humdystrop 38,770 8277 66215 1.7079 humghcsa 66,495 14367 114937 1.7285 humhdabcd 58,864 12604 100828 1.7129 humhprtb 56,737 12063 96504 1.7009 mpomtcg 186,608 40221 321768 1.7243 mtpacg 100,314 21694 173553 1.7301 hehcmvcg 229,354 49036 392287 1.7104 vaccg 191,737 40993 327947 1.7104 Fig. 3. Average bits per base for DNA compression algorithms 5. Discussions It is not likely that exactly one compression strategy will be optimal for diverse DNA sequence. Different experimental results are going to show various bases distributions whereby one compression strategy can be more efficient than another. We have proposed four new compressions methods specialized on searching redundant substrings on highly repetitive sequences. The first one is particularly significant for medium repeat value of r = 9; whereas the second one is relevant for large r values r > 9, third one is applicable for small r value r = 2 and the final one stands out on extremely uniform repetitive collections with the small segments of size r = 3. Depending on the repetition of bases one of the four modules gives extremely good quality result. So for any type of exact base repeat SBVRLDNABase surpass the other standard techniques. 2.092 1.784 1.743 1.739 1.725 1.715 1.714 1.714 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 Average Compression Ratio Per…
  • 12. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 84 Acknowledgements This work is supported in part by the Bioinformatics Centre of Bose Institute, Computer Science and Engineering Department of University of Calcutta and Academy of Technology under WBUT. We thank Dr. Zumur Ghosh and Arijita Sarkar. References [1] Kircher M and Kelso J, 2010, High-throughput DNA sequencing – concepts and limitations, Bioessays, Wiley Online Library, 32, 6, 524–536. [2] Paula WCP, 2009, An Approach to Multiple DNA Sequences Compression-A thesis submitted in partial fulfillment of the requirements for the Degree of Master of Philosophy, The Hong Kong Polytechnic University, Hong Kong. [3] Shiu HJ, Ng KL, Fang JF, Lee RCT and Huang CH, 2010, Data hiding methods based upon DNA sequences, Information Sciences, Elsevier, 180, 2196–2208. [4] Mridula TV and Samuel P, 2011 , Lossless segment based DNA compression, Proceedings of the 3rd International Conference on Electronics Computer Technology, IEEE Xplore Press, 298- 302. [5] Kozanitis C, Saunders C, Kruglyak S, Bafna V and Varghese G, 2010, Compressing Genomic Sequence Fragments Using Slimgene, Research in Comp. Mol. Bio, 6044, 310-324. [6] Daily K, Rigor P, Christley S, Xie X, and Baldi P, 2010, Data Structures and Compression Algorithms for High-Throughput Sequencing Technologies, BMC Bioinformatics, 11, 1, article 514. [7] Fritz MHY, Leinonen R, Cochrane G, and Birney E, 2011, Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression, Genome Research, 21, 5, 734-740. [8] Meyer S, 2010, Signature in the Cell: DNA and the Evidence for Intelligent Design, 1st Edn., HarperOne, ISBN-10: 0061472794, 55. [9] Aly W, Yousuf Band and Zohdy B, 2013, A Deoxyribonucleic acid compression algorithm using auto-regression and swarm intelligence, Journal of Computer Science, 9, 6, 690-698. [10] Roy S, Khatua S, Roy S and Bandyopadhyay SK, 2012 , An Efficient Biological Sequence Compression Technique Using LUT and Repeat in the Sequence, IOSRJCE, 6, 1, 42-50. [11] Roy S and Khatua S, 2014, DNA DATA COMPRESSION ALGORITHMS BASED ON REDUNDANCY, IJFCST, 4, 6, 49-58. [12] Ziv J and Lempel A, 1977, An Universal Algorithm for Sequential Data Compression, IEEE Trans. Info. Theory, IT-23, 3, 337-343. [13] Cleary J and Witten I, 1984, Data Compression Using Adaptive Coding and Partial String Matching, IEEE Trans. Comm., COM-32, 4, 396-402. [14] Grumbach S and Tahi F, 1993, Compression of DNA sequences, IEEE Symp. on the Data Compression Conf., DCC-93, Snowbird, UT, 340–350. [15] Grumbach S and Tahi F, 1994, A new challenge for compression algorithms: genetic sequences, Info. Process. & Manage, Elsevier, 875-866. [16] Rivals E, Delahaye J, Dauchet M and Delgrange O, 1996, A Guaranteed Compression Scheme for Repetitive DNA Sequences, DCC ’96: Proc. Data Compression Conf., 453. [17] Apostolico A and Lonardi S, 2000, Compression of Biological Sequences by Greedy Off-Line Textual Substitution, DCC ’00: Proc. Data Compression Conf., pp. 143-152. [18] Mishra KN, Aaggarwal A, Abdelhadi E and Srivastava PC, 2010, An Efficient Horizontal and Vertical Method for Online DNA Sequence Compression, IJCA, 3, 1, 39-46. [19] Roy S and Khatua S, 2013, Compression Algorithm for all Specified Bases in Nucleic Acid Sequences, IJCA, 75, 4, 29-34. [20] Chen X, Kwong S and Li M, 2001, A Compression Algorithm for DNA Sequences, Using Approximate Matching for Better Compression Ratio to Reveal the True Characteristics of DNA, IEEE Engg. in Med. and Bio., 61-66.
  • 13. International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015 85 [21] Chen X, Li M, Ma B, and Tromp J, 2002, DNACompress: Fast and Effective DNA Sequence Compression, Bioinformatics, 18, 1696-1698. [22] Behzadi B and Fessant FL, 2005, DNA Compression Challenge Revisited: A Dynamic Programming Approach, CPM ’05: Proc. 16th Ann. Symp. Comb. Pattern Matching, 190- 200. [23] Korodi G and Tabus I, 2005, An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression”, ACM Trans. Information Systems, 23, 1, 3-34. [24] TAN L, SUN J and XIONG W, 2012, A Compression Algorithm for DNA Sequence Using Extended Operations, Journal of Computational Information Systems, 8,18, 7685–7691. [25] Ma B, Tromp J and Li M, 2002, PatternHunter—faster and more sensitive homology search, Bioinformatics, 18, 440–445. [26] Loewenstern D and Yianilos P, 1997, Significantly Lower Entropy Estimates for Natural DNA Sequences, DCC ’97: Proc. Data Comp. Conf., 151. [27] Matsumoto T, Sadakane K and Imai H, 2000, Biological Sequence Compression Algorithms, Genome Informatics, 43–52. [28] Cao MD, Dix T, Allison L and Mears C, 2007, A Simple Statistical Algorithm for Biological Sequence Compression, DCC ’07: Proc. Data Comp. Conf., 43-52. Authors Subhankar Roy is currently an Assistant Professor in the Department of Computer Science & Engineering, Academy of Technology, India. He has received his B. Tech and M. Tech degree in Computer Science and Engineering both from University of Calcutta, India in 2010 and 2012 respectively. His research interests are in the areas of Bioinformatics and compression techniques. He is a member of IEEE. Akash Bhagat is presently working as Faculty at Mahendra Educational Pvt. Ltd., Asansol, India. He has received his MCA degree from Academy of Technology, India in 2015. His research interests include Bioinformatics and DNA data compression techniques. Kumari Annapurna Sharma is presently working as Project Engineer at WIPRO-Project SHELL, India. He has received his MCA degree from Academy of Technology, India in 2015. His research interests include Bioinformatics and DNA data compression techniques. Sunirmal Khatua is currently an Assistant Professor in the Department of Computer Science and Engineering, University of Calcutta, India. He has received the M.E. degree in Computer Science and Engineering from Jadavpur University, India in 2006. He is also pursuing his PhD in Cloud Computing from Jadavpur University. His research interests are in the areas of Distributed Computing, Cloud Computing, Bioinformatics, and Sensor Networks. He is a member of IEEE.