SlideShare a Scribd company logo
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
DNA DATA COMPRESSION ALGORITHMS BASED ON 
REDUNDANCY 
Subhankar Roy1 and Sunirmal Khatua2 
1Department of Computer Science and Engineering, Academy of Technology, G. T. 
Road, Aedconagar, Hooghly-712121, W.B., India 
2Department of Computer Science and Engineer, University of Calcutta, 92, A.P.C. Road, 
Kolkata-700009, India 
ABSTRACT 
Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other 
via our DNA. In frequent cases there are four bases in a DNA. They are a (Adenine), c (Cytosine), g 
(Guanine) and t (Thymine). Each of these bases can be represented by two bits as 2 powers 2 =4 i.e. a – 00, 
c – 01, g – 11 and t – 10 respectively, although this choice is random. So redundancy within a sequence is 
more likely to exist. That’s why in this paper we have explored different types of repeat to compress DNA. 
These are direct repeats, palindrome or reverse direct repeat, inverted exact repeats or complementary 
palindrome or exact reverse complement, inverted approximate repeats or approximate complementary 
palindrome or approximate reverse complement, interspersed or dispersed repeats, flanking repeats or 
terminal repeats etc. Better compression gives better network speed and save storage space. 
KEYWORDS 
Direct repeats, Inverted repeats, Complementary palindrome, Interspersed repeats and Flanking repeats. 
1. INTRODUCTION 
Today, more and more DNA sequences are becoming available [1]. The information about DNA 
sequences are stored in molecular biology databases. The size and importance of these databases 
getting bigger with time; therefore this information must be stored or communicated efficiently 
[2]. Furthermore, sequence compression can be used to define similarities between biological 
sequences. The only way to handle this situation is to encoding these sequences. Better 
compression ratio depends upon the DNA sequences [3]. 
However if one applies the standard text compression software such as GZIP [4], they cannot 
compress DNA sequences but only expand the file with more than two bits per symbol. There are 
some reasons pointed out. This software is designed mainly for English text compression, while 
the regularities in DNA sequences are much subtler and these tools do not make use of the special 
characteristics of a DNA sequence. For example, it is well known to us that all DNA sequences 
have very long-term correlation [5] in that the subsequences in different regions of any DNA 
sequence of same genome or of different genome are quite similar to each other. State-of-the-art 
DNA encoding schemes rely on exploiting this long-term correlation. In particular, similarity 
within the DNA sequence is searched so that similar subsequences of different fragment can be 
encoded with reference to its earlier section. 
Thus DNA sequences are important as a new challenge for study of compression algorithms. 
Eclipse uses Cp1252 as the default for its encoding. But UTF-8 can represent every character in 
Unicode (216 = 65536) and has a much bigger library of recognized symbols. The console in 
DOI:10.5121/ijfcst.2014.4605 49
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
eclipse uses the default encoding of OS i.e. inherited from container (Cp1252). The size of each 
character in Unicode is two byte whereas ASCII code character size is one byte. When 
compression is done on DNA sequence or on some plain text, for a line break in a file each line 
feed character (‘n’) and carriage return (‘r’) as a whole takes four bytes. So if a text file contains 
more line compressed file using Unicode character give bad compression factor. But in ASCII 
code (28=256) the size of each character is one byte so a line break that is a combination of line 
feed (‘n’) and carriage return (‘r’) as a whole takes two bytes. 
If number of lines is 1000 then compressed file using Unicode 1000*4 = 4000 bytes and using 
ASCII code 1000*2 = 2000 bytes that is half in size compare to Unicode. 
That is why most of the efficient compression algorithm coding is done using C language rather 
than using Java because Java character set is called UNICODE character set and C follows ASCII 
code character set. 
2. LITERATURE REVIEW 
It is true that the compression of DNA sequence is a difficult task for general compression 
algorithms, but at the same time, from the viewpoint of compression theory it is an interesting 
subject for understanding the properties of various compression algorithms. 
Generally the windows of the methods based on dictionary [6-7] have a fixed width of small size. 
The use of small windows is efficient on plain text whose redundancy is very fewer and local. 
However, in the case of DNA sequence, redundancies may occur at very long distances and 
factors can be very long. 
Shannon-Fano [8-9] or Huffman Coding [8-9] does not give good compression factor for DNA 
data compression. Huffman’s code fails badly on DNA sequences both in the static and adaptive 
model [8-9], because there are only four kind symbols in DNA sequences and the probabilities of 
occurrence of the symbols are close. Both follow statistical modeling by creating variable length 
codes for bases in DNA sequence. Base with higher probability of occurrence in a sequence gets 
shorter codes in binary form as an integral multiple. Code assignment is done by making a binary 
tree that is called the Huffman tree. As the most frequent bases in DNA sequence are 4 in number. 
They are a, c, g and t respectively. Let us consider a part of a sequence 
“actggtcgatgtacacacataggaaccaacccccaaaaa … accat” and table 1 shows their frequencies in that 
sequence. 
50 
Table 1. Bases and their frequencies in a sequence 
Nucleotides Frequencies 
a 100 
c 100 
g 50 
t 50 
So if we build the Huffman tree using these symbols then the binary code for these four symbols 
is shown in Table 2. 
Table 2. The Huffman Code table of nucleotides 
Nucleotides Codes 
a 0 / 00 
c 10 / 01
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
51 
g 110 / 10 
t 111 / 11 
So total number of bits required is (100*1 + 100*2 + 50*3 + 50*3) = (100+200+150+150) = 
(100*2 + 100*2 + 50*2 + 50*3) = (200+200+100+100) = 600 bits. 
Using simplest two bit encoding technique it requires same number of bits. As Huffman code 
does not consider any properties within a sequence such as repeats or similarity that is why it 
cannot give better result. If we apply Shannon-Fano technique then again it needs 600 bits to 
compress the above file. 
Concerning compression ratio, Prediction by Partial Matching (PPM) [10] is one of the best 
compression algorithms in practice developed by Cleary and Witten is accomplished of very high 
compression rates, encoding English text in as little as 2.2 bits/character. It predicts according to 
the most recent symbols seen. However it cannot compress DNA sequences less than two bits per 
symbol either. 
Context Tree Weighting (CTW) [11] method introduced by Willems, Shtarkov, and Tjalkens can 
compress DNA sequences less than two bits per symbol. It compresses the data by branch 
prediction. But this algorithm does not use special structures of biological sequences. 
DNA sequences are more redundant in nature due to only four frequent bases a, c, g and t without 
considering eleven rare bases. So it is obvious that the similarity between subsequence within a 
particular sequence will be more. Now if instead of considering a particular sequence if a number 
of sequences of a particular genome considered then the similarity between those sequences are 
more due to redundancy. So it would be advantageous to compress different chromosomes 
together to take into account both self-chromosomal similarity and cross-chromosomal 
similarities within chromosome [12]. A statistical study reveals that the similarity within a 
sequence is 4.5% of total sequence length. Whereas similarity within two chromosomes is 10% in 
which 8% comes from cross-chromosomal similarity and 2% comes from self-chromosomal 
similarity. Among 16 chromosomes it is 18% in which cross-similarity is 16.8% and rest is self-similarity. 
There is a software tool called PatternHunter are used to search both self and cross-chromosomal 
similarity. It can search both exact, approximate repeat and complementary 
palindrome. So the compression algorithm using self and cross chromosomal similarity is local in 
nature because here chromosomes of a particular genome has been considered. This one is like a 
dictionary based encoding algorithm because each chromosomes is compared with other using 
PatternHunter. So there have some reference chromosomes with which target chromosomes are 
compared and the differences are compressed. There are some pointer which points the common 
region and length of the common region of the reference chromosomes. In the decompression 
time the compressed difference and reference sequences are used to recover the original sequence. 
Instead of compressing individual genome sequence of an organism, here entire genome sequence 
of an organism has been compressed [13-14]. This would give better saving percentage due to 
redundancy nature of DNA sequence. Here each genome sequence of an organism is compared 
with other genome sequence of that organism and only difference are compressed. The common 
region needs not to compress again. This one is like a dictionary based encoding algorithm 
because each different genome sequence is compared with different specific genome sequence. 
Therefore, the compressed data are poised of reference sequence, differences sequence, and the 
locations of differences, instead of storing each DNA data sequence individually. A better result 
of this approach is depends on the similarity between sequences. It is shown that mitochondria 
dataset give 195-fold compression rate. 
Original sequence = Reference sequences + The difference within sequence
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
Genbit compression algorithm [15] considered both for repetitive and non repetitive DNA 
sequences. In this scheme all possible combination has been introduced. The input sequence is 
divided in to block, where each block is of four bases. Thus in this coding scheme, 2 power 8 = 
256 combinations can be represented. Hence every DNA segment containing four bases is 
replaced by an 8 bit binary number for e.g. “01010101”. If the consecutive fragments are same, 
then a specific bit “1” is introduced as a 9th bit. If the consecutive fragments are different, then a 
specific bit “0” is introduced as a 9th bit to the 8 bit unique number. GenBit Compress is a simple 
algorithm without Dynamic programming approach. It takes an input of a DNA sequence of 
length n, and divides into n/4 number of fragments. The left out individual bases (fragment 
length<4) is assigned 4 unique “2” bits. (a=”00”, g=”01”, c=”10”, t=”11”). 
3. PROPOSED METHODOLOGY 
The general encoding methodology is before encoding the next symbol in a DNA sequence by 
mapping technique, algorithms search for direct repeats, reverse direct repeats and 
complementary palindrome which are explain below in details. If they found with desired 
sequence length then our algorithm count the number of occurrence and represents it by the 
mapped symbol followed by the number of repeat. These algorithms are completely different 
from others existing in the way that it search for best result from different compression technique 
using different technique and give the final output with least bpb. 
3.1 Direct exact or approximate repeats in DNA sequence 
Algorithm FS4DR uses this property. 
Repeat in DNA sequence means exact and approximate matches. Approximate matches are done 
by operations such as substitution or Replacement, insertion and deletion. 
An exact appearance means two substrings of a string in a sequence consist of indistinguishable 
bases along the chosen subsequence. 
For example if a subsequence is “attcgtgtattcgtgt”. The first eight bases i.e. “attcgtgt” and last 
eight bases i.e. “attcgtgt”, the second subsequence is exact copy of the first one. 
Exact repeat also called tandem repeat. When a pattern of two or more nucleotides is repeated and 
the repetitions are adjacent to each other either directly or inverted. 
An example is: 
atcgatcgatcgatcg, in which the sequence “atcg” is repeated four times. 
When the number is not known, variable, or irrelevant, it is sometimes called a variable number 
tandem repeats (VNTR). The term minisatellite has been interchangeably used with variable 
number tandem repeats (VNTRs). 
When between 10 and 60 nucleotides are repeated, it is called a minisatellite. These occur at more 
than 1,000 locations in the human genome. Those with fewer are known as microsatellites or 
short tandem repeats (STR).STRs are also repeated sequences, but they are usually 2–10 
nucleotides long. 
When exactly two nucleotides are repeated, it is called a dinucleotide repeat (for example: 
tctctctc…). The microsatellite instability in hereditary nonpolyposis colon cancer most commonly 
affects such regions. 
When three nucleotides are repeated, it is called a trinucleotide repeat (for example: 
tagtagtagtag…), and abnormalities in such regions can give rise to trinucleotide repeat disorders. 
52
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
An approximate appearance means two substrings of a string in a sequence consist of 
indistinguishable bases after one of the following operations along the chosen subsequence. 
For example if a subsequence is “atttgtgtattcgtgt”. The first eight bases i.e. “atttgtgt” and last 
eight nucleotides i.e. “attcgtgt”. The second subsequence can be obtained from the first 
subsequence if the 4th base “T” in the first subsequence is substituted or replaced by “c”. 
If a DNA sequence is “attcgttattcgtgt”. The second subsequence “attcgtgt” can be obtained from 
the first one “attcgtt” if we insert “g” within first sequence between 6th and 7th character. This is 
called approximate matches with insertion. 
For approximate matches with deletion operation, let us consider a DNA sequence is 
“attcgtgtattcgtt”. The second subsequence “attcgtt” can be obtained from the first one “attcgtgt” if 
we delete 7th base i.e. “g” within first sequence. 
Substitution can be used as a combination of insertion and deletion operations in some cases. For 
example if a subsequence is “atttgtgtattcgtgt” then the second subsequence “attcgtgt” of length 
eight can be obtained from the first one “atttgtgt” by inserting “c” between 3rd and 4th bases then 
by deleting base “t” from the 5th position. This one can be also obtained by simply replacing 
character as described above. 
53 
1 atgaatgaatgaatgaatgaatga 24 
1 atga 4 
5 atga 8 
9 atga 12 
13 atga 16 
17 atga 20 
21 atga 24 
Figure 1. An example of direct repeat of four nucleotides. The sequence is a part of DNA sequence. 
3.2 Palindrome or reverse direct repeat, Inverted exact repeat or complementary 
palindrome and inverted approximate repeat or approximate complementary 
palindrome or approximate reverse complement repeats 
Algorithm FS4RDR and FS4CP uses these property. 
The palindrome sequence is “acgttgca”, the fragment “tgca” can be obtained from fragment 
“acgt” by reversing the second one. 
An inverted repeat, reversed repeat, complemented inverted repeat or reverse complement repeat 
is a sequence of bases that is the reversed after complementing of another sequence further 
downstream to the sequence. 
The original sequence of nucleotides is written in the 5' to 3' direction. 
For e.g. 5'-atgc-3' 
The complementary sequence is written, matching bases ‘a’ with -‘t’ and ‘g’ with ‘c’. The 
resulting sequence is in the 3' to 5' orientation. 
3'-tacg-5' 
If the new sequence “3'-tacg-5' “, is reversed, resulting in the complementary sequence in the 
correct orientation i.e. 5'-gcat-3'. 
If a sequence is “gcatatgc” so second subsequence “atgc” of length four is reverse complement of 
first subsequences “gcat” of length four.
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
If there are no nucleotides or bases intervene between the subsequence and its downstream 
complement, it is called a palindrome. So in the above case the second subsequence is 
complementary palindrome or inverted exact repeat of the first subsequence. Inverted repeats also 
indicate regions capable of self-complementary base pairing (regions within a single sequence 
which can base pair with each other). 
If a sequence is “gcttatgc” so second subsequence “atgc” is reverse complements of first 
subsequences “gctt”. But it is an approximate reverse complement. Complement of the second 
subsequence is “tacg” in 3' to 5' direction. So in the original direction i.e. 5' to 3' it is “gcat”. So if 
the 3rd base “t” of reference subsequence is replaced by ”a” then it will be similar to the compared 
subsequence. 
54 
1 ctaggatcgatcgatcgatcgatc 24 
ctagctagctagctagctag -> reverse 
Figure 2. An example of palindrome of four nucleotides. The sequence is a part of DNA sequence. 
1 ctgatcagtcagtcagtcagtcagtcag 24 
agtcagtcagtcagtcagtcagtc -> Complement 
ctgactgactgactgactgactga -> Reverse 
Figure 3. An example of complementary palindrome of four nucleotides. The sequence is a part of DNA 
sequence. 
3.3 Interspersed or Dispersed Repeats 
It is the copy of repetitive sequences that are interspersed throughout the genome. If a genome 
sequence that consists of two or more repeats of a specific subsequence. Two identical (or nearly 
identical) nucleotide bases sequences sometimes separated by a sequence of non-repeated DNA. 
For example if a sequence is “3' tagtacgttagt 5'”. The last four subsequences “tagt” is exact repeat 
of first four subsequences i.e. “tagt” but the intermediate sub sequence “acgt” in a non repeated 
region. It can present in multiple copies in the genome. This can be exact and approximate repeat. 
It is classified as Short Interspersed Repeat (SIR) and Long Interspersed Repeat (LIR). 
3.4 Flanking Repeats or Terminal Repeat 
The type of repeats is sequences that are repeated on both ends of a sequence. Direct terminal 
repeats are in the same direction and inverted terminal repeats are opposite to each other in 
direction. 
3.5 Complementary Nature of Nucleotide Sequences 
DNA double strand is complementary in nature. If one strand is known, then other strand can be 
formed from that one just by replacing ‘a’ with ‘t’ and ‘g’ with ‘c’ and vice versa. The polymer 
forms in a unidirectional manner from the 5' carbon that is bound to a phosphate group to the 3' 
carbon that is bound to the hydroxyl group. DNA sequences are often written in a shorthand 
notation using the one-letter name for each nucleotide, and listing the nucleotides in the 5' to 3' 
direction.
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
55 
3.6 Compression technique 
Original file 
e.g. humdystrop.txt 
Compression 
technique 
FS4 
Compressed file 
e.g. humdystrop_cmp.txt 
Figure 4. The compression technique. 
Figure 5. An example of sequence humdystrop.txt 
3.7 Decompression Technique 
Compressed file 
e.g. humdystrop_cmp.txt 
Decompression 
technique 
RFS4 
Reconstructed file 
e.g. humdystrop_rec.txt 
Figure 6. The decompression technique.
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
56 
Figure 7. An example of compressed sequence humdystrop_cmp.txt 
4. SIMULATION RESULTS 
Table 3 shows the information about the ten standard benchmark genome sequence [17]. Table 4 
shows the result in bytes by fragment size four (FS4), fragment size four with direct repeat 
(FS4DR), fragment size four with reverse direct repeat (FS4RDR) and fragment size four with 
complementary palindrome (FS4CP). RFS4 means reverse fragment size four. Table 5 shows the 
resultant bits per base (bpb) used of our proposed compression algorithm in compressing each 
genome sequence separately. 
Tab1e 3. Information of ten standard benchmark DNA sequences. 
Sequences Name Length(Bytes) Source File Size(KB) 
chmpxx 121,024 Chloroplasts 118.1875 
chntxx 155,844 152.1914 
humdystrop 38,770 Human 37.8613 
humghcsa 66,495 64.9365 
humhdabcd 58,864 57.4844 
humhprtb 56,737 55.4072 
mpomtcg 186,608 Mitochondria 182.2344 
mtpacg 100,314 97.9629 
hehcmvcg 229,354 Viruses 223.9785 
vaccg 191,737 187.2431 
Tab1e 4. The sequences length in bytes by different compression techniques. The bits per base (bpb) are 
used by the below compression methods along with their properties. 
Sequences Name FS4 FS4DR FS4RDR FS4CP 
chmpxx 31,945 30,045 30,403 30,203 
chntxx 40,960 38,960 38,900 38,700 
humdystrop 10,006 10,083 10,011 10,020 
humghcsa 18,160 16,576 16,776 16,976 
humhdabcd 15,295 14,576 14,376 14,376 
humhprtb 14,984 14,153 14,053 14,053
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
57 
mpomtcg 51,499 46,388 46,188 46,088 
mtpacg 26,983 25,008 24,908 24,808 
hehcmvcg 60,806 56,806 57,306 57,206 
vaccg 52,169 47,852 47,352 47,152 
Tab1e 5. The bpb by different compression techniques. 
Sequences Name FS4 FS4DR FS4RDR FS4CP 
chmpxx 2.111647 1.986052 2.009717 1.996497 
chntxx 2.102615 1.999949 1.996869 1.986602 
humdystrop 2.064689 2.080578 2.065721 2.067578 
humghcsa 2.184826 1.994255 2.018317 2.042379 
humhdabcd 2.07869 1.980973 1.953792 1.953792 
humhprtb 2.112766 1.995594 1.981494 1.981494 
mpomtcg 2.207794 1.988682 1.980108 1.975821 
mtpacg 2.151883 1.994378 1.986403 1.978428 
hehcmvcg 2.120948 1.981426 1.998866 1.995378 
vaccg 2.17669 1.900741 1.975706 1.967362 
Average bpb 2.131255 1.990263 1.996699 1.994533 
5. CONCLUSIONS 
Our proposed compression algorithm was tested on ten standard benchmark DNA sequences. 
The experimental results showed that the bpb is less than two bits per base when we compress 
these sequences using any of the three simple properties. After applying all techniques our 
algorithm searches for optimal solution to get the final result and will store that result. When none 
of the above characteristics is used the compression factor is approximately equals to 2, because 
each nucleotides base is represented by two bits. 
ACKNOWLEDGEMENTS 
This research work is supported in part by the Computer Science and Engineering Department of 
University of Calcutta, Bioinformatics Centre of Bose Institute and Academy of Technology 
under WBUT. We would like to thank Dr. Zumur Ghosh and Asst. Professor Utpal Nandi for 
their valuable suggestion. 
REFERENCES 
[1] Subhankar Roy, Sunirmal Khatua, Sudipta Roy and Prof. Samir K. Bandyopadhyay, “An Efficient 
Biological Sequence Compression Technique Using LUT and Repeat in the Sequence”, IOSRJCE, 
Vol. 6, Issue 1, pp. 42-50, Sep-Oct. 2012. 
[2] Subhankar Roy, Sunirmal Khatua, “Compression Algorithm for all Specified Bases in Nucleic Acid 
Sequences”, IJCA, Vol. 75, No. 4, pp. 29-34, August. 2013. 
[3] Subhankar Roy, Sunirmal Khatua, “Biological Data Compression Methods Based Upon 
Microsatellite”, IJARCSSE, pp. 1321-1326, Vol. 3, Issue 7, July 2013. 
[4] http://www.gzip.org. 
[5] Wu Choi Ping Paula, “An Approach to Multiple DNA Sequences Compression”, The Hong Kong 
Polytechnic University, xiii, 115 p, PolyU Library Call No.: [THS] LG51 .H577M EIE 2009 Wu , 
February 2009.
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 
[6] Shanika Kuruppu, Simon J. Puglisi and Justin Zobel,” Relative Lempel-Ziv Compression of Genomes 
for Large-Scale Storage and Retrieval”, Springer-Verlag Berlin Heidelberg, LNCS 6393, pp. 201– 
206, 2010. 
[7] Ziv, J. and Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inform. 
58 
Theory, 23(3):337{343, 1977. 
[8] Khalid Sayood, “Introduction to Data Compression, 3rd ed.”, University of Nebraska, 2006. 
[9] Mark Nelson and Jean-Loup Gailly, “The Data Compression Book, 2nd ed.”, 2012. 
[10] Alistair Moffat, “Implementing the PPM Data Compression Scheme”, IEEE Transactions on 
Communications, Vol. 38, No. 11, November 1990. 
[11] Zijun Wu, “A Tutorial of The Context Tree Weighting Method: Basic Properties”, November 9, 2005. 
[12] Choi-Ping Paula Wu1, Ngai-Fong Law and Wan-Chi Siu, “Cross chromosomal similarity for DNA 
sequence compression”, Biomedical Informatics Publishing Group, pp. 412-416, vol. 2, No. 9, July 
14, 2008. 
[13] Hyoung Do Kim and Ju-Han Kim, “DNA Data Compression Based on the Whole Genome 
Sequence”, Journal of Convergence Information Technology, Vol. 4, No. 3, September 2009. 
[14] Heba Afify, Muhammad Islam and Manal Abdel Wahed, “DNA Lossless Differential Compression 
Algorithm Based on Similarity of Genomic Sequence Database”, International Journal of Computer 
Science & Information Technology, pp. 145 – 154, Vol. 3, No. 4, August 2011. 
[15] P.Raja Rajeswari, Dr. Allam AppaRao, “Genbit Compress Tool (GBC): A Java-Based Tool to 
Compress DNA Sequences and Compute Compression Ratio (bits/base) of Genomes”, International 
Journal of Computer Science and Information Technology, pp. 181-191, Vol. 2, No. 3, June 2010. 
[16] Nour S. Bakr andAmr A. Sharawi, “DNA Lossless Compression Algorithms: Review”, American 
Journal of Bioinformatics Research, pp. 72-81, 2013. 
[17] The GeNML homepage: http://www.cs.tut.fi/~tabus/genml/results.html. 
Authors 
Subhankar Roy is currently working as an Assistant Professor in the Department of 
Computer Science & Engineering, Academy of Technology, India. He received his 
B.Tech degree and M.Tech degree in Computer Science and Engineering in 2012 both 
from University of Calcutta. He is doing research work from University of Calcutta 
under the guidance of Mr. Sunirmal Khatua. His research interests include 
Bioinformatics and compression techniques. He is a member of IEEE. 
Sunirmal Khatua is currently an Assistant Professor in the Department of Computer 
Science and Engineering, University of Calcutta, India. He has received the M.E. degree 
in Computer Science and Engineering from Jadavpur University, India in 2006. He is 
also pursuing his PhD in Cloud Computing from Jadavpur University. His research 
interests are in the areas of Distributed Computing, Cloud Computing and Sensor 
Networks. He is a member of IEEE.

More Related Content

What's hot

LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
Tomonari Masada
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random Forests
IDES Editor
 
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
IDES Editor
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Nimrita Koul
 
Address standardization with latent semantic association
Address standardization with latent semantic associationAddress standardization with latent semantic association
Address standardization with latent semantic associationjyhuangtc
 
Secure data transmission using dna encryption
Secure data transmission using dna encryptionSecure data transmission using dna encryption
Secure data transmission using dna encryption
Alexander Decker
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
Editor IJCATR
 
DNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewDNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewRasheed Karuvally
 
Supplementary material for my following paper: Infinite Latent Process Decomp...
Supplementary material for my following paper: Infinite Latent Process Decomp...Supplementary material for my following paper: Infinite Latent Process Decomp...
Supplementary material for my following paper: Infinite Latent Process Decomp...Tomonari Masada
 
Dna cryptography
Dna cryptographyDna cryptography
Dna cryptography
Mayukh Maitra
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clusteringSouvik Roy
 
GRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPMGRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPM
ijcseit
 
A new DNA encryption technique for secure data transmission with authenticati...
A new DNA encryption technique for secure data transmission with authenticati...A new DNA encryption technique for secure data transmission with authenticati...
A new DNA encryption technique for secure data transmission with authenticati...
Sajedul Karim
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
IRJET Journal
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 

What's hot (16)

LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Multilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random ForestsMultilabel Classification by BCH Code and Random Forests
Multilabel Classification by BCH Code and Random Forests
 
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Address standardization with latent semantic association
Address standardization with latent semantic associationAddress standardization with latent semantic association
Address standardization with latent semantic association
 
Secure data transmission using dna encryption
Secure data transmission using dna encryptionSecure data transmission using dna encryption
Secure data transmission using dna encryption
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 
DNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewDNA based Cryptography_Final_Review
DNA based Cryptography_Final_Review
 
Supplementary material for my following paper: Infinite Latent Process Decomp...
Supplementary material for my following paper: Infinite Latent Process Decomp...Supplementary material for my following paper: Infinite Latent Process Decomp...
Supplementary material for my following paper: Infinite Latent Process Decomp...
 
Dna cryptography
Dna cryptographyDna cryptography
Dna cryptography
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
 
GRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPMGRAMMAR-BASED PRE-PROCESSING FOR PPM
GRAMMAR-BASED PRE-PROCESSING FOR PPM
 
A new DNA encryption technique for secure data transmission with authenticati...
A new DNA encryption technique for secure data transmission with authenticati...A new DNA encryption technique for secure data transmission with authenticati...
A new DNA encryption technique for secure data transmission with authenticati...
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 

Viewers also liked

A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
IOSR Journals
 

Viewers also liked (7)

A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
A New Approach of Protein Sequence Compression using Repeat Reduction and ASC...
 
HeartAd
HeartAdHeartAd
HeartAd
 
50120130405031
5012013040503150120130405031
50120130405031
 
50120130405030
5012013040503050120130405030
50120130405030
 
40120130405020
4012013040502040120130405020
40120130405020
 
Test
TestTest
Test
 
40220130405017
4022013040501740220130405017
40220130405017
 

Similar to Dna data compression algorithms based on redundancy

SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
ijcsa
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
CSCJournals
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
IJMER
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
Crypt Sequence DNA
Crypt Sequence DNACrypt Sequence DNA
Crypt Sequence DNA
IOSR Journals
 
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
cscpconf
 
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
csandit
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
Monica Munoz-Torres
 
Sequence alignment
Sequence alignmentSequence alignment
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
Dan Gaston
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
Monica Munoz-Torres
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
 
Loss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic CodingLoss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic Coding
IJERA Editor
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryIAEME Publication
 
This lab has two parts – please answer all parts.Lab 7 Biotechn.docx
This lab has two parts – please answer all parts.Lab 7 Biotechn.docxThis lab has two parts – please answer all parts.Lab 7 Biotechn.docx
This lab has two parts – please answer all parts.Lab 7 Biotechn.docx
glennf2
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
Statistics and Bioinformatics (EiB-UB)
 

Similar to Dna data compression algorithms based on redundancy (20)

SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Crypt Sequence DNA
Crypt Sequence DNACrypt Sequence DNA
Crypt Sequence DNA
 
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
 
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
Loss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic CodingLoss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic Coding
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
2224d_final
2224d_final2224d_final
2224d_final
 
This lab has two parts – please answer all parts.Lab 7 Biotechn.docx
This lab has two parts – please answer all parts.Lab 7 Biotechn.docxThis lab has two parts – please answer all parts.Lab 7 Biotechn.docx
This lab has two parts – please answer all parts.Lab 7 Biotechn.docx
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 

More from ijfcstjournal

ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
ijfcstjournal
 
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLINGA SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
ijfcstjournal
 
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLESA COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
ijfcstjournal
 
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
ijfcstjournal
 
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
ijfcstjournal
 
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
ijfcstjournal
 
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
ijfcstjournal
 
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDINGAN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
ijfcstjournal
 
EAGRO CROP MARKETING FOR FARMING COMMUNITY
EAGRO CROP MARKETING FOR FARMING COMMUNITYEAGRO CROP MARKETING FOR FARMING COMMUNITY
EAGRO CROP MARKETING FOR FARMING COMMUNITY
ijfcstjournal
 
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHSEDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
ijfcstjournal
 
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEMCOMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
ijfcstjournal
 
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMSPSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
ijfcstjournal
 
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
ijfcstjournal
 
A MUTATION TESTING ANALYSIS AND REGRESSION TESTING
A MUTATION TESTING ANALYSIS AND REGRESSION TESTINGA MUTATION TESTING ANALYSIS AND REGRESSION TESTING
A MUTATION TESTING ANALYSIS AND REGRESSION TESTING
ijfcstjournal
 
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
ijfcstjournal
 
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCHA NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
ijfcstjournal
 
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKSAGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
ijfcstjournal
 
International Journal on Foundations of Computer Science & Technology (IJFCST)
International Journal on Foundations of Computer Science & Technology (IJFCST)International Journal on Foundations of Computer Science & Technology (IJFCST)
International Journal on Foundations of Computer Science & Technology (IJFCST)
ijfcstjournal
 
AN INTRODUCTION TO DIGITAL CRIMES
AN INTRODUCTION TO DIGITAL CRIMESAN INTRODUCTION TO DIGITAL CRIMES
AN INTRODUCTION TO DIGITAL CRIMES
ijfcstjournal
 
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
ijfcstjournal
 

More from ijfcstjournal (20)

ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
ENHANCING ENGLISH WRITING SKILLS THROUGH INTERNET-PLUS TOOLS IN THE PERSPECTI...
 
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLINGA SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
A SURVEY TO REAL-TIME MESSAGE-ROUTING NETWORK SYSTEM WITH KLA MODELLING
 
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLESA COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
A COMPARATIVE ANALYSIS ON SOFTWARE ARCHITECTURE STYLES
 
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
SYSTEM ANALYSIS AND DESIGN FOR A BUSINESS DEVELOPMENT MANAGEMENT SYSTEM BASED...
 
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
AN ALGORITHM FOR SOLVING LINEAR OPTIMIZATION PROBLEMS SUBJECTED TO THE INTERS...
 
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
LBRP: A RESILIENT ENERGY HARVESTING NOISE AWARE ROUTING PROTOCOL FOR UNDER WA...
 
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
STRUCTURAL DYNAMICS AND EVOLUTION OF CAPSULE ENDOSCOPY (PILL CAMERA) TECHNOLO...
 
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDINGAN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
AN OPTIMIZED HYBRID APPROACH FOR PATH FINDING
 
EAGRO CROP MARKETING FOR FARMING COMMUNITY
EAGRO CROP MARKETING FOR FARMING COMMUNITYEAGRO CROP MARKETING FOR FARMING COMMUNITY
EAGRO CROP MARKETING FOR FARMING COMMUNITY
 
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHSEDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
EDGE-TENACITY IN CYCLES AND COMPLETE GRAPHS
 
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEMCOMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
COMPARATIVE STUDY OF DIFFERENT ALGORITHMS TO SOLVE N QUEENS PROBLEM
 
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMSPSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
PSTECEQL: A NOVEL EVENT QUERY LANGUAGE FOR VANET’S UNCERTAIN EVENT STREAMS
 
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON...
 
A MUTATION TESTING ANALYSIS AND REGRESSION TESTING
A MUTATION TESTING ANALYSIS AND REGRESSION TESTINGA MUTATION TESTING ANALYSIS AND REGRESSION TESTING
A MUTATION TESTING ANALYSIS AND REGRESSION TESTING
 
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
GREEN WSN- OPTIMIZATION OF ENERGY USE THROUGH REDUCTION IN COMMUNICATION WORK...
 
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCHA NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
A NEW MODEL FOR SOFTWARE COSTESTIMATION USING HARMONY SEARCH
 
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKSAGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
AGENT ENABLED MINING OF DISTRIBUTED PROTEIN DATA BANKS
 
International Journal on Foundations of Computer Science & Technology (IJFCST)
International Journal on Foundations of Computer Science & Technology (IJFCST)International Journal on Foundations of Computer Science & Technology (IJFCST)
International Journal on Foundations of Computer Science & Technology (IJFCST)
 
AN INTRODUCTION TO DIGITAL CRIMES
AN INTRODUCTION TO DIGITAL CRIMESAN INTRODUCTION TO DIGITAL CRIMES
AN INTRODUCTION TO DIGITAL CRIMES
 
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
 

Recently uploaded

Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 

Recently uploaded (20)

Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 

Dna data compression algorithms based on redundancy

  • 1. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 DNA DATA COMPRESSION ALGORITHMS BASED ON REDUNDANCY Subhankar Roy1 and Sunirmal Khatua2 1Department of Computer Science and Engineering, Academy of Technology, G. T. Road, Aedconagar, Hooghly-712121, W.B., India 2Department of Computer Science and Engineer, University of Calcutta, 92, A.P.C. Road, Kolkata-700009, India ABSTRACT Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA. In frequent cases there are four bases in a DNA. They are a (Adenine), c (Cytosine), g (Guanine) and t (Thymine). Each of these bases can be represented by two bits as 2 powers 2 =4 i.e. a – 00, c – 01, g – 11 and t – 10 respectively, although this choice is random. So redundancy within a sequence is more likely to exist. That’s why in this paper we have explored different types of repeat to compress DNA. These are direct repeats, palindrome or reverse direct repeat, inverted exact repeats or complementary palindrome or exact reverse complement, inverted approximate repeats or approximate complementary palindrome or approximate reverse complement, interspersed or dispersed repeats, flanking repeats or terminal repeats etc. Better compression gives better network speed and save storage space. KEYWORDS Direct repeats, Inverted repeats, Complementary palindrome, Interspersed repeats and Flanking repeats. 1. INTRODUCTION Today, more and more DNA sequences are becoming available [1]. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases getting bigger with time; therefore this information must be stored or communicated efficiently [2]. Furthermore, sequence compression can be used to define similarities between biological sequences. The only way to handle this situation is to encoding these sequences. Better compression ratio depends upon the DNA sequences [3]. However if one applies the standard text compression software such as GZIP [4], they cannot compress DNA sequences but only expand the file with more than two bits per symbol. There are some reasons pointed out. This software is designed mainly for English text compression, while the regularities in DNA sequences are much subtler and these tools do not make use of the special characteristics of a DNA sequence. For example, it is well known to us that all DNA sequences have very long-term correlation [5] in that the subsequences in different regions of any DNA sequence of same genome or of different genome are quite similar to each other. State-of-the-art DNA encoding schemes rely on exploiting this long-term correlation. In particular, similarity within the DNA sequence is searched so that similar subsequences of different fragment can be encoded with reference to its earlier section. Thus DNA sequences are important as a new challenge for study of compression algorithms. Eclipse uses Cp1252 as the default for its encoding. But UTF-8 can represent every character in Unicode (216 = 65536) and has a much bigger library of recognized symbols. The console in DOI:10.5121/ijfcst.2014.4605 49
  • 2. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 eclipse uses the default encoding of OS i.e. inherited from container (Cp1252). The size of each character in Unicode is two byte whereas ASCII code character size is one byte. When compression is done on DNA sequence or on some plain text, for a line break in a file each line feed character (‘n’) and carriage return (‘r’) as a whole takes four bytes. So if a text file contains more line compressed file using Unicode character give bad compression factor. But in ASCII code (28=256) the size of each character is one byte so a line break that is a combination of line feed (‘n’) and carriage return (‘r’) as a whole takes two bytes. If number of lines is 1000 then compressed file using Unicode 1000*4 = 4000 bytes and using ASCII code 1000*2 = 2000 bytes that is half in size compare to Unicode. That is why most of the efficient compression algorithm coding is done using C language rather than using Java because Java character set is called UNICODE character set and C follows ASCII code character set. 2. LITERATURE REVIEW It is true that the compression of DNA sequence is a difficult task for general compression algorithms, but at the same time, from the viewpoint of compression theory it is an interesting subject for understanding the properties of various compression algorithms. Generally the windows of the methods based on dictionary [6-7] have a fixed width of small size. The use of small windows is efficient on plain text whose redundancy is very fewer and local. However, in the case of DNA sequence, redundancies may occur at very long distances and factors can be very long. Shannon-Fano [8-9] or Huffman Coding [8-9] does not give good compression factor for DNA data compression. Huffman’s code fails badly on DNA sequences both in the static and adaptive model [8-9], because there are only four kind symbols in DNA sequences and the probabilities of occurrence of the symbols are close. Both follow statistical modeling by creating variable length codes for bases in DNA sequence. Base with higher probability of occurrence in a sequence gets shorter codes in binary form as an integral multiple. Code assignment is done by making a binary tree that is called the Huffman tree. As the most frequent bases in DNA sequence are 4 in number. They are a, c, g and t respectively. Let us consider a part of a sequence “actggtcgatgtacacacataggaaccaacccccaaaaa … accat” and table 1 shows their frequencies in that sequence. 50 Table 1. Bases and their frequencies in a sequence Nucleotides Frequencies a 100 c 100 g 50 t 50 So if we build the Huffman tree using these symbols then the binary code for these four symbols is shown in Table 2. Table 2. The Huffman Code table of nucleotides Nucleotides Codes a 0 / 00 c 10 / 01
  • 3. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 51 g 110 / 10 t 111 / 11 So total number of bits required is (100*1 + 100*2 + 50*3 + 50*3) = (100+200+150+150) = (100*2 + 100*2 + 50*2 + 50*3) = (200+200+100+100) = 600 bits. Using simplest two bit encoding technique it requires same number of bits. As Huffman code does not consider any properties within a sequence such as repeats or similarity that is why it cannot give better result. If we apply Shannon-Fano technique then again it needs 600 bits to compress the above file. Concerning compression ratio, Prediction by Partial Matching (PPM) [10] is one of the best compression algorithms in practice developed by Cleary and Witten is accomplished of very high compression rates, encoding English text in as little as 2.2 bits/character. It predicts according to the most recent symbols seen. However it cannot compress DNA sequences less than two bits per symbol either. Context Tree Weighting (CTW) [11] method introduced by Willems, Shtarkov, and Tjalkens can compress DNA sequences less than two bits per symbol. It compresses the data by branch prediction. But this algorithm does not use special structures of biological sequences. DNA sequences are more redundant in nature due to only four frequent bases a, c, g and t without considering eleven rare bases. So it is obvious that the similarity between subsequence within a particular sequence will be more. Now if instead of considering a particular sequence if a number of sequences of a particular genome considered then the similarity between those sequences are more due to redundancy. So it would be advantageous to compress different chromosomes together to take into account both self-chromosomal similarity and cross-chromosomal similarities within chromosome [12]. A statistical study reveals that the similarity within a sequence is 4.5% of total sequence length. Whereas similarity within two chromosomes is 10% in which 8% comes from cross-chromosomal similarity and 2% comes from self-chromosomal similarity. Among 16 chromosomes it is 18% in which cross-similarity is 16.8% and rest is self-similarity. There is a software tool called PatternHunter are used to search both self and cross-chromosomal similarity. It can search both exact, approximate repeat and complementary palindrome. So the compression algorithm using self and cross chromosomal similarity is local in nature because here chromosomes of a particular genome has been considered. This one is like a dictionary based encoding algorithm because each chromosomes is compared with other using PatternHunter. So there have some reference chromosomes with which target chromosomes are compared and the differences are compressed. There are some pointer which points the common region and length of the common region of the reference chromosomes. In the decompression time the compressed difference and reference sequences are used to recover the original sequence. Instead of compressing individual genome sequence of an organism, here entire genome sequence of an organism has been compressed [13-14]. This would give better saving percentage due to redundancy nature of DNA sequence. Here each genome sequence of an organism is compared with other genome sequence of that organism and only difference are compressed. The common region needs not to compress again. This one is like a dictionary based encoding algorithm because each different genome sequence is compared with different specific genome sequence. Therefore, the compressed data are poised of reference sequence, differences sequence, and the locations of differences, instead of storing each DNA data sequence individually. A better result of this approach is depends on the similarity between sequences. It is shown that mitochondria dataset give 195-fold compression rate. Original sequence = Reference sequences + The difference within sequence
  • 4. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 Genbit compression algorithm [15] considered both for repetitive and non repetitive DNA sequences. In this scheme all possible combination has been introduced. The input sequence is divided in to block, where each block is of four bases. Thus in this coding scheme, 2 power 8 = 256 combinations can be represented. Hence every DNA segment containing four bases is replaced by an 8 bit binary number for e.g. “01010101”. If the consecutive fragments are same, then a specific bit “1” is introduced as a 9th bit. If the consecutive fragments are different, then a specific bit “0” is introduced as a 9th bit to the 8 bit unique number. GenBit Compress is a simple algorithm without Dynamic programming approach. It takes an input of a DNA sequence of length n, and divides into n/4 number of fragments. The left out individual bases (fragment length<4) is assigned 4 unique “2” bits. (a=”00”, g=”01”, c=”10”, t=”11”). 3. PROPOSED METHODOLOGY The general encoding methodology is before encoding the next symbol in a DNA sequence by mapping technique, algorithms search for direct repeats, reverse direct repeats and complementary palindrome which are explain below in details. If they found with desired sequence length then our algorithm count the number of occurrence and represents it by the mapped symbol followed by the number of repeat. These algorithms are completely different from others existing in the way that it search for best result from different compression technique using different technique and give the final output with least bpb. 3.1 Direct exact or approximate repeats in DNA sequence Algorithm FS4DR uses this property. Repeat in DNA sequence means exact and approximate matches. Approximate matches are done by operations such as substitution or Replacement, insertion and deletion. An exact appearance means two substrings of a string in a sequence consist of indistinguishable bases along the chosen subsequence. For example if a subsequence is “attcgtgtattcgtgt”. The first eight bases i.e. “attcgtgt” and last eight bases i.e. “attcgtgt”, the second subsequence is exact copy of the first one. Exact repeat also called tandem repeat. When a pattern of two or more nucleotides is repeated and the repetitions are adjacent to each other either directly or inverted. An example is: atcgatcgatcgatcg, in which the sequence “atcg” is repeated four times. When the number is not known, variable, or irrelevant, it is sometimes called a variable number tandem repeats (VNTR). The term minisatellite has been interchangeably used with variable number tandem repeats (VNTRs). When between 10 and 60 nucleotides are repeated, it is called a minisatellite. These occur at more than 1,000 locations in the human genome. Those with fewer are known as microsatellites or short tandem repeats (STR).STRs are also repeated sequences, but they are usually 2–10 nucleotides long. When exactly two nucleotides are repeated, it is called a dinucleotide repeat (for example: tctctctc…). The microsatellite instability in hereditary nonpolyposis colon cancer most commonly affects such regions. When three nucleotides are repeated, it is called a trinucleotide repeat (for example: tagtagtagtag…), and abnormalities in such regions can give rise to trinucleotide repeat disorders. 52
  • 5. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 An approximate appearance means two substrings of a string in a sequence consist of indistinguishable bases after one of the following operations along the chosen subsequence. For example if a subsequence is “atttgtgtattcgtgt”. The first eight bases i.e. “atttgtgt” and last eight nucleotides i.e. “attcgtgt”. The second subsequence can be obtained from the first subsequence if the 4th base “T” in the first subsequence is substituted or replaced by “c”. If a DNA sequence is “attcgttattcgtgt”. The second subsequence “attcgtgt” can be obtained from the first one “attcgtt” if we insert “g” within first sequence between 6th and 7th character. This is called approximate matches with insertion. For approximate matches with deletion operation, let us consider a DNA sequence is “attcgtgtattcgtt”. The second subsequence “attcgtt” can be obtained from the first one “attcgtgt” if we delete 7th base i.e. “g” within first sequence. Substitution can be used as a combination of insertion and deletion operations in some cases. For example if a subsequence is “atttgtgtattcgtgt” then the second subsequence “attcgtgt” of length eight can be obtained from the first one “atttgtgt” by inserting “c” between 3rd and 4th bases then by deleting base “t” from the 5th position. This one can be also obtained by simply replacing character as described above. 53 1 atgaatgaatgaatgaatgaatga 24 1 atga 4 5 atga 8 9 atga 12 13 atga 16 17 atga 20 21 atga 24 Figure 1. An example of direct repeat of four nucleotides. The sequence is a part of DNA sequence. 3.2 Palindrome or reverse direct repeat, Inverted exact repeat or complementary palindrome and inverted approximate repeat or approximate complementary palindrome or approximate reverse complement repeats Algorithm FS4RDR and FS4CP uses these property. The palindrome sequence is “acgttgca”, the fragment “tgca” can be obtained from fragment “acgt” by reversing the second one. An inverted repeat, reversed repeat, complemented inverted repeat or reverse complement repeat is a sequence of bases that is the reversed after complementing of another sequence further downstream to the sequence. The original sequence of nucleotides is written in the 5' to 3' direction. For e.g. 5'-atgc-3' The complementary sequence is written, matching bases ‘a’ with -‘t’ and ‘g’ with ‘c’. The resulting sequence is in the 3' to 5' orientation. 3'-tacg-5' If the new sequence “3'-tacg-5' “, is reversed, resulting in the complementary sequence in the correct orientation i.e. 5'-gcat-3'. If a sequence is “gcatatgc” so second subsequence “atgc” of length four is reverse complement of first subsequences “gcat” of length four.
  • 6. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 If there are no nucleotides or bases intervene between the subsequence and its downstream complement, it is called a palindrome. So in the above case the second subsequence is complementary palindrome or inverted exact repeat of the first subsequence. Inverted repeats also indicate regions capable of self-complementary base pairing (regions within a single sequence which can base pair with each other). If a sequence is “gcttatgc” so second subsequence “atgc” is reverse complements of first subsequences “gctt”. But it is an approximate reverse complement. Complement of the second subsequence is “tacg” in 3' to 5' direction. So in the original direction i.e. 5' to 3' it is “gcat”. So if the 3rd base “t” of reference subsequence is replaced by ”a” then it will be similar to the compared subsequence. 54 1 ctaggatcgatcgatcgatcgatc 24 ctagctagctagctagctag -> reverse Figure 2. An example of palindrome of four nucleotides. The sequence is a part of DNA sequence. 1 ctgatcagtcagtcagtcagtcagtcag 24 agtcagtcagtcagtcagtcagtc -> Complement ctgactgactgactgactgactga -> Reverse Figure 3. An example of complementary palindrome of four nucleotides. The sequence is a part of DNA sequence. 3.3 Interspersed or Dispersed Repeats It is the copy of repetitive sequences that are interspersed throughout the genome. If a genome sequence that consists of two or more repeats of a specific subsequence. Two identical (or nearly identical) nucleotide bases sequences sometimes separated by a sequence of non-repeated DNA. For example if a sequence is “3' tagtacgttagt 5'”. The last four subsequences “tagt” is exact repeat of first four subsequences i.e. “tagt” but the intermediate sub sequence “acgt” in a non repeated region. It can present in multiple copies in the genome. This can be exact and approximate repeat. It is classified as Short Interspersed Repeat (SIR) and Long Interspersed Repeat (LIR). 3.4 Flanking Repeats or Terminal Repeat The type of repeats is sequences that are repeated on both ends of a sequence. Direct terminal repeats are in the same direction and inverted terminal repeats are opposite to each other in direction. 3.5 Complementary Nature of Nucleotide Sequences DNA double strand is complementary in nature. If one strand is known, then other strand can be formed from that one just by replacing ‘a’ with ‘t’ and ‘g’ with ‘c’ and vice versa. The polymer forms in a unidirectional manner from the 5' carbon that is bound to a phosphate group to the 3' carbon that is bound to the hydroxyl group. DNA sequences are often written in a shorthand notation using the one-letter name for each nucleotide, and listing the nucleotides in the 5' to 3' direction.
  • 7. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 55 3.6 Compression technique Original file e.g. humdystrop.txt Compression technique FS4 Compressed file e.g. humdystrop_cmp.txt Figure 4. The compression technique. Figure 5. An example of sequence humdystrop.txt 3.7 Decompression Technique Compressed file e.g. humdystrop_cmp.txt Decompression technique RFS4 Reconstructed file e.g. humdystrop_rec.txt Figure 6. The decompression technique.
  • 8. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 56 Figure 7. An example of compressed sequence humdystrop_cmp.txt 4. SIMULATION RESULTS Table 3 shows the information about the ten standard benchmark genome sequence [17]. Table 4 shows the result in bytes by fragment size four (FS4), fragment size four with direct repeat (FS4DR), fragment size four with reverse direct repeat (FS4RDR) and fragment size four with complementary palindrome (FS4CP). RFS4 means reverse fragment size four. Table 5 shows the resultant bits per base (bpb) used of our proposed compression algorithm in compressing each genome sequence separately. Tab1e 3. Information of ten standard benchmark DNA sequences. Sequences Name Length(Bytes) Source File Size(KB) chmpxx 121,024 Chloroplasts 118.1875 chntxx 155,844 152.1914 humdystrop 38,770 Human 37.8613 humghcsa 66,495 64.9365 humhdabcd 58,864 57.4844 humhprtb 56,737 55.4072 mpomtcg 186,608 Mitochondria 182.2344 mtpacg 100,314 97.9629 hehcmvcg 229,354 Viruses 223.9785 vaccg 191,737 187.2431 Tab1e 4. The sequences length in bytes by different compression techniques. The bits per base (bpb) are used by the below compression methods along with their properties. Sequences Name FS4 FS4DR FS4RDR FS4CP chmpxx 31,945 30,045 30,403 30,203 chntxx 40,960 38,960 38,900 38,700 humdystrop 10,006 10,083 10,011 10,020 humghcsa 18,160 16,576 16,776 16,976 humhdabcd 15,295 14,576 14,376 14,376 humhprtb 14,984 14,153 14,053 14,053
  • 9. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 57 mpomtcg 51,499 46,388 46,188 46,088 mtpacg 26,983 25,008 24,908 24,808 hehcmvcg 60,806 56,806 57,306 57,206 vaccg 52,169 47,852 47,352 47,152 Tab1e 5. The bpb by different compression techniques. Sequences Name FS4 FS4DR FS4RDR FS4CP chmpxx 2.111647 1.986052 2.009717 1.996497 chntxx 2.102615 1.999949 1.996869 1.986602 humdystrop 2.064689 2.080578 2.065721 2.067578 humghcsa 2.184826 1.994255 2.018317 2.042379 humhdabcd 2.07869 1.980973 1.953792 1.953792 humhprtb 2.112766 1.995594 1.981494 1.981494 mpomtcg 2.207794 1.988682 1.980108 1.975821 mtpacg 2.151883 1.994378 1.986403 1.978428 hehcmvcg 2.120948 1.981426 1.998866 1.995378 vaccg 2.17669 1.900741 1.975706 1.967362 Average bpb 2.131255 1.990263 1.996699 1.994533 5. CONCLUSIONS Our proposed compression algorithm was tested on ten standard benchmark DNA sequences. The experimental results showed that the bpb is less than two bits per base when we compress these sequences using any of the three simple properties. After applying all techniques our algorithm searches for optimal solution to get the final result and will store that result. When none of the above characteristics is used the compression factor is approximately equals to 2, because each nucleotides base is represented by two bits. ACKNOWLEDGEMENTS This research work is supported in part by the Computer Science and Engineering Department of University of Calcutta, Bioinformatics Centre of Bose Institute and Academy of Technology under WBUT. We would like to thank Dr. Zumur Ghosh and Asst. Professor Utpal Nandi for their valuable suggestion. REFERENCES [1] Subhankar Roy, Sunirmal Khatua, Sudipta Roy and Prof. Samir K. Bandyopadhyay, “An Efficient Biological Sequence Compression Technique Using LUT and Repeat in the Sequence”, IOSRJCE, Vol. 6, Issue 1, pp. 42-50, Sep-Oct. 2012. [2] Subhankar Roy, Sunirmal Khatua, “Compression Algorithm for all Specified Bases in Nucleic Acid Sequences”, IJCA, Vol. 75, No. 4, pp. 29-34, August. 2013. [3] Subhankar Roy, Sunirmal Khatua, “Biological Data Compression Methods Based Upon Microsatellite”, IJARCSSE, pp. 1321-1326, Vol. 3, Issue 7, July 2013. [4] http://www.gzip.org. [5] Wu Choi Ping Paula, “An Approach to Multiple DNA Sequences Compression”, The Hong Kong Polytechnic University, xiii, 115 p, PolyU Library Call No.: [THS] LG51 .H577M EIE 2009 Wu , February 2009.
  • 10. International Journal in Foundations of Computer Science & Technology (IJFCST), Vol.4, No.6, November 2014 [6] Shanika Kuruppu, Simon J. Puglisi and Justin Zobel,” Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval”, Springer-Verlag Berlin Heidelberg, LNCS 6393, pp. 201– 206, 2010. [7] Ziv, J. and Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inform. 58 Theory, 23(3):337{343, 1977. [8] Khalid Sayood, “Introduction to Data Compression, 3rd ed.”, University of Nebraska, 2006. [9] Mark Nelson and Jean-Loup Gailly, “The Data Compression Book, 2nd ed.”, 2012. [10] Alistair Moffat, “Implementing the PPM Data Compression Scheme”, IEEE Transactions on Communications, Vol. 38, No. 11, November 1990. [11] Zijun Wu, “A Tutorial of The Context Tree Weighting Method: Basic Properties”, November 9, 2005. [12] Choi-Ping Paula Wu1, Ngai-Fong Law and Wan-Chi Siu, “Cross chromosomal similarity for DNA sequence compression”, Biomedical Informatics Publishing Group, pp. 412-416, vol. 2, No. 9, July 14, 2008. [13] Hyoung Do Kim and Ju-Han Kim, “DNA Data Compression Based on the Whole Genome Sequence”, Journal of Convergence Information Technology, Vol. 4, No. 3, September 2009. [14] Heba Afify, Muhammad Islam and Manal Abdel Wahed, “DNA Lossless Differential Compression Algorithm Based on Similarity of Genomic Sequence Database”, International Journal of Computer Science & Information Technology, pp. 145 – 154, Vol. 3, No. 4, August 2011. [15] P.Raja Rajeswari, Dr. Allam AppaRao, “Genbit Compress Tool (GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio (bits/base) of Genomes”, International Journal of Computer Science and Information Technology, pp. 181-191, Vol. 2, No. 3, June 2010. [16] Nour S. Bakr andAmr A. Sharawi, “DNA Lossless Compression Algorithms: Review”, American Journal of Bioinformatics Research, pp. 72-81, 2013. [17] The GeNML homepage: http://www.cs.tut.fi/~tabus/genml/results.html. Authors Subhankar Roy is currently working as an Assistant Professor in the Department of Computer Science & Engineering, Academy of Technology, India. He received his B.Tech degree and M.Tech degree in Computer Science and Engineering in 2012 both from University of Calcutta. He is doing research work from University of Calcutta under the guidance of Mr. Sunirmal Khatua. His research interests include Bioinformatics and compression techniques. He is a member of IEEE. Sunirmal Khatua is currently an Assistant Professor in the Department of Computer Science and Engineering, University of Calcutta, India. He has received the M.E. degree in Computer Science and Engineering from Jadavpur University, India in 2006. He is also pursuing his PhD in Cloud Computing from Jadavpur University. His research interests are in the areas of Distributed Computing, Cloud Computing and Sensor Networks. He is a member of IEEE.