INTERNATIONALComputer Engineering and Technology ENGINEERING  International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volu...
Upcoming SlideShare
Loading in …5
×

A new revisited compression technique through innovative partition group binary

152 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
152
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A new revisited compression technique through innovative partition group binary

  1. 1. INTERNATIONALComputer Engineering and Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME & TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) IJCETVolume 4, Issue 2, March – April (2013), pp. 94-101© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI) ©IAEMEwww.jifactor.com A NEW REVISITED COMPRESSION TECHNIQUE THROUGH INNOVATIVE PARTITION GROUP BINARY COMPRESSION :A NOVEL APPROACH V. Hari Prasad Sphoorthy Engineering college, A.P ABSTRACT Day by day the ratio of living organisms are increased and its accumulation in the biological databases creating a major task. In this connection some of the classical algorithms are strived into the world and fails to compress genetic sequences due to the encrypted alphabets. Existing substitution techniques will work on repetitive and non repetitiveness of bases of DNA and they achieve Best compression rates if any sequence may contain tandem repeats. But the category of grasses like maize and rice contains very less repeats (nearer of 27) over 67789 total bps. By working existing techniques on such sequences the results are not bountiful and running on worst case comparisons. This paper introduces a novel Innovative partition group Binary compression technique yields first art compression rates which is far better than existing techniques. This algorithm is developed based on comparative study of existing algorithms and which is more applicable for non tandem repeats of DNA sequences in genomes. Keywords: Dnabit compress, Genbit compress, Huffbit compress, Encode, Decode I. INTRODUCTION The human genome was finally deciphered! In other words, scientists have succeeded in reading the chain of more than 3 billion base pairs that constitute the DNA molecule of humans; this process is called, sequencing. That daunting task required new analytical methods created by bioinformatics .Bio informatics is nothing but information technology is applied in terms of biological databases. To maintain such huge databases we require efficient Bio informatics computational tools to store and process data in a more efficient 94
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEway. Today more and more DNA sequences are storing in biological databases so the size ofthe databases is growing in an exponential manner. Thus need of compression arises and it isbecoming a vital challenge, In this connection many universal algorithms are came into theexistence but they fails to compress genetic sequences due to the specificity of “text”. Someclassical algorithms are also introduced but they are running on negative compression rates.Compression may be in two flavors one is lossless and other is loss. Loss compression can beapplicable for multimedia applications like image, audio and video. In multimediaapplications if we remove some unused pixels also resultant may not variant like removingnoise from audio or removing unnecessary pixels from images But Text compression isalways loss less even after decoding the entire encoded text we have to retain its originalproperty. DNA can be encoded in four letter alphabets like text {A, C, G, and T}.Thus eachBase of symbol (Base) can be represented by two bits. . General purpose compressionalgorithms do not perform well with biological sequences. Giancarlo et al. [1][2] haveprovided a review of compression algorithms designed for biological sequences. Finding thecharacteristics and comparing Genomes is a major task (Koonin 1999[3]; Wooley 1999[4]).In mathematical point of view, compression implies understanding and comprehension (Liand Vitanyi 1998) [5]. Compression is a great tool for Genome comparison and for studyingvarious properties of Genomes. DNA sequences, which encode life should be compressible.It is well known that DNA sequences in higher eukaryotes contain many tandem repeats, andessentials genes (like rRNAs) have many copies. It is also proved that genes duplicatethemselves sometimes for evolutionary purposes. All these facts conclude that DNAsequences should be compressible. The compression of DNA sequences is not an easy task.(Grumback and Tahi 1994[6], Rivals et al. 1995 [7]; Chen et al. 2000 [8]) DNA sequencesconsists of only four nucleotides bases {a,c,g,t}. Two bits are enough to store each base. Thestandard compression software’s such as “compress”, “gzip”, “bzip2”, “winzip” expanded theDNA genome file more than compressing it. Most of the Existing software tools worked well for English text compression (Bell etal. 1990[9]) but not for DNA Genomes. There are many text compression algorithmsavailable having quite a good compression ratio. But they have not been proved well forcompressing DNA sequences as the algorithm does not incorporate the characteristics ofDNA sequences even though DNA sequences can be represented in simple text form.DNAsequences are comprised of just four different bases labeled A, T, C, and G (for adenine,thymine, cytosine, and guanine respectively). T pairs with A, and G pairs with C. Each basecan be represented in computer code by a two character binary digit, two bits in other words,A (00), C (01), G (10), and T (11). At first glance, one might imagine that this is the mostefficient way to store DNA sequences. Like the binary alphabet {0, 1} used in computers, thefour-letter alphabet of DNA {A, T, C, and G} can encode messages of arbitrary complexitywhen encoded into long sequences. A. Plan of the paper This paper is organized as follows. Section 2 describes general compression algorithms. Section 3 describes related existing algorithms to compress genome data. Section 4 describes proposed algorithms analysis how it is better one than existing techniques. Section 5 describes comparative study on a sample sequence. Section 6 is concluding with future work. 95
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEII. GENERAL COMPRESSION ALGORITHMS The compression of DNA sequences is considered as one of the most challengingtasks in the field of data compression. In this connection the very first DNA compression andits subsequent algorithms BioCompress[10] and BioCompress-2[11] detects exact repeats andcomplementary palindromes located in the sequence and Encode the factor by the sizerepresentation( l, p ) where l is the length of the factor and p is the position of its firstoccurrence .If the size is greater the factor then use two bit encoding . More memoryreferences will require decoding the same, so the performance may degrade.DnaPack[12] which uses hamming distance for the repeats and complementary palindromesand it is implemented by dynamic programming approach. So that it is not simple in designand it will require more time to execute and require more memory requirements also. Thealgorithm achieves a compression rate in an average of 1.6602.DNAcompress [13] will work on approximation of repeats if number of tandem repeats moreit saves bits to encode if not discard. Non repeated sequences will be appended to thesequence at the end. This algorithm achieves a compression rate only 1.72 bits per base. Ifthere is no tandem repeat in the sequence it may run in worst case..DNASequitur is agrammar based compression algorithm for DNA sequences which infers a context freegrammar to represent input data..Designing a CFG for the given input data may leads toredundancies and constructing type2 language corresponding the grammar also leads toambiguity. The Lossless segment based compression enables part by part decompression byintroducing non base character so that it will save memory requirements but it is applicablewell on repeating sequences are more and more in the sequence. If such sequences like AT-rich DNA, which constitutes a distinct fraction of the cellular DNA of the archaebacteriumMethanococcus voltae, consists of non-repetitive sequences, so part by part decompression islittle bit tedious.III. RELATED EXISTING ALGORITHMSCompression methods are fall into two categories. • Statistical methods which compress data by replacing the shorter code. Huffman code comes under this category and it’s not suitable larger sequences. • Dictionary based replacing larger strings by shorter code. Cfact which searches the longest exact matching repeat using tree data structures Based on the above methods some loss less compression algorithmsstrived based on two bits encoding schemes i.e. A(00),C(01),G(10) and T(11). HUFFBITCOMPRESS [14], GENBITCOMPRESS [15] algorithms are explained performance analysis(Best, Avg and Worst) based on repetitive and non repetitive bases of DNA and computedresults..Suppose in the given sequence more and more tandem repeats are there then [14],[15]will run in Beast case and achieves bountiful compression ratios, if not the same may run inworst case and achieves in an average 2.323 bits per bases[14],[15].Our proposed algorithmPGBC(Partitioned group binary compression) techniques will achieves best compressionration even the given DNA sequence may doesn’t contain tandem repeats or very less tandemrepeats which can consider as minute in a very larger sequence like category of maze and ricegrass sequences Our proposed algorithms are better suitable for non-repetitive DNAsequences in genome and which is achieving in an average of 1.333 bits per bases which isfar better than all existing techniques. 96
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEIV. PROPOSED ALGORITHM Our proposed algorithms are developed based on the comparative study of existingtechniques and we are starching on non-repetitiveness of DNA sequences. Existingtechniques are run in worst case if any DNA sequences contain no tandem repeats. Here weare working with worst-case scenarios and achieves better compression rates in terms bits persymbols. A. Idea behind the algorithm Every DNA sequence contain {A, C, G, T} nucleotides where each literal is named asBASE and encoded in two bits as follows A=00, C=01, G=10 and T=11. Compression ratio is calculated encoded bits per Bases. Compression Ratio = Encoded Bits/Bases. B. Plan of work Here we took and input sequence as sample DNA (which doesn’t contain any tandemrepeats) of length n and divides it into n/4 fragments (where each fragment contain fourbases i.e. A, C, G and T). In Encoding process every six fragments can grouped as partition(P) which contains two sub partitions (Fh and Sh ) . We can substitute its equivalent binarybits before making sub partitions and later we can group it into single main partition set. (Gs).In decoding process we can do the reverse or encoding to retain loss less DNA property.Finally we will calculate will group all the partition the equations are total number ofencoded bits by grouping all the part ions. Suppose if we took sample sequence of DNA which contain 72 bases then by applyingPGBC techniques it will fragmented into n/4 i.e. 18 fragments , 3 partitions which willcontain 6 sub partitions and then grouped it into single main partition. This will representnumber of encoded bits in the given sequence. Our PGBC technique work as followsPartition set (Ps) can calculate as follows.Group partition set can calculate as followsHere Gs will represent the binary equivalent numeric (nearest to integer) in terms of Bytesstorage. (Suppose if we will implement the technique in C language unsigned int will require4 bytes of storage).Here m=n i.e. length of the given sequence. 97
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMETotal Number of Encoded group bits are calculated as follows.Finally compression Ratio can calculate as follows. C. Analysis Length of the given sequence N = 72 Bases. Then possibly 3 part ion sets which will include 6 sub partitions (3-Fh and 3-Sh) and finally we can group all partitions into a Group partition set. Ps=P1 + P2 + P3. we will calculate binary equivalent numeric(which is nearer to integer) value for each partition in terms of Bytes storage(suppose if we will store each partition in C we will require two bytes if we can accommodate on int if not we can go for unsigned long). Gs=Ps (p1 + p2 + p3) Total Number of Encoded Bits Eg b= Gs= Ps Every partition may contain 24 bases so it may not fit in integer so that we can store in unsigned long. So totally our sequence is divided into three partitions and grouped as one set .So totally we require 12 bytes to store. =4 + 4 + 4 = 12 Bytes (96 bits) Finally we calculate Compression Ratio as follows. = 96 / 72 = 1.333 (bits per Bases.) Encoding and decoding algorithms for DNA compression is as follows. D. Encoding Algorithm INP: input String OPS: Encoded String PROCEDURE ENCODE Begin • Group INS into equivalent fragments as four bases • Generate all possible combinations of DNA and it will contain non- repetitive (our INS assumed as no tandem repeats). 98
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME • Group six fragments into partition set which will consist of two subpartions. • Assign binary bits(0&1) for every base of DNA like A=00, C=01, G=10 and T=11 • Calculate Gs for every Ps in INP till eof INP • Calculate Egb for every Gs till eof INP • Repeat the steps 4 and 5 until the length of the INP • Transfer the sequence Egb to the output string i.e. OPS String. End. E. Decoding Algorithm INP: input String OPS: Decoded String PROCEDURE DECODE Begin • Generate all possible combinations of (A,C,G,T) • Read the binary data of each sub partition from OPS and assign the two bits by equivalent Base s (00=A,01=C,10=G and 11=T) and then store it in an array till eof • Repeat step 2 until eof INS is reached and calculate Dgb and Ds in the reverse process.. • Transfer the sequence Db to the input String i.e. INP End.V. EXAMPLE AND COMPARISON Let us consider the sequence.Sequence1:ACGT GCGC GATC GCCT GCTA GGCG TACG TCGC AGGC GATC GATG TGCTAGAT CAGA TGAC TCAGTGCA CGAT. Sequence length (no of bases) = 72. Bytes required to store in a text file = 72 Bytes.The above sequence doesn’t contain tandem repeats so existing algorithms like Huffbitcompress,Genbit Compress and Dnabit compress may run on worst case and require morebits to encode the sequence. Huffbit,GenBit and Dna compress =162 bits(2.25) Genbit Compress (Tool based) = 160 bits (2.23) PGBC Technique (Compression) = 96 bits (1.333) 99
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEVI. CONCLUSION AND FUTURE WORK By using of our algorithm we can encode every base by 1.33 bits .By applying of ourswe are saving nearer of 8 bytes to encode the given sequence, compression may vary withsize of the sequence. So our technique is far better than existing ones and we can apply thistechnique on non repetitive DNA sequences of genomes .If the given sequence can containtandem repeats also our technique will achieve same compression rate in an average. Inaddition to that existing techniques uses dynamic programming to compress the sequencewhich is complex in implementation and time consuming. Our technique is implementedwithout dynamic programming approach, so it is simple and fast. The simplicity of this willreduce the complexity in processing and definitely it will be the invaluable tool in Bioinformatics era. Our algorithm can be extended to any tool based approach.ACKNOWLEDGEMENTS We would like thank other members of the Bio –Informatics teams (Faculty of CSEand IT) at Sphoorthy Engineering college Nadergul(V),R.R Dist,Hyderabad. I am very muchgreatful to my mother and father V.subba lakshmamma and V.C Obanna in every path of mysuccess. Last but not least students of CSE in sphoorthy for sharing their ideas with us inrefining of our architectureREFERENCES[1] E Schrodinger. Cambridge University Press: Cambridge, UK, 1944.[PMID:15985324][2] R Giancarlo et al. A synopsis Bioinformatics 25:1575 (2009) [PMID:19251772][3] EV Koonin. Bioinformatics 15: 265 (1999)[4] JC Wooley. J.Comput.Biol 6: 459 (1999) [PMID: 10582579][5] CH Bennett et al. IEEE Trans.Inform.Theory 44: 4 (1998)[6] S Grumbach & F Tahi. Journal of Information Processing and Management 30(6):875 (1994)[7] E Rivals et al. A guaranteed compression scheme for repetitive DNA sequences.LIFL, Lille I University, technical report IT-285 (1995)[8] X Chen et al. A compression algorithm for DNA sequences and its applications inGenome comparison. In Proceedings of the Fourth Annual International Conference onComputational Molecular Biology, Tokyo, Japan, April 8-11, 2000. [PMID: 11072342][9] TC Bell et al. Newyork:Prentice Hall (1990)[10] J Ziv & A Lempel. IEEE Trans. Inf. Theory 23: 337 (1977)[11] A Grumbach & F Tahi. In Proceedings of the IEEE Data [12][12] DNA compression is challenge is revisited Beshad Behajadi[13] Allam AppaRao.In proceedings of the Bio medical Informatics Journal[2011].DNABIT compress-compression of DNA sequences[14] Allam AppaRao.In proceedings of the JATIT journal computationalf Biology andBio Informatics:[2009].HuffBit compress-compression of DNA using extended binary trees[15] Allam AppaRao.In proceedings of the JATIT journal computational Biology and BioInformatics:[2011].Genbit compress-compression of DNA sequences. 100
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEAUTHORS’ INFORMATION V Hari Prasad, Assoc. professor, B.Tech CSE from JNTU University, Anantapur, M.Tech CSE from JNTUCEH,HYD and pursuing research in the area of Bio Informatics at JNTU KAKINADA, A.P as a External Research scholar in CSE .He has 10 years of teaching experience in various Engineerig colleges. Presently He is heading the CSE Dept at Sphoorthy Engineering college ,Nadergul(V),Hyd. He is a Life Member of MISTE andMember of IEEE and UGC-NET qualified.He presented papers at International & Nationalconferences on various domains. His interested areas are Bio Informatics, Databases, andArtificial Intelligence. 101

×