International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No...
International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No...
International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No...
International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No...
International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No...
Upcoming SlideShare
Loading in …5
×

Sunzip user tool for data reduction using huffman algorithm

773 views

Published on

Smart Huffman Compression is a software appliance designed to compress a file in a better way. By functioning as an JSP, it provides high level abstraction of java Servlet. For example, Smart Huffman Compression encodes the digital information using fewer bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. It also provides us the decompression facility also. Smart Huffman Compression provides our organization with effective solutions to reduce the file size or lossless compression of data. It also expedites security of data using the encoding functionality. It is necessary to analyze the relationship between different methods and put them into a framework to better understand and better exploit the possibilities that compression provides us image compression, data compression, audio compression, video compression etc.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
773
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sunzip user tool for data reduction using huffman algorithm

  1. 1. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online) Volume No.-1, Issue No.-2, May, 2013 RES Publication © 2012 Page | 11 http://www.resindia.org Sunzip user tool for data reduction using Huffman algorithm Ramesh Jangid #1 M.Tech-Computer Science Jagannath University Jaipur, India E-mail: engr.ramesh29@gmail.com Sandeep Kumar#2 Asst. Prof., Computer Science Jagannath University Jaipur, India E-mail: sandpoonia@gmail.com Abstract: Smart Huffman Compression is a software appliance designed to compress a file in a better way. By functioning as an JSP, it provides high level abstraction of java Servlet. For example, Smart Huffman Compression encodes the digital information using fewer bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. It also provides us the decompression facility also. Smart Huffman Compression provides our organization with effective solutions to reduce the file size or lossless compression of data. It also expedites security of data using the encoding functionality. It is necessary to analyze the relationship between different methods and put them into a framework to better understand and better exploit the possibilities that compression provides us image compression, data compression, audio compression, video compression etc. [1] Keywords: Data Reduction, Java Servlet, Compression, Encoding, JSP I. INTRODUCTION Smart Huffman Compression Decompression is a software- application which is designed to simplify for compressing a file and makes more efficient use of disk space. It also allows better utilization of bandwidth for transfer of data. Form factors of data which are easy to manage through this application are: Data Compression: Simplify the text compression in a digital form. The text is encoded using fewer bits and original text is replaced by the bits. Image Compression: Includes segmentation, filtration of pixels, altering the colours to reduce the size of digital image. Audio Compression: It helps to reduce the size of digital audio streams and files. It has the potential to reduce the transmission bandwidth and storage requirements of audio data. Video Compression: Reduce the size of digital video streams and files. It combines the spatial image compression and temporal motion compensation. It is a practical implementation of source coding in information theory. Video compression typically operates on square-shaped groups of neighboring pixels, often called macro blocks. The concept behind Huffman Algorithm is: It uses a variable length code for each of the elements within the information, which analyze the information to determine the probability of elements within the information. The most probable elements are coded with a few bits and the least probable coded with a greater number of bits II. HUFFMAN ALGORITHM Huffman Algorithm is a compression technique with variable length codes. On behalf of the data symbols and their frequency of occurrence (their probabilities), it constructs a set of variable-length codeword‘s with the shortest average length and assigns them to the symbols. It generally produces better codes, and like the Shannon-Fanon method, it produces the best variable-length codes when the probabilities of the symbols are negative powers of 2. The main difference between the two methods is that Shannon-Fanon constructs its codes from top to bottom and the bits of each codeword are constructed from left to right, while Huffman constructs a code tree from the bottom up and the bits of each codeword are constructed from right to left. Huffman Encoding Algorithm Step 1: Find out the occurrence or probability of a each symbol in the given text. Step 2: List all the source symbols in order of decreasing probability in a tabular format. Step 3: Combine the probabilities of the two symbols having the lowest probabilities, and reorder with the resultant probability in decreasing order. This step is called reduction 1. Step 4: Repeat step 2 until there are two ordered probabilities are remaining. Step 5: Now go back and assign 0 and 1 to the remaining probabilities that were combined in the previous reduction step, retaining all assignments made in step 3. Step 6: Keep regressing this way until the first column is reached. Example Let the given text is: SIDVICIIISIDIDVI There are five symbols in this text. Now we find out the probability of the symbols from this text. These are: Symbol Probability 'C' 1/16 0.0625 'D' 3/16 0.1875 'I' 8/16 0.5 'S' 2/16 0.125 'V' 2/16 0.125 Now according to step 2 Symbol Probability ‗I‘ 0.5 ‗D‘ 0.1875 ‗S‘ 0.125 ‗V‘ 0.125
  2. 2. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online) Volume No.-1, Issue No.-2, May, 2013 RES Publication © 2012 Page | 12 http://www.resindia.org ‗C‘ 0.0625 According to step 3, 4, 5 Figure 1: Procedure of Huffman Encoding Now we can write the codes for particular symbols as: Symbols Code Code Length 'C' 1001 4 'D' 11 2 'I' 0 1 'S' 101 3 'V' 1000 4 We have the code lengths and we can calculate average code length for this text is: Formula: L = ∑ Pi * Ni ¥ i=1 to m Where Pi = Probability of symbol at i value Ni = Code length of symbol at i value So L = (0.0625*4) + (0.1875*2) + (0.5*1) + (0.125*3) + (0.125 * 4) L = 2 Encoded Message becomes: SIDVICIIISIDIDVI = 101 0 11 1000 0 1001 0 0 0 101 0 11 0 11 1000 0 The spaces are only to make the reading easier. So, the compressed output takes 32 bits and we need at least 10 bits to transfer the Huffman tree by sending the code lengths. The message originally took 48 bits, now it takes at least 42 bits. The codes are used to construct the Huffman Tree. Huffman Tree Figure 2: Huffman Tree Huffman Decoding The codes of each symbol are based on the probabilities or frequencies of occurrence of the symbols. The probabilities or frequencies have to be written, as side information, on the output, so that any Huffman decompressor (decoder) will be able to decompress the data. This is easy, because the frequencies are integers and the probabilities can be written as scaled integers. It normally adds just a few hundred bytes to the output. It is also possible to write the variable-length codes themselves on the output, but this may be awkward, because the codes have different sizes. The algorithm for decoding is simple. Start at the root and read the first bit off the input i.e. the compressed file. If it is zero, follow the bottom edge of the tree; if it is one, follow the top edge. Read the next bit and move another edge toward the leaves of the tree. When the decoder arrives at a leaf, it finds there the original, uncompressed symbol (normally its ASCII code), and that code is emitted by the decoder. The process starts again at the root with the next bit. III. HUFFMAN PERFORMANCE Huffman is the subject of intensive research in data compression. As we know it is an algebraic approach to construct the Huffman code. Robert Gallager shows that the redundancy of Huffman coding is at most p1 + 0.086 where p1 is the probability of the most-common symbol in the alphabet. The redundancy is the difference between the average Huffman codeword length and the entropy. Given a large alphabet, such as the set of letters, digits and punctuation marks used by a natural language, the largest symbol probability is typically around 15–20%, bringing the value of the quantity p1 + 0.086 to around 0.1. This means that Huffman codes are at most 0.1 bit longer per symbol than an ideal entropy encoder, such as arithmetic coding. The Huffman method assumes that the frequencies of occurrence of all the symbols of the alphabet are known to the compressor. In practice, the frequencies are seldom, if ever, known in advance. One approach to this problem is for the compressor to read the original data twice. The first time, it only counts the frequencies; the second time, it compresses the data. Between the two passes, the compressor constructs the Huffman tree. Such a two-pass method is sometimes called semi-adaptive and is normally too slow to be practical. The method that is used in practice is called adaptive (or dynamic) Huffman coding. This method is the basis of the UNIX compact program. The method was originally developed by Faller and Gallagher with substantial improvements by Knuth. The main idea is for the compressor and the decompressed to start with an empty Huffman tree and to modify it as symbols are being read and processed (in the case of the compressor, the word ―processed‖ means compressed; in the case of the decompressed, it means decompressed). The compressor and decompressed should modify the tree in the same way, so at any point in the process they should use the same codes, although those codes may change from step to step. We say that the compressor and decompressed are
  3. 3. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online) Volume No.-1, Issue No.-2, May, 2013 RES Publication © 2012 Page | 13 http://www.resindia.org synchronized or that they work in lockstep, although they don‘t necessarily work together; compression and decompression normally take place at different times. The term mirroring is perhaps a better choice. The decoder mirrors the operations of the encoder. Initially, the compressor starts with an empty Huffman tree. No symbols have been assigned codes yet. The first symbol being input is simply written on the output in its uncompressed form. The symbol is then added to the tree and a code assigned to it. The next time this symbol is encountered, its current code is written on the output, and its frequency incremented by 1. Since this modifies the tree, the tree is examined to see whether it is still a Huffman tree (best codes). If not, it is rearranged, an operation that results in modified codes. The decompress or mirrors the same steps. When it reads the uncompressed form of a symbol, it adds it to the tree and assigns it a code. When it reads a compressed variable-length code, it scans the current tree to determine what symbol the code belongs to, and it increments the symbol‘s frequency and rearranges the tree in the same way as the compressor. It is immediately clear that the decompressed needs to know whether the item it has just input is an uncompressed symbol normally, an 8-bit ASCII code or a variable-length code. To remove any ambiguity, each uncompressed symbol is preceded by a special, variable-size escape code. When the decompress or reads this code, it knows that the next eight bits are the ASCII code of a symbol that appears in the compressed file for the first time The trouble is that the escape code should not be any of the variable-length codes used for the symbols. These codes, however, are being modified every time the tree is rearranged, which is why the escape code should also be modified. A natural way to do this is to add an empty leaf to the tree, a leaf with a zero frequency of occurrence, that‘s always assigned to the 0-branch of the tree. Since the leaf is in the tree, it is assigned a variable-length code. This code is the escape code preceding every uncompressed symbol. As the tree is being rearranged, the position of the empty leaf-and thus its code- change, but this escape code is always used to identify uncompressed symbols in the compressed file. IV. ADVANCEMENTS IN HUFFMAN Huffman coding is a process that replaces fixed length symbols of 8-bit bytes with changing length codes. GNU zip , also known as GZIP, is a compression technique which is originally intended to replace the compress program used in the early Unix systems. GZIP is the advancements in Huffman algorithm. It is based on an algorithm known as DEFLATE. This is also a lossless data compression algorithm. It uses both the LZ77 algorithm and Huffman coding. Essentially, GZIP refers to the file format of the same name. This format is a 10- byte header which contains a magic number, which means a numerical or text value that never changes and is used to signify a file format or protocol, an unnamed numerical value that never changes, or distinct values that cannot be mistaken for anything else, extra headers that may or may not actually be necessary (original file name, for example), a body that contains a DEFLATE -compressed payload which is the data that the headers carry, and an 8-byte footer which contains a CRC-32 checksum, as well as the actual length of the original uncompressed data. It is used when huge file is compressed. It is very beneficial when we need more space and save time. It compresses file using very low space. GZIP compresses one large file instead of multiple smaller ones, it can take advantage of the redundancy in the files to reduce the file size even further. GZIP is a purely compression tool to compress a file. But it uses another tool i.e. Tar to archive a file. Compression is a technique which is used to reduce the size of a file while Archive is a technique which is used to combine multiple files into a single one after compression. GZIP archive all the files into single tarball before compression. GZIP is used in UNIX like Operating system such as the Linux distribution. V. BENIFITS OF HUFFMAN ENCODING Huffman Encoding is one the best Compression technique. It is fast, simple and easy to implement. It starts with a set of symbols whose probability are known and helps us to construct a code tree. When the tree is completed, it determines the variable length prefix codeword‘s for the individual symbols in the text. Huffman Compression Algorithm is used to handle the following problems. The implementer of Huffman compressor /decompress or selects a set of documents that are judged typically. The implementer analyse the document and count the occurrence of the each symbol. Based on the occurrence, he construct the Huffman code tree. These codes may not conform the symbol‘s probability of any particular input file i.e. being compressed. This approach is simple and fast so it is used in FAX Machines. It is a two pass compression job which produces the ideal codewords for the input file. The input file is read twice so this approach is slow. In the first pass, the encoder counts the symbol occurrence, and determines the probability of each symbol. It uses this information to construct the Huffman codewords for the input file which is being compressed. In the second pass, the encoder actually compress the data by replacing the each symbols with is respective codewords. The Adaptive Huffman Compression starts with a empty Huffman code tree and update tree as the input symbol are read and processed. When a symbol is input the tree is searched for it. If the symbol is in the tree, the codeword is used otherwise it is added to the tree and a new codeword is assigned to a particular symbol. In this case the tree is examined and rearranged to keep it Huffman Code tree. This process has to be done carefully to make sure that the decoder can perform it in the same way as the encoder in lockstep. This is difficult to implement.
  4. 4. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online) Volume No.-1, Issue No.-2, May, 2013 RES Publication © 2012 Page | 14 http://www.resindia.org VI. Sample Screen Shots Step wise steps- Figure 1 Figure 2 Here shows the sunzip tool for compression in which show the original file size,distnict chars , compressed file size and compression ration .fully details of data file format like text and mp3 shows in the table format below . VII. COMPARISION TABLE Type Of File:- TXT Algorithm Name S No Original File Size Compressed File Size Compr ession Ratio Distinct Charact ers Best HUFFMAN 1 1702 1081 63.51 % 50 2 334 321 96.11 % 45 3 48890 32249 65.96 % 93 SHANNON FANO 1 1702 1114 65.45 % 50 2 334 331 99.10 % 45 3 48890 33666 68.86 % 93 GZIP 1 1702 812 47.71 % yes 2 334 183 54.79 % 3 48890 10734 21.96 % COSMO 1 1702 1335 78.44 % 50 2 334 304 91.02 % 45 3 48890 42880 87.71 % 93 JUNK CODE BINARY 1 1702 1205 70.80 % 50 2 334 276 82.63 % 45 3 48890 37950 77.62 % 93 LZW 1 1702 1273 74.79 % yes 2 334 333 99.70 % 3 48890 23058 47.16 % Table: 1 Remark:- Time:- GZIP, HUFFMAN, SHANNON FANO, JUNK CODE BINARY, LZW Compression Order:- GZIP, LZW, HUFFMAN, JUNK CODE BINARY, SHANNON FANO, COSMO. Space Required:- GZIP, LZW, HUFFMAN, JUNK CODE BINARY, SHANNON FANO, COSMO. Here the table 1 the file format is text file and All above the in table show the data reduction techniques that distinguished the time, compression ration and space required . In the table 1 specify that Gzip and Lzw is better techniques for the compression of text file In the next table 2 of comparison take the file format mp3 in which Huffman and Gzip is better technique for data compression. so Huffman is better provide the data reduction Type Of File:- MP3
  5. 5. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online) Volume No.-1, Issue No.-2, May, 2013 RES Publication © 2012 Page | 15 http://www.resindia.org Algorithm Name S N o Original File Size Compress ed File Size Compressi on Ratio Distin ct Chara cters Bette r Algo HUFFMA N 1 732310 6 7294912 99.62% 256 yes 2 478988 8 4781161 99.82% 256 3 593350 9 5904888 99.52% 256 SHANNO N FANO 1 732310 6 7404275 101.11% 256 2 478988 8 4865326 101.57% 256 3 593350 9 5986265 100.89% 256 GZIP 1 732310 6 7223973 98.65% yes 2 478988 8 4733205 98.82% 3 593350 9 5846223 98.53% JUNK CODE BINARY 1 732310 6 7854595 107.26% 256 2 478988 8 5153463 107.59% 256 3 593350 9 6363777 107.25% 256 RLE 1 732310 6 7411408 101.21% 2 478988 8 4834808 100.94% 3 593350 9 5994496 101.03% LZW 1 732310 6 1018648 3 139.10% 2 478988 8 6820672 142.40% 3 593350 9 8217964 138.50% Table: 2 Remark:- Time :- RLE, GZIP, HUFFMAN, SHANNON FANO, JUNK CODE BINARY, LZW Compression Order:- GZIP, HUFFMAN, RLE, SHANNON FANO, JUNK CODE BINARY, LZW Space Required:- GZIP, HUFFMAN, RLE, SHANNON FANO, JUNK CODE BINARY, LZW VIII. CONCLUSION Huffman Algorithm is a lossless compression technique. Huffman is the most efficient but requires two passes over the data. The amount of compression, of course, depends on the type of file being compressed. Random data, such as executable programs or object code files, typically has low compression resulting in a file which is 50 to 95% of the original file size. Still images and animation files tend to have high compression and typically result in a file which is between only 2 and 20% of the original file size. It should be noted that once a file has been compressed there is virtually no gain in compressing it again. Thus storing or transmitting compressed files over a system which has further compression will not increase the compression ratio. Huffman codes are used to differentiate between data i.e. literal values and back references. IX. REFERENCES [1] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, ―On breaking a Huffman code,‖ IEEE Trans. Inform. Theory, vol. 42, no. 3, pp. 972- 976, May 1996. [2] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, May 1977 [3] Mridual k. M., ―lossless huffman coding technique for image compression and reconstruction using binary trees ―ijcta ,vol 3 (1), 76-79,feb 2012 [4] A.B.Watson,‖Image, Compression using the DCT "― ,Mathematica Journal, 1995,pp.81-88 [5] J. Ziv and A. Lempel, ``A Universal Algorithm for Sequential Data Compression,'' IEEE Transactions on Information Theory, Vol. 23, pp. 337--342, 1977 [6] D.E. Knuth — Dynamic Huffman Coding — Journal of Algorithms, 6, 1983 pp. 163-180. [7] Dzung Tien Hoang and Jeffery Scott Vitter .Fast and Efficient Algorithms for video Compression and Rate Control, June 20,1998. AUTHOR’S BIOGRAPHIES First Author Ramesh Jangid, M.Tech-CS Student computer science engineering from Jagannath University, Jaipur. I am the member of IACSIT, IAENG. I have done B.E-computer science engineering in 2008 Batch from Rajasthan University, Jaipur since 2008 My specialization is data structure, computer networking Redhat linux, Real time system, cloud-computing. Second Author Mr. Sandeep Kumar Assistant Professor in computer science department from Jagannath University, Jaipur. He is M.Tech &, Ph.d (Pursuing) and various Journal and International Paper published. He is the member of IACSIT, IAENG. His specialization field of area is data structure, computer network, artificial intelligence, database management system.

×