Text compression in LZW and Flate

6,236 views

Published on

1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
6,236
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
426
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Text compression in LZW and Flate

  1. 1. BySubeer Rangra(08EBKCS059) &Mukul Ranjan (08EBKCS029)
  2. 2. Index1. Introduction to Data Compression2. Introduction to Text Compression3. LZW 3.1 LZW Encoding Algorithm 3.2 Encoding a String Example 3.2 LZW Decoding Algorithm 3.3 Decoding a String Example.4. Flate Compression 4.1 Decomposition 4.1.1 Huffman Coding 4.1.2 LZ77 Compression 4.1.3 Putting both together5. Advantages and Disadvantages 5.1 LZW 5.2 Flate6. Conclusion
  3. 3. 1. Introduction to DataCompression Encoding information using fewer bits than the original representation. Data Compression is achieved when redundancies are reduced or eliminated Lossless where no information is lost. Lossy where some information is lost. Compression reduces the data storage space.
  4. 4. Introduction to DataCompression…. Contd. Reduces transmission time needed over the network. Data must be decompressed or decoded to be reused. Symmetrical or Asymmetrical Software or Hardware
  5. 5. 2. Introduction to TextCompression The compression of Text based data. Major difference between Text and Image compression. Databases, binary programs, text on one side and sound, image, video signals on the other. Text compression needs Losseless Compression. Needed in literary works, product catalogues, genomic databases, raw text databases.
  6. 6. 3. LZW (Lempel-Ziv-Welch) Starts with a dictionary of all the single characters and gradually builds the dictionary as the information is sent through. Lossless compression hence works good for text compression. A dictionary or code table based encoding algorithm. Uses a code table with 4096 as a common choice for number of entries. It tries to identify repeated sequences of data and adds them to the code table.
  7. 7. LZW (Lempel-Ziv-Welch)….contd. A general compression algorithm capable of working on almost any type of data. Large size Text files in English language can be typically be compressed to half it’s size. Used in GIF (Graphics Interchange Format) to reduce the size without degrading the visual quality.
  8. 8. 3.1 LZW Encoding Algorithm1. STRING = get input character2. WHILE not end of input stream DO3. CHARACTER = get input character4. IF STRING+CHARACTER is in the string table then5. STRING = STRING+CHARACTER6. ELSE7. Output the code for STRING8. add STRING+CHARACTER to the STRING table9. STRING = CHARACTER10. END of IF11. END of WHILE12. Output the code for STRING
  9. 9. LZW Encoding Flowchart
  10. 10. 3.2 Encoding a String example To encode a string of characters1. First Generate a initial dictionary of single characters Symbol Binary Decimal # 00000 0 A 00001 1 B 00010 2 C 00011 3 D 00100 4 E 00101 5 Contd…….. upto Z
  11. 11. Encoding a String Example …..contd2. Example TOBEORNOTTOBEORTOBEORNOT Current Output Next Char Extended Dictionary Comments Sequence Code Bits NULL T T O 20 10100 27: TO 27 = first available code after 0 through 26 O B 15 01111 28: OB B E 2 00010 29: BE E O 5 00101 30: EO O R 15 01111 31: OR 32 requires 6 bits, so for next output use 6 R N 18 10010 32: RN bits N O 14 001110 33: NO O T 15 001111 34: OT T T 20 010100 35: TT TO B 27 011011 36: TOB BE O 29 011101 37: BEO
  12. 12. Encoding a String Example …..contd TO B 27 011011 36: TOB BE O 29 011101 37: BEO OR T 31 011111 38: ORT TOB E 36 100100 39: TOBE EO R 30 011110 40: EOR RN O 32 100000 41: RNO # stops the algorithm; OT # 34 100010 send the cur seq 0 000000 and the stop code
  13. 13. 3.3 LZW Decoding Algorithm1. Read OLD_CODE2. output OLD_CODE3. CHARACTER = OLD_CODE4. WHILE there are still input characters DO5. Read NEW_CODE6. IF NEW_CODE is not in the translation table THEN7. STRING = get translation of OLD_CODE8. STRING = STRING+CHARACTER9. ELSE10. STRING = get translation of NEW_CODE11. END of IF12. output STRING13. CHARACTER = first character in STRING14. add OLD_CODE + CHARACTER to the translation table15. OLD_CODE = NEW_CODE16. END of WHILE
  14. 14. LZW Decoding Flowchart
  15. 15. 3.4 Decoding a String Example To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. Input New Dictionary Entry Output Comments Bits Code Sequence Full Conjecture10100 20 T 27: T?01111 15 O 27: TO 28: O?00010 2 B 28: OB 29: B?00101 5 E 29: BE 30: E?01111 15 O 30: EO 31: O? created code 31 (last to fit10010 18 R 31: OR 32: R? in 5 bits) so start reading input at 6001110 14 N 32: RN 33: N? bits
  16. 16. 4. Flate Compression A lossless data compression. Can discover and exploit many patterns in the input data. An improvement over LZW compression, Flate encoded data is usually much more compact than LZW encoded output. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951. Used in PDF compression, Adobe uses a Flate compression tool for PDF files.
  17. 17. 4.1 Decomposition Flate specifications defines a lossless data format that compresses data using a combination of LZ77 algorithm and Huffman coding. Hence the format can be implemented readily in a manner not covered by patents. The manner in which these two algorithms work are explained below and then the combination of the two which work to produce Flate compression.
  18. 18. 4.1.1 Huffman Coding A type of entropy encoding algorithm. Used for lossless data compression. Can be used to generate variable-length codes. The variable length codes are generated based on the frequency of the occurrence of the characters. The idea of assigning shortest code to the character with the highest probability of occurrence.
  19. 19. Huffman Coding…. contd. The algorithm starts by assigning each element a ‘weight’ a number that represents the relative frequency within the data to be compressed.Taking an example for the set of weights {1,2,3,3,4}1. They are assigned to be the nodes or leaves of the Huffman tree to be formed
  20. 20. Huffman Coding…. contd.2. During the first step, the two nodes with weights (highest priority OR lowest probability) 1 and 2 are merged, to create a new tree with a root of weight 3.
  21. 21. Huffman Coding…. contd.3. Now we have three nodes with weights 3 at their roots, so choosing one of the 3 weighted node.
  22. 22. Huffman Coding…. contd.4. Now our two minimum trees are the two singleton nodes of weights 3 and 4. We will combine these to form a new tree of weight 7.
  23. 23. Huffman Coding…. contd.5. Finally we merge our last two remaining trees.
  24. 24. Huffman Coding…. contd. When all nodes have been recombined into a single ``Huffman tree, then by starting at the root and selecting 0 or 1 at each step, you can reach any element in the tree. Each element now has a Huffman code, which is the sequence of 0s and 1s that represents that path through the tree.
  25. 25. 4.1.2 LZ77 Compression Works by finding the sequence of data that are repeated. A lossless data compression algorithm. Maintains a ‘sliding window during compression’ which means that the compressor have a record of what last characters were. Goes through the text in a sliding window consisting of a search buffer and a look ahead buffer. The search buffer is used as dictionary.
  26. 26. LZ77 Compression…. contd.1. Suppose the input text is AABABBBABAABABBBABBABB2. The first block found is simply A, encoded as (0,A). The next is AB, encoded as (1,B) where 1 is a reference to A: A|AB|ABBBABAABABBBABBABB3. The next block is ABB, which is encoded as (2,B) where 2 is a reference to AB, entered in the dictionary one iteration ago. Going this way, the string parses into A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB
  27. 27. LZ77 Compression…. Contd. At the end of the algorithm, the dictionary is: Reference Phrase Encoding 1 A (0,A) 2 AB (1,B) 3 ABB (2,B) 4 B (0,B) 5 ABA (2,A) 6 ABAB (5,B) 7 BB (4,B) 8 ABBA (3,A) 9 BB (7,0)
  28. 28. 4.1.3 Putting Both TogetherThe Flate is a smart algorithm that adapts the way itcompresses data to the actual data themselves. There arethree modes of compression that the compressor hasavailable:1. Not compressed at all an intelligent choice when the data has already been compressed.2. Compression, first with LZ77 and then with a slightly modified version of Huffman coding. The trees that are used are defined by the Flate specification itself.
  29. 29. Putting Both Together….contd.3. Compression first with LZ77 and then with Huffman coding with trees that compressor creates and stores along with the data. The data is broken up into blocks each block uses a single mode of compression.
  30. 30. 5. Advantages & Disadvantages5.1 LZWAdvantage  Is a lossless compression algo. Hence no information is lost.  One need not pass the code table between the two compression and the decompression.  Simple, fast and good compression.Disadvantage  What happens when the dictionary becomes too large.  One approach is to throw the dictionary away when it reaches a certain size.  Useful only for a large amount of text data where redundancy is high.
  31. 31. Advantages & Disadvantages5.1 Flate CompressionAdvantage  Huffman is easy to implement.  Flate is a lossless compression technique hence no loss of text.  Simple, fast and good compression.  Freedom to chose the type of compression based on the need of the content.Disadvantage  Overhead is generated due to Huffman tree generation.  The actual resulting compression code becomes too complex as it combines LZ77 and Huffman.  It’s quiet tricky to understand and correctly apply the correct combination of LZ77 and Huffman.
  32. 32. 6. Conclusion LZW has various advantages when being used to compress large text data, in English language which has high redundancy. Both LZW and Flate are software based, Dictionary and lossless methods of compression. The text compression needs lossless technique of compression. Flate which is readily used in PDF files, is an adaptive, changeable and complex way to compress text.
  33. 33. Thank You

×