SlideShare a Scribd company logo
1 of 33
By
Subeer Rangra
(08EBKCS059)
      &
Mukul Ranjan
 (08EBKCS029)
Index
1.   Introduction to Data Compression
2.   Introduction to Text Compression
3.   LZW
     3.1 LZW Encoding Algorithm
     3.2 Encoding a String Example
     3.2 LZW Decoding Algorithm
     3.3 Decoding a String Example.
4.   Flate Compression
     4.1 Decomposition
        4.1.1 Huffman Coding
        4.1.2 LZ77 Compression
        4.1.3 Putting both together
5.   Advantages and Disadvantages
     5.1 LZW
     5.2 Flate
6.   Conclusion
1. Introduction to Data
Compression
 Encoding information using fewer bits than the
 original representation.
 Data Compression is achieved when redundancies are
 reduced or eliminated
 Lossless where no information is lost.

 Lossy where some information is lost.

 Compression reduces the data storage space.
Introduction to Data
Compression…. Contd.
 Reduces transmission time needed over the network.

 Data must be decompressed or decoded to be reused.

 Symmetrical or Asymmetrical

 Software or Hardware
2. Introduction to Text
Compression
 The compression of Text based data.

 Major difference between Text and Image compression.

 Databases, binary programs, text on one side and sound,
  image, video signals on the other.

 Text compression needs Losseless Compression.

 Needed in literary works, product catalogues, genomic
  databases, raw text databases.
3. LZW (Lempel-Ziv-Welch)
 Starts with a dictionary of all the single characters and gradually
  builds the dictionary as the information is sent through.

 Lossless compression hence works good for text compression.

 A dictionary or code table based encoding algorithm.

 Uses a code table with 4096 as a common choice for number of
  entries.

 It tries to identify repeated sequences of data and adds them to
  the code table.
LZW (Lempel-Ziv-Welch)….contd.
 A general compression algorithm capable of working
  on almost any type of data.

 Large size Text files in English language can be
  typically be compressed to half it’s size.

 Used in GIF (Graphics Interchange Format) to reduce
  the size without degrading the visual quality.
3.1 LZW Encoding Algorithm
1.  STRING = get input character
2. WHILE not end of input stream DO
3.     CHARACTER = get input character
4.     IF STRING+CHARACTER is in the string table then
5.         STRING = STRING+CHARACTER
6.     ELSE
7.         Output the code for STRING
8.         add STRING+CHARACTER to the STRING table
9.         STRING = CHARACTER
10.     END of IF
11. END of WHILE
12. Output the code for STRING
LZW Encoding Flowchart
3.2 Encoding a String example
 To encode a string of characters
1.   First Generate a initial dictionary of single characters

                  Symbol      Binary       Decimal
              #            00000       0
              A            00001       1
              B            00010       2
              C            00011       3
              D            00100       4
              E            00101       5
              Contd……..
              upto Z
Encoding a String Example …..contd
2. Example TOBEORNOTTOBEORTOBEORNOT
    Current                           Output
              Next Char                                 Extended Dictionary                    Comments
   Sequence                    Code            Bits
    NULL         T


      T          O        20             10100        27:         TO          27 = first available code after 0 through 26


      O          B        15             01111        28:         OB
      B          E        2              00010        29:         BE
      E          O        5              00101        30:         EO
      O          R        15             01111        31:         OR


                                                                              32 requires 6 bits, so for next output use 6
      R          N        18             10010        32:         RN
                                                                              bits


      N          O        14             001110       33:         NO
      O          T        15             001111       34:         OT
      T          T        20             010100       35:         TT
     TO          B        27             011011       36:         TOB

     BE          O        29             011101       37:         BEO
Encoding a String Example …..contd
  TO    B   27   011011   36:   TOB

  BE    O   29   011101   37:   BEO

  OR    T   31   011111   38:   ORT

  TOB   E   36   100100   39:   TOBE

  EO    R   30   011110   40:   EOR

  RN    O   32   100000   41:   RNO


                                       # stops the algorithm;
  OT    #   34   100010
                                       send the cur seq


            0    000000                and the stop code
3.3 LZW Decoding Algorithm
1.    Read OLD_CODE
2.    output OLD_CODE
3.    CHARACTER = OLD_CODE
4.    WHILE there are still input characters DO
5.      Read NEW_CODE
6.      IF NEW_CODE is not in the translation table THEN
7.         STRING = get translation of OLD_CODE
8.         STRING = STRING+CHARACTER
9.      ELSE
10.        STRING = get translation of NEW_CODE
11.     END of IF
12.     output STRING
13.     CHARACTER = first character in STRING
14.     add OLD_CODE + CHARACTER to the translation table
15.     OLD_CODE = NEW_CODE
16.   END of WHILE
LZW Decoding Flowchart
3.4 Decoding a String Example
 To decode an LZW-compressed archive, one needs to know
   in advance the initial dictionary used, but additional
   entries can be reconstructed as they are always simply
   concatenations of previous entries.
         Input                           New Dictionary Entry
                        Output
                                                                             Comments
  Bits          Code   Sequence         Full            Conjecture
10100       20            T                       27:        T?
01111       15            O       27:    TO       28:        O?
00010       2             B       28:    OB       29:        B?
00101       5             E       29:    BE       30:        E?
01111       15            O       30:    EO       31:        O?
                                                                     created code 31 (last to fit
10010       18            R       31:    OR       32:        R?
                                                                     in 5 bits)


                                                                     so start reading input at 6
001110      14            N       32:    RN       33:        N?
                                                                     bits
4. Flate Compression
 A lossless data compression.
 Can discover and exploit many patterns in the input
  data.
 An improvement over LZW compression, Flate
  encoded data is usually much more compact than
  LZW encoded output.
 It was originally defined by Phil Katz for version 2 of
  his PKZIP archiving tool and was later specified in RFC
  1951.
 Used in PDF compression, Adobe uses a Flate
  compression tool for PDF files.
4.1 Decomposition
 Flate specifications defines a lossless data format that
  compresses data using a combination of LZ77 algorithm
  and Huffman coding.
 Hence the format can be implemented readily in a manner
  not covered by patents.
 The manner in which these two algorithms work are
  explained below and then the combination of the two
  which work to produce Flate compression.
4.1.1 Huffman Coding
 A type of entropy encoding algorithm.

 Used for lossless data compression.

 Can be used to generate variable-length codes.

 The variable length codes are generated based on the
 frequency of the occurrence of the characters.
 The idea of assigning shortest code to the character
 with the highest probability of occurrence.
Huffman Coding…. contd.
 The algorithm starts by assigning each element a
  ‘weight’ a number that represents the relative
  frequency within the data to be compressed.
Taking an example for the set of weights {1,2,3,3,4}




1.   They are assigned to be the nodes or leaves of the
     Huffman tree to be formed
Huffman Coding…. contd.
2. During the first step, the two nodes with weights
   (highest priority OR lowest probability) 1 and 2 are
   merged, to create a new tree with a root of weight 3.
Huffman Coding…. contd.
3. Now we have three nodes with weights 3 at their
   roots, so choosing one of the 3 weighted node.
Huffman Coding…. contd.
4. Now our two minimum trees are the two singleton
   nodes of weights 3 and 4. We will combine these to
   form a new tree of weight 7.
Huffman Coding…. contd.
5. Finally we merge our last two remaining trees.
Huffman Coding…. contd.
 When all nodes have been recombined into a single
  ``Huffman tree,'' then by starting at the root and
  selecting 0 or 1 at each step, you can reach any element
  in the tree.
 Each element now has a Huffman code, which is the
  sequence of 0's and 1's that represents that path
  through the tree.
4.1.2 LZ77 Compression
 Works by finding the sequence of data that are
    repeated.
   A lossless data compression algorithm.
   Maintains a ‘sliding window during compression’
    which means that the compressor have a record of
    what last characters were.
   Goes through the text in a sliding window consisting
    of a search buffer and a look ahead buffer.
   The search buffer is used as dictionary.
LZ77 Compression…. contd.
1. Suppose the input text is
    AABABBBABAABABBBABBABB
2. The first block found is simply A, encoded as (0,A).
   The next is AB, encoded as (1,B) where 1 is a reference
   to A:
    A|AB|ABBBABAABABBBABBABB
3. The next block is ABB, which is encoded as (2,B)
   where 2 is a reference to AB, entered in the
   dictionary one iteration ago. Going this way, the
   string parses into
   A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB
LZ77 Compression…. Contd.
 At the end of the algorithm, the dictionary is:
                  Reference        Phrase    Encoding
              1               A             (0,A)
              2               AB            (1,B)
              3               ABB           (2,B)
              4               B             (0,B)
              5               ABA           (2,A)
              6               ABAB          (5,B)
              7               BB            (4,B)
              8               ABBA          (3,A)
              9               BB            (7,0)
4.1.3 Putting Both Together
The Flate is a smart algorithm that adapts the way it
compresses data to the actual data themselves. There are
three modes of compression that the compressor has
available:
1. Not compressed at all an intelligent choice when the
    data has already been compressed.
2. Compression, first with LZ77 and then with a slightly
    modified version of Huffman coding. The trees that
    are used are defined by the Flate specification itself.
Putting Both Together….contd.
3. Compression first with LZ77 and then with Huffman
   coding with trees that compressor creates and stores
   along with the data.
   The data is broken up into blocks each block uses a
   single mode of compression.
5. Advantages & Disadvantages
5.1 LZW
Advantage
   Is a lossless compression algo. Hence no information is lost.
   One need not pass the code table between the two
    compression and the decompression.
   Simple, fast and good compression.
Disadvantage
   What happens when the dictionary becomes too large.
   One approach is to throw the dictionary away when it reaches
    a certain size.
   Useful only for a large amount of text data where redundancy
    is high.
Advantages & Disadvantages
5.1 Flate Compression
Advantage
    Huffman is easy to implement.
    Flate is a lossless compression technique hence no loss of text.
    Simple, fast and good compression.
    Freedom to chose the type of compression based on the need of the
     content.
Disadvantage
    Overhead is generated due to Huffman tree generation.
    The actual resulting compression code becomes too complex as it
     combines LZ77 and Huffman.
    It’s quiet tricky to understand and correctly apply the correct
     combination of LZ77 and Huffman.
6. Conclusion
 LZW has various advantages when being used to
  compress large text data, in English language which
  has high redundancy.
 Both LZW and Flate are software based, Dictionary
  and lossless methods of compression.
 The text compression needs lossless technique of
  compression.
 Flate which is readily used in PDF files, is an adaptive,
  changeable and complex way to compress text.
Thank You

More Related Content

What's hot

Data compression huffman coding algoritham
Data compression huffman coding algorithamData compression huffman coding algoritham
Data compression huffman coding algoritham
Rahul Khanwani
 
Huffman's Alforithm
Huffman's AlforithmHuffman's Alforithm
Huffman's Alforithm
Roohaali
 

What's hot (20)

Compiler Design Unit 5
Compiler Design Unit 5Compiler Design Unit 5
Compiler Design Unit 5
 
Image Compression
Image CompressionImage Compression
Image Compression
 
Run-Length Encoding algorithm
Run-Length Encoding algorithmRun-Length Encoding algorithm
Run-Length Encoding algorithm
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic coding
 
Floyd Warshall Algorithm
Floyd Warshall Algorithm Floyd Warshall Algorithm
Floyd Warshall Algorithm
 
Wavelet transform in image compression
Wavelet transform in image compressionWavelet transform in image compression
Wavelet transform in image compression
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Data compression huffman coding algoritham
Data compression huffman coding algorithamData compression huffman coding algoritham
Data compression huffman coding algoritham
 
Huffman's Alforithm
Huffman's AlforithmHuffman's Alforithm
Huffman's Alforithm
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
Lzw coding technique for image compression
Lzw coding technique for image compressionLzw coding technique for image compression
Lzw coding technique for image compression
 
Digital Image Processing (Lab 07)
Digital Image Processing (Lab 07)Digital Image Processing (Lab 07)
Digital Image Processing (Lab 07)
 
sum of subset problem using Backtracking
sum of subset problem using Backtrackingsum of subset problem using Backtracking
sum of subset problem using Backtracking
 
Prims and kruskal algorithms
Prims and kruskal algorithmsPrims and kruskal algorithms
Prims and kruskal algorithms
 
Bellman ford algorithm
Bellman ford algorithmBellman ford algorithm
Bellman ford algorithm
 
image compression ppt
image compression pptimage compression ppt
image compression ppt
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Convergence Of Power Series , Taylor And Laurent Theorems (Without Proof)
Convergence Of Power Series , Taylor And Laurent Theorems (Without Proof)Convergence Of Power Series , Taylor And Laurent Theorems (Without Proof)
Convergence Of Power Series , Taylor And Laurent Theorems (Without Proof)
 

Viewers also liked

Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compression
anithabalaprabhu
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
faizang909
 

Viewers also liked (20)

Lzw algorithm
Lzw algorithmLzw algorithm
Lzw algorithm
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHMOPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
 
Lz77 / Lempel-Ziv Algorithm
Lz77 / Lempel-Ziv AlgorithmLz77 / Lempel-Ziv Algorithm
Lz77 / Lempel-Ziv Algorithm
 
LZ78
LZ78LZ78
LZ78
 
Huffman Coding
Huffman CodingHuffman Coding
Huffman Coding
 
Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compression
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Compression
CompressionCompression
Compression
 
Data compression techniques
Data compression techniquesData compression techniques
Data compression techniques
 
Shannon Fano
Shannon FanoShannon Fano
Shannon Fano
 
Data compression
Data compressionData compression
Data compression
 
Image compression
Image compressionImage compression
Image compression
 
Digital Communication Techniques
Digital Communication TechniquesDigital Communication Techniques
Digital Communication Techniques
 
Compression techniques
Compression techniquesCompression techniques
Compression techniques
 
Data compression
Data compressionData compression
Data compression
 
Multimediaexercise
MultimediaexerciseMultimediaexercise
Multimediaexercise
 
Fundamentals of Data compression
Fundamentals of Data compressionFundamentals of Data compression
Fundamentals of Data compression
 
Run length encoding
Run length encodingRun length encoding
Run length encoding
 
Huffman Text Compression Technique
Huffman Text Compression TechniqueHuffman Text Compression Technique
Huffman Text Compression Technique
 

Similar to Text compression in LZW and Flate

Similar to Text compression in LZW and Flate (20)

Lec-03 Entropy Coding I: Hoffmann & Golomb Codes
Lec-03 Entropy Coding I: Hoffmann & Golomb CodesLec-03 Entropy Coding I: Hoffmann & Golomb Codes
Lec-03 Entropy Coding I: Hoffmann & Golomb Codes
 
Data Encryption standard in cryptography
Data Encryption standard in cryptographyData Encryption standard in cryptography
Data Encryption standard in cryptography
 
Lz algorithm
Lz algorithmLz algorithm
Lz algorithm
 
EMBEDDED SYSTEMS 2&3
EMBEDDED SYSTEMS 2&3EMBEDDED SYSTEMS 2&3
EMBEDDED SYSTEMS 2&3
 
Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional Logic
 
ATT SMK.pptx
ATT SMK.pptxATT SMK.pptx
ATT SMK.pptx
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Chapter 4 combinational circuit
Chapter 4 combinational circuit Chapter 4 combinational circuit
Chapter 4 combinational circuit
 
11.ppt
11.ppt11.ppt
11.ppt
 
06 Arithmetic 1
06 Arithmetic 106 Arithmetic 1
06 Arithmetic 1
 
Lab01
Lab01Lab01
Lab01
 
Lecture.1
Lecture.1Lecture.1
Lecture.1
 
unit 5 (1).pptx
unit 5 (1).pptxunit 5 (1).pptx
unit 5 (1).pptx
 
Computer archi&mp
Computer archi&mpComputer archi&mp
Computer archi&mp
 
Octal encoding
Octal encodingOctal encoding
Octal encoding
 
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptxCrypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
Crypto-Presentation jfjfd dkfdnfdj kdfjdjfdjkfd .pptx
 
Compression ii
Compression iiCompression ii
Compression ii
 
Turbo Code
Turbo Code Turbo Code
Turbo Code
 
Ch03 des
Ch03 desCh03 des
Ch03 des
 

Recently uploaded

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Recently uploaded (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Text compression in LZW and Flate

  • 1. By Subeer Rangra (08EBKCS059) & Mukul Ranjan (08EBKCS029)
  • 2. Index 1. Introduction to Data Compression 2. Introduction to Text Compression 3. LZW 3.1 LZW Encoding Algorithm 3.2 Encoding a String Example 3.2 LZW Decoding Algorithm 3.3 Decoding a String Example. 4. Flate Compression 4.1 Decomposition 4.1.1 Huffman Coding 4.1.2 LZ77 Compression 4.1.3 Putting both together 5. Advantages and Disadvantages 5.1 LZW 5.2 Flate 6. Conclusion
  • 3. 1. Introduction to Data Compression  Encoding information using fewer bits than the original representation.  Data Compression is achieved when redundancies are reduced or eliminated  Lossless where no information is lost.  Lossy where some information is lost.  Compression reduces the data storage space.
  • 4. Introduction to Data Compression…. Contd.  Reduces transmission time needed over the network.  Data must be decompressed or decoded to be reused.  Symmetrical or Asymmetrical  Software or Hardware
  • 5. 2. Introduction to Text Compression  The compression of Text based data.  Major difference between Text and Image compression.  Databases, binary programs, text on one side and sound, image, video signals on the other.  Text compression needs Losseless Compression.  Needed in literary works, product catalogues, genomic databases, raw text databases.
  • 6. 3. LZW (Lempel-Ziv-Welch)  Starts with a dictionary of all the single characters and gradually builds the dictionary as the information is sent through.  Lossless compression hence works good for text compression.  A dictionary or code table based encoding algorithm.  Uses a code table with 4096 as a common choice for number of entries.  It tries to identify repeated sequences of data and adds them to the code table.
  • 7. LZW (Lempel-Ziv-Welch)….contd.  A general compression algorithm capable of working on almost any type of data.  Large size Text files in English language can be typically be compressed to half it’s size.  Used in GIF (Graphics Interchange Format) to reduce the size without degrading the visual quality.
  • 8. 3.1 LZW Encoding Algorithm 1. STRING = get input character 2. WHILE not end of input stream DO 3. CHARACTER = get input character 4. IF STRING+CHARACTER is in the string table then 5. STRING = STRING+CHARACTER 6. ELSE 7. Output the code for STRING 8. add STRING+CHARACTER to the STRING table 9. STRING = CHARACTER 10. END of IF 11. END of WHILE 12. Output the code for STRING
  • 10. 3.2 Encoding a String example  To encode a string of characters 1. First Generate a initial dictionary of single characters Symbol Binary Decimal # 00000 0 A 00001 1 B 00010 2 C 00011 3 D 00100 4 E 00101 5 Contd…….. upto Z
  • 11. Encoding a String Example …..contd 2. Example TOBEORNOTTOBEORTOBEORNOT Current Output Next Char Extended Dictionary Comments Sequence Code Bits NULL T T O 20 10100 27: TO 27 = first available code after 0 through 26 O B 15 01111 28: OB B E 2 00010 29: BE E O 5 00101 30: EO O R 15 01111 31: OR 32 requires 6 bits, so for next output use 6 R N 18 10010 32: RN bits N O 14 001110 33: NO O T 15 001111 34: OT T T 20 010100 35: TT TO B 27 011011 36: TOB BE O 29 011101 37: BEO
  • 12. Encoding a String Example …..contd TO B 27 011011 36: TOB BE O 29 011101 37: BEO OR T 31 011111 38: ORT TOB E 36 100100 39: TOBE EO R 30 011110 40: EOR RN O 32 100000 41: RNO # stops the algorithm; OT # 34 100010 send the cur seq 0 000000 and the stop code
  • 13. 3.3 LZW Decoding Algorithm 1. Read OLD_CODE 2. output OLD_CODE 3. CHARACTER = OLD_CODE 4. WHILE there are still input characters DO 5. Read NEW_CODE 6. IF NEW_CODE is not in the translation table THEN 7. STRING = get translation of OLD_CODE 8. STRING = STRING+CHARACTER 9. ELSE 10. STRING = get translation of NEW_CODE 11. END of IF 12. output STRING 13. CHARACTER = first character in STRING 14. add OLD_CODE + CHARACTER to the translation table 15. OLD_CODE = NEW_CODE 16. END of WHILE
  • 15. 3.4 Decoding a String Example  To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. Input New Dictionary Entry Output Comments Bits Code Sequence Full Conjecture 10100 20 T 27: T? 01111 15 O 27: TO 28: O? 00010 2 B 28: OB 29: B? 00101 5 E 29: BE 30: E? 01111 15 O 30: EO 31: O? created code 31 (last to fit 10010 18 R 31: OR 32: R? in 5 bits) so start reading input at 6 001110 14 N 32: RN 33: N? bits
  • 16. 4. Flate Compression  A lossless data compression.  Can discover and exploit many patterns in the input data.  An improvement over LZW compression, Flate encoded data is usually much more compact than LZW encoded output.  It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951.  Used in PDF compression, Adobe uses a Flate compression tool for PDF files.
  • 17. 4.1 Decomposition  Flate specifications defines a lossless data format that compresses data using a combination of LZ77 algorithm and Huffman coding.  Hence the format can be implemented readily in a manner not covered by patents.  The manner in which these two algorithms work are explained below and then the combination of the two which work to produce Flate compression.
  • 18. 4.1.1 Huffman Coding  A type of entropy encoding algorithm.  Used for lossless data compression.  Can be used to generate variable-length codes.  The variable length codes are generated based on the frequency of the occurrence of the characters.  The idea of assigning shortest code to the character with the highest probability of occurrence.
  • 19. Huffman Coding…. contd.  The algorithm starts by assigning each element a ‘weight’ a number that represents the relative frequency within the data to be compressed. Taking an example for the set of weights {1,2,3,3,4} 1. They are assigned to be the nodes or leaves of the Huffman tree to be formed
  • 20. Huffman Coding…. contd. 2. During the first step, the two nodes with weights (highest priority OR lowest probability) 1 and 2 are merged, to create a new tree with a root of weight 3.
  • 21. Huffman Coding…. contd. 3. Now we have three nodes with weights 3 at their roots, so choosing one of the 3 weighted node.
  • 22. Huffman Coding…. contd. 4. Now our two minimum trees are the two singleton nodes of weights 3 and 4. We will combine these to form a new tree of weight 7.
  • 23. Huffman Coding…. contd. 5. Finally we merge our last two remaining trees.
  • 24. Huffman Coding…. contd.  When all nodes have been recombined into a single ``Huffman tree,'' then by starting at the root and selecting 0 or 1 at each step, you can reach any element in the tree.  Each element now has a Huffman code, which is the sequence of 0's and 1's that represents that path through the tree.
  • 25. 4.1.2 LZ77 Compression  Works by finding the sequence of data that are repeated.  A lossless data compression algorithm.  Maintains a ‘sliding window during compression’ which means that the compressor have a record of what last characters were.  Goes through the text in a sliding window consisting of a search buffer and a look ahead buffer.  The search buffer is used as dictionary.
  • 26. LZ77 Compression…. contd. 1. Suppose the input text is AABABBBABAABABBBABBABB 2. The first block found is simply A, encoded as (0,A). The next is AB, encoded as (1,B) where 1 is a reference to A: A|AB|ABBBABAABABBBABBABB 3. The next block is ABB, which is encoded as (2,B) where 2 is a reference to AB, entered in the dictionary one iteration ago. Going this way, the string parses into A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB
  • 27. LZ77 Compression…. Contd.  At the end of the algorithm, the dictionary is: Reference Phrase Encoding 1 A (0,A) 2 AB (1,B) 3 ABB (2,B) 4 B (0,B) 5 ABA (2,A) 6 ABAB (5,B) 7 BB (4,B) 8 ABBA (3,A) 9 BB (7,0)
  • 28. 4.1.3 Putting Both Together The Flate is a smart algorithm that adapts the way it compresses data to the actual data themselves. There are three modes of compression that the compressor has available: 1. Not compressed at all an intelligent choice when the data has already been compressed. 2. Compression, first with LZ77 and then with a slightly modified version of Huffman coding. The trees that are used are defined by the Flate specification itself.
  • 29. Putting Both Together….contd. 3. Compression first with LZ77 and then with Huffman coding with trees that compressor creates and stores along with the data. The data is broken up into blocks each block uses a single mode of compression.
  • 30. 5. Advantages & Disadvantages 5.1 LZW Advantage  Is a lossless compression algo. Hence no information is lost.  One need not pass the code table between the two compression and the decompression.  Simple, fast and good compression. Disadvantage  What happens when the dictionary becomes too large.  One approach is to throw the dictionary away when it reaches a certain size.  Useful only for a large amount of text data where redundancy is high.
  • 31. Advantages & Disadvantages 5.1 Flate Compression Advantage  Huffman is easy to implement.  Flate is a lossless compression technique hence no loss of text.  Simple, fast and good compression.  Freedom to chose the type of compression based on the need of the content. Disadvantage  Overhead is generated due to Huffman tree generation.  The actual resulting compression code becomes too complex as it combines LZ77 and Huffman.  It’s quiet tricky to understand and correctly apply the correct combination of LZ77 and Huffman.
  • 32. 6. Conclusion  LZW has various advantages when being used to compress large text data, in English language which has high redundancy.  Both LZW and Flate are software based, Dictionary and lossless methods of compression.  The text compression needs lossless technique of compression.  Flate which is readily used in PDF files, is an adaptive, changeable and complex way to compress text.