SlideShare a Scribd company logo
1 of 54
UNIT II
TEXT COMPRESSION
Outline
Compression techniques
Run length coding
Huffman coding
Adaptive Huffman Coding
Arithmetic coding
Shannon-Fano coding
Dictionary techniques
LZW family algorithms.
Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa
Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa
Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa
Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa
Introduction
 Compression: the process of coding that will effectively reduce
the total number of bits needed to represent certain information.
3
Fig.1: A General Data Compression Scheme.
Introduction
 If the compression and decompression processes induce no
information loss, then the compression scheme is lossless;
otherwise, it is lossy.
 Compression ratio:
 B0 – number of bits before compression
 B1 – number of bits after compression
 In general, we would desire any codec (encoder/decoder
scheme) to have a compression ratio much larger than 1.0.
 The higher the compression ratio, the better the lossless
compression scheme, as long as it is computationally feasible.
5
Basics of Information Theory
6
 What is entropy? is a measure of the number of specific
ways in which a system may be arranged, commonly
understood as a measure of the disorder of a system.
 As an example, if the information source S is a gray-level
digital image, each si is a gray-level intensity ranging from
0 to (2k − 1), where k is the number of bits used to
represent each pixel in an uncompressed image.
 We need to find the entropy of this image; which the
number of bits to represent the image after compression.
Run-Length Coding
• RLC is one of the simplest forms of data compression.
 The basic idea is that if the information source has the property that
symbols tend to form continuous groups, then such symbol and the
length of the group can be coded.
 Consider a screen containing plain black text on a solid white
background.
 There will be many long runs of white pixels in the blank space, and
many short runs of black pixels within the text. Let us take a
hypothetical single scan line, with B representing a black pixel and W
representing white: WWWWWBWWWWBBBWWWWWWBWWW
 If we apply the run-length encoding (RLE) data compression algorithm
to the above hypothetical scan line, we get the following:
5W1B4W3B6W1B3W
 The run-length code represents the original 21 characters in only 14.
7
Variable-Length Coding
8
 Variable-length coding (VLC) is one of the best-known
entropy coding methods
 Here, we will study the Shannon–Fano algorithm, Huffman
coding, and adaptive Huffman coding.
Shannon–Fano Algorithm
9
 To illustrate the algorithm, let us suppose the symbols to be
coded are the characters in the word HELLO.
 The frequency count of the symbols is
Symbol H E L O
Count 1 1 2 1
 The encoding steps of the Shannon–Fano algorithm can be
presented in the following top-down manner:
 1. Sort the symbols according to the frequency count of their
occurrences.
 2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts
contain only one symbol.
Shannon–Fano Algorithm
10
 A natural way of implementing the above procedure is to build
a binary tree.
 As a convention, let us assign bit 0 to its left branches and 1
to the right branches.
 Initially, the symbols are sorted as LHEO.
 As Fig. 7.3 shows, the first division yields two parts: L with a
count of 2, denoted as L:(2); and H, E and O with a total count
of 3, denoted as H, E, O:(3).
 The second division yields H:(1) and E, O:(2).
 The last division is E:(1) and O:(1).
Shannon–Fano Algorithm
11
Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.
Table 7.1: Result of Performing Shannon-Fano on HELLO
Li & Drew 12
Symbol Count Log2 Code # of bits used
L 2 1.32 0 2
H 1 2.32 10 2
E 1 2.32 110 3
O 1 2.32 111 3
TOTAL # of bits: 10
1
pi
Another coding tree for HELLO by Shannon-Fano.
Li & Drew 13
Another Result of Performing Shannon-Fano
 on HELLO (see Fig. 7.4)
Li & Drew 14
Symbol Count Log2 Code # of bits used
L 2 1.32 00 4
H 1 2.32 01 2
E 1 2.32 10 2
O 1 2.32 11 2
TOTAL # of bits: 10
1
pi
Shannon–Fano Algorithm-Analysis
15
 The Shannon–Fano algorithm delivers satisfactory coding results
for data compression, but it was soon outperformed and overtaken
by the Huffman coding method.
 The Huffman algorithm requires prior statistical knowledge about
the information source, and such information is often not available.
 This is particularly true in multimedia applications, where future
data is unknown before its arrival, as for example in live (or
streaming) audio and video.
 Even when the statistics are available, the transmission of the
symbol table could represent heavy overhead
 The solution is to use adaptive Huffman coding compression
algorithms, in which statistics are gathered and updated
dynamically as the data stream arrives.
15.16
LOSSLESS COMPRESSION
In lossless data compression, the integrity of the data is
preserved. The original data and the data after compression
and decompression are exactly the same because, in these
methods, the compression and decompression algorithms are
exact inverses of each other: no part of the data is lost in the
process. Redundant data is removed in compression and
added during decompression. Lossless compression methods
are normally used when we cannot afford to lose any data.
15.17
Run-length encoding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the frequency of
occurrence of symbols and can be very efficient if data is
represented as 0s and 1s.
The general idea behind this method is to replace consecutive
repeating occurrences of a symbol by one occurrence of the symbol
followed by the number of occurrences.
The method can be even more efficient if the data uses only two
symbols (for example 0 and 1) in its bit pattern and one symbol is
more frequent than the other.
15.18
Run-length encoding example
15.19
Run-length encoding for two symbols
15.20
Huffman coding
Huffman coding assigns shorter codes to symbols that occur more
frequently and longer codes to those that occur less frequently. For
example, imagine we have a text file that uses only five characters
(A, B, C, D, E). Before we can assign bit patterns to each character,
we assign each character a weight based on its frequency of use. In
this example, assume that the frequency of the characters is as
shown in Table 15.1.
15.21
Huffman coding
15.22
A character’s code is found by starting at the root and following the
branches that lead to that character. The code itself is the bit value
of each branch on the path, taken in sequence.
Final tree and code
15.23
Encoding
Let us see how to encode text using the code for our five
characters. Figure 15.6 shows the original and the encoded text.
Huffman encoding
15.24
Decoding
The recipient has a very easy job in decoding the data it receives.
shows how decoding takes place.
Huffman decoding
page 25
05/06/15 CSE 40373/60373: Multimedia Systems
Adaptive Huffman Coding
Extended Huffman is in book: group symbols
together
Adaptive Huffman: statistics are gathered and
updated dynamically as the data stream arrives
ENCODER
-------
Initial_code();
while not EOF
{
get(c);
encode(c);
update_tree(c);
}
DECODER
-------
Initial_code();
while not EOF
{
decode(c);
output(c);
update_tree(c);
}
26
Adaptive Coding
Motivations:
 The previous algorithms (both Shannon-Fano and Huffman) require the
statistical knowledge which is often not available (e.g., live audio, video).
 Even when it is available, it could be a heavy overhead.
 Higher-order models incur more overhead. For example, a 255 entry
probability table would be required for a 0-order model. An order-1 model
would require 255 such probability tables. (A order-1 model will consider
probabilities of occurrences of 2 symbols)
The solution is to use adaptive algorithms. Adaptive Huffman Coding is
one such mechanism that we will study.
The idea of “adaptiveness” is however applicable to other adaptive
compression algorithms.
27
Adaptive Coding
ENCODER
Initialize_model();
do {
c = getc( input );
encode( c, output );
update_model( c );
} while ( c != eof)
DECODER
Initialize_model();
while ( c = decode (input)) != eof) {
putc( c, output)
update_model( c );
}
 The key is that, both encoder and decoder use exactly the same
initialize_model and update_model routines.
28
The Sibling Property
The node numbers will be assigned in such a way that:
1. A node with a higher weight will have a higher node
number
2. A parent node will always have a higher node number
than its children.
In a nutshell, the sibling property requires that the nodes
(internal and leaf) are arranged in order of increasing
weights.
The update procedure swaps nodes in violation of the sibling
property.
 The identification of nodes in violation of the sibling
property is achieved by using the notion of a block.
 All nodes that have the same weight are said to belong
to one block
29
Flowchart of the update procedure
START
First
appearance
of symbol
Go to symbol
external node
Node
number max
in block?
Increment
node weight
Switch node with
highest numbered
node in block
Is this
the root
node?
Go to
parent node
STOP
NYT gives birth
To new NYT and
external node
Increment weight
of external node
and old NYT node;
Adjust node
numbers
Go to old
NYT node
Yes
No
No
Yes
Yes
No
 The Huffman tree is initialized
with a single node, known as the
Not-Yet-Transmitted (NYT) or
escape code. This code will be sent
every time that a new character,
which is not in the tree, is
encountered, followed by the
ASCII encoding of the character.
This allows for the de-compressor
to distinguish between a code and
a new character. Also, the
procedure creates a new node for
the character and a new NYT
from the old NYT node.
 The root node will have the
highest node number because it
has the highest weight.
30
Example
B
W=2
#1
C
W=2
#2
D
W=2
#3
W=2
#4
W=4
#5
W=6
#6
E
W=10
#7
Root
W=16
#8
Counts:
(number of
occurrences)
B:2
C:2
D:2
E:10
Example Huffman tree after some symbols have been processed in accordance
with the sibling property
NYT
NYT
#0
Initial Huffman Tree
#0
31
Example
W=1
#2
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=2+1
#6
W=4
#7
W=6+1
#8
E
W=10
#9
Root
W=16+1
#10
Counts:
(number of
occurrences)
A:1
B:2
C:2
D:2
E:10
A Huffman tree after first appearance of symbol A
A
W=1
#1
NYT
#0
32
Increment
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=3+1
#6
W=4
#7
W=7+1
#8
E
W=10
#9
Root W=17+1
#10
Counts:
A:1+1
B:2
C:2
D:2
E:10
An increment in the count for A propagates up to the root
W=1+1
#2
A
W=1+1
#1
NYT
#0
33
Swapping
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=4
#6
W=4
#7
W=8
#8
E
W=10
#9
Root
W=18
#10
Counts:
A:2+1
B:2
C:2
D:2
E:10
B
W=2
#3
C
W=2
#4
A
W=2+1
#5
W=4
#6
W=4+1
#7
W=8+1
#8
E
W=10
#9
Root
W=18+1
#10
Counts:
A:3
B:2
C:2
D:2
E:10
Swap nodes 1 and 5
Another increment in the count for A results in swap
W=2
#2
A
W=2
#1
NYT
W=2
#2
D
W=2
#1
NYT
#0
#0
34
Swapping … contd.
B
W=2
#3
C
W=2
#4
A
W=3+1
#5
W=4
#6
W=5+1
#7
W=9+1
#8
E
W=10
#9
Root
W=19+1
#10
Counts:
A:3+1
B:2
C:2
D:2
E:10
Another increment in the count for A propagates up
W=2
#2
D
W=2
#1
NYT
#0
35
Swapping … contd.
B
W=2
#3
C
W=2
#4
A
W=4
#5
W=4
#6
W=6
#7
W=10
#8
E
W=10
#9
Root
W=20
#10
Counts:
A:4+1
B:2
C:2
D:2
E:10
Swap nodes 5 and 6
Another increment in the count for A causes swap of sub-tree
W=2
#2
D
W=2
#1
NYT
#0
36
Swapping … contd.
C
W=2
#4
W=6
#7
W=10
#8
E
W=10
#9
Root
W=20
#10
Counts:
A:4+1
B:2
C:2
D:2
E:10
B
W=2
#3
W=4
#5
A
W=4+1
#6
Swap nodes 8 and 9
Further swapping needed to fix the tree
W=2
#2
D
W=2
#1
NYT
#0
37
Swapping … contd.
C
W=2
#4
W=6
#7
W=10+1
#9
Root
W=20+1
#10
Counts:
A:5
B:2
C:2
D:2
E:10
B
W=2
#3
W=4
#5
A
W=5
#6
E
W=10
#8
W=2
#2
D
W=2
#1
NYT
#0
38
Arithmetic Coding
Arithmetic coding is based on the concept of interval
subdividing.
In arithmetic coding a source ensemble is represented by an interval
between 0 and 1 on the real number line.
Each symbol of the ensemble narrows this interval.
As the interval becomes smaller, the number of bits needed to specify it
grows.
Arithmetic coding assumes an explicit probabilistic model of the source.
It uses the probabilities of the source messages to successively narrow
the interval used to represent the ensemble.
• A high probability message narrows the interval less than a low
probability message, so that high probability messages contribute fewer
bits to the coded ensemble
39
Arithmetic Coding: Description
In the following discussions, we will use M as the size of the alphabet of the
data source,
N[x] as symbol x's probability,
Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i])
Assume we know the probabilities of each symbol of the data source,
we can allocate to each symbol an interval with width proportional to its probability,
and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two ends of each
interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).
40
Arithmetic Coding: Encoder
We begin with the interval [0,1) and subdivide the interval
iteratively.
For each symbol entered, the current interval is divided according to the
probabilities of the alphabet.
The interval corresponding to the symbol is picked as the interval to be
further proceeded with.
The procedure continues until all symbols in the message have been
processed.
Since each symbol's interval does not overlap with others, for each possible
message there is a unique interval assigned.
We can represent the message with the interval's two ends [L,H). In fact,
taking any single value in the interval as the encoded code is enough, and
usually the left end L is selected.
41
Arithmetic Coding Algorithm
L = 0.0; H = 1.0;
While ( (x = getc(input)) != EOF )
{
R = (H-L);
H = L + R * Q[x];
L = L + R * Q[x-1];
}
Output(L);
R is the interval range, and H and L are two ends of the current code
interval. x is the new symbol to be encoded.
H and L are initialized to 0 and 1 respectively
42
Arithmetic Coding: Encoder example
Symbol, x Probability, N[x] [Q[x-1], Q[x])
A 0.4 0.0, 0.4
B 0.3 0.4, 0.7
C 0.2 0.7, 0.9
D 0.1 0.9, 1.0
1
0
B
0.4
0.7 0.67
0.61
C
0.634
0.61
A
0.6286
0.6196
B String: BCAB
Code sent: 0.6196
43
Decoding Algorithm
 When decoding the code v is placed on the current code interval to
find the symbol x so that Q[x-1] <= code < Q[x]. The procedure
iterates until all symbols are decoded.
v = input_code();
for (;;) {
x = find_symbol_straddling_this_range(v);
putc(x);
R = Q[x] – Q[x-1];
v = (v – Q[x-1])/R;
}
v Output Char
x
Q[x-1] Q[x] R
0.6196 B 0.4 0.7 0.3
0.732 C 0.7 0.9 0.2
0.16 A 0.0 0.4 0.4
0.4 B 0.4 0.7 0.3
0.0
44
Arithmetic Coding: Issues
The zero-frequency problem: Each symbol's predicted probability must not be zero
or the interval will become zero and interval renormalization would fail. This is
called the zero-frequency problem. Models that adapt online may encounter such
problem when decaying.
The EOF problem:
Assume we pick the lower end of the interval as the encoded code. Two messages may yield
the same code if one message is identical to the other, except for a sequence of finite
number of the first symbol(first in table, not in the sequence) as a suffix.
• For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but
different upper intervals. (try it)
The simplest solution is to let the decoder know the length of the encoded message. The
decoder will know if the message size is fixed or can be transmitted at first. However this is
not plausible if the data size is not known beforehand, such as live broadcasting data; or it's
too costly to do so, such as tapes whose size is unknown at the beginning.
There is another solution if we introduce a special EOF symbol to the alphabet. The symbol
takes a small interval and is used only at the end of the message. When the decoder detects
the EOF symbol it knows the end of the message is reached.
45
Dictionary-Based Compression
The compression algorithms we studied so far use a statistical model to
encode single symbols
Compression: Encode symbols into bit strings that use fewer bits.
Dictionary-based algorithms do not encode single symbols as variable-
length bit strings; they encode variable-length strings of symbols as
single tokens
The tokens form an index into a phrase dictionary
If the tokens are smaller than the phrases they replace, compression occurs.
Dictionary-based compression is easier to understand because it uses
a strategy that programmers are familiar with-> using indexes into
databases to retrieve information from large amounts of storage.
Telephone numbers
Postal codes
46
Dictionary-Based Compression: Example
Consider the Random House Dictionary of the English Language,
Second edition, Unabridged. Using this dictionary, the string:
A good example of how dictionary based compression works
can be coded as:
1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2
Coding:
Uses the dictionary as a simple lookup table
Each word is coded as x/y, where, x gives the page in the dictionary and y gives
the number of the word on that page.
The dictionary has 2,200 pages with less than 256 entries per page: Therefore x
requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word).
Using ASCII coding the above string requires 48 bytes, whereas our encoding
requires only 20 (<-2.5 * 8) bytes: 50% compression.
47
Adaptive Dictionary-based Compression
Build the dictionary adaptively
Necessary when the source data is not plain text, say audio or video data.
Is better tailored to the specific source.
Original methods due to Ziv and Lempel in 1977 (LZ77) and 1978
(LZ78). Terry Welch improved the scheme in 1984(called LZW
compression). It is used in, UNIX compress, and, GIF.
LZ77: A sliding window technique in which the dictionary consists of a
set of fixed length phrases found in a window into the previously
processed text
LZ78: Instead of using fixed-length phrases from a window into the text,
it builds phrases up one symbol at a time, adding a new symbol to an
existing phrase when a match occurs.
48
LZW Algorithm
Preliminaries:
 A dictionary that is indexed by “codes” is used.
 The dictionary is assumed to be initialized with 256 entries (indexed
with ASCII codes 0 through 255) representing the ASCII table.
 The compression algorithm assumes that the output is either a file or
a communication channel. The input being a file or buffer.
 Conversely, the decompression algorithm assumes that the input is
a file or a communication channel and the output is a file or a buffer.
Decompression
Compression
file/buffer Compressed file/
Communication channel
file/buffer
49
LZW Algorithm
LZW Compression:
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w = k
endloop
The program reads one character at a time. If the code is in the dictionary, then it
adds the character to the current work string, and waits for the next one. This occurs
on the first character as well. If the work string is not in the dictionary, (such as when
the second character comes across), it adds the work string to the dictionary and
sends over the wire (or writes to a file) the code assigned to the work string without
the new character. It then sets the work string to the new character.
50
Input String: ^WED^WE^WEE^WEB^WET
w k Outp
ut
Index Symb
ol
NIL ^
^ W ^ 256 ^W
W E W 257 WE
E D E 258 ED
D ^ D 259 D^
^ W
^W E 256 260 ^WE
E ^ E 261 E^
^ W
^W E
^WE E 260 262 ^WEE
E ^
E^ W 261 263 E^W
W E
WE B 257 264 WEB
B ^ B 265 B^
^ W
^W E
^WE T 260 266 ^WET
T EOF T
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w = k
endloop
51
LZW Algorithm
LZW Decompression:
read fixed length token k (code or char)
output k
w = k
loop
read a fixed length token k
entry = dictionary entry for k
output entry
add w + first char of entry to
the dictionary
w = entry
endloop
The nice thing is that the decompressor builds its own dictionary on its side, that
matches exactly the compressor's, so that only the codes need to be sent.
52
Example of LZW
Input String (to decode): ^WED<256>E<260><261><257>B<260>T
w k Output Index Symbol
^ ^
^ W W 256 ^W
W E E 257 WE
E D D 258 ED
D <256> ^W 259 D^
^W E E 260 ^WE
E <260> ^WE 261 E^
^WE <261> E^ 262 ^WEE
E^ <257> WE 263 E^W
WE B B 264 WEB
B <260> ^WE 265 B^
^WE T T 266 ^WET
read a fixed length token k
(code or char)
output k
w = k
loop
read a fixed length token k
(code or char)
entry = dictionary entry for k
output entry
add w + first char of entry to
the dictionary
w = entry
endloop
53
LZW Algorithm - Discussion
9 bits
0
9 bits
1
<- ASCII characters
(0 to 255)
<- Codes
(256 to 512)
 Where is the compression?
 Original String to decode : ^WED^WE^WEE^WEB^WET
 Decoded String : ^WED<256>E<260><261><257>B<260>T
 Plain ASCII coding of the string : 19 * 8 bits = 152 bits
 LZW coding of the string: 12*9 bits = 108 bits (7 symbols and 5 codes,
each of 9 bits)
 Why 9 bits?
 An ASCII character has a value ranging from 0 to 255
 All tokens have fixed length
 There has to be a distinction in representation between an
ASCII character and a Code (assigned to strings of length 2 or more)
 Codes can only have values 256 and above
54
LZW Algorithm – Discussion (continued)
With 9 bits we can only have a maximum of 256 codes for strings of length
2 or above (with the first 256 entries for ASCII characters)
 Original LZW uses dictionary with 4K entries, with the length of each
symbol/code being 12 bits
12 bits
0
<- ASCII characters
(0 to 255 entries)
<- Codes
(256 to 4096 entries)
0
0 0
1
0
0 0
1
1
1 1
 With 12 bits, we can have a maximum of 212 – 256 codes.
55
Practical implementations of LZW algorithm follow the two approaches:
Flush the dictionary periodically
– no wasted codes
Grow the length of the codes as the algorithm proceeds
- First start with a length of 9 bits for the codes.
- Once we run out of codes, increase the length to 10 bits. When we run out of
codes with 10 bits then we increase the code length to 11 bits and so on.
- more efficient.
0 ASCII
1 Codes 256-512 0 0 ASCII
0 1 Codes 256-511
1 0 Codes 512-767
1 1 Codes 768-1023
0 0 0 ASCII
0 0 1 Codes 256-511
0 1 0 Codes 512-767
0 1 1 Codes 768-1023
1 0 0 Codes 1024-1279
1 0 1 Codes 1280-1535
1 1 0 Codes 1536-1791
1 1 1 Codes 1792-2047
Out of codes
Out of codes

More Related Content

Similar to 2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt

Sunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmSunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmDr Sandeep Kumar Poonia
 
Chapter%202%20 %20 Text%20compression(2)
Chapter%202%20 %20 Text%20compression(2)Chapter%202%20 %20 Text%20compression(2)
Chapter%202%20 %20 Text%20compression(2)nes
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataImplementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataBRNSSPublicationHubI
 
Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compressionanithabalaprabhu
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
image compression in data compression
image compression in data compressionimage compression in data compression
image compression in data compressionZaabir Ali
 
Presentation for the Project on VLSI and Embedded
Presentation for the Project on VLSI and EmbeddedPresentation for the Project on VLSI and Embedded
Presentation for the Project on VLSI and Embeddedlthanuja01
 

Similar to 2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt (20)

Sunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithmSunzip user tool for data reduction using huffman algorithm
Sunzip user tool for data reduction using huffman algorithm
 
Image compression
Image compressionImage compression
Image compression
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Compression Ii
Compression IiCompression Ii
Compression Ii
 
Source coding
Source codingSource coding
Source coding
 
Compress
CompressCompress
Compress
 
Chapter%202%20 %20 Text%20compression(2)
Chapter%202%20 %20 Text%20compression(2)Chapter%202%20 %20 Text%20compression(2)
Chapter%202%20 %20 Text%20compression(2)
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Turbo Code
Turbo Code Turbo Code
Turbo Code
 
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text DataImplementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text Data
 
Data Compression
Data CompressionData Compression
Data Compression
 
Dictionary Based Compression
Dictionary Based CompressionDictionary Based Compression
Dictionary Based Compression
 
Compression ii
Compression iiCompression ii
Compression ii
 
Lzw coding technique for image compression
Lzw coding technique for image compressionLzw coding technique for image compression
Lzw coding technique for image compression
 
Wk1to4
Wk1to4Wk1to4
Wk1to4
 
Data compression
Data compressionData compression
Data compression
 
Lec5 Compression
Lec5 CompressionLec5 Compression
Lec5 Compression
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
image compression in data compression
image compression in data compressionimage compression in data compression
image compression in data compression
 
Presentation for the Project on VLSI and Embedded
Presentation for the Project on VLSI and EmbeddedPresentation for the Project on VLSI and Embedded
Presentation for the Project on VLSI and Embedded
 

Recently uploaded

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt

  • 2. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding Dictionary techniques LZW family algorithms. Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa
  • 3. Introduction  Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information. 3 Fig.1: A General Data Compression Scheme.
  • 4. Introduction  If the compression and decompression processes induce no information loss, then the compression scheme is lossless; otherwise, it is lossy.  Compression ratio:  B0 – number of bits before compression  B1 – number of bits after compression  In general, we would desire any codec (encoder/decoder scheme) to have a compression ratio much larger than 1.0.  The higher the compression ratio, the better the lossless compression scheme, as long as it is computationally feasible. 5
  • 5. Basics of Information Theory 6  What is entropy? is a measure of the number of specific ways in which a system may be arranged, commonly understood as a measure of the disorder of a system.  As an example, if the information source S is a gray-level digital image, each si is a gray-level intensity ranging from 0 to (2k − 1), where k is the number of bits used to represent each pixel in an uncompressed image.  We need to find the entropy of this image; which the number of bits to represent the image after compression.
  • 6. Run-Length Coding • RLC is one of the simplest forms of data compression.  The basic idea is that if the information source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded.  Consider a screen containing plain black text on a solid white background.  There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWBWWWWBBBWWWWWWBWWW  If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 5W1B4W3B6W1B3W  The run-length code represents the original 21 characters in only 14. 7
  • 7. Variable-Length Coding 8  Variable-length coding (VLC) is one of the best-known entropy coding methods  Here, we will study the Shannon–Fano algorithm, Huffman coding, and adaptive Huffman coding.
  • 8. Shannon–Fano Algorithm 9  To illustrate the algorithm, let us suppose the symbols to be coded are the characters in the word HELLO.  The frequency count of the symbols is Symbol H E L O Count 1 1 2 1  The encoding steps of the Shannon–Fano algorithm can be presented in the following top-down manner:  1. Sort the symbols according to the frequency count of their occurrences.  2. Recursively divide the symbols into two parts, each with approximately the same number of counts, until all parts contain only one symbol.
  • 9. Shannon–Fano Algorithm 10  A natural way of implementing the above procedure is to build a binary tree.  As a convention, let us assign bit 0 to its left branches and 1 to the right branches.  Initially, the symbols are sorted as LHEO.  As Fig. 7.3 shows, the first division yields two parts: L with a count of 2, denoted as L:(2); and H, E and O with a total count of 3, denoted as H, E, O:(3).  The second division yields H:(1) and E, O:(2).  The last division is E:(1) and O:(1).
  • 10. Shannon–Fano Algorithm 11 Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.
  • 11. Table 7.1: Result of Performing Shannon-Fano on HELLO Li & Drew 12 Symbol Count Log2 Code # of bits used L 2 1.32 0 2 H 1 2.32 10 2 E 1 2.32 110 3 O 1 2.32 111 3 TOTAL # of bits: 10 1 pi
  • 12. Another coding tree for HELLO by Shannon-Fano. Li & Drew 13
  • 13. Another Result of Performing Shannon-Fano  on HELLO (see Fig. 7.4) Li & Drew 14 Symbol Count Log2 Code # of bits used L 2 1.32 00 4 H 1 2.32 01 2 E 1 2.32 10 2 O 1 2.32 11 2 TOTAL # of bits: 10 1 pi
  • 14. Shannon–Fano Algorithm-Analysis 15  The Shannon–Fano algorithm delivers satisfactory coding results for data compression, but it was soon outperformed and overtaken by the Huffman coding method.  The Huffman algorithm requires prior statistical knowledge about the information source, and such information is often not available.  This is particularly true in multimedia applications, where future data is unknown before its arrival, as for example in live (or streaming) audio and video.  Even when the statistics are available, the transmission of the symbol table could represent heavy overhead  The solution is to use adaptive Huffman coding compression algorithms, in which statistics are gathered and updated dynamically as the data stream arrives.
  • 15. 15.16 LOSSLESS COMPRESSION In lossless data compression, the integrity of the data is preserved. The original data and the data after compression and decompression are exactly the same because, in these methods, the compression and decompression algorithms are exact inverses of each other: no part of the data is lost in the process. Redundant data is removed in compression and added during decompression. Lossless compression methods are normally used when we cannot afford to lose any data.
  • 16. 15.17 Run-length encoding Run-length encoding is probably the simplest method of compression. It can be used to compress data made of any combination of symbols. It does not need to know the frequency of occurrence of symbols and can be very efficient if data is represented as 0s and 1s. The general idea behind this method is to replace consecutive repeating occurrences of a symbol by one occurrence of the symbol followed by the number of occurrences. The method can be even more efficient if the data uses only two symbols (for example 0 and 1) in its bit pattern and one symbol is more frequent than the other.
  • 19. 15.20 Huffman coding Huffman coding assigns shorter codes to symbols that occur more frequently and longer codes to those that occur less frequently. For example, imagine we have a text file that uses only five characters (A, B, C, D, E). Before we can assign bit patterns to each character, we assign each character a weight based on its frequency of use. In this example, assume that the frequency of the characters is as shown in Table 15.1.
  • 21. 15.22 A character’s code is found by starting at the root and following the branches that lead to that character. The code itself is the bit value of each branch on the path, taken in sequence. Final tree and code
  • 22. 15.23 Encoding Let us see how to encode text using the code for our five characters. Figure 15.6 shows the original and the encoded text. Huffman encoding
  • 23. 15.24 Decoding The recipient has a very easy job in decoding the data it receives. shows how decoding takes place. Huffman decoding
  • 24. page 25 05/06/15 CSE 40373/60373: Multimedia Systems Adaptive Huffman Coding Extended Huffman is in book: group symbols together Adaptive Huffman: statistics are gathered and updated dynamically as the data stream arrives ENCODER ------- Initial_code(); while not EOF { get(c); encode(c); update_tree(c); } DECODER ------- Initial_code(); while not EOF { decode(c); output(c); update_tree(c); }
  • 25. 26 Adaptive Coding Motivations:  The previous algorithms (both Shannon-Fano and Huffman) require the statistical knowledge which is often not available (e.g., live audio, video).  Even when it is available, it could be a heavy overhead.  Higher-order models incur more overhead. For example, a 255 entry probability table would be required for a 0-order model. An order-1 model would require 255 such probability tables. (A order-1 model will consider probabilities of occurrences of 2 symbols) The solution is to use adaptive algorithms. Adaptive Huffman Coding is one such mechanism that we will study. The idea of “adaptiveness” is however applicable to other adaptive compression algorithms.
  • 26. 27 Adaptive Coding ENCODER Initialize_model(); do { c = getc( input ); encode( c, output ); update_model( c ); } while ( c != eof) DECODER Initialize_model(); while ( c = decode (input)) != eof) { putc( c, output) update_model( c ); }  The key is that, both encoder and decoder use exactly the same initialize_model and update_model routines.
  • 27. 28 The Sibling Property The node numbers will be assigned in such a way that: 1. A node with a higher weight will have a higher node number 2. A parent node will always have a higher node number than its children. In a nutshell, the sibling property requires that the nodes (internal and leaf) are arranged in order of increasing weights. The update procedure swaps nodes in violation of the sibling property.  The identification of nodes in violation of the sibling property is achieved by using the notion of a block.  All nodes that have the same weight are said to belong to one block
  • 28. 29 Flowchart of the update procedure START First appearance of symbol Go to symbol external node Node number max in block? Increment node weight Switch node with highest numbered node in block Is this the root node? Go to parent node STOP NYT gives birth To new NYT and external node Increment weight of external node and old NYT node; Adjust node numbers Go to old NYT node Yes No No Yes Yes No  The Huffman tree is initialized with a single node, known as the Not-Yet-Transmitted (NYT) or escape code. This code will be sent every time that a new character, which is not in the tree, is encountered, followed by the ASCII encoding of the character. This allows for the de-compressor to distinguish between a code and a new character. Also, the procedure creates a new node for the character and a new NYT from the old NYT node.  The root node will have the highest node number because it has the highest weight.
  • 29. 30 Example B W=2 #1 C W=2 #2 D W=2 #3 W=2 #4 W=4 #5 W=6 #6 E W=10 #7 Root W=16 #8 Counts: (number of occurrences) B:2 C:2 D:2 E:10 Example Huffman tree after some symbols have been processed in accordance with the sibling property NYT NYT #0 Initial Huffman Tree #0
  • 34. 35 Swapping … contd. B W=2 #3 C W=2 #4 A W=4 #5 W=4 #6 W=6 #7 W=10 #8 E W=10 #9 Root W=20 #10 Counts: A:4+1 B:2 C:2 D:2 E:10 Swap nodes 5 and 6 Another increment in the count for A causes swap of sub-tree W=2 #2 D W=2 #1 NYT #0
  • 37. 38 Arithmetic Coding Arithmetic coding is based on the concept of interval subdividing. In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source. It uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble. • A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble
  • 38. 39 Arithmetic Coding: Description In the following discussions, we will use M as the size of the alphabet of the data source, N[x] as symbol x's probability, Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i]) Assume we know the probabilities of each symbol of the data source, we can allocate to each symbol an interval with width proportional to its probability, and each of the intervals does not overlap with others. This can be done if we use the cumulative probabilities as the two ends of each interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x]. Symbol x is said to own the range [Q[x-1], Q[x]).
  • 39. 40 Arithmetic Coding: Encoder We begin with the interval [0,1) and subdivide the interval iteratively. For each symbol entered, the current interval is divided according to the probabilities of the alphabet. The interval corresponding to the symbol is picked as the interval to be further proceeded with. The procedure continues until all symbols in the message have been processed. Since each symbol's interval does not overlap with others, for each possible message there is a unique interval assigned. We can represent the message with the interval's two ends [L,H). In fact, taking any single value in the interval as the encoded code is enough, and usually the left end L is selected.
  • 40. 41 Arithmetic Coding Algorithm L = 0.0; H = 1.0; While ( (x = getc(input)) != EOF ) { R = (H-L); H = L + R * Q[x]; L = L + R * Q[x-1]; } Output(L); R is the interval range, and H and L are two ends of the current code interval. x is the new symbol to be encoded. H and L are initialized to 0 and 1 respectively
  • 41. 42 Arithmetic Coding: Encoder example Symbol, x Probability, N[x] [Q[x-1], Q[x]) A 0.4 0.0, 0.4 B 0.3 0.4, 0.7 C 0.2 0.7, 0.9 D 0.1 0.9, 1.0 1 0 B 0.4 0.7 0.67 0.61 C 0.634 0.61 A 0.6286 0.6196 B String: BCAB Code sent: 0.6196
  • 42. 43 Decoding Algorithm  When decoding the code v is placed on the current code interval to find the symbol x so that Q[x-1] <= code < Q[x]. The procedure iterates until all symbols are decoded. v = input_code(); for (;;) { x = find_symbol_straddling_this_range(v); putc(x); R = Q[x] – Q[x-1]; v = (v – Q[x-1])/R; } v Output Char x Q[x-1] Q[x] R 0.6196 B 0.4 0.7 0.3 0.732 C 0.7 0.9 0.2 0.16 A 0.0 0.4 0.4 0.4 B 0.4 0.7 0.3 0.0
  • 43. 44 Arithmetic Coding: Issues The zero-frequency problem: Each symbol's predicted probability must not be zero or the interval will become zero and interval renormalization would fail. This is called the zero-frequency problem. Models that adapt online may encounter such problem when decaying. The EOF problem: Assume we pick the lower end of the interval as the encoded code. Two messages may yield the same code if one message is identical to the other, except for a sequence of finite number of the first symbol(first in table, not in the sequence) as a suffix. • For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but different upper intervals. (try it) The simplest solution is to let the decoder know the length of the encoded message. The decoder will know if the message size is fixed or can be transmitted at first. However this is not plausible if the data size is not known beforehand, such as live broadcasting data; or it's too costly to do so, such as tapes whose size is unknown at the beginning. There is another solution if we introduce a special EOF symbol to the alphabet. The symbol takes a small interval and is used only at the end of the message. When the decoder detects the EOF symbol it knows the end of the message is reached.
  • 44. 45 Dictionary-Based Compression The compression algorithms we studied so far use a statistical model to encode single symbols Compression: Encode symbols into bit strings that use fewer bits. Dictionary-based algorithms do not encode single symbols as variable- length bit strings; they encode variable-length strings of symbols as single tokens The tokens form an index into a phrase dictionary If the tokens are smaller than the phrases they replace, compression occurs. Dictionary-based compression is easier to understand because it uses a strategy that programmers are familiar with-> using indexes into databases to retrieve information from large amounts of storage. Telephone numbers Postal codes
  • 45. 46 Dictionary-Based Compression: Example Consider the Random House Dictionary of the English Language, Second edition, Unabridged. Using this dictionary, the string: A good example of how dictionary based compression works can be coded as: 1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2 Coding: Uses the dictionary as a simple lookup table Each word is coded as x/y, where, x gives the page in the dictionary and y gives the number of the word on that page. The dictionary has 2,200 pages with less than 256 entries per page: Therefore x requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word). Using ASCII coding the above string requires 48 bytes, whereas our encoding requires only 20 (<-2.5 * 8) bytes: 50% compression.
  • 46. 47 Adaptive Dictionary-based Compression Build the dictionary adaptively Necessary when the source data is not plain text, say audio or video data. Is better tailored to the specific source. Original methods due to Ziv and Lempel in 1977 (LZ77) and 1978 (LZ78). Terry Welch improved the scheme in 1984(called LZW compression). It is used in, UNIX compress, and, GIF. LZ77: A sliding window technique in which the dictionary consists of a set of fixed length phrases found in a window into the previously processed text LZ78: Instead of using fixed-length phrases from a window into the text, it builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.
  • 47. 48 LZW Algorithm Preliminaries:  A dictionary that is indexed by “codes” is used.  The dictionary is assumed to be initialized with 256 entries (indexed with ASCII codes 0 through 255) representing the ASCII table.  The compression algorithm assumes that the output is either a file or a communication channel. The input being a file or buffer.  Conversely, the decompression algorithm assumes that the input is a file or a communication channel and the output is a file or a buffer. Decompression Compression file/buffer Compressed file/ Communication channel file/buffer
  • 48. 49 LZW Algorithm LZW Compression: set w = NIL loop read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary w = k endloop The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and sends over the wire (or writes to a file) the code assigned to the work string without the new character. It then sets the work string to the new character.
  • 49. 50 Input String: ^WED^WE^WEE^WEB^WET w k Outp ut Index Symb ol NIL ^ ^ W ^ 256 ^W W E W 257 WE E D E 258 ED D ^ D 259 D^ ^ W ^W E 256 260 ^WE E ^ E 261 E^ ^ W ^W E ^WE E 260 262 ^WEE E ^ E^ W 261 263 E^W W E WE B 257 264 WEB B ^ B 265 B^ ^ W ^W E ^WE T 260 266 ^WET T EOF T set w = NIL loop read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary w = k endloop
  • 50. 51 LZW Algorithm LZW Decompression: read fixed length token k (code or char) output k w = k loop read a fixed length token k entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.
  • 51. 52 Example of LZW Input String (to decode): ^WED<256>E<260><261><257>B<260>T w k Output Index Symbol ^ ^ ^ W W 256 ^W W E E 257 WE E D D 258 ED D <256> ^W 259 D^ ^W E E 260 ^WE E <260> ^WE 261 E^ ^WE <261> E^ 262 ^WEE E^ <257> WE 263 E^W WE B B 264 WEB B <260> ^WE 265 B^ ^WE T T 266 ^WET read a fixed length token k (code or char) output k w = k loop read a fixed length token k (code or char) entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop
  • 52. 53 LZW Algorithm - Discussion 9 bits 0 9 bits 1 <- ASCII characters (0 to 255) <- Codes (256 to 512)  Where is the compression?  Original String to decode : ^WED^WE^WEE^WEB^WET  Decoded String : ^WED<256>E<260><261><257>B<260>T  Plain ASCII coding of the string : 19 * 8 bits = 152 bits  LZW coding of the string: 12*9 bits = 108 bits (7 symbols and 5 codes, each of 9 bits)  Why 9 bits?  An ASCII character has a value ranging from 0 to 255  All tokens have fixed length  There has to be a distinction in representation between an ASCII character and a Code (assigned to strings of length 2 or more)  Codes can only have values 256 and above
  • 53. 54 LZW Algorithm – Discussion (continued) With 9 bits we can only have a maximum of 256 codes for strings of length 2 or above (with the first 256 entries for ASCII characters)  Original LZW uses dictionary with 4K entries, with the length of each symbol/code being 12 bits 12 bits 0 <- ASCII characters (0 to 255 entries) <- Codes (256 to 4096 entries) 0 0 0 1 0 0 0 1 1 1 1  With 12 bits, we can have a maximum of 212 – 256 codes.
  • 54. 55 Practical implementations of LZW algorithm follow the two approaches: Flush the dictionary periodically – no wasted codes Grow the length of the codes as the algorithm proceeds - First start with a length of 9 bits for the codes. - Once we run out of codes, increase the length to 10 bits. When we run out of codes with 10 bits then we increase the code length to 11 bits and so on. - more efficient. 0 ASCII 1 Codes 256-512 0 0 ASCII 0 1 Codes 256-511 1 0 Codes 512-767 1 1 Codes 768-1023 0 0 0 ASCII 0 0 1 Codes 256-511 0 1 0 Codes 512-767 0 1 1 Codes 768-1023 1 0 0 Codes 1024-1279 1 0 1 Codes 1280-1535 1 1 0 Codes 1536-1791 1 1 1 Codes 1792-2047 Out of codes Out of codes