2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt

Outline
Compression techniques
Run length coding
Huffman coding
Adaptive Huffman Coding
Arithmetic coding
Shannon-Fano coding
Dictionary techniques
LZW family algorithms.
Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW fa

Introduction
 Compression: the process of coding that will effectively reduce
the total number of bits needed to represent certain information.
3
Fig.1: A General Data Compression Scheme.

Introduction
 If the compression and decompression processes induce no
information loss, then the compression scheme is lossless;
otherwise, it is lossy.
 Compression ratio:
 B0 – number of bits before compression
 B1 – number of bits after compression
 In general, we would desire any codec (encoder/decoder
scheme) to have a compression ratio much larger than 1.0.
 The higher the compression ratio, the better the lossless
compression scheme, as long as it is computationally feasible.
5

Basics of Information Theory
6
 What is entropy? is a measure of the number of specific
ways in which a system may be arranged, commonly
understood as a measure of the disorder of a system.
 As an example, if the information source S is a gray-level
digital image, each si is a gray-level intensity ranging from
0 to (2k − 1), where k is the number of bits used to
represent each pixel in an uncompressed image.
 We need to find the entropy of this image; which the
number of bits to represent the image after compression.

Run-Length Coding
• RLC is one of the simplest forms of data compression.
 The basic idea is that if the information source has the property that
symbols tend to form continuous groups, then such symbol and the
length of the group can be coded.
 Consider a screen containing plain black text on a solid white
background.
 There will be many long runs of white pixels in the blank space, and
many short runs of black pixels within the text. Let us take a
hypothetical single scan line, with B representing a black pixel and W
representing white: WWWWWBWWWWBBBWWWWWWBWWW
 If we apply the run-length encoding (RLE) data compression algorithm
to the above hypothetical scan line, we get the following:
5W1B4W3B6W1B3W
 The run-length code represents the original 21 characters in only 14.
7

Variable-Length Coding
8
 Variable-length coding (VLC) is one of the best-known
entropy coding methods
 Here, we will study the Shannon–Fano algorithm, Huffman
coding, and adaptive Huffman coding.

Shannon–Fano Algorithm
9
 To illustrate the algorithm, let us suppose the symbols to be
coded are the characters in the word HELLO.
 The frequency count of the symbols is
Symbol H E L O
Count 1 1 2 1
 The encoding steps of the Shannon–Fano algorithm can be
presented in the following top-down manner:
 1. Sort the symbols according to the frequency count of their
occurrences.
 2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts
contain only one symbol.

10
 A natural way of implementing the above procedure is to build
a binary tree.
 As a convention, let us assign bit 0 to its left branches and 1
to the right branches.
 Initially, the symbols are sorted as LHEO.
 As Fig. 7.3 shows, the first division yields two parts: L with a
count of 2, denoted as L:(2); and H, E and O with a total count
of 3, denoted as H, E, O:(3).
 The second division yields H:(1) and E, O:(2).
 The last division is E:(1) and O:(1).

11
Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.

Table 7.1: Result of Performing Shannon-Fano on HELLO
Li & Drew 12
Symbol Count Log2 Code # of bits used
L 2 1.32 0 2
H 1 2.32 10 2
E 1 2.32 110 3
O 1 2.32 111 3
TOTAL # of bits: 10
1
pi

Another coding tree for HELLO by Shannon-Fano.
Li & Drew 13

Another Result of Performing Shannon-Fano
 on HELLO (see Fig. 7.4)
Li & Drew 14
Symbol Count Log2 Code # of bits used
L 2 1.32 00 4
H 1 2.32 01 2
E 1 2.32 10 2
O 1 2.32 11 2
TOTAL # of bits: 10
1
pi

Shannon–Fano Algorithm-Analysis
15
 The Shannon–Fano algorithm delivers satisfactory coding results
for data compression, but it was soon outperformed and overtaken
by the Huffman coding method.
 The Huffman algorithm requires prior statistical knowledge about
the information source, and such information is often not available.
 This is particularly true in multimedia applications, where future
data is unknown before its arrival, as for example in live (or
streaming) audio and video.
 Even when the statistics are available, the transmission of the
symbol table could represent heavy overhead
 The solution is to use adaptive Huffman coding compression
algorithms, in which statistics are gathered and updated
dynamically as the data stream arrives.

15.16
LOSSLESS COMPRESSION
In lossless data compression, the integrity of the data is
preserved. The original data and the data after compression
and decompression are exactly the same because, in these
methods, the compression and decompression algorithms are
exact inverses of each other: no part of the data is lost in the
process. Redundant data is removed in compression and
added during decompression. Lossless compression methods
are normally used when we cannot afford to lose any data.

15.17
Run-length encoding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the frequency of
occurrence of symbols and can be very efficient if data is
represented as 0s and 1s.
The general idea behind this method is to replace consecutive
repeating occurrences of a symbol by one occurrence of the symbol
followed by the number of occurrences.
The method can be even more efficient if the data uses only two
symbols (for example 0 and 1) in its bit pattern and one symbol is
more frequent than the other.

15.18
Run-length encoding example

15.19
Run-length encoding for two symbols

15.20
Huffman coding
Huffman coding assigns shorter codes to symbols that occur more
frequently and longer codes to those that occur less frequently. For
example, imagine we have a text file that uses only five characters
(A, B, C, D, E). Before we can assign bit patterns to each character,
we assign each character a weight based on its frequency of use. In
this example, assume that the frequency of the characters is as
shown in Table 15.1.

15.22
A character’s code is found by starting at the root and following the
branches that lead to that character. The code itself is the bit value
of each branch on the path, taken in sequence.
Final tree and code

15.23
Encoding
Let us see how to encode text using the code for our five
characters. Figure 15.6 shows the original and the encoded text.
Huffman encoding

15.24
Decoding
The recipient has a very easy job in decoding the data it receives.
shows how decoding takes place.
Huffman decoding

05/06/15 CSE 40373/60373: Multimedia Systems
Adaptive Huffman Coding
Extended Huffman is in book: group symbols
together
Adaptive Huffman: statistics are gathered and
updated dynamically as the data stream arrives
ENCODER
-------
Initial_code();
while not EOF
{
get(c);
encode(c);
update_tree(c);
}
DECODER
-------
Initial_code();
while not EOF
{
decode(c);
output(c);
update_tree(c);
}

26
Adaptive Coding
Motivations:
 The previous algorithms (both Shannon-Fano and Huffman) require the
statistical knowledge which is often not available (e.g., live audio, video).
 Even when it is available, it could be a heavy overhead.
 Higher-order models incur more overhead. For example, a 255 entry
probability table would be required for a 0-order model. An order-1 model
would require 255 such probability tables. (A order-1 model will consider
probabilities of occurrences of 2 symbols)
The solution is to use adaptive algorithms. Adaptive Huffman Coding is
one such mechanism that we will study.
The idea of “adaptiveness” is however applicable to other adaptive
compression algorithms.

27
Adaptive Coding
ENCODER
Initialize_model();
do {
c = getc( input );
encode( c, output );
update_model( c );
} while ( c != eof)
DECODER
Initialize_model();
while ( c = decode (input)) != eof) {
putc( c, output)
update_model( c );
}
 The key is that, both encoder and decoder use exactly the same
initialize_model and update_model routines.

28
The Sibling Property
The node numbers will be assigned in such a way that:
1. A node with a higher weight will have a higher node
number
2. A parent node will always have a higher node number
than its children.
In a nutshell, the sibling property requires that the nodes
(internal and leaf) are arranged in order of increasing
weights.
The update procedure swaps nodes in violation of the sibling
property.
 The identification of nodes in violation of the sibling
property is achieved by using the notion of a block.
 All nodes that have the same weight are said to belong
to one block

29
Flowchart of the update procedure
START
First
appearance
of symbol
Go to symbol
external node
Node
number max
in block?
Increment
node weight
Switch node with
highest numbered
node in block
Is this
the root
node?
Go to
parent node
STOP
NYT gives birth
To new NYT and
external node
Increment weight
of external node
and old NYT node;
Adjust node
numbers
Go to old
NYT node
Yes
No
No
Yes
Yes
No
 The Huffman tree is initialized
with a single node, known as the
Not-Yet-Transmitted (NYT) or
escape code. This code will be sent
every time that a new character,
which is not in the tree, is
encountered, followed by the
ASCII encoding of the character.
This allows for the de-compressor
to distinguish between a code and
a new character. Also, the
procedure creates a new node for
the character and a new NYT
from the old NYT node.
 The root node will have the
highest node number because it
has the highest weight.

30
Example
B
W=2
#1
C
W=2
#2
D
W=2
#3
W=2
#4
W=4
#5
W=6
#6
E
W=10
#7
Root
W=16
#8
Counts:
(number of
occurrences)
B:2
C:2
D:2
E:10
Example Huffman tree after some symbols have been processed in accordance
with the sibling property
NYT
NYT
#0
Initial Huffman Tree
#0

31
Example
W=1
#2
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=2+1
#6
W=4
#7
W=6+1
#8
E
W=10
#9
Root
W=16+1
#10
Counts:
(number of
occurrences)
A:1
B:2
C:2
D:2
E:10
A Huffman tree after first appearance of symbol A
A
W=1
#1
NYT
#0

32
Increment
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=3+1
#6
W=4
#7
W=7+1
#8
E
W=10
#9
Root W=17+1
#10
Counts:
A:1+1
B:2
C:2
D:2
E:10
An increment in the count for A propagates up to the root
W=1+1
#2
A
W=1+1
#1
NYT
#0

33
Swapping
B
W=2
#3
C
W=2
#4
D
W=2
#5
W=4
#6
W=4
#7
W=8
#8
E
W=10
#9
Root
W=18
#10
Counts:
A:2+1
B:2
C:2
D:2
E:10
B
W=2
#3
C
W=2
#4
A
W=2+1
#5
W=4
#6
W=4+1
#7
W=8+1
#8
E
W=10
#9
Root
W=18+1
#10
Counts:
A:3
B:2
C:2
D:2
E:10
Swap nodes 1 and 5
Another increment in the count for A results in swap
W=2
#2
A
W=2
#1
NYT
W=2
#2
D
W=2
#1
NYT
#0
#0

34
Swapping … contd.
B
W=2
#3
C
W=2
#4
A
W=3+1
#5
W=4
#6
W=5+1
#7
W=9+1
#8
E
W=10
#9
Root
W=19+1
#10
Counts:
A:3+1
B:2
C:2
D:2
E:10
Another increment in the count for A propagates up
W=2
#2
D
W=2
#1
NYT
#0

35
Swapping … contd.
B
W=2
#3
C
W=2
#4
A
W=4
#5
W=4
#6
W=6
#7
W=10
#8
E
W=10
#9
Root
W=20
#10
Counts:
A:4+1
B:2
C:2
D:2
E:10
Swap nodes 5 and 6
Another increment in the count for A causes swap of sub-tree
W=2
#2
D
W=2
#1
NYT
#0

36
Swapping … contd.
C
W=2
#4
W=6
#7
W=10
#8
E
W=10
#9
Root
W=20
#10
Counts:
A:4+1
B:2
C:2
D:2
E:10
B
W=2
#3
W=4
#5
A
W=4+1
#6
Swap nodes 8 and 9
Further swapping needed to fix the tree
W=2
#2
D
W=2
#1
NYT
#0

37
Swapping … contd.
C
W=2
#4
W=6
#7
W=10+1
#9
Root
W=20+1
#10
Counts:
A:5
B:2
C:2
D:2
E:10
B
W=2
#3
W=4
#5
A
W=5
#6
E
W=10
#8
W=2
#2
D
W=2
#1
NYT
#0

38
Arithmetic Coding
Arithmetic coding is based on the concept of interval
subdividing.
In arithmetic coding a source ensemble is represented by an interval
between 0 and 1 on the real number line.
Each symbol of the ensemble narrows this interval.
As the interval becomes smaller, the number of bits needed to specify it
grows.
Arithmetic coding assumes an explicit probabilistic model of the source.
It uses the probabilities of the source messages to successively narrow
the interval used to represent the ensemble.
• A high probability message narrows the interval less than a low
probability message, so that high probability messages contribute fewer
bits to the coded ensemble

39
Arithmetic Coding: Description
In the following discussions, we will use M as the size of the alphabet of the
data source,
N[x] as symbol x's probability,
Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i])
Assume we know the probabilities of each symbol of the data source,
we can allocate to each symbol an interval with width proportional to its probability,
and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two ends of each
interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).

40
Arithmetic Coding: Encoder
We begin with the interval [0,1) and subdivide the interval
iteratively.
For each symbol entered, the current interval is divided according to the
probabilities of the alphabet.
The interval corresponding to the symbol is picked as the interval to be
further proceeded with.
The procedure continues until all symbols in the message have been
processed.
Since each symbol's interval does not overlap with others, for each possible
message there is a unique interval assigned.
We can represent the message with the interval's two ends [L,H). In fact,
taking any single value in the interval as the encoded code is enough, and
usually the left end L is selected.

41
Arithmetic Coding Algorithm
L = 0.0; H = 1.0;
While ( (x = getc(input)) != EOF )
{
R = (H-L);
H = L + R * Q[x];
L = L + R * Q[x-1];
}
Output(L);
R is the interval range, and H and L are two ends of the current code
interval. x is the new symbol to be encoded.
H and L are initialized to 0 and 1 respectively

42
Arithmetic Coding: Encoder example
Symbol, x Probability, N[x] [Q[x-1], Q[x])
A 0.4 0.0, 0.4
B 0.3 0.4, 0.7
C 0.2 0.7, 0.9
D 0.1 0.9, 1.0
1
0
B
0.4
0.7 0.67
0.61
C
0.634
0.61
A
0.6286
0.6196
B String: BCAB
Code sent: 0.6196

43
Decoding Algorithm
 When decoding the code v is placed on the current code interval to
find the symbol x so that Q[x-1] <= code < Q[x]. The procedure
iterates until all symbols are decoded.
v = input_code();
for (;;) {
x = find_symbol_straddling_this_range(v);
putc(x);
R = Q[x] – Q[x-1];
v = (v – Q[x-1])/R;
}
v Output Char
x
Q[x-1] Q[x] R
0.6196 B 0.4 0.7 0.3
0.732 C 0.7 0.9 0.2
0.16 A 0.0 0.4 0.4
0.4 B 0.4 0.7 0.3
0.0

44
Arithmetic Coding: Issues
The zero-frequency problem: Each symbol's predicted probability must not be zero
or the interval will become zero and interval renormalization would fail. This is
called the zero-frequency problem. Models that adapt online may encounter such
problem when decaying.
The EOF problem:
Assume we pick the lower end of the interval as the encoded code. Two messages may yield
the same code if one message is identical to the other, except for a sequence of finite
number of the first symbol(first in table, not in the sequence) as a suffix.
• For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but
different upper intervals. (try it)
The simplest solution is to let the decoder know the length of the encoded message. The
decoder will know if the message size is fixed or can be transmitted at first. However this is
not plausible if the data size is not known beforehand, such as live broadcasting data; or it's
too costly to do so, such as tapes whose size is unknown at the beginning.
There is another solution if we introduce a special EOF symbol to the alphabet. The symbol
takes a small interval and is used only at the end of the message. When the decoder detects
the EOF symbol it knows the end of the message is reached.

45
Dictionary-Based Compression
The compression algorithms we studied so far use a statistical model to
encode single symbols
Compression: Encode symbols into bit strings that use fewer bits.
Dictionary-based algorithms do not encode single symbols as variable-
length bit strings; they encode variable-length strings of symbols as
single tokens
The tokens form an index into a phrase dictionary
If the tokens are smaller than the phrases they replace, compression occurs.
Dictionary-based compression is easier to understand because it uses
a strategy that programmers are familiar with-> using indexes into
databases to retrieve information from large amounts of storage.
Telephone numbers
Postal codes

46
Dictionary-Based Compression: Example
Consider the Random House Dictionary of the English Language,
Second edition, Unabridged. Using this dictionary, the string:
A good example of how dictionary based compression works
can be coded as:
1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2
Coding:
Uses the dictionary as a simple lookup table
Each word is coded as x/y, where, x gives the page in the dictionary and y gives
the number of the word on that page.
The dictionary has 2,200 pages with less than 256 entries per page: Therefore x
requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word).
Using ASCII coding the above string requires 48 bytes, whereas our encoding
requires only 20 (<-2.5 * 8) bytes: 50% compression.

47
Adaptive Dictionary-based Compression
Build the dictionary adaptively
Necessary when the source data is not plain text, say audio or video data.
Is better tailored to the specific source.
Original methods due to Ziv and Lempel in 1977 (LZ77) and 1978
(LZ78). Terry Welch improved the scheme in 1984(called LZW
compression). It is used in, UNIX compress, and, GIF.
LZ77: A sliding window technique in which the dictionary consists of a
set of fixed length phrases found in a window into the previously
processed text
LZ78: Instead of using fixed-length phrases from a window into the text,
it builds phrases up one symbol at a time, adding a new symbol to an
existing phrase when a match occurs.

48
LZW Algorithm
Preliminaries:
 A dictionary that is indexed by “codes” is used.
 The dictionary is assumed to be initialized with 256 entries (indexed
with ASCII codes 0 through 255) representing the ASCII table.
 The compression algorithm assumes that the output is either a file or
a communication channel. The input being a file or buffer.
 Conversely, the decompression algorithm assumes that the input is
a file or a communication channel and the output is a file or a buffer.
Decompression
Compression
file/buffer Compressed file/
Communication channel
file/buffer

49
LZW Algorithm
LZW Compression:
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w = k
endloop
The program reads one character at a time. If the code is in the dictionary, then it
adds the character to the current work string, and waits for the next one. This occurs
on the first character as well. If the work string is not in the dictionary, (such as when
the second character comes across), it adds the work string to the dictionary and
sends over the wire (or writes to a file) the code assigned to the work string without
the new character. It then sets the work string to the new character.

50
Input String: ^WED^WE^WEE^WEB^WET
w k Outp
ut
Index Symb
ol
NIL ^
^ W ^ 256 ^W
W E W 257 WE
E D E 258 ED
D ^ D 259 D^
^ W
^W E 256 260 ^WE
E ^ E 261 E^
^ W
^W E
^WE E 260 262 ^WEE
E ^
E^ W 261 263 E^W
W E
WE B 257 264 WEB
B ^ B 265 B^
^ W
^W E
^WE T 260 266 ^WET
T EOF T
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w = k
endloop

51
LZW Algorithm
LZW Decompression:
read fixed length token k (code or char)
output k
w = k
loop
read a fixed length token k
entry = dictionary entry for k
output entry
add w + first char of entry to
the dictionary
w = entry
endloop
The nice thing is that the decompressor builds its own dictionary on its side, that
matches exactly the compressor's, so that only the codes need to be sent.

52
Example of LZW
Input String (to decode): ^WED<256>E<260><261><257>B<260>T
w k Output Index Symbol
^ ^
^ W W 256 ^W
W E E 257 WE
E D D 258 ED
D <256> ^W 259 D^
^W E E 260 ^WE
E <260> ^WE 261 E^
^WE <261> E^ 262 ^WEE
E^ <257> WE 263 E^W
WE B B 264 WEB
B <260> ^WE 265 B^
^WE T T 266 ^WET
(code or char)
output k
w = k
loop
(code or char)
entry = dictionary entry for k
output entry
add w + first char of entry to
the dictionary
w = entry
endloop

53
LZW Algorithm - Discussion
9 bits
0
9 bits
1
<- ASCII characters
(0 to 255)
<- Codes
(256 to 512)
 Where is the compression?
 Original String to decode : ^WED^WE^WEE^WEB^WET
 Decoded String : ^WED<256>E<260><261><257>B<260>T
 Plain ASCII coding of the string : 19 * 8 bits = 152 bits
 LZW coding of the string: 12*9 bits = 108 bits (7 symbols and 5 codes,
each of 9 bits)
 Why 9 bits?
 An ASCII character has a value ranging from 0 to 255
 All tokens have fixed length
 There has to be a distinction in representation between an
ASCII character and a Code (assigned to strings of length 2 or more)
 Codes can only have values 256 and above

54
LZW Algorithm – Discussion (continued)
With 9 bits we can only have a maximum of 256 codes for strings of length
2 or above (with the first 256 entries for ASCII characters)
 Original LZW uses dictionary with 4K entries, with the length of each
symbol/code being 12 bits
12 bits
0
<- ASCII characters
(0 to 255 entries)
<- Codes
(256 to 4096 entries)
0
0 0
1
0
0 0
1
1
1 1
 With 12 bits, we can have a maximum of 212 – 256 codes.

55
Practical implementations of LZW algorithm follow the two approaches:
Flush the dictionary periodically
– no wasted codes
Grow the length of the codes as the algorithm proceeds
- First start with a length of 9 bits for the codes.
- Once we run out of codes, increase the length to 10 bits. When we run out of
codes with 10 bits then we increase the code length to 11 bits and so on.
- more efficient.
0 ASCII
1 Codes 256-512 0 0 ASCII
0 1 Codes 256-511
1 0 Codes 512-767
1 1 Codes 768-1023
0 0 0 ASCII
0 0 1 Codes 256-511
0 1 0 Codes 512-767
0 1 1 Codes 768-1023
1 0 0 Codes 1024-1279
1 0 1 Codes 1280-1535
1 1 0 Codes 1536-1791
1 1 1 Codes 1792-2047
Out of codes
Out of codes

2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt

Recommended

Recommended

More Related Content

Similar to 2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt

Similar to 2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt (20)

Recently uploaded

Recently uploaded (20)

2.3 unit-ii-text-compression-a-outline-compression-techniques-run-length-coding.ppt