2. 2
S
I
L
I
C
O
N
contents
īŽ IntroductionIntroduction
īŽ What ,whenWhat ,when
īŽ Some questionSome question
īŽ UsesUses
īŽ Major stepsMajor steps
īŽ Type of data compressionType of data compression
īŽ disadvantagesdisadvantages
īŽ conclusionconclusion
3. 3
S
I
L
I
C
O
N
INTRODUCTION
Data Compression What:Data Compression What:
īŽ As name implies, makes your data smaller, saving space
īŽ Looks for repetitive sequences or patterns in data - e.g. the
the quick the brown fox the
īŽ We are more repetitive than we think - text often
compresses over 50%
īŽ Lossless vs. lossy
4. 4
S
I
L
I
C
O
N
Data Compression - WHY
īŽ Most data from nature has redundancy
īŽ There is more data than the actual information contained
in the data.
īŽ Squeezing out the excess data amounts to compression.
īŽ However, unsqeezing out is necessary to be able to figure
out what the data means.
īŽ Always possible to compress?
īŽ Consider a two-bit sequence.
īŽ Can you always compress it to one bit?
īŽ the limits of compression and give clues on
how to compress well.
5. 5
S
I
L
I
C
O
N
Question:
Question:Question: Why do we want to make files smaller?Why do we want to make files smaller?
Answer:Answer:
īĩ To use less storage, i.e., saving costsTo use less storage, i.e., saving costs
īĩ To transmit these files faster, decreasing accessTo transmit these files faster, decreasing access
time or using the same access time, but with atime or using the same access time, but with a
lower and cheaper bandwidthlower and cheaper bandwidth
īĩ To process the file sequentially faster.To process the file sequentially faster.
7. 7
S
I
L
I
C
O
N
Preparation:-Preparation:-It include analog to digital conversionIt include analog to digital conversion
and generating appropriate digital representationand generating appropriate digital representation
of the information. An image is divided intoof the information. An image is divided into
blacks of 8/8 pixels, and represented by affix no.blacks of 8/8 pixels, and represented by affix no.
of bit per pixel.of bit per pixel.
īŽ Processing:-Processing:-It is 1st stage of compression processIt is 1st stage of compression process
which make use sophisticated algorithms.which make use sophisticated algorithms.
īŽ Quantization:-Quantization:-It is the result of previous step. ItIt is the result of previous step. It
specifies the granularity of the mapping of realspecifies the granularity of the mapping of real
number into integer number. This process resultsnumber into integer number. This process results
in a reduction of precision.in a reduction of precision.
īŽ Entropy encoding: -Entropy encoding: - It is the last step. ItIt is the last step. It
compresses a sequential digital data streamcompresses a sequential digital data stream
without loss. For ex:-compress sequence ofwithout loss. For ex:-compress sequence of
zeroes specifying the no. of occurrence.zeroes specifying the no. of occurrence.
8. 8
S
I
L
I
C
O
N
USES OF DATA
COMPRESSION
īŽ More and more data is being stored electronically. DigitalMore and more data is being stored electronically. Digital
video libraries, for example, contain vast amounts of data,video libraries, for example, contain vast amounts of data,
and compression allows cost-effective storage of the data.and compression allows cost-effective storage of the data.
īŽ New technology has allowed the possibility of interactiveNew technology has allowed the possibility of interactive
digital television and the demand is for high-qualitydigital television and the demand is for high-quality
transmissions, a wide selection of programs to choose fromtransmissions, a wide selection of programs to choose from
and inexpensive hardware. But for digital television to be aand inexpensive hardware. But for digital television to be a
success, it must use data compression [Saxton, 1996].success, it must use data compression [Saxton, 1996]. DataData
compression reduces the number of bits required tocompression reduces the number of bits required to
represent or transmit information.represent or transmit information.
9. 9
S
I
L
I
C
O
N
TYPES OF DATA
COMPRESSIONīŽ Entropy encodingEntropy encoding -- lossless. Data considered a-- lossless. Data considered a
simple digital sequence and semantics of data aresimple digital sequence and semantics of data are
ignored.ignored.
īŽ Source encodingSource encoding -- lossy. Takes semantics of data-- lossy. Takes semantics of data
into account. Amount of compression depends oninto account. Amount of compression depends on
data contents.data contents.
īŽ Hybrid encodingHybrid encoding -- combination of entropy and-- combination of entropy and
source. Most multimedia systems use these.source. Most multimedia systems use these.
10. 10
S
I
L
I
C
O
N
TYPES OF DATA
COMPRESSIONīŽ Entropy encodingEntropy encoding -- lossless.-- lossless.
īĩ Data in data stream considered a simple digitalData in data stream considered a simple digital
sequence and semantics of data are ignored.sequence and semantics of data are ignored.
īĩ Short Code words for frequently occurring symbols.Short Code words for frequently occurring symbols.
Longer Code words for more infrequently occurringLonger Code words for more infrequently occurring
symbolssymbols
ī´ For example: E occurs frequently in English, soFor example: E occurs frequently in English, so
we should give it a shorter code than Qwe should give it a shorter code than Q
īĩ Examples of Entropy Encoding:Examples of Entropy Encoding:
ī´ Loss less data compressionLoss less data compression
ī´ Huffman codingHuffman coding
ī´ Arithmetic codingArithmetic coding
11. 11
S
I
L
I
C
O
N
LOSSLESS DATA
COMPRESSION
īŽ Run-Length CodingRun-Length Coding
īĩ RunsRuns (sequences) of data are stored as a single value(sequences) of data are stored as a single value
and count, rather than the individual run.and count, rather than the individual run.
īĩ Example:Example:
ī´ ThisThis::
âĸ WWWWWWWWWWWWBWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWW
WWBBBWWWWWWWWWWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWW
WWWWWBWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
ī´ Becomes:Becomes:
âĸ 12WB12W3B24WB14W12WB12W3B24WB14W
12. 12
S
I
L
I
C
O
N
īŽ Data is not lost - the original is really needed.Data is not lost - the original is really needed.
īŽ text compression.text compression.
īŽ compression of computer binaries to fit on acompression of computer binaries to fit on a
floppy.floppy.
īŽ Compression ratio typically 2:1 to 8:1Compression ratio typically 2:1 to 8:1..
lossless compression on many kinds of files.lossless compression on many kinds of files.
īŽ Statistical Techniques:Statistical Techniques:
īŽ Huffman coding.Huffman coding.
īŽ Arithmetic coding.Arithmetic coding.
īŽ Dictionary techniques:Dictionary techniques:
īŽ LZW, LZ77.LZW, LZ77.
Standards - Morse code, Braille, Unix compress,Standards - Morse code, Braille, Unix compress,
gzip,gzip,
īŽ zip, bzip, GIF, PNG, JBIG, Lossless JPEG.zip, bzip, GIF, PNG, JBIG, Lossless JPEG.
13. 13
S
I
L
I
C
O
N
SHANNON-FANO
COADING
īŽ Shannon lossless source coding theorem isShannon lossless source coding theorem is
based on the concept of block coding. Tobased on the concept of block coding. To
illustrate this concept, we introduce aillustrate this concept, we introduce a
special information source in which thespecial information source in which the
alphabet consists of only two letters:alphabet consists of only two letters:
1.1. First-Order Block CodeFirst-Order Block Code
A={a,b}A={a,b}
15. 15
S
I
L
I
C
O
N
An example:-
Note that 24 bits are used to represent 24Note that 24 bits are used to represent 24
characters --- an average of 1characters --- an average of 1
bit/character.bit/character.
16. 16
S
I
L
I
C
O
N
īŽ Second-Order Block Code :-Second-Order Block Code :- Pairs ofPairs of
characters are mapped to either one, two, or threecharacters are mapped to either one, two, or three
bits.bits.
17. 17
S
I
L
I
C
O
N
.
īŽ ..B2B2 P(B2)P(B2) CodewordCodeword
aaaa 0.450.45 00
bbbb 0.450.45 1010
abab 0.050.05 110110
baba 0.050.05 111111
R=0.825bits/characterR=0.825bits/character
18. 18
S
I
L
I
C
O
N
An example:
Note that 20 bits are used to represent 24Note that 20 bits are used to represent 24
characters --- an average of 0.83characters --- an average of 0.83
bits/character.bits/character.
19. 19
S
I
L
I
C
O
N
īŽ Third-Order Block Code: -Third-Order Block Code: -Triplets ofTriplets of
characters are mapped to bit sequence of lengths onecharacters are mapped to bit sequence of lengths one
through six.through six.
20. 20
S
I
L
I
C
O
N
..
B3B3 P(B3)P(B3) CodewordCodeword
aaaaaa 0.4050.405 00
bbbbbb 0.4050.405 0101
aabaab 0.4050.405 11001100
abbabb 0.4050.405 11011101
bbabba 0.4050.405 11101110
baabaa 0.4050.405 1111011110
abaaba 0.0050.005 111110111110
R=0.68R=0.68 Bits/charactersBits/characters
21. 21
S
I
L
I
C
O
N
īŽ An example:An example:
Note that 17 bits are used to represent 24Note that 17 bits are used to represent 24
characters --- an average of 0.71characters --- an average of 0.71
bits/character.bits/character.
22. 22
S
I
L
I
C
O
N
HUFFMAN CODING
īŽ Suppose messages are made of letters a, b, c, d, and e,Suppose messages are made of letters a, b, c, d, and e,
which appear with probabilities .12, .4, .15, .08, and .25,which appear with probabilities .12, .4, .15, .08, and .25,
respectively.respectively.
īŽ We wish to encode each character into a sequence of 0âsWe wish to encode each character into a sequence of 0âs
and 1âs so that no code for a character is theand 1âs so that no code for a character is the prefixprefix forfor
another.another.
īŽ Answer (using Huffmanâs algorithm given on the nextAnswer (using Huffmanâs algorithm given on the next
slide): a=1111, b=0, c=110, d=1110, e=10.slide): a=1111, b=0, c=110, d=1110, e=10.
26. 26
S
I
L
I
C
O
N
HUFFMAN CODING
īŽ ExampleExample
5
īŽ n = 5n = 5,, w[0:4] = [2, 5, 4, 7, 9].w[0:4] = [2, 5, 4, 7, 9].
īŽ 2=0102=010
īŽ 5=005=00
īŽ 4=0114=011
īŽ 7=107=10
īŽ 9=119=11 2 4
6
11
7 9
16
27
00
0
0
0
1
1
1 1
27. 27
S
I
L
I
C
O
N
LZ-77 ENCODING
īŽ Good as they are, Huffman and arithmeticGood as they are, Huffman and arithmetic
coding are not perfect for encoding textcoding are not perfect for encoding text
because they don't capture the higher-orderbecause they don't capture the higher-order
relationships between words and phrases.relationships between words and phrases.
There is a simple, clever, and effectiveThere is a simple, clever, and effective
approach to compressing text known asapproach to compressing text known as
"LZ-77", which uses the redundant nature"LZ-77", which uses the redundant nature
of text to provide compression.of text to provide compression.
28. 28
S
I
L
I
C
O
N
For an example, consider the phrase:For an example, consider the phrase:
the_rain_in_Spain_falls_mainly_in_the_the_rain_in_Spain_falls_mainly_in_the_
plainplain
-- where the underscores ("_") indicate-- where the underscores ("_") indicate
spaces. This uncompressed message is 43spaces. This uncompressed message is 43
bytes, or 344 bits, long.bytes, or 344 bits, long.
29. 29
S
I
L
I
C
O
N
the_rain_in_Spain_falls_mainly_in_the_plain
At first, LZ-77 simply outputs uncompressedAt first, LZ-77 simply outputs uncompressed
characters, since there are no previous occurrencescharacters, since there are no previous occurrences
of any strings to refer back to. In our example,of any strings to refer back to. In our example,
these characters will not be compressed:these characters will not be compressed:
1- the_rain_1- the_rain_ The next chunk of the message:The next chunk of the message:
in_in_ -- has occurred earlier in the message, and can-- has occurred earlier in the message, and can
be represented as a pointer back to that earlier text,be represented as a pointer back to that earlier text,
along with a length field. This gives:along with a length field. This gives:
2-the_rain_<3,3>2-the_rain_<3,3>
30. 30
S
I
L
I
C
O
N
the_rain_in_Spain_falls_mainly_in_the_plain
-- which has to be output uncompressed:-- which has to be output uncompressed:
3- the_rain_<3,3>Sp3- the_rain_<3,3>Sp However, the charactersHowever, the characters
"ain_" have already been sent, so they are encoded"ain_" have already been sent, so they are encoded
with a pointer:with a pointer:
4- the_rain_<3,3>Sp<9,4>4- the_rain_<3,3>Sp<9,4>
The characters "falls_m" are output uncompressed,The characters "falls_m" are output uncompressed,
but "ain" has been used before in "rain" andbut "ain" has been used before in "rain" and
"Spain", so once again it is encoded with a"Spain", so once again it is encoded with a
pointer:pointer:
5- the_rain_<3,3>Sp<9,4>falls _m<11,3>5- the_rain_<3,3>Sp<9,4>falls _m<11,3>
32. 32
S
I
L
I
C
O
N
ARITHMATIC CODEIND
īŽ Huffman coding looks pretty slick, and it is, butHuffman coding looks pretty slick, and it is, but
there's a way to improve on it, known asthere's a way to improve on it, known as
"arithmetic coding". The idea is subtle and best"arithmetic coding". The idea is subtle and best
explained by example.explained by example.
īŽ Suppose we have a message that only contains theSuppose we have a message that only contains the
characters A, B, and C, with the followingcharacters A, B, and C, with the following
frequencies, expressed as fractions:frequencies, expressed as fractions:
īŽ A: 0.5 B: 0.2 C: 0.3A: 0.5 B: 0.2 C: 0.3
33. 33
S
I
L
I
C
O
N
letter probability interval binary fractionletter probability interval binary fraction
____ _________ ______ ___________ _________ ______ _______
C: 0.3 0.0 : 0.3 0C: 0.3 0.0 : 0.3 0
B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375
A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5
34. 34
S
I
L
I
C
O
N
Irreversible Compression
īŽ Irreversible CompressionIrreversible Compression is based on the assumptionis based on the assumption
that some information can be sacrificed. [Irreversiblethat some information can be sacrificed. [Irreversible
compression is also calledcompression is also called Entropy ReductionEntropy Reduction].].
īŽ Example: Shrinking a raster image from 400-by-400Example: Shrinking a raster image from 400-by-400
pixels to 100-by-100 pixels. The new image containspixels to 100-by-100 pixels. The new image contains
1 pixel for every 16 pixels in the original image.1 pixel for every 16 pixels in the original image.
īŽ There is usually no way to determine what theThere is usually no way to determine what the
original pixels were from the one new pixel.original pixels were from the one new pixel.
īŽ In data files, irreversible compression is seldom used.In data files, irreversible compression is seldom used.
However, it is used in image and speech processing.However, it is used in image and speech processing.
35. 35
S
I
L
I
C
O
N
LOSSY COMPRESSION
īŽ Data is lost, but not too much:Data is lost, but not too much:
īŽ Audio.Audio.
īŽ Video.Video.
īŽ Still images, medical images, photographs.Still images, medical images, photographs.
īŽ Compression ratios of 10:1 often yield quiteCompression ratios of 10:1 often yield quite
īŽ Major techniques include:Major techniques include:
īŽ Vector Quantization.Vector Quantization.
īŽ Block transforms.Block transforms.
īŽ Standards â JPEG, JPEG 2000, MPEG (1, 2, 4, 7).Standards â JPEG, JPEG 2000, MPEG (1, 2, 4, 7).
37. 37
S
I
L
I
C
O
N
DISADVANTAGES
Some technique are there by which data canSome technique are there by which data can
compress efficiently. But there is a chancecompress efficiently. But there is a chance
of losses data.of losses data.
38. 38
S
I
L
I
C
O
N
CONCLUSION
From the above description ,there is noFrom the above description ,there is no
algorithm has not been devloped.That is noalgorithm has not been devloped.That is no
such kind of algorithm which is applicablesuch kind of algorithm which is applicable
in every data file.But this difficulties can bein every data file.But this difficulties can be
handle by using Hybrid data compression.Inhandle by using Hybrid data compression.In
this IT ara data compression isthis IT ara data compression is
essential.Even though some data will beessential.Even though some data will be
loss.loss.