Data compession

Chameli Devi Group Of Institutions
Chameli Devi School Of Engineering
Guided By:-
Shadab Pasha
Submitted By:-
Arvind Carpenter

Contents
1. Introduction
2. Categorization of Compression
3. Lossless Compression
4. Run-length Encoding
5. Huffman Coding
6. Lempel Ziv (LZ) Encoding
7. Lossy Compression
8. Image Compression (JPEG) Encoding
9. Video Compression (MPEG) Encoding
10. Audio Compression (MP3)
11 Conclusion
12 References

 Video: 30 pictures per second
 Each picture = 200,000 dots or pixels
 8-bits to represent each primary color
 For RGB = 28 x 28 x 28
 Bits required for one second movie = 503316480 pixels
 Two hour movie requires = 2 x 60 x 60 x 503316480

Introduction
 Compression is a way to reduce the number of bits in a
frame but retaining its meaning.
 Decreases space, time to transmit, and cost
 Technique is to identify redundancy and to eliminate it
 If a file contains only capital letters, we may encode all
the 26 alphabets using 5-bit numbers instead of 8-bit
ASCII code
 If the file had n-characters, then the savings = (8n-5n)/8n
=> 37.5%

Lossless Compression
In lossless data compression:-
o The integrity of the data is preserved.
o The original data and the data after compression and
decompression are exactly the same.
o No data loss.
o Redundant data is removed in compression and added
during decompression.
o Lossless compression methods are normally used
when we cannot afford to lose any data.

Run-length Encoding
Run-length encoding is simple and lossless
Here
How
It Works
Is

Notice that here are 9
pieces of fruits
We can store these information as is.....

There is a much better way.......
Check
It
Out !

Currently to read the line
of fruits aloud exactly
it appears you would say.
Kind of redundant.......

To save on space We can
“Compress” The
Information.....

Notice that there are multiples of
certain fruits....

Now if we read these aloud it’s not
So weird 
“Three apples, two pears, one banana, two oranges
and one apple”
.........And it saves SPACE

Now to translate into
computer terms...
A scan line contains a run of numbers...
55556987444425555611111988888222222222
...Using run-length Encoding
(4,5) (1,6) (1,9) (1,8) (1,7)
(4,4) (1,2) (4,5) (1,6) (5,1)
(1,9) (5,8) (9,2)

To Sum it up.....
In Wikipedia terms.....
Run-length encoding (RLE) is a very simple
form of data compression in which runs of data
(that is, sequences in which the same data
value occurs in many consecutive data
elements) are stored as a single data value
and count, rather than as the original run

Huffman Coding
 Huffman coding is credited to David Albert Huffman
 Huffman coding is an entropy encoding algorithm used
for lossless data compression.
 Huffman coding is a method of storing strings of data as
binary code in efficient manner
 Huffman coding uses variable length coding which
means that symbols in the data you are encoded are
converted in to a binary symbol based on how often that
symbol is used
 There is a way to decide what binary code to give to each
character using trees

The (Real) Basic Algorithm
 Scan text to be compressed and tally occurrence of all
characters.
 Sort or prioritize characters based on number of
occurrences in text.
 Build Huffman code tree based on prioritized list.
 Perform a traversal of tree to determine all code words.
 Scan text again and create new file using the Huffman
codes.

Building a Tree
Scan the original text
 Consider the following short text:
Eerie eyes seen near lake.
 Count up the occurrences of all characters in the text
CS 102

Building a Tree
What characters are present?
E e r i space
y s n a r l k .
CS 102

What is the frequency of each character in the
text?
Char Freq
E 1
e 8
r 2
i 1
Space 4
y 1
s 2
n 2
CS 102
Char Freq
a 2
l 1
k 1
. 1
Building a Tree

 The queue after inserting all nodes
 Null Pointers are not shown
CS 102
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
Building a Tree

CS 102
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
BUILDING A TREE

CS
102
E1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
BUILDING A TREE

CS
102
E1
i
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
BUILDING A TREE

CS
102
BUILDING A TREE
E1
i
1
r
2
s
2
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2

CS
102
BUILDING A TREE
E1
i
1
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4

CS
102
E1
i
1
n
2
a
2
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
BUILDING A TREE

CS
102
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
BUILDING A TREE

CS
102
BUILDING A TREE
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
4

CS
102
4 4
E1
i
1
sp
4
e
2 8
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
BUILDING A TREE

CS
102
BUILDING A TREE
4 4
E1i
1
sp
4
e
2 8
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
6

CS
102
BUILDING A TREE
4 4 6
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
What is happening to the characters with a low number of occurrences?

CS
102
E1
i
1
sp
4
e
2 8
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
4
6
8
BUILDING A TREE

CS
102
BUILDING A TREE
E1
i
1
sp
4
e
2 8
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
4
6 8

CS
102
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2 4
4
6
8
10
BUILDING A TREE

CS
102
BUILDING A TREE
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
r s
2
2
2
4
n2
a2 4 4
6
8 10

CS
102
E1
i
1
sp
4
e8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
4
6
8
10
16
BUILDING A TREE

CS
102
E1
i
1
sp
4
e
2 8
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
BUILDING A TREE

CS
102
BUILDING A TREE
E1
i
1
sp
4
e
8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n
2
a
2
4
4
6
8
10
16
26

CS
102
E1
i
1
sp
4
e8
2
y
1
l
1
2
k
1
.
1
2
r
2
s
2
4
n2
a2
4
4
6
8
10
16
26
After enqueueing this node
there is only one node left
in priority queue.
BUILDING A TREE

CS 102
 Perform a traversal of the
tree to obtain new code
words
 Going left is a 0 going right
is a 1
 code word is only
completed when a leaf
node is reached
E1
i
1
sp
4
e8
2
y
1
l
1
2
k
1
.
1
2
r
2
4
s
2
n2
a2
4
4
6
8
10
16
26
Encoding the File
Traverse Tree for Codes

CS 102
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
space 011
e 10
r 1100
s 1101
n 1110
a 1111
E1
i
1
sp
4
e8
2
y
1
l
1
2
k
1
.
1
2
r
2
4
s
2
n2
a2
4
4
6
8
10
16
26

CS 102
ENCODING THE FILE
 Rescan text and encode file
using new code words
Char Code
E 0000
i 0001
y 0010
l 0011
k 0100
. 0101
space 011
e 10
r 1100
s 1101
n 1110
a 1111
0000101100000110011100010101101101
00111110101111110001100111111010010
0101
 Why is there no need for a
separator character?
.

CS 102
ENCODING THE FILE
RESULTS
 Have we made things any
better?
 73 bits to encode the text
 ASCII would take 8 * 26 =
208 bits
0000101100000110011100010101101101
00111110101111110001100111111010010
0101

Lemple Ziv (LZ) Encoding
 Data compression up until the late 1970's mainly directed
towards creating better methodologies for Huffman coding.
 An innovative, radically different method was introduced
in1977 by Abraham Lempel and Jacob Ziv.
 This technique ( called Lempel-Ziv) actually consists of two
considerably different algorithms, LZ77 and LZ78.
 Due to patents, LZ77 and LZ78 led to many variants.
LZ77 LZR LZSS LZB LZH
Variants
LZ78 LZW LZC LZT LZMW LZJ LZFG
Variants
 The zip and unzip use the LZH technique while UNIX's
compress methods belong to the LZW and LZC classes

EXAMPLE : LZ78 COMPRESSION
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ78 algorithm.
The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)
Note: The above is just a representation, the commas and parentheses are not transmitted;
we will discuss the actual form of the compressed message later on in slide 12.

EXAMPLE : LZ78 COMPRESSION (CONT’D)
1. A is not in the Dictionary; insert it
2. B is not in the Dictionary; insert it
3. B is in the Dictionary.
BC is not in the Dictionary; insert it.
BC is in the Dictionary.
BCA is not in the Dictionary; insert it.
BA is not in the Dictionary; insert it.
BCA is in the Dictionary.
BCAA is not in the Dictionary; insert it.
BCA is in the Dictionary.
BCAA is in the Dictionary.
BCAAB is not in the Dictionary; insert it.

Lossy Compression Methods
 Used for compressing images and video files
(our eyes cannot distinguish subtle changes, so
lossy data is acceptable).
 These methods are cheaper, less time and
space.
 Several methods:
 JPEG: compress pictures and graphics
 MPEG: compress video
 MP3: compress audio

JPEG Encoding
 Used to compress pictures and graphics.
 In JPEG, a grayscale picture is divided into 8x8
pixel blocks to decrease the number of
calculations.
 Basic idea:
 Change the picture into a linear (vector) sets of numbers that
reveals the redundancies.
 The redundancies is then removed by one of lossless
compression methods.

JPEG Encoding - DCT
DCT: Discrete Concise Transform
DCT transforms the 64 values in 8x8 pixel block
in a way that the relative relationships between
pixels are kept but the redundancies are
revealed.
 Example:
A gradient grayscale

Quantization & Compression
 Quantization:
 After T table is created, the values are quantized to reduce the
number of bits needed for encoding.
 Quantization divides the number of bits by a constant, then
drops the fraction. This is done to optimize the number of bits
and the number of 0s for each particular application.
• Compression:
 Quantized values are read from the table and redundant 0s are
removed.
 To cluster the 0s together, the table is read diagonally in an
zigzag fashion. The reason is if the table doesn’t have fine
changes, the bottom right corner of the table is all 0s.
 JPEG usually uses lossless run-length encoding at the
compression phase.

MPEG Encoding
 Used to compress video.
 Basic idea:
 Each video is a rapid sequence of a set of
frames. Each frame is a spatial combination
of pixels, or a picture.
 Compressing video =
spatially compressing each frame
+
temporally compressing a set of
frames.

MPEG Encoding
• Spatial Compression
• Each frame is spatially compressed by JPEG.
• Temporal Compression
• Redundant frames are removed.
• For example, in a static scene in which someone is talking,
most frames are the same except for the segment around the
speaker’s lips, which changes from one frame to the next.

Audio Compression
Used for speech or music
 Speech: compress a 64 kHz digitized signal
 Music: compress a 1.411 MHz signal
Two categories of techniques:
 Predictive encoding
 Perceptual encoding

Audio Encoding
•Predictive Encoding
•Only the differences between samples are encoded, not
the whole sample values.
•Several standards: GSM (13 kbps), G.729 (8 kbps), and
G.723.3 (6.4 or 5.3 kbps)
•Perceptual Encoding: MP3
•CD-quality audio needs at least 1.411 Mbps and cannot
be sent over the Internet without compression.
•MP3 (MPEG audio layer 3) uses perceptual encoding
technique to compress audio.

Conclusion
Compression is used in all types of data
to save space and time. There are two
types of data compression-lossy and
lossless. Lossy techniques are used for
images, videos and audios, where we
can bear data loss. Lossless technique
is used for textual data it can be
encoded through run-length, Huffman
and Lempel Ziv.

References
 http://www.csie.kuas.edu.tw/course/cs/englis
h/ch-15.ppt
CS157B-Lecture 19 by Professor Lee
http://cs.sjsu.edu/~lee/cs157b/cs157b.html
 “The essentials of computer organization
and architecture” by Linda Null and Julia
Nobur
 .
 http://www.wekipedia.com

Data compession

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Data compession

Similar to Data compession (20)

Recently uploaded

Recently uploaded (20)

Data compession

Editor's Notes