DATA COMPRESSION
&
ITS TECHNIQUES
DATA COMPRESSION
 The process of reducing the volume of
data by applying a compression technique is
called compression. The resulting data is
called compressed data.
 The reverse process of reproducing the
original data from compressed data is called
decompression. The resulting data is called
decompressed data.
REASONS TO COMPRESS
• Make optimal use of limited storage space
• Save time and help to optimize resources
 If compression and decompression are done in I/O
processor, less time is required to move data to or
from storage subsystem, freeing I/O bus for other
work
 In sending data over communication line: less time
to transmit and less storage to host
TYPES OF COMPRESSION TECHNIQUES
Compression techniques can be
categorized based on following
consideration:
• Lossless or lossy
• Symmetrical or asymmetrical
• Software or hardware
TYPES OF COMPRESSION TECHNIQUES
1. Lossless or lossy
 If the decompressed data is the same as the original data, it is
referred to as lossless compression, otherwise the compression is
lossy.
2. Symmetrical or asymmetrical
 In symmetrical compression, the time required to compress and to
decompress are roughly the same.
 In asymmetrical compression, the time taken for compression is
usually much longer than decompression.
3. Software or hardware
 A compression technique may be implemented either in hardware
or software. As compared to software codecs (coder and decoder),
hardware codecs offer better quality and performance.
DATA COMPRESSION METHODS
 Data compression is about storing and sending a smaller
number of bits.
 There’re two major categories for methods to compress
data: lossless and lossy methods
LOSSLESS COMPRESSION
METHODS
 In lossless methods, original data and the data after
compression and decompression are exactly the same.
 Redundant data is removed in compression and added
during decompression.
 Lossless methods are used when we can’t afford to lose
any data: legal and medical documents, computer
programs.
RUN-LENGTH ENCODING
 Simplest method of compression.
 How: replace consecutive repeating occurrences of a symbol by 1 occurrence of
the symbol itself, then followed by the number of occurrences.
Example: Consider string Xtmprsqzntwlfb
After RLE encoding, this string becomes:
1X1t1m1p1r1s1q1z1n1t1w1l1f1b
RLE schemes are simple and fast, but their compression efficiency depends on the type of data being
encoded.
Example:A black-and-white image that is mostly white, such as the page of a book, will encode very
well, due to the large amount of contiguous data that is all the same color. An image with many colors
that is very busy in appearance, however, such as a photograph, will not encode very well. This is
because the complexity of the image is expressed as a large number of different colors. And because of
this complexity there will be relatively few runs of the same color.
HUFFMAN CODING
 Assign fewer bits to symbols that occur more frequently and more
bits to symbols appear less often.
 There’s no unique Huffman code and every Huffman code has the
same average code length.
 Algorithm:
1. Make a leaf node for each code symbol
Add the generation probability of each symbol to the leaf node
2. Take the two leaf nodes with the smallest probability and connect them into a new
node
Add 1 or 0 to each of the two branches
The probability of the new node is the sum of the probabilities of the two
connecting nodes
3. If there is only one node left, the code construction is completed. If not, go back to
(2)
HUFFMAN CODING
Example
HUFFMAN CODING
Encoding
Decoding
LEMPEL ZIV ENCODING
 It is dictionary-based encoding
 Dictionary coding techniques rely upon the observation that there are
correlations between parts of data (recurring patterns). The basic idea is to
replace those repetitions by (shorter) references to a "dictionary" containing the
original.
 The dictionary based method may be static or dynamic depending upon the
creation and use of dictionary.
 Static dictionary is prepared before the communication of the encoded message
to the receiver’s end. All possible chars/words/phrases are inserted into the
dictionary and indexed.
 The main drawback of static method is that performance depends upon the text
to be encoded and is highly dependent on the organization of the
chars/words/phrases in the dictionary.
 Secondly, if there is any word not in the dictionary, it fails.
 The solution to the problem is dynamic dictionary compression. In this method,
the dictionary is prepared at the time of encoding of text.
 LZ77, LZ78 AND LZW techniques use dynamic dictionary compression
technique.
LZ77 (LEMPEL-ZIV) COMPRESSION
TECHNIQUE
The dictionary used is actually a portion of the input text, which has been
recently encoded.
The text that needs to be encoded is compared with the strings of symbols in the
dictionary.
The longest matched string in the dictionary is characterized by a pointer
(sometimes called a token), which is represented by a triple of data items.
Note that this triple functions as an index to the dictionary.
In this way,a variable-length string of symbols is mapped to a fixed-length
pointer.
There is a sliding window in the LZ77 algorithms. The window consists of two
parts: a search buffer and a look-ahead buffer.
The search buffer contains: the portion of the text stream that has recently been
encoded ---the dictionary.
The look-ahead buffer contains: the text to be encoded next.
The window slides through the input text stream from beginning to end during
the entire encoding process.
LZ77 (LEMPEL-ZIV) COMPRESSION
TECHNIQUE
LZ77
:
SEARCH BUFFER LOOKAHEAD BUFFER
c a b r A c a d a b r A r r a r r
pointer
1. To encode the sequence in look-ahead buffer, the encoder moves a search pointer
back through the search buffer until it encounters a match to the first symbol in the
look-ahead buffer. The distance of the pointer from the look-ahead buffer is called
the offset.
2. The encoder then examines the symbols following the symbol at the pointer location
to see if they match consecutive symbols in the look-ahead buffer. The number of
consecutive symbols in the search buffer that match consecutive symbols in the
look-ahead buffer, starting with the first symbol, is called the length of the match.
The encoder searches the search buffer for the longest match.
LZ77 (LEMPEL-ZIV) COMPRESSION
TECHNIQUE
3. Once the longest match has been found, the encoder encodes it with a
triple <o,l,c> where o is the offset, l is the length of the match and c is
the code-word corresponding to the symbol in the look-ahead buffer
that follows the match.
 In the diagram, the longest match is the first a of the search buffer.
The offset o in this case is 2, l is 4, and the symbol in the look-ahead
buffer following the match is r.
 The reason for sending the third element in the triple is to take care of
the situation where no match for the symbol in the look-ahead buffer can
be found in the search buffer. In this case, the offset and the match
length values are set to 0, and the third element of the triple is the code
for the symbol itself.
 For the decoding process, it is basically a table look-up procedure and
can be done by reversing the encoding procedure.
 The limitation of the approach is that if the distance between the
repeated patterns in the input text stream is larger than the size of the
search buffer, then the approach cannot utilize the structure to compress
the text. The longest match possible is roughly the size of the look-ahead
buffer.
LZ78 (LEMPEL-ZIV) COMPRESSION
TECHNIQUE
No use of the sliding window.
Use encoded text as a dictionary which, potentially, does not have a fixed
size.
Each time a pointer (token) is issued, the encoded string is included in the
dictionary.
Once a preset limit to the dictionary size has been reached, either the
dictionary is fixed for the future (if the coding efficiency is good), or it is
reset to zero,i.e., it must be restarted.
Instead of the triples used in the LZ77,only pairs are used in the LZ78.
Specifically, only the position of the pointer to the matched string and the
symbol following the matched string need to be encoded.
Example: The string S =
001212121021012101221011 is to be encoded.
Figure shows the encoding process.
DECODING PROCESS
LZW COMPRESSION
TECHNIQUE
 Improved version of the original LZ78 algorithm is perhaps the
most famous modification and is sometimes even mistakenly
referred to as the Lempel Ziv algorithm.
 It basically applies the principle of not explicitly transmitting the
next non-matching symbol to the LZ78 algorithm. The only
remaining output of this improved algorithm are fixed-length
references to the dictionary (indexes).
 The dictionary has to be initialized with all the symbols of the
input alphabet and this initial dictionary needs to be made known
to the decoder.
LOSSY COMPRESSION
METHODS
 Used for compressing images and video files (our eyes
cannot distinguish subtle changes, so lossy data is
acceptable).
 These methods are cheaper, less time and space.
 Several methods:
 JPEG: compress pictures and graphics
 MPEG: compress video
 MP3: compress audio
THE JPEG STANDARD
 Joint Photographic Experts Group
 Jpeg is the standard compression techniques for still images
 Lossy compression
 Employs a transform coding method using the DCT (Discrete
Cosine Transform)
 Main Steps in JPEG Image Compression
1. Transform RGB to YIQ or YUV and subsample color.
2. DCT on image blocks.
3. Quantization.
4. Zig-zag ordering and run-length encoding.
5. Entropy coding.
BLOCK DIAGRAM FOR JPEG
ENCODER
DCT ON IMAGE BLOCKS
 Each image is divided into 8 × 8 blocks. The 2D
DCT is applied to each block image f(i, j), with
output being the DCT coefficients F(u, v) for
each block.
 By applying Discrete Cosine Transform (DCT),
the data in time (spatial) domain can be
transformed into frequency domain.
QUANTIZATION IN JPEG
• ^F(u, v) = round(F(u, v)/Q(u, v))
• F(u, v) represents a DCT coefficient, Q(u, v) is a
“quantization matrix” entry, and ˆ F(u, v)
represents the quantized DCT coefficients which
JPEG will use in the succeeding entropy coding.
• The quantization step is the main source for loss
in JPEG compression.
MPEG
ENCODING
 Used to compress video.
 Basic idea:
 Each video is a rapid sequence of a set of frames. Each
frame is a spatial combination of pixels, or a picture.
 Compressing video =
spatially compressing each frame
+
temporally compressing a set of frames.
MPEG
ENCODING
• Spatial Compression
 Each frame is spatially compressed by JPEG.
• Temporal Compression
 Redundant frames are removed.
 For example, in a static scene in which someone is talking, most frames
are the same except for the segment around the speaker’s lips, which
changes from one frame to the next.
THREE TYPES OF FRAMES
 Intra frames (same as JPEG)
 Self contained frames
 Predictive frames
 encode from previous I or P reference frame
 Bi-directional frames
 encode from previous and future I or P frames
I P I
P P
B B B B B B B B
 The method employs the 3-D integer wavelet
transform and EBCOT(Embedded block
coding with optimized truncation) to create
bit stream.
 Provides random access.
 Provides resolution and quality scalability.
 High reconstruction quality.
 Optimized VOI coding.
3-D SCALABLE MEDICAL IMAGE
COMPRESSION WITH OPTIMIZED VOLUME
OF INTEREST CODING
3-D SCALABLE MEDICAL IMAGE
COMPRESSION WITH OPTIMIZED VOLUME
OF INTEREST CODING
Block Diagram
Based on combination of five techniques:
1. Fast 2-D wavelet transform.
2. Rearrangement of wavelet coefficients for efficient
processing.
3. Zerotree coding of correlated coefficients.
4. Gradual successive approximation of the wavelet
coefficients.
5. Lossless entropy coding of the quantized coefficients
using either adaptive arithmetic coding,which is slow and
gives better compression rate,or adaptive run-length
coding which is faster and has less performance we get
from arithmetic coding.
LOW BIT-RATE EFFICIENT COMPRESSION
FOR SEISMIC DATA
Wavelet
coefficients
Set
threshold
Classify pass
Coding pass
Arithmetic
coder
The quantization
loop
output
Fig: Compression Algorithm
LOW BIT-RATE EFFICIENT COMPRESSION
FOR SEISMIC DATA
THANK YOU

111111111111111111111111111111111789.ppt

  • 1.
  • 2.
    DATA COMPRESSION  Theprocess of reducing the volume of data by applying a compression technique is called compression. The resulting data is called compressed data.  The reverse process of reproducing the original data from compressed data is called decompression. The resulting data is called decompressed data.
  • 3.
    REASONS TO COMPRESS •Make optimal use of limited storage space • Save time and help to optimize resources  If compression and decompression are done in I/O processor, less time is required to move data to or from storage subsystem, freeing I/O bus for other work  In sending data over communication line: less time to transmit and less storage to host
  • 4.
    TYPES OF COMPRESSIONTECHNIQUES Compression techniques can be categorized based on following consideration: • Lossless or lossy • Symmetrical or asymmetrical • Software or hardware
  • 5.
    TYPES OF COMPRESSIONTECHNIQUES 1. Lossless or lossy  If the decompressed data is the same as the original data, it is referred to as lossless compression, otherwise the compression is lossy. 2. Symmetrical or asymmetrical  In symmetrical compression, the time required to compress and to decompress are roughly the same.  In asymmetrical compression, the time taken for compression is usually much longer than decompression. 3. Software or hardware  A compression technique may be implemented either in hardware or software. As compared to software codecs (coder and decoder), hardware codecs offer better quality and performance.
  • 6.
    DATA COMPRESSION METHODS Data compression is about storing and sending a smaller number of bits.  There’re two major categories for methods to compress data: lossless and lossy methods
  • 7.
    LOSSLESS COMPRESSION METHODS  Inlossless methods, original data and the data after compression and decompression are exactly the same.  Redundant data is removed in compression and added during decompression.  Lossless methods are used when we can’t afford to lose any data: legal and medical documents, computer programs.
  • 8.
    RUN-LENGTH ENCODING  Simplestmethod of compression.  How: replace consecutive repeating occurrences of a symbol by 1 occurrence of the symbol itself, then followed by the number of occurrences. Example: Consider string Xtmprsqzntwlfb After RLE encoding, this string becomes: 1X1t1m1p1r1s1q1z1n1t1w1l1f1b RLE schemes are simple and fast, but their compression efficiency depends on the type of data being encoded. Example:A black-and-white image that is mostly white, such as the page of a book, will encode very well, due to the large amount of contiguous data that is all the same color. An image with many colors that is very busy in appearance, however, such as a photograph, will not encode very well. This is because the complexity of the image is expressed as a large number of different colors. And because of this complexity there will be relatively few runs of the same color.
  • 9.
    HUFFMAN CODING  Assignfewer bits to symbols that occur more frequently and more bits to symbols appear less often.  There’s no unique Huffman code and every Huffman code has the same average code length.  Algorithm: 1. Make a leaf node for each code symbol Add the generation probability of each symbol to the leaf node 2. Take the two leaf nodes with the smallest probability and connect them into a new node Add 1 or 0 to each of the two branches The probability of the new node is the sum of the probabilities of the two connecting nodes 3. If there is only one node left, the code construction is completed. If not, go back to (2)
  • 10.
  • 11.
  • 12.
    LEMPEL ZIV ENCODING It is dictionary-based encoding  Dictionary coding techniques rely upon the observation that there are correlations between parts of data (recurring patterns). The basic idea is to replace those repetitions by (shorter) references to a "dictionary" containing the original.  The dictionary based method may be static or dynamic depending upon the creation and use of dictionary.  Static dictionary is prepared before the communication of the encoded message to the receiver’s end. All possible chars/words/phrases are inserted into the dictionary and indexed.  The main drawback of static method is that performance depends upon the text to be encoded and is highly dependent on the organization of the chars/words/phrases in the dictionary.  Secondly, if there is any word not in the dictionary, it fails.  The solution to the problem is dynamic dictionary compression. In this method, the dictionary is prepared at the time of encoding of text.  LZ77, LZ78 AND LZW techniques use dynamic dictionary compression technique.
  • 13.
    LZ77 (LEMPEL-ZIV) COMPRESSION TECHNIQUE Thedictionary used is actually a portion of the input text, which has been recently encoded. The text that needs to be encoded is compared with the strings of symbols in the dictionary. The longest matched string in the dictionary is characterized by a pointer (sometimes called a token), which is represented by a triple of data items. Note that this triple functions as an index to the dictionary. In this way,a variable-length string of symbols is mapped to a fixed-length pointer. There is a sliding window in the LZ77 algorithms. The window consists of two parts: a search buffer and a look-ahead buffer. The search buffer contains: the portion of the text stream that has recently been encoded ---the dictionary. The look-ahead buffer contains: the text to be encoded next. The window slides through the input text stream from beginning to end during the entire encoding process.
  • 14.
    LZ77 (LEMPEL-ZIV) COMPRESSION TECHNIQUE LZ77 : SEARCHBUFFER LOOKAHEAD BUFFER c a b r A c a d a b r A r r a r r pointer 1. To encode the sequence in look-ahead buffer, the encoder moves a search pointer back through the search buffer until it encounters a match to the first symbol in the look-ahead buffer. The distance of the pointer from the look-ahead buffer is called the offset. 2. The encoder then examines the symbols following the symbol at the pointer location to see if they match consecutive symbols in the look-ahead buffer. The number of consecutive symbols in the search buffer that match consecutive symbols in the look-ahead buffer, starting with the first symbol, is called the length of the match. The encoder searches the search buffer for the longest match.
  • 15.
    LZ77 (LEMPEL-ZIV) COMPRESSION TECHNIQUE 3.Once the longest match has been found, the encoder encodes it with a triple <o,l,c> where o is the offset, l is the length of the match and c is the code-word corresponding to the symbol in the look-ahead buffer that follows the match.  In the diagram, the longest match is the first a of the search buffer. The offset o in this case is 2, l is 4, and the symbol in the look-ahead buffer following the match is r.  The reason for sending the third element in the triple is to take care of the situation where no match for the symbol in the look-ahead buffer can be found in the search buffer. In this case, the offset and the match length values are set to 0, and the third element of the triple is the code for the symbol itself.  For the decoding process, it is basically a table look-up procedure and can be done by reversing the encoding procedure.  The limitation of the approach is that if the distance between the repeated patterns in the input text stream is larger than the size of the search buffer, then the approach cannot utilize the structure to compress the text. The longest match possible is roughly the size of the look-ahead buffer.
  • 16.
    LZ78 (LEMPEL-ZIV) COMPRESSION TECHNIQUE Nouse of the sliding window. Use encoded text as a dictionary which, potentially, does not have a fixed size. Each time a pointer (token) is issued, the encoded string is included in the dictionary. Once a preset limit to the dictionary size has been reached, either the dictionary is fixed for the future (if the coding efficiency is good), or it is reset to zero,i.e., it must be restarted. Instead of the triples used in the LZ77,only pairs are used in the LZ78. Specifically, only the position of the pointer to the matched string and the symbol following the matched string need to be encoded.
  • 17.
    Example: The stringS = 001212121021012101221011 is to be encoded. Figure shows the encoding process.
  • 18.
  • 19.
    LZW COMPRESSION TECHNIQUE  Improvedversion of the original LZ78 algorithm is perhaps the most famous modification and is sometimes even mistakenly referred to as the Lempel Ziv algorithm.  It basically applies the principle of not explicitly transmitting the next non-matching symbol to the LZ78 algorithm. The only remaining output of this improved algorithm are fixed-length references to the dictionary (indexes).  The dictionary has to be initialized with all the symbols of the input alphabet and this initial dictionary needs to be made known to the decoder.
  • 20.
    LOSSY COMPRESSION METHODS  Usedfor compressing images and video files (our eyes cannot distinguish subtle changes, so lossy data is acceptable).  These methods are cheaper, less time and space.  Several methods:  JPEG: compress pictures and graphics  MPEG: compress video  MP3: compress audio
  • 21.
    THE JPEG STANDARD Joint Photographic Experts Group  Jpeg is the standard compression techniques for still images  Lossy compression  Employs a transform coding method using the DCT (Discrete Cosine Transform)  Main Steps in JPEG Image Compression 1. Transform RGB to YIQ or YUV and subsample color. 2. DCT on image blocks. 3. Quantization. 4. Zig-zag ordering and run-length encoding. 5. Entropy coding.
  • 22.
    BLOCK DIAGRAM FORJPEG ENCODER
  • 23.
    DCT ON IMAGEBLOCKS  Each image is divided into 8 × 8 blocks. The 2D DCT is applied to each block image f(i, j), with output being the DCT coefficients F(u, v) for each block.  By applying Discrete Cosine Transform (DCT), the data in time (spatial) domain can be transformed into frequency domain.
  • 24.
    QUANTIZATION IN JPEG •^F(u, v) = round(F(u, v)/Q(u, v)) • F(u, v) represents a DCT coefficient, Q(u, v) is a “quantization matrix” entry, and ˆ F(u, v) represents the quantized DCT coefficients which JPEG will use in the succeeding entropy coding. • The quantization step is the main source for loss in JPEG compression.
  • 25.
    MPEG ENCODING  Used tocompress video.  Basic idea:  Each video is a rapid sequence of a set of frames. Each frame is a spatial combination of pixels, or a picture.  Compressing video = spatially compressing each frame + temporally compressing a set of frames.
  • 26.
    MPEG ENCODING • Spatial Compression Each frame is spatially compressed by JPEG. • Temporal Compression  Redundant frames are removed.  For example, in a static scene in which someone is talking, most frames are the same except for the segment around the speaker’s lips, which changes from one frame to the next.
  • 27.
    THREE TYPES OFFRAMES  Intra frames (same as JPEG)  Self contained frames  Predictive frames  encode from previous I or P reference frame  Bi-directional frames  encode from previous and future I or P frames I P I P P B B B B B B B B
  • 28.
     The methodemploys the 3-D integer wavelet transform and EBCOT(Embedded block coding with optimized truncation) to create bit stream.  Provides random access.  Provides resolution and quality scalability.  High reconstruction quality.  Optimized VOI coding. 3-D SCALABLE MEDICAL IMAGE COMPRESSION WITH OPTIMIZED VOLUME OF INTEREST CODING
  • 29.
    3-D SCALABLE MEDICALIMAGE COMPRESSION WITH OPTIMIZED VOLUME OF INTEREST CODING Block Diagram
  • 30.
    Based on combinationof five techniques: 1. Fast 2-D wavelet transform. 2. Rearrangement of wavelet coefficients for efficient processing. 3. Zerotree coding of correlated coefficients. 4. Gradual successive approximation of the wavelet coefficients. 5. Lossless entropy coding of the quantized coefficients using either adaptive arithmetic coding,which is slow and gives better compression rate,or adaptive run-length coding which is faster and has less performance we get from arithmetic coding. LOW BIT-RATE EFFICIENT COMPRESSION FOR SEISMIC DATA
  • 31.
    Wavelet coefficients Set threshold Classify pass Coding pass Arithmetic coder Thequantization loop output Fig: Compression Algorithm LOW BIT-RATE EFFICIENT COMPRESSION FOR SEISMIC DATA
  • 32.