1. LZ SOURCE CODING TECHNIQUES, OCTOBER 30, 2016 1
Lempel-Ziv Source Coding Techniques
LITU ROUT
Indian Institute of Space Science and Technology
liturout1997@gmail.com
Abstract—There are so many data compression techniques
available which are used for efficient transmission and storage
of data with less memory space.Lempel-Ziv(LZ) scheme is one
of those lossless data compression techniques.LZ is not a single
algorithm, but a whole family of algorithms derived from their
basic algorithm proposed in the year 1977 and 1978.Today
these algorithms are remembered by the name of the authors
and the year of implementation of the same. LZ77 exploits
the fact that words and phrases in a text are likely to be
repeated.When there is a repetition they can be pointed to the
earlier occurrence thereby saving the memory space. LZ78 is
a dictionary based technique.A dictionary is created while the
data are being encoded.So encoding can be done on the fly.The
dictionary need not to be transmitted as it can be generated
at the receiving end on the fly. If there is an overflow in the
dictionary then one bit is added to the code word until all the
characters are included.
Index Terms—Lossless,LZ77,LZ78,Repetitive
text/patterns,Dictionary Scheme
I. INTRODUCTION
DATA compression deals with the reduction of space
needed to store information thereby reducing the
amount of time required to transmit data.These compression
techniques are mainly based on identification and isolation of
excess information.Data are compressed such that it meets
the minimum requirements of the reconstructed signals.
All of the books in the world contain no more information
than is broadcast as video in a single large American city
in a single year. Not all bits have equal value. -Carl Sagan
Some data compression schemes are lossless i.e. the
exact information can be reconstructed from the transmitted
data.This is needed when we can’t afford to loose any detail
about the information such as medical imaging,text,computer
executable files etc.
Some schemes are lossy i.e. only some specific information
can be reconstructed from the transmitted data.In some
cases it might be meet our need to get the approximate
result.Lossy compressions have better compression ratio than
lossless ones. Data such as multimedia images, video and
audio are more easily compressed by lossy compression
techniques because of the way human audio and visual
system works.Human ears can’t detect very small changes
in the sound.So rather than sending the whole information
Litu Rout,Bachelor of Technology,Department of Avionics,IIST
data can be compressed and sent with undistinguishable result.
Lossy compression has better compression ratio but it
is limited to audio,video and images where some loss is
acceptable.Each has it’s own merits and demerits.So the
question of which one is better is irrelevant here.
There has been quite a few lossless data compression
algorithms based on the probabilistic or dictionary method
which was first proposed by Lempel and Ziv in their 1977
and 1978 papers.The Dictionary based compression technique
Lempel-Ziv scheme is divided into two families: those derived
from LZ77 (LZ77, LZSS, LZH and LZB) and those derived
from LZ78 (LZ78, LZW and LZFG).
In this paper I have given a brief description about LZ77
Series and LZ78 Series.
A. LZ77 Series
LZ77 exploits the fact that the words and phrases in text
are likely to be repeated.When there is a repetition ,the new
character can be referred to the previous occurrence by a
pointer.It doesn’t need any prior knowledge and requires no
assumptions about the characteristics of the source.
In the LZ77 approach, the dictionary is simply a portion
of the previously encoded sequence. The encoder examines
the input sequence through a sliding window which consists
of two parts: a search buffer that contains a portion of the
recently encoded sequence and a look ahead buffer that
contains the next portion of the sequence to be encoded. The
algorithm searches the sliding window for the longest match
with the beginning of the look-ahead buffer and outputs a
reference (a pointer) to that match. It is possible that there is
no match at all, so the output cannot contain just pointers. In
LZ77 the reference is always output as a triple <o,l,c>, where
‘o’ is an offset to the match, ‘l’ is length of the match, and ‘c’
is the next symbol after the match. If there is no match, the
algorithm outputs a null-pointer (both the offset and the match
length equal to 0) and the first symbol in the look-ahead buffer.
The values of an offset to a match and length must
be limited to some maximum constants. Moreover the
compression performance of LZ77 mainly depends on these
values. Usually the offset is encoded on 12-16 bits, so it is
limited from 0 to 65535 symbols. So, there is no need to
remember more than 65535 last seen symbols in the sliding
window. The match length is usually encoded on 8 bits,
2. LZ SOURCE CODING TECHNIQUES, OCTOBER 30, 2016 2
which gives maximum match length equal to 255.
Algorithm of LZ77 :
While (lookAheadBuffer not empty)
get a reference (position, length) to longest match;
if (length > 0)
output (position, length, next symbol);
shift the window length+1 positions along;
else
output (0, 0, first symbol in the lookahead buffer);
shift the window 1 character along;
There are lots of ways that LZ77 scheme can be made more
efficient and many of the improvements deal with the efficient
encoding with the triples. There are several variations on LZ77
scheme, the best known are LZSS, LZH and LZB.
Fig. 1: Lempel-Ziv Derivatives
1) LZSS : It’s a lossless data compression algorithm.It’s
a derivative of LZ77, created by James Storer and Thomas
Szymanski in 1982.The difference between LZ77 and LZSS
is that it doesn’t allow the reference bit to be longer than the
length of the string that it was replacing.It uses a break even
point as threshold to omit such references.It also uses a flag
bit to indicate that the next data is a pointer or a single symbol.
2) LZH : LZH is the scheme that combines the Ziv-
Lempel and Huffman techniques. Here coding is performed
in two passes. The first is essentially same as LZSS, while the
second uses statistics measured in the first to code pointers
and explicit characters using Huffman coding.
The Fig.2 and Fig.3 indicate that different ASCII files are
compressed to an average bits per character (BPC) which is
a little less than half of the original size.Out of LZ77 series
which are mentioned in this paper LZB uses lowest average
BPC thereby provides a higher compression rate than the
others.It’s average BPC is around 3.11 .
Fig. 2: LZ77 Comparison For various data sets
Fig. 3: LZ77 Comparison For various data sets
B. LZ78 Series
In 1978 Jacob Ziv and Abraham Lempel presented their
dictionary based scheme , which is known as LZ78. It is a
dictionary based compression algorithm that maintains an
explicit dictionary. This dictionary has to be built both at the
encoding and decoding side and they must follow the same
rules to ensure that they use an identical dictionary. The
codewords output by the algorithm consists of two elements
<i,c> where ‘i’ is an index referring to the longest matching
dictionary entry and the first non-matching symbol. In
addition to outputting the codeword for storage / transmission
the algorithm also adds the index and symbol pair to the
dictionary. When a symbol that is not yet found in the
dictionary, the codeword has the index value 0 and it is added
to the dictionary as well. The algorithm gradually builds up a
dictionary with this method. The algorithm for LZ78 is given
below:
w := NIL;
while ( there is input )
K := next symbol from input;
if (wK exists in the dictionary)
w := wK;
else
output (index(w), K);
add wK to the dictionary;
w := NIL;
LZ78 can hold patterns for a longer duration of time
due to it’s dictionary based scheme.But it has a serious
drawback of large dictionary size.As the patterns increases
the length of the dictionary also increases and finally it
3. LZ SOURCE CODING TECHNIQUES, OCTOBER 30, 2016 3
affects the performance of the encoding process.One of the
main advantages of LZ78 over LZ77 is the dictionary based
compression technique which helps in faster encoding.The
important property of LZ77 that LZ78 preserves is that its
decoding process is much faster than the encoding process.
1) LZW : Terry Welch has presented his algorithm in 1984
which is based on LZ78 and LZSS.He used the concept of not
generating the dictionary from the scratch rather initialized
the dictionary with all possible forms of input alphabets.If
the combination of current letter and the next letter is not
found in the dictionary then the combined word is added
to the dictionary.It guarantees that a match will always be
found. LZW would only send the index to the dictionary.
The input to the encoder is accumulated in a pattern ‘w’ as
long as ‘w’ is contained in the dictionary. If the addition of
another letter ‘K’ results in a pattern ‘w*K’ that is not in
the dictionary, then the index of ‘w’ is transmitted to the
receiver, the pattern ‘w*K’ is added to the dictionary and
another pattern is started with the letter ‘K’. The algorithm
then proceeds as follows:
w := NIL;
while ( there is input )
K := next symbol from input;
if (wK exists in the dictionary)
w := wK;
else
output (index(w));
add wK to the dictionary;
w := k;
One of the most useful compression algorithms of recent
decade is LZW algorithm.The compression and decompression
schemes for the sentence thisisthe are illustrated below.Each
letter is represented in ASCII format.
LZW compression Flow Chart :
Current Next Output Add to Dictionary
t(116) h(104) t(116) th(256)
h(104) i(105) h(104) hi(257)
i(105) s(115) i(105) is(258)
s(115) i(105) s(115) si(259)
i(105) s(115)
‘is’ is present in
the dictionary
not added
is(258) t(116) is(258) ist(260)
t(116) h(104)
‘th’ is present
in the dictio-
nary
not added
th(256) e(101) th(256) the(261)
e(101) e(101)
ASCII uses 8 bits to represent each character.In
uncompressed scheme 9 letters are transmitted where as
the LZW schemes transmits only 7 letters.In uncompressed
version total bits to be transmitted is 8*9=72.In compressed
version let’s say 9 bits are used to represent each word in the
dictionary.So total 9*7=63 bits are transmitted.
Data transmitted : 116 104 105 115 258 256 101
Percentage Data transmitted is =
(63/72)*100 = 87.5%
LZW Decompression Flow Chart:
Assuming that the data received are not altered by the
channel.So the data received are : 116 104 105 115 258 256
101
The decompression process is as follow :
Current Next Output Add to Dictionary
116 104 116 116 104(256)
104 105 104 104 105(257)
105 115 105 105 115(258)
115 258 115 115 105 115(259)
258 256 105 115 (258) 105 115 116(260)
256 101 116 104 (256) 116 104 101 (261)
101 101
• In the fourth row of the above table next bit is 258
which is present in the dictionary.So it is replaced by
the corresponding bit pairs i.e 105 and 115 .
• In the fifth row, since both the words are already present
we should add 258 256 i.e 105 115 116 104 to the
dictionary;but at each instant we can add only one extra
bit to the dictionary.That’s why only the lower bit of the
bit pair 116 104 to the dictionary along with 105 115.
• Data obtained at the output of the decompresser are :
116 104 105 115 105 115 116 104 101
These are ASCII representation of the sentence
’thisisthe’.
• Since the exact information has been reconstructed fully,
this is a lossless data compression scheme.
In the original proposal of LZW, the pointer size is chosen
to be 12 bits, allowing for up to 4096 dictionary entries.
Once the limit is reached, the dictionary becomes static.
2) LZFG: LZFG which was developed by Fiala and
Greene, gives fast encoding and decoding and good
compression without undue storage requirements. This
algorithm uses the original dictionary building technique
as LZ78 does but the only difference is that it stores the
elements in a tree data structure. Here, the encoded characters
are placed in a window (as in LZ77) to remove the oldest
phrases from the dictionary.
The overall performance in terms of average BPC of the
above referred Statistical coding methods are shown Fig. 4
and Fig. 5.
4. LZ SOURCE CODING TECHNIQUES, OCTOBER 30, 2016 4
Fig. 4: LZ78 Comparison For various data sets
Fig. 5: LZ78 Comparison For various data sets
From the above table it’s clear that LZFG provides a better
average BPC than the others.It’s average BPC (2.89) is much
less than other data compression processes mentioned in this
paper.
Compression Ratio :
The compression ratio indicates the difference between
size of the compressed data and the uncompressed data.Most
algorithms have a typical range of compression ratios that
they can achieve over a variety of data sets. Because of this,
it is usually more useful to look at an average compression
ratio for a particular method.The compression affects picture
quality,higher the compression ratio poorer the quality of the
resulting image.So while compressing data or images this
fact is taken into consideration.
Compression ratio= Sizeof OriginalData
Sizeof compressedData
Using LZW algorithm 60-70% compression can be achieved
for monochrome images and text files with repeated patterns.
II. CONCLUSION
Parkin’s Law : Data expands to fill space.
In this paper, various techniques derived from LZ77 and
LZ78 have been discussed and their algorithms are pro-
posed.Out of the proposed algorithms, LZB outperforms the
rest among LZ77 series with an average BPC of 3.11 .In LZ78
series LZFG outperforms the rest with an average BPC of
2.89 . Lempel and Ziv have built the foundation of lossless
compression for most of the algorithms which are widely used
now a days.As of now no other innovative algorithms which
are not derived from LZ series have been discovered.Many
researchers are working in this area to expand more and more
data within the available space according to Parkin’s law.
ACKNOWLEDGMENT
I would like to thank my parents for giving me the opportu-
nity to pursue my dream in this reputed space institute.I would
like to thank Dr. Vineeth B.S. for his continuous guidance
and support. Without his encouragement this research would
not have been successful.I am thankful to Dr. V.K. Dadhwal,
Director of IIST for allowing me to do this research.After all,
my deepest gratitude goes to Google and YouTube for helping
me to build a strong foundation in this area and clear all my
doubts.
REFERENCES
[1] Ziv. J and Lempel A., "A Universal Algorithm for Sequential Data
Compression", IEEE Transactions on Information Theory 23 (3), pp. 337-
342, May 1977.
[2] Ziv. J and Lempel A., "Compression of Individual Sequences via Variable-
Rate Coding", IEEE Transactions on Information Theory 24 (5), pp. 530-
536, September 1978.
[3] Huffman D.A., "A method for the construction of minimum- redundancy
codes", Proceedings of the Institute of Radio Engineers, 40 (9), pp. 1098-
1101, September 1952.
[4] Shannon C.E., "A mathematical theory of communication", Bell Sys.
Tech. Jour., vol. 27, pp. 398-403; July, 1948.
[5] Storer J and Szymanski T.G., "Data compression via textual substitution",
Journal of the ACM 29, pp. 928-951, 1982.
[6] Welch T.A., "A technique for high-performance data compression", IEEE
Computer, 17, pp. 8-19, 1984.
[7] Mohammad Banikazemi, "LZB: Data Compression with Bounded Ref-
erences", Proceedings of the 2009 Data Compression Conference, IEEE
Computer Society, 2009.
[8] The Scientist and Engineer’s Guide to Digital Signal Processing by Steven
W. Smith
[9] Data Compression: The Complete Reference by David Salomon
[10] Data Compression in Digital Systems (Digital Multimedia Standards
Series) by Roy Hoffman
[11] Elements of information theory by Thomas M. Cover and Joy A. Thomas
[12] Faller N., "An adaptive system for data compression", In Record of
the 7th Asilornar Conference on Circuits, Systems and Computers, pages
593-597, Piscataway, NJ, 1973. IEEE Press.
[13] Fano R.M., "The Transmission of Information", Technical Report No.
65, Research Laboratory of Electronics, M.I.T., Cambridge, Mass.; 1949.
[14] Knuth D.E., "Dynamic Huffman coding", Journal of Algorithms,
6(2):163-180, June 1985.
Litu Rout Indian Institute of Space Science
and Technology
Department of Avionics
Bachelor of Technology
Student id: SC14B101
liturout1997@gmail.com