FINAL PROJECT REPORT

1
FINAL PROJECT REPORT
A Statistical analysis and extended implementation of
Compression algorithms
Guided By: Prof. Bhaskar Chaudhary
Prepared By: Dhrumil Shah (201201117)
Abstract
This project report presents a detailed analysis of some compression algorithms and their
implementation with some new functionalities developed for them in an attempt to increase
their efficiency. While conducting the research for this project, a systematic procedure of
testing was done on these compression algorithms and the details on a comparative
analysis of results and properties exhibited by different algorithms under different
parameters has been tabulated in this report in both statistical and graphical forms.
1

2
INTRODUCTION

Data Compression as the name suggests is a technique in by which the same amount of data is
transmitted by using a smaller number of bits; for example, by replacing a string of twenty repeated
digits with a command to repeat the digit twenty times.





Data Compression can be either lossy or lossless. As the names suggest, in lossless data
compression the information is not lost and all of the information at the beginning is as it is at the
end. It reduces bits by identifying and eliminating statistical redundancy.





Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and
data communication.





Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an
uncompressed file. Smaller files are also desirable for data communication, because the smaller a
file the faster it can be transferred.




A compressed file appears to increase the speed of data transfer over an uncompressed file.



Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and
data communication.





Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an
uncompressed file. Smaller files are also desirable for data communication, because the smaller a
file the faster it can be transferred.




A compressed file appears to increase the speed of data transfer over an uncompressed file.

2

3
ALGORITHMS COVERED
The algorithms worked upon and which are covered in this report are:
1. Huffman encoding
 Finding minimum length bit string which can be used to encode a bit string of symbols.
Huffman encoding is largely used in text compression.



 Uses a table of frequency of occurrence for each symbol or character in the input. This
frequency table is derived from the input itself by studying the nature of the input.


 The higher frequency letters have small digit-encodings the lesser frequency letters which
have larger digit-encodings.


 By the greedy approach, n characters are placed in n sub-trees by combining the two least
weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight
for its root node.

 Occurs in O (nlogn) time.


2. Run Length Encoding
 Runs of data, say long sequences of data are stored as one value, a single data or count rather than
the original chunk.

 Performs lossless data compression.

 Most useful on data which contains many such sequences.


 For instance if there’s a sequence with runs such as WWWWWBBWWWWWW, it can be encoded
as 5W2B6W clearly translating as 5 whites, then 2 Blues, then 6 Whites.

 Gets performed in Order n time.
3

4
3. LZW algorithm
 LZ77 and LZ78 are lossless data compression algorithms: Forms basis of DEFLATE used in PNG
files) and GIF.




 Lossless Data Compression algorithms: Original data is completely reconstructed from the
compressed data. Lossy data compression allows compression of only an approximation of
the original data.(although improves compression rates)

 LDC algorithms used in ZIP file formats. PNG and GIF use lossless compressions. 3D
graphics (OpenCTM), Cryptography.





 Both LZ77 and LZ78 are DICTIONARY CODERS. (Substitution coders): A class of LDC algorithms.




 Look for matches between data set to be compressed and a set of strings (contained in a
data structure maintained by the encoder) called dictionary.




 When a match is found, a reference to the string’s position in the data structure is given.




 In LZ77 compression MUST start at the beginning of the input itself.






 Typically every character is stored with 8 bits (up to 256 unique symbols for this data)..... Can be
extended to 9 to 12 bits per character. New unique symbols made of combinations of symbols
which occurred previously in the string.
4

5
NATURE OF DOCUMENT
 We did the testing on a proper book as discussed in the meeting.

 For the file “graph_theory.txt”, the nature of the document is described below covering classified into
the number of characters, the number of sentences, the no. of paragraphs
No. of characters = 267722.
No. of words = 51000
No. of sentences = 5400
No. of paragraphs = 5418
 Occurrence of vowels

In the same file, “graph_theory.txt”, per cent age occurrence of the individual vowel characters is taken into
account.
‘U’ – 196 occurrences of the character ‘U’ from a text of over 2.5 lac characters. ‘U’ occurs about
0.36% of the time.
 For all the subsequent text files, we have followed the same procedures for determining the nature of the
document.
5

6
TABULAR ANALYSIS
Huffman Encoding Run length Encoding LZW algorithm LZW variant
Original File size 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes
Compressed File Size 2.5 kilobytes 3.390 kilobytes 3.4 kilo bytes 2.63 kilo bytes
(using % approximately
compression)
Compression Time 0.77 milli Seconds 2.95 milli seconds 19.38 milli seconds 32.3 milli seconds
for a file size of 1108 for a file size of 500
kilo bytes kilo bytes
Decompression 0.72 milli Seconds 0.73 milli seconds 15.5 milli seconds 0.72 milliseconds.
Time for a file size of 500
kb.
Memory Space --- --- 500 kilo bytes 728 kilo bytes for a
For a file size of file size of 1100 kilo
1100 kilo bytes bytes
Percentage 40 per cent -ve 0.5 per cent 25 per cent 22 per cent of
compression compression compression compression for a compression for a
file size of 1100 kilo file size of 1100 kilo
Bytes bytes.
GRAPHICAL ANALYSIS
Based on few parameters, and after implementing all the algorithms mentioned above, we
generated graphs highlighting the significant comparisons between these algorithms based on some
important factors like compression rates, time taken to compress the files and some other factors.
6

7
1. Factor describing the original file size versus the size of the compressed file in case of
all four algorithms.
The above graph shows the original file size (in kB) on x-axis and the compressed percentage on Y-axis Here
the least compression is done by run length due to the simplicity of the algorithm while Huffman coding
does about 40% compression of the original file size in almost every case. Form the graph we can easily see
that the lzw algorithm is slightly more efficient in terms of compression compared to lzw_variant.
Huffman coding works by creating heap which in fact makes it more efficient in terms of gaining
compression compared to other algorithms.
7

8
2. Factor describing the compression time versus original file
Here the compression time is shown on Y-axis. As the lzw_variant stores only 2 characters in the indexes of
dictionary thus, while compressing large files it takes much longer time compared to lzw which can be easily
seen from the graph. The run length encoding counts the occurrences of the characters and encodes it with a
number while Huffman encoding creates a heap starting from least frequently occurring characters.
3. Graph showing decompression time Vs Original File
8

9
The above graph shows the time required by algorithm to decompress the compressed file into original file.
Huffman encoding requires traversing the heap every time in order to find the character corresponding to
the compressed one. This does not take much time when comparison is done on smaller files but when
files are relatively big, it does take a huge amount of time which is shown in the graph.
4. Dictionary space utilized for to compress the original file with some file size.
Here the two algorithms which are compared are the lzw and it’s variant because they both are dictionary-based
algorithms where they store a particular string in an index of dictionary. Here the benefit of using lzw_variant is that
its dictionary consumes much less space compared to original lzw when compressing relatively big files. Thus, if the
priority is space management compared to compression efficiency then out of the two lzw_variant should be
preferred.
9

10
CONCLUSION
 The algorithm which we implemented is the variant of LZW in which we restricted the length of
dictionary index to 2 characters. This was done in order to avoid the search of large characters
in the dictionary indexes.


 Observing the graph (Compressed Files Vs Decompressed Files) we observed that we were not
able to achieve a higher compression compared to LZW. So, in terms of compression performance
the edited version of LZW was of not much use.


 But the thing which we achieved in this algorithm was that the dictionary consumed much less
memory than the normal LZW. This can be clearly seen from the graph (Memory Space Vs
Original File).


 The algorithm designed is much efficient in memory usage compared to compression performance.
The algorithm designed is used to compress text files.
10

FINAL PROJECT REPORT

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to FINAL PROJECT REPORT

Similar to FINAL PROJECT REPORT (20)

FINAL PROJECT REPORT