SlideShare a Scribd company logo
1
FINAL PROJECT REPORT
A Statistical analysis and extended implementation of
Compression algorithms
Guided By: Prof. Bhaskar Chaudhary
Prepared By: Dhrumil Shah (201201117)
Abstract
This project report presents a detailed analysis of some compression algorithms and their
implementation with some new functionalities developed for them in an attempt to increase
their efficiency. While conducting the research for this project, a systematic procedure of
testing was done on these compression algorithms and the details on a comparative
analysis of results and properties exhibited by different algorithms under different
parameters has been tabulated in this report in both statistical and graphical forms.
1
2
INTRODUCTION

Data Compression as the name suggests is a technique in by which the same amount of data is
transmitted by using a smaller number of bits; for example, by replacing a string of twenty repeated
digits with a command to repeat the digit twenty times.





Data Compression can be either lossy or lossless. As the names suggest, in lossless data
compression the information is not lost and all of the information at the beginning is as it is at the
end. It reduces bits by identifying and eliminating statistical redundancy.





Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and
data communication.





Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an
uncompressed file. Smaller files are also desirable for data communication, because the smaller a
file the faster it can be transferred.




A compressed file appears to increase the speed of data transfer over an uncompressed file.



Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and
data communication.





Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an
uncompressed file. Smaller files are also desirable for data communication, because the smaller a
file the faster it can be transferred.




A compressed file appears to increase the speed of data transfer over an uncompressed file.

2
3
ALGORITHMS COVERED
The algorithms worked upon and which are covered in this report are:
1. Huffman encoding
 Finding minimum length bit string which can be used to encode a bit string of symbols.
Huffman encoding is largely used in text compression.



 Uses a table of frequency of occurrence for each symbol or character in the input. This
frequency table is derived from the input itself by studying the nature of the input.


 The higher frequency letters have small digit-encodings the lesser frequency letters which
have larger digit-encodings.


 By the greedy approach, n characters are placed in n sub-trees by combining the two least
weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight
for its root node.

 Occurs in O (nlogn) time.


2. Run Length Encoding
 Runs of data, say long sequences of data are stored as one value, a single data or count rather than
the original chunk.

 Performs lossless data compression.

 Most useful on data which contains many such sequences.


 For instance if there’s a sequence with runs such as WWWWWBBWWWWWW, it can be encoded
as 5W2B6W clearly translating as 5 whites, then 2 Blues, then 6 Whites.

 Gets performed in Order n time.
3
4
3. LZW algorithm
 LZ77 and LZ78 are lossless data compression algorithms: Forms basis of DEFLATE used in PNG
files) and GIF.




 Lossless Data Compression algorithms: Original data is completely reconstructed from the
compressed data. Lossy data compression allows compression of only an approximation of
the original data.(although improves compression rates)

 LDC algorithms used in ZIP file formats. PNG and GIF use lossless compressions. 3D
graphics (OpenCTM), Cryptography.





 Both LZ77 and LZ78 are DICTIONARY CODERS. (Substitution coders): A class of LDC algorithms.




 Look for matches between data set to be compressed and a set of strings (contained in a
data structure maintained by the encoder) called dictionary.




 When a match is found, a reference to the string’s position in the data structure is given.




 In LZ77 compression MUST start at the beginning of the input itself.






 Typically every character is stored with 8 bits (up to 256 unique symbols for this data)..... Can be
extended to 9 to 12 bits per character. New unique symbols made of combinations of symbols
which occurred previously in the string.
4
5
NATURE OF DOCUMENT
 We did the testing on a proper book as discussed in the meeting.

 For the file “graph_theory.txt”, the nature of the document is described below covering classified into
the number of characters, the number of sentences, the no. of paragraphs
No. of characters = 267722.
No. of words = 51000
No. of sentences = 5400
No. of paragraphs = 5418
 Occurrence of vowels

In the same file, “graph_theory.txt”, per cent age occurrence of the individual vowel characters is taken into
account.
‘U’ – 196 occurrences of the character ‘U’ from a text of over 2.5 lac characters. ‘U’ occurs about
0.36% of the time.
 For all the subsequent text files, we have followed the same procedures for determining the nature of the
document.
5
6
TABULAR ANALYSIS
Huffman Encoding Run length Encoding LZW algorithm LZW variant
Original File size 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes
Compressed File Size 2.5 kilobytes 3.390 kilobytes 3.4 kilo bytes 2.63 kilo bytes
(using % approximately
compression)
Compression Time 0.77 milli Seconds 2.95 milli seconds 19.38 milli seconds 32.3 milli seconds
for a file size of 1108 for a file size of 500
kilo bytes kilo bytes
Decompression 0.72 milli Seconds 0.73 milli seconds 15.5 milli seconds 0.72 milliseconds.
Time for a file size of 500
kb.
Memory Space --- --- 500 kilo bytes 728 kilo bytes for a
For a file size of file size of 1100 kilo
1100 kilo bytes bytes
Percentage 40 per cent -ve 0.5 per cent 25 per cent 22 per cent of
compression compression compression compression for a compression for a
file size of 1100 kilo file size of 1100 kilo
Bytes bytes.
GRAPHICAL ANALYSIS
Based on few parameters, and after implementing all the algorithms mentioned above, we
generated graphs highlighting the significant comparisons between these algorithms based on some
important factors like compression rates, time taken to compress the files and some other factors.
6
7
1. Factor describing the original file size versus the size of the compressed file in case of
all four algorithms.
The above graph shows the original file size (in kB) on x-axis and the compressed percentage on Y-axis Here
the least compression is done by run length due to the simplicity of the algorithm while Huffman coding
does about 40% compression of the original file size in almost every case. Form the graph we can easily see
that the lzw algorithm is slightly more efficient in terms of compression compared to lzw_variant.
Huffman coding works by creating heap which in fact makes it more efficient in terms of gaining
compression compared to other algorithms.
7
8
2. Factor describing the compression time versus original file
Here the compression time is shown on Y-axis. As the lzw_variant stores only 2 characters in the indexes of
dictionary thus, while compressing large files it takes much longer time compared to lzw which can be easily
seen from the graph. The run length encoding counts the occurrences of the characters and encodes it with a
number while Huffman encoding creates a heap starting from least frequently occurring characters.
3. Graph showing decompression time Vs Original File
8
9
The above graph shows the time required by algorithm to decompress the compressed file into original file.
Huffman encoding requires traversing the heap every time in order to find the character corresponding to
the compressed one. This does not take much time when comparison is done on smaller files but when
files are relatively big, it does take a huge amount of time which is shown in the graph.
4. Dictionary space utilized for to compress the original file with some file size.
Here the two algorithms which are compared are the lzw and it’s variant because they both are dictionary-based
algorithms where they store a particular string in an index of dictionary. Here the benefit of using lzw_variant is that
its dictionary consumes much less space compared to original lzw when compressing relatively big files. Thus, if the
priority is space management compared to compression efficiency then out of the two lzw_variant should be
preferred.
9
10
CONCLUSION
 The algorithm which we implemented is the variant of LZW in which we restricted the length of
dictionary index to 2 characters. This was done in order to avoid the search of large characters
in the dictionary indexes.


 Observing the graph (Compressed Files Vs Decompressed Files) we observed that we were not
able to achieve a higher compression compared to LZW. So, in terms of compression performance
the edited version of LZW was of not much use.


 But the thing which we achieved in this algorithm was that the dictionary consumed much less
memory than the normal LZW. This can be clearly seen from the graph (Memory Space Vs
Original File).


 The algorithm designed is much efficient in memory usage compared to compression performance.
The algorithm designed is used to compress text files.
10

More Related Content

What's hot

Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
DBOnto
 
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
IRJET Journal
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
Shivraj Raj
 
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
IOSR Journals
 
Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
 
Hashing
HashingHashing
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DRNguyen Tran
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
A Framework for Design of Partially Replicated Distributed Database Systems w...
A Framework for Design of Partially Replicated Distributed Database Systems w...A Framework for Design of Partially Replicated Distributed Database Systems w...
A Framework for Design of Partially Replicated Distributed Database Systems w...
ijdms
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraVarunkumar Manohar
 
Nsl seminar(2)
Nsl seminar(2)Nsl seminar(2)
Nsl seminar(2)
Thomhert Siadari
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
marek_pomocka
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
Yoss Cohen
 
Collective Communications in MPI
 Collective Communications in MPI Collective Communications in MPI
Collective Communications in MPI
Hanif Durad
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Ravindra Raju Kolahalam
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 

What's hot (19)

Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
 
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
IRJET-Block-Level Message Encryption for Secure Large File to Avoid De-Duplic...
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
Comparative study of IPv4 & IPv6 Point to Point Architecture on various OS pl...
 
project_531
project_531project_531
project_531
 
Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4
 
Hashing
HashingHashing
Hashing
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 
User biglm
User biglmUser biglm
User biglm
 
A Framework for Design of Partially Replicated Distributed Database Systems w...
A Framework for Design of Partially Replicated Distributed Database Systems w...A Framework for Design of Partially Replicated Distributed Database Systems w...
A Framework for Design of Partially Replicated Distributed Database Systems w...
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
Nsl seminar(2)
Nsl seminar(2)Nsl seminar(2)
Nsl seminar(2)
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
 
Collective Communications in MPI
 Collective Communications in MPI Collective Communications in MPI
Collective Communications in MPI
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]
 
search engine
search enginesearch engine
search engine
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 

Viewers also liked

Presentación sin título (2)
Presentación sin título (2)Presentación sin título (2)
Presentación sin título (2)
sebasteca
 
Bullying social
Bullying socialBullying social
Bullying social
Segundo Henao
 
Informatica
InformaticaInformatica
Informatica
luzguaman2001
 
TREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
TREND24 - Lançamento da Maiojama junto ao Moinhos de VentoTREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
TREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
Angelice Santos
 
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
Elizabeth Kurpis
 
Animals
AnimalsAnimals
Marketview angola 2012
Marketview angola 2012Marketview angola 2012
Marketview angola 2012goncalosorio
 

Viewers also liked (9)

Presentación sin título (2)
Presentación sin título (2)Presentación sin título (2)
Presentación sin título (2)
 
Bullying social
Bullying socialBullying social
Bullying social
 
Conceptos basicos
Conceptos basicosConceptos basicos
Conceptos basicos
 
Informatica
InformaticaInformatica
Informatica
 
TREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
TREND24 - Lançamento da Maiojama junto ao Moinhos de VentoTREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
TREND24 - Lançamento da Maiojama junto ao Moinhos de Vento
 
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
Knockoffs: To Kill or Not to Kill, That is the Copyright Question Before the ...
 
Organelas - Biologia
Organelas - BiologiaOrganelas - Biologia
Organelas - Biologia
 
Animals
AnimalsAnimals
Animals
 
Marketview angola 2012
Marketview angola 2012Marketview angola 2012
Marketview angola 2012
 

Similar to FINAL PROJECT REPORT

Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
IOSR Journals
 
Lossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDALossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDA
IOSR Journals
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
AllamJayaPrakash
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
AllamJayaPrakash
 
Comparison of Compression Algorithms in text data for Data Mining
Comparison of Compression Algorithms in text data for Data MiningComparison of Compression Algorithms in text data for Data Mining
Comparison of Compression Algorithms in text data for Data Mining
IJAEMSJORNAL
 
Data compression
Data compression Data compression
Data compression
Muhammad Irtiza
 
A research paper_on_lossless_data_compre
A research paper_on_lossless_data_compreA research paper_on_lossless_data_compre
A research paper_on_lossless_data_compre
Luisa Francisco
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...
ijitjournal
 
FS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdfFS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdf
LakshmishaRALakshmis
 
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding TechniqueAffable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
IOSR Journals
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
Data compression
Data  compressionData  compression
Data compression
Ashutosh Kawadkar
 
data compression technique
data compression techniquedata compression technique
data compression technique
CHINMOY PAUL
 
Data Compression
Data CompressionData Compression
Data Compression
Shubham Bammi
 
Design and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression AlgorithmDesign and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression Algorithm
ijistjournal
 
Design and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression AlgorithmDesign and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression Algorithm
ijistjournal
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
Comparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression TechniquesComparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression Techniques
IJERA Editor
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 

Similar to FINAL PROJECT REPORT (20)

Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
 
Lossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDALossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDA
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
Comparison of Compression Algorithms in text data for Data Mining
Comparison of Compression Algorithms in text data for Data MiningComparison of Compression Algorithms in text data for Data Mining
Comparison of Compression Algorithms in text data for Data Mining
 
Data compression
Data compression Data compression
Data compression
 
A research paper_on_lossless_data_compre
A research paper_on_lossless_data_compreA research paper_on_lossless_data_compre
A research paper_on_lossless_data_compre
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...
 
FS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdfFS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdf
 
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding TechniqueAffable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
Data compression
Data  compressionData  compression
Data compression
 
data compression technique
data compression techniquedata compression technique
data compression technique
 
Data Compression
Data CompressionData Compression
Data Compression
 
Design and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression AlgorithmDesign and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression Algorithm
 
Design and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression AlgorithmDesign and Implementation of LZW Data Compression Algorithm
Design and Implementation of LZW Data Compression Algorithm
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Data compression
Data compressionData compression
Data compression
 
Comparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression TechniquesComparision Of Various Lossless Image Compression Techniques
Comparision Of Various Lossless Image Compression Techniques
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 

FINAL PROJECT REPORT

  • 1. 1 FINAL PROJECT REPORT A Statistical analysis and extended implementation of Compression algorithms Guided By: Prof. Bhaskar Chaudhary Prepared By: Dhrumil Shah (201201117) Abstract This project report presents a detailed analysis of some compression algorithms and their implementation with some new functionalities developed for them in an attempt to increase their efficiency. While conducting the research for this project, a systematic procedure of testing was done on these compression algorithms and the details on a comparative analysis of results and properties exhibited by different algorithms under different parameters has been tabulated in this report in both statistical and graphical forms. 1
  • 2. 2 INTRODUCTION  Data Compression as the name suggests is a technique in by which the same amount of data is transmitted by using a smaller number of bits; for example, by replacing a string of twenty repeated digits with a command to repeat the digit twenty times.      Data Compression can be either lossy or lossless. As the names suggest, in lossless data compression the information is not lost and all of the information at the beginning is as it is at the end. It reduces bits by identifying and eliminating statistical redundancy.      Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and data communication.      Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an uncompressed file. Smaller files are also desirable for data communication, because the smaller a file the faster it can be transferred.     A compressed file appears to increase the speed of data transfer over an uncompressed file.    Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and data communication.      Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an uncompressed file. Smaller files are also desirable for data communication, because the smaller a file the faster it can be transferred.     A compressed file appears to increase the speed of data transfer over an uncompressed file.  2
  • 3. 3 ALGORITHMS COVERED The algorithms worked upon and which are covered in this report are: 1. Huffman encoding  Finding minimum length bit string which can be used to encode a bit string of symbols. Huffman encoding is largely used in text compression.     Uses a table of frequency of occurrence for each symbol or character in the input. This frequency table is derived from the input itself by studying the nature of the input.    The higher frequency letters have small digit-encodings the lesser frequency letters which have larger digit-encodings.    By the greedy approach, n characters are placed in n sub-trees by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.   Occurs in O (nlogn) time.   2. Run Length Encoding  Runs of data, say long sequences of data are stored as one value, a single data or count rather than the original chunk.   Performs lossless data compression.   Most useful on data which contains many such sequences.    For instance if there’s a sequence with runs such as WWWWWBBWWWWWW, it can be encoded as 5W2B6W clearly translating as 5 whites, then 2 Blues, then 6 Whites.   Gets performed in Order n time. 3
  • 4. 4 3. LZW algorithm  LZ77 and LZ78 are lossless data compression algorithms: Forms basis of DEFLATE used in PNG files) and GIF.      Lossless Data Compression algorithms: Original data is completely reconstructed from the compressed data. Lossy data compression allows compression of only an approximation of the original data.(although improves compression rates)   LDC algorithms used in ZIP file formats. PNG and GIF use lossless compressions. 3D graphics (OpenCTM), Cryptography.       Both LZ77 and LZ78 are DICTIONARY CODERS. (Substitution coders): A class of LDC algorithms.      Look for matches between data set to be compressed and a set of strings (contained in a data structure maintained by the encoder) called dictionary.      When a match is found, a reference to the string’s position in the data structure is given.      In LZ77 compression MUST start at the beginning of the input itself.        Typically every character is stored with 8 bits (up to 256 unique symbols for this data)..... Can be extended to 9 to 12 bits per character. New unique symbols made of combinations of symbols which occurred previously in the string. 4
  • 5. 5 NATURE OF DOCUMENT  We did the testing on a proper book as discussed in the meeting.   For the file “graph_theory.txt”, the nature of the document is described below covering classified into the number of characters, the number of sentences, the no. of paragraphs No. of characters = 267722. No. of words = 51000 No. of sentences = 5400 No. of paragraphs = 5418  Occurrence of vowels  In the same file, “graph_theory.txt”, per cent age occurrence of the individual vowel characters is taken into account. ‘U’ – 196 occurrences of the character ‘U’ from a text of over 2.5 lac characters. ‘U’ occurs about 0.36% of the time.  For all the subsequent text files, we have followed the same procedures for determining the nature of the document. 5
  • 6. 6 TABULAR ANALYSIS Huffman Encoding Run length Encoding LZW algorithm LZW variant Original File size 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes 3.38 kilobytes Compressed File Size 2.5 kilobytes 3.390 kilobytes 3.4 kilo bytes 2.63 kilo bytes (using % approximately compression) Compression Time 0.77 milli Seconds 2.95 milli seconds 19.38 milli seconds 32.3 milli seconds for a file size of 1108 for a file size of 500 kilo bytes kilo bytes Decompression 0.72 milli Seconds 0.73 milli seconds 15.5 milli seconds 0.72 milliseconds. Time for a file size of 500 kb. Memory Space --- --- 500 kilo bytes 728 kilo bytes for a For a file size of file size of 1100 kilo 1100 kilo bytes bytes Percentage 40 per cent -ve 0.5 per cent 25 per cent 22 per cent of compression compression compression compression for a compression for a file size of 1100 kilo file size of 1100 kilo Bytes bytes. GRAPHICAL ANALYSIS Based on few parameters, and after implementing all the algorithms mentioned above, we generated graphs highlighting the significant comparisons between these algorithms based on some important factors like compression rates, time taken to compress the files and some other factors. 6
  • 7. 7 1. Factor describing the original file size versus the size of the compressed file in case of all four algorithms. The above graph shows the original file size (in kB) on x-axis and the compressed percentage on Y-axis Here the least compression is done by run length due to the simplicity of the algorithm while Huffman coding does about 40% compression of the original file size in almost every case. Form the graph we can easily see that the lzw algorithm is slightly more efficient in terms of compression compared to lzw_variant. Huffman coding works by creating heap which in fact makes it more efficient in terms of gaining compression compared to other algorithms. 7
  • 8. 8 2. Factor describing the compression time versus original file Here the compression time is shown on Y-axis. As the lzw_variant stores only 2 characters in the indexes of dictionary thus, while compressing large files it takes much longer time compared to lzw which can be easily seen from the graph. The run length encoding counts the occurrences of the characters and encodes it with a number while Huffman encoding creates a heap starting from least frequently occurring characters. 3. Graph showing decompression time Vs Original File 8
  • 9. 9 The above graph shows the time required by algorithm to decompress the compressed file into original file. Huffman encoding requires traversing the heap every time in order to find the character corresponding to the compressed one. This does not take much time when comparison is done on smaller files but when files are relatively big, it does take a huge amount of time which is shown in the graph. 4. Dictionary space utilized for to compress the original file with some file size. Here the two algorithms which are compared are the lzw and it’s variant because they both are dictionary-based algorithms where they store a particular string in an index of dictionary. Here the benefit of using lzw_variant is that its dictionary consumes much less space compared to original lzw when compressing relatively big files. Thus, if the priority is space management compared to compression efficiency then out of the two lzw_variant should be preferred. 9
  • 10. 10 CONCLUSION  The algorithm which we implemented is the variant of LZW in which we restricted the length of dictionary index to 2 characters. This was done in order to avoid the search of large characters in the dictionary indexes.    Observing the graph (Compressed Files Vs Decompressed Files) we observed that we were not able to achieve a higher compression compared to LZW. So, in terms of compression performance the edited version of LZW was of not much use.    But the thing which we achieved in this algorithm was that the dictionary consumed much less memory than the normal LZW. This can be clearly seen from the graph (Memory Space Vs Original File).    The algorithm designed is much efficient in memory usage compared to compression performance. The algorithm designed is used to compress text files. 10