SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020]
https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311
www.ijaems.com Page | 231
Comparison of Compression Algorithms in text
data for Data Mining
Roya Mahmoudi*
, Mansoureh Zare
Department of Computer and Information Technology Engineering, Sabzevar branch, Islamic Azad University, Sabzevar, Iran.
*Corresponding Author
Abstract— Text data compression is a process of reducing text size by encrypting it. In our daily lives, we
sometimes come to the point where the physical limitations of a text or data sent to work with it need to be
addressed more quickly. Ways to create text compression should be adaptive and challenging.
Compressing text data without losing the original context is important because it significantly reduces
storage and communication costs. In this study, the focus is on different techniques of encoding text files
and comparing them in data processing. Some efficient memory encryption programs are analyzed and
executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative
coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how
much compression for these coding techniques. Writing, the amount of memory required for each
technique, a comparison between these techniques is possible to find out which technique is better in which
conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio.
In addition, the improved RLE encoding suggests a higher performance than the typical example in
compressing data in text.
Keywords— data compression, data mining, encryption, text data.
I. INTRODUCTION
In computer science and information theory, using
encryption techniques encrypts and compresses data using
fewer bits or symbols than the original representation. Data
compression is useful because it helps reduce the
consumption of expensive resources, such as hard disk
space or transmission bandwidth [3]. In the present
situation where we are faced with the problem of data
abundance and data disruption, data mining and data
refinement and compression are used for this purpose.
However, the problem is that the pressure of large
compression may cause data loss [2]. That data is one of
the valuable assets of any organization that the loss of
some information can lead to major problems in the
organization's decisions and strategies. The purpose here is
to present and implement some data compression
algorithms efficiently and compare them to the methods
used for data compression in data mining. Calculating
some of the important compression factors for each of
these algorithms, comparing different encryption
techniques and improving the performance of different data
compression techniques, and selecting a suitable
encryption method for data mining is the textual data
considered.
Data mining, which has become very popular in recent
years, refers to the use of data analysis tools to discover
valid patterns and relationships that have been unknown
until now [13]. These tools may include statistical models,
mathematical algorithms, and learning methods (Machine
Learning Methods) that are automated based on experience
gained through neural networks or decision-making trees.
Improves. Data mining is not limited to data collection and
management, but also involves the analysis of data and the
refinement and compression of data. Applications that
explore data by examining text or multimedia files for data
warehouses that have become popular [10]. This type of
warehouse provides a variety of data types from text,
image to storage and processing for use in organizational
strategies and better understanding of data as information.
Their advantage over databases is the ability to refine data
and compress data to preserve important data and delete
trivial data [5]. Each database has its query and query
language. A multi-year problem in data analysis is
increasing the size of the data set. This process
demonstrates the need for more efficient compression
programs and also for performing analytical operations that
work directly on compressed data. Efficient compression
schemes can be designed based on the misuse of patterns
International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020]
https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311
www.ijaems.com Page | 232
and intrinsic structures in the data [9]. Data rotation is one
of these features that can significantly enhance
compression.
Text compression is a process to reduce text size by
encoding more information. It uses this number of bits and
bytes to store data which is somewhat reduced [1]. It
requires a hassle-free technique that will not lose any data
when compressing a text file. This approach returns the
text file to its original state [4]. Text compression and
decompression techniques are intended for natural
languages such as large data sets of high redundancy
English and other data with a similar sequential structure
such as the source code of a program. However, this
method can be used for any data type to achieve some
compression [11]. The importance of text compression
involves reducing storage hardware, data transfer time, and
bandwidth, which is the use of file Compression can lead
to a significant reduction in the cost of a drive or solid-state
drive and improve network bandwidth, cost savings [7].
Textual data of various sizes are used in English to apply
compression and decompression techniques. Text
compression reduces data storage space. By reducing or
eliminating redundancies, data compression is done. The
encryption technique uses fewer bits than the original
version and eliminates unnecessary information in this
case. This is a harmless way in which the number of bits
and bytes is cut off to store information.
Previous investigations
There are various methods for compressing data that take
about forty years to develop. Shannon Fano and Hoffman
developed and developed compression algorithms in the
late 40s [10]. In 1949, Shannon and Fano devised a method
for compression using a systematic method of assigning
code words based on the probability of blocks called
Shannon Fano coding. Another method of compressing
data was found in 1951 by Huffman, known as the
Huffman algorithm. Since then, the Huffman algorithm has
been used for compression.
It has been shown in a review [13], that text data
compression using the Shannon-Fano algorithm performs
the same function as the Huffman algorithm, when all
characters are repeated in a string, and when the expression
is short and only one character in Text repeats, has the
same functionality. When input data that have long text
and data text has a more combinatorial character in the
string or word, the Huffman algorithm has a greater impact
on compression.
In another review [2], we tried compression ratio,
compression time and decompression time for RLE
encryption algorithms, Huffman encryption algorithm,
Shannon Fan algorithm, Adaptive Hoffman encryption
algorithm, Arithmetic encryption algorithm, and Lempel
Zev Wel encryption. Were compared using random text
files. The results show that as the text file size increases,
the compression time increases. Medium-value encryption
was less time consuming than the other algorithms for both
Huffman and Shannon-Fano encryption approaches. In the
LZW algorithm, it only worked fine for small files. The
compression time of the Huffman algorithm was
comparable to the encoding size at the highest time. The
decompression time of all algorithms was less than
500,000 milliseconds except for the Adaptive Huffman and
LZW algorithms. The compression ratio was similar for
small size files except for RLE encoding, but LZW
encoding worked best.
In another study [9], a comparison was made between
RLE, Huffman, LZW computational encoding, first LZW
then Hoffman and finally RLE, on the random doc, txt,
BMP, tiff, gif and jpg files. , Which studies showed that
LZW and Hoffman give almost the same results when used
for compressing text files.
In another study [4], different methods of data compression
algorithms such as LZW, Hoffman, fixed-length code
(FLC) and Hoffman after using fixed-length code (HFLC)
on English text files in terms of size compression, The
ratio, time (speed) and entropy were examined. The best
algorithms on all LZW encryption compression scales,
then Hoffman, Hoffman, after using a fixed-length code
(HFLC) and a fixed-length code (FLC), with entropy were
4.719, 4.855, 5.014 and 6.889, respectively.
In a similar study [5] that analyzed Hoffmann's algorithm
and compared it with other common compression
techniques such as Arithmetic, LZW, and RLE based on
their use in different programs and its benefits. This is very
efficient coding, and in RLE coding, when the sequence of
pixels with fewer bits is more frequent, it significantly
reduces the file size. The LZW algorithm is mostly used in
TIFF, GIF and text files, which is a fast, harmless
algorithm that is very easy to use, while the Hoffmann
algorithm is used in JPEG compression, which produces
optimal and low-volume code but is relatively slow.
In another paper [7], Shannon Fano coding, Hoffman
coding, Adaptive Huffman coding, RLE, arithmetic
coding, LZ77, LZ78, and LZW were tested using the
Calgary body. In statistical compression techniques,
accounting coding has been more effective than other
methods. In another entropy review, the English text file
for coding Shannon Fano, Hoffmann coding, Run-length
(RLE), Lempel-Ziv-Welch (LZW) coding was calculated.
The compression ratio for encoding Shannon Fano and
International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020]
https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311
www.ijaems.com Page | 233
Hoffman was almost the same, and the two algorithms can
save 54.7% of space. The compression ratio of Lempel-
Ziv-Welch algorithms is low compared to the Hoffmann
and Shannon Fano algorithms, and it has been concluded
that the Hoffmann encryption algorithm is the best result
for text files.
In another study [12], data was first compressed by each
code based on the length of the run, such as Golomb code,
FDR code, EFDR, MFDR, SAFDR, and OLEL encryption,
and then another compressed data with Hoffman code was
compressed. The double compression using the Hoffman
code was 50.8% compression ratio. Better results were
obtained for the data set with incremental data.
In a study [1] on execution time, compression ratio and
compression efficiency in a distributed customer-server
environment using four compression algorithms: Hoffmann
algorithm, Shannon Fano algorithm, LZW algorithm, and
Run-length encryption algorithm Was analyzed. A
customer's data is distributed across multiple
processors/servers and subsequently compressed by servers
in remote locations and sent to the customer. The Sigrid
framework was used, and the results showed that the LZ
algorithm achieves better efficiency/scalability and
compression ratio, but is slower than other algorithms.
Hoffman coding, LZW coding, LZW-based Hoffman
coding were compared for multiple and single
compression. This showed that Hoffman-based LZW
encryption can compress data more than any other three,
while in normal mode the LZW compression ratio is based
on Hoffmann 4.41, where the maximum compression ratio
is maximal in the case of compression. The LZW build is
4.17. Hoffman-based LZW compression is in some cases
better than LZW compression.
II. METHOD
As a test, five compression and stress relief techniques
were used: Shannon-Fano coding, Hoffman coding,
Hoffman comparative coding, and Lempel Ziv Welch
coding, and improved RLE and RLE coding. Different
sizes of text files have been used as data sets, and the
compression ratio has been measured to determine which
compression method is best.
III. RESULTS
The compression ratio depends on the size of the output
file. The more compressed the output file is, the lower the
compression ratio. When the compression ratio exceeds 1,
the output file size is larger than the original input file size.
Here we can see that in most modified RLE encrypted
input files the compression ratio is greater than 1, meaning
that these algorithms expand the original files rather than
compress them. RLE encoding works best for files with
sequential duplicates, but typically duplicate data files do
not. For this reason, the compression ratio is greater than 1.
The results also show that Shannon Fano encoding is better
for most inputs than Huffman encoding since for some
input files the symbol code length in Huffman encoding is
so large that it increases the output files over the input
files. Becomes original. It can be said that Repeated
Huffman encoding shows the best compression ratio for
most files.
If the code is shorter, the algorithms have less memory.
Now if we consider the average length of the code, we will
see that for Runlength encoding, we have no value for the
average code length because no encoding is created in Run
length. But for other techniques, Shannon Fano encodes
less code than Huffman coding and modified Run-length
coding. Here, too, Huffman coding shows the best average
code length for most files. Again, if the standard deviations
are smaller, the algorithms occupy less memory. Here, too,
for run-length coding, we have no value for standard
deviation since no coding is created in Run-length coding.
But for other techniques, Shannon Fano has less standard
deviation than Huffman coding and modified Run-length
coding. Again, Huffman coding provides the best standard
deviation for most files.
From the above analysis, we can say that Repeated
Huffman coding works best for most cases. It has a smaller
compression ratio, average code length, and standard
deviation. So, it seems the best algorithm among them.
This ultimately results in less memory than others for the
output file after the final lapse.
If extra runtime is provided for Hoffman's frequent
programming, then Shannon Fano's programming at lower
runtime would work quite well. RLE coding for files with
consecutive duplicates can be very effective. Modified
RLE coding can be very useful if the temporary file
intermediary created after applying the Huffman encoding
to the input file has 0 and 1 consecutive iterations. Finally,
each compression technique has its pros and cons. It
depends on the content of the input files on how much
better compression can be achieved. Table 1 shows the
four input files after compression and resizing algorithms.
International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020]
https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311
www.ijaems.com Page | 234
Table 1- Comparison of compression algorithms
Modified
Runlength
Coding
Run length
Coding
Shannon
Fano Coding
Huffman
Coding
Characteristics
Input File
Size
(KB)
Input File
(.txt)
163 KB291 KB3.89 KB87.5 KBOutput File Size
1561
1.091.960.040.58Compression Ratio
9.83-6.3512.38Avg. Code Length
15.33-0.1912.33Standard Deviation
164 KB245 KB91.8 KB77.3 KBOutput File Size
1172
1.391.960.830.54Compression Ratio
9.27-6.138.95Avg. Code Length
14.77-0.1013.89Standard Deviation
581 KB388 KB339 KB698 KBOutput File Size
4533
1.410.910.871.87Compression Ratio
28.12-6.4926.99Avg. Code Length
412.79-0.27414.78Standard Deviation
477 KB919 KB387 KB765 KBOutput File Size
4704
1.011.910.791.79Compression Ratio
34.16-6.3634.09Avg. Code Length
447.01-0.28450.87Standard Deviation
IV. CONCLUSION
Text compression is very important in data refinement in
data mining. Because data is an important part of data
mining textual data. Therefore, it requires a no-loss
technique so that no efficient information is lost when
compressing or compressing information. In the
investigations performed in this study on the performance
and computation and comparison of compression ratios,
mean code length and standard deviation for Shannon
Fano coding, Huffman coding, Huffman repeated coding,
Run-length coding, and modified Run-length coding, in
The issue of how to compress was examined by each one
so that, at present, the most effective algorithm can be
used based on the size of the input text file, the type of
content, available memory, and execution time to get the
best results. A new approach to data compression is called
"Run-length Run Algorithm", which offers much better
compression than the run-length algorithm.
REFERENCES
[1] P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J.
Weickert, “Evaluating the true potential of diffusion-based
inpainting in a compression context,” Signal Processing:
Image Commun., No. 46,40–53 (2016).
[2] M. N. Huda, “Study on Huffman Coding,” Graduate
Thesis, 2004.
[3] The Canterbury Corpus [Online].
http://corpus.canterbury.ac.nz/resources/cantrbry.zip.
Accessed 30 May (2018)
[4] Habib, A., Rahman, M.S.: Balancing decoding speed and
memory usage for Huffman codes using quaternary tree.
Appl. Inf. 4, 5 (2017). https://doi.org/10.1186/s40535-016-
0032-z
[5] S.R. Kodituwakku and U. S. Amarasinghe, “Comparison of
Lossless Data Compression Algorithms for Text Data,”
Indian Journal of Computer Science and Engineering, vol.
I(4), 2007, pp. 416-426.
[6] C. Lamorahan, B. Pinontoan and N. Nainggolan, “ Data
Compression Using Shannon-Fano Algorithm,” JdC, Vol.
2, No. 2, September, 2013, pp. 10-17.
[7] M. A. Khan, “Evaluation of Basic Data Compression
Algorithms in a Distributed Environment,” Journal of Basic
& Applied Sciences, Vol. 8, 2012, pp. 362-365.
[8] D. Belazzougui, S.J. Puglisi, Range predecessor and
Lempel–Ziv parsing, in: Proc. 27th Annual ACM–SIAM
Symposium on Discrete Algorithms (SODA), 2016, pp.
2053–2071.
International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020]
https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311
www.ijaems.com Page | 235
[9] P. Bille, M.B. Ettienne, I.L. Gørtz, H.W. Vildhøj, Time-
space trade-offs for Lempel–Ziv compressed indexing,
Theoret. Comput. Sci. 713 (2018) 66–77.
[10] D. Kempa, D. Kosolobov, LZ-End parsing in compressed
space, in: Proc. 27th Data Compression Conference (DCC),
2017, pp. 350–359, Extended version in
https://arxiv.org/pdf/1611.01769.pdf.
[11] G. Navarro, A self-index on Block Trees, in: Proc. 24th
International Symposium on String Processing and
Information Retrieval (SPIRE), in: LNCS, vol. 10508,
2017, pp. 278–289.
[12] T. Nishimoto, T.I.S. Inenaga, H. Bannai, M. Takeda,
Dynamic index and LZ factorization in compressed space,
in: Proc. Prague Stringology Conference (PSC), 2016, pp.
158–170. [44] T. Nishimoto, T.I.S. Inenaga
[13] D. Belazzougui, S.J. Puglisi, Y. Tabei, Access, rank, select
in grammar-compressed strings, in: Proc. 23rd Annual
European Symposium on Algorithms (ESA), in: LNCS,
vol. 9294, 2015, pp. 142–154.
[14] D. Kempa, N. Prezza, At the roots of dictionary
compression: string attractors, in: Proc. 50th Annual ACM
SIGACT Symposium on Theory of Computing (STOC),
2018, pp. 827–840.

More Related Content

What's hot

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using RKnoldus Inc.
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Florian Mai
 
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming TELKOMNIKA JOURNAL
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...IJCSIS Research Publications
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
A comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text miningA comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text miningIAEME Publication
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSIRJET Journal
 
Traid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodTraid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodjournalBEEI
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 

What's hot (20)

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Ir 02
Ir   02Ir   02
Ir 02
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
 
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
 
58 65
58 6558 65
58 65
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
G04124041046
G04124041046G04124041046
G04124041046
 
A comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text miningA comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text mining
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFS
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Traid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodTraid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography method
 
Ir 03
Ir   03Ir   03
Ir 03
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 

Similar to Comparison of Compression Algorithms in text data for Data Mining

Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu MattatholiAnu Mattatholi
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORTDhrumil Shah
 
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding TechniqueAffable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding TechniqueIOSR Journals
 
A research paper_on_lossless_data_compre
A research paper_on_lossless_data_compreA research paper_on_lossless_data_compre
A research paper_on_lossless_data_compreLuisa Francisco
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
 
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHY
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYEXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHY
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYIJNSA Journal
 
Performance Review of Zero Copy Techniques
Performance Review of Zero Copy TechniquesPerformance Review of Zero Copy Techniques
Performance Review of Zero Copy TechniquesCSCJournals
 
Image Compression Through Combination Advantages From Existing Techniques
Image Compression Through Combination Advantages From Existing TechniquesImage Compression Through Combination Advantages From Existing Techniques
Image Compression Through Combination Advantages From Existing TechniquesCSCJournals
 
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKS
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKSEFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKS
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKSijwmn
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPijdms
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 

Similar to Comparison of Compression Algorithms in text data for Data Mining (20)

Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu Mattatholi
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
 
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding TechniqueAffable Compression through Lossless Column-Oriented Huffman Coding Technique
Affable Compression through Lossless Column-Oriented Huffman Coding Technique
 
Ijrdtvlis11 140006
Ijrdtvlis11 140006Ijrdtvlis11 140006
Ijrdtvlis11 140006
 
A research paper_on_lossless_data_compre
A research paper_on_lossless_data_compreA research paper_on_lossless_data_compre
A research paper_on_lossless_data_compre
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
50120130405014 2-3
50120130405014 2-350120130405014 2-3
50120130405014 2-3
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHY
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYEXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHY
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHY
 
Performance Review of Zero Copy Techniques
Performance Review of Zero Copy TechniquesPerformance Review of Zero Copy Techniques
Performance Review of Zero Copy Techniques
 
Image Compression Through Combination Advantages From Existing Techniques
Image Compression Through Combination Advantages From Existing TechniquesImage Compression Through Combination Advantages From Existing Techniques
Image Compression Through Combination Advantages From Existing Techniques
 
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKS
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKSEFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKS
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKS
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
 
Compression tech
Compression techCompression tech
Compression tech
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Comparison of Compression Algorithms in text data for Data Mining

  • 1. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020] https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311 www.ijaems.com Page | 231 Comparison of Compression Algorithms in text data for Data Mining Roya Mahmoudi* , Mansoureh Zare Department of Computer and Information Technology Engineering, Sabzevar branch, Islamic Azad University, Sabzevar, Iran. *Corresponding Author Abstract— Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text. Keywords— data compression, data mining, encryption, text data. I. INTRODUCTION In computer science and information theory, using encryption techniques encrypts and compresses data using fewer bits or symbols than the original representation. Data compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth [3]. In the present situation where we are faced with the problem of data abundance and data disruption, data mining and data refinement and compression are used for this purpose. However, the problem is that the pressure of large compression may cause data loss [2]. That data is one of the valuable assets of any organization that the loss of some information can lead to major problems in the organization's decisions and strategies. The purpose here is to present and implement some data compression algorithms efficiently and compare them to the methods used for data compression in data mining. Calculating some of the important compression factors for each of these algorithms, comparing different encryption techniques and improving the performance of different data compression techniques, and selecting a suitable encryption method for data mining is the textual data considered. Data mining, which has become very popular in recent years, refers to the use of data analysis tools to discover valid patterns and relationships that have been unknown until now [13]. These tools may include statistical models, mathematical algorithms, and learning methods (Machine Learning Methods) that are automated based on experience gained through neural networks or decision-making trees. Improves. Data mining is not limited to data collection and management, but also involves the analysis of data and the refinement and compression of data. Applications that explore data by examining text or multimedia files for data warehouses that have become popular [10]. This type of warehouse provides a variety of data types from text, image to storage and processing for use in organizational strategies and better understanding of data as information. Their advantage over databases is the ability to refine data and compress data to preserve important data and delete trivial data [5]. Each database has its query and query language. A multi-year problem in data analysis is increasing the size of the data set. This process demonstrates the need for more efficient compression programs and also for performing analytical operations that work directly on compressed data. Efficient compression schemes can be designed based on the misuse of patterns
  • 2. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020] https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311 www.ijaems.com Page | 232 and intrinsic structures in the data [9]. Data rotation is one of these features that can significantly enhance compression. Text compression is a process to reduce text size by encoding more information. It uses this number of bits and bytes to store data which is somewhat reduced [1]. It requires a hassle-free technique that will not lose any data when compressing a text file. This approach returns the text file to its original state [4]. Text compression and decompression techniques are intended for natural languages such as large data sets of high redundancy English and other data with a similar sequential structure such as the source code of a program. However, this method can be used for any data type to achieve some compression [11]. The importance of text compression involves reducing storage hardware, data transfer time, and bandwidth, which is the use of file Compression can lead to a significant reduction in the cost of a drive or solid-state drive and improve network bandwidth, cost savings [7]. Textual data of various sizes are used in English to apply compression and decompression techniques. Text compression reduces data storage space. By reducing or eliminating redundancies, data compression is done. The encryption technique uses fewer bits than the original version and eliminates unnecessary information in this case. This is a harmless way in which the number of bits and bytes is cut off to store information. Previous investigations There are various methods for compressing data that take about forty years to develop. Shannon Fano and Hoffman developed and developed compression algorithms in the late 40s [10]. In 1949, Shannon and Fano devised a method for compression using a systematic method of assigning code words based on the probability of blocks called Shannon Fano coding. Another method of compressing data was found in 1951 by Huffman, known as the Huffman algorithm. Since then, the Huffman algorithm has been used for compression. It has been shown in a review [13], that text data compression using the Shannon-Fano algorithm performs the same function as the Huffman algorithm, when all characters are repeated in a string, and when the expression is short and only one character in Text repeats, has the same functionality. When input data that have long text and data text has a more combinatorial character in the string or word, the Huffman algorithm has a greater impact on compression. In another review [2], we tried compression ratio, compression time and decompression time for RLE encryption algorithms, Huffman encryption algorithm, Shannon Fan algorithm, Adaptive Hoffman encryption algorithm, Arithmetic encryption algorithm, and Lempel Zev Wel encryption. Were compared using random text files. The results show that as the text file size increases, the compression time increases. Medium-value encryption was less time consuming than the other algorithms for both Huffman and Shannon-Fano encryption approaches. In the LZW algorithm, it only worked fine for small files. The compression time of the Huffman algorithm was comparable to the encoding size at the highest time. The decompression time of all algorithms was less than 500,000 milliseconds except for the Adaptive Huffman and LZW algorithms. The compression ratio was similar for small size files except for RLE encoding, but LZW encoding worked best. In another study [9], a comparison was made between RLE, Huffman, LZW computational encoding, first LZW then Hoffman and finally RLE, on the random doc, txt, BMP, tiff, gif and jpg files. , Which studies showed that LZW and Hoffman give almost the same results when used for compressing text files. In another study [4], different methods of data compression algorithms such as LZW, Hoffman, fixed-length code (FLC) and Hoffman after using fixed-length code (HFLC) on English text files in terms of size compression, The ratio, time (speed) and entropy were examined. The best algorithms on all LZW encryption compression scales, then Hoffman, Hoffman, after using a fixed-length code (HFLC) and a fixed-length code (FLC), with entropy were 4.719, 4.855, 5.014 and 6.889, respectively. In a similar study [5] that analyzed Hoffmann's algorithm and compared it with other common compression techniques such as Arithmetic, LZW, and RLE based on their use in different programs and its benefits. This is very efficient coding, and in RLE coding, when the sequence of pixels with fewer bits is more frequent, it significantly reduces the file size. The LZW algorithm is mostly used in TIFF, GIF and text files, which is a fast, harmless algorithm that is very easy to use, while the Hoffmann algorithm is used in JPEG compression, which produces optimal and low-volume code but is relatively slow. In another paper [7], Shannon Fano coding, Hoffman coding, Adaptive Huffman coding, RLE, arithmetic coding, LZ77, LZ78, and LZW were tested using the Calgary body. In statistical compression techniques, accounting coding has been more effective than other methods. In another entropy review, the English text file for coding Shannon Fano, Hoffmann coding, Run-length (RLE), Lempel-Ziv-Welch (LZW) coding was calculated. The compression ratio for encoding Shannon Fano and
  • 3. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020] https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311 www.ijaems.com Page | 233 Hoffman was almost the same, and the two algorithms can save 54.7% of space. The compression ratio of Lempel- Ziv-Welch algorithms is low compared to the Hoffmann and Shannon Fano algorithms, and it has been concluded that the Hoffmann encryption algorithm is the best result for text files. In another study [12], data was first compressed by each code based on the length of the run, such as Golomb code, FDR code, EFDR, MFDR, SAFDR, and OLEL encryption, and then another compressed data with Hoffman code was compressed. The double compression using the Hoffman code was 50.8% compression ratio. Better results were obtained for the data set with incremental data. In a study [1] on execution time, compression ratio and compression efficiency in a distributed customer-server environment using four compression algorithms: Hoffmann algorithm, Shannon Fano algorithm, LZW algorithm, and Run-length encryption algorithm Was analyzed. A customer's data is distributed across multiple processors/servers and subsequently compressed by servers in remote locations and sent to the customer. The Sigrid framework was used, and the results showed that the LZ algorithm achieves better efficiency/scalability and compression ratio, but is slower than other algorithms. Hoffman coding, LZW coding, LZW-based Hoffman coding were compared for multiple and single compression. This showed that Hoffman-based LZW encryption can compress data more than any other three, while in normal mode the LZW compression ratio is based on Hoffmann 4.41, where the maximum compression ratio is maximal in the case of compression. The LZW build is 4.17. Hoffman-based LZW compression is in some cases better than LZW compression. II. METHOD As a test, five compression and stress relief techniques were used: Shannon-Fano coding, Hoffman coding, Hoffman comparative coding, and Lempel Ziv Welch coding, and improved RLE and RLE coding. Different sizes of text files have been used as data sets, and the compression ratio has been measured to determine which compression method is best. III. RESULTS The compression ratio depends on the size of the output file. The more compressed the output file is, the lower the compression ratio. When the compression ratio exceeds 1, the output file size is larger than the original input file size. Here we can see that in most modified RLE encrypted input files the compression ratio is greater than 1, meaning that these algorithms expand the original files rather than compress them. RLE encoding works best for files with sequential duplicates, but typically duplicate data files do not. For this reason, the compression ratio is greater than 1. The results also show that Shannon Fano encoding is better for most inputs than Huffman encoding since for some input files the symbol code length in Huffman encoding is so large that it increases the output files over the input files. Becomes original. It can be said that Repeated Huffman encoding shows the best compression ratio for most files. If the code is shorter, the algorithms have less memory. Now if we consider the average length of the code, we will see that for Runlength encoding, we have no value for the average code length because no encoding is created in Run length. But for other techniques, Shannon Fano encodes less code than Huffman coding and modified Run-length coding. Here, too, Huffman coding shows the best average code length for most files. Again, if the standard deviations are smaller, the algorithms occupy less memory. Here, too, for run-length coding, we have no value for standard deviation since no coding is created in Run-length coding. But for other techniques, Shannon Fano has less standard deviation than Huffman coding and modified Run-length coding. Again, Huffman coding provides the best standard deviation for most files. From the above analysis, we can say that Repeated Huffman coding works best for most cases. It has a smaller compression ratio, average code length, and standard deviation. So, it seems the best algorithm among them. This ultimately results in less memory than others for the output file after the final lapse. If extra runtime is provided for Hoffman's frequent programming, then Shannon Fano's programming at lower runtime would work quite well. RLE coding for files with consecutive duplicates can be very effective. Modified RLE coding can be very useful if the temporary file intermediary created after applying the Huffman encoding to the input file has 0 and 1 consecutive iterations. Finally, each compression technique has its pros and cons. It depends on the content of the input files on how much better compression can be achieved. Table 1 shows the four input files after compression and resizing algorithms.
  • 4. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020] https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311 www.ijaems.com Page | 234 Table 1- Comparison of compression algorithms Modified Runlength Coding Run length Coding Shannon Fano Coding Huffman Coding Characteristics Input File Size (KB) Input File (.txt) 163 KB291 KB3.89 KB87.5 KBOutput File Size 1561 1.091.960.040.58Compression Ratio 9.83-6.3512.38Avg. Code Length 15.33-0.1912.33Standard Deviation 164 KB245 KB91.8 KB77.3 KBOutput File Size 1172 1.391.960.830.54Compression Ratio 9.27-6.138.95Avg. Code Length 14.77-0.1013.89Standard Deviation 581 KB388 KB339 KB698 KBOutput File Size 4533 1.410.910.871.87Compression Ratio 28.12-6.4926.99Avg. Code Length 412.79-0.27414.78Standard Deviation 477 KB919 KB387 KB765 KBOutput File Size 4704 1.011.910.791.79Compression Ratio 34.16-6.3634.09Avg. Code Length 447.01-0.28450.87Standard Deviation IV. CONCLUSION Text compression is very important in data refinement in data mining. Because data is an important part of data mining textual data. Therefore, it requires a no-loss technique so that no efficient information is lost when compressing or compressing information. In the investigations performed in this study on the performance and computation and comparison of compression ratios, mean code length and standard deviation for Shannon Fano coding, Huffman coding, Huffman repeated coding, Run-length coding, and modified Run-length coding, in The issue of how to compress was examined by each one so that, at present, the most effective algorithm can be used based on the size of the input text file, the type of content, available memory, and execution time to get the best results. A new approach to data compression is called "Run-length Run Algorithm", which offers much better compression than the run-length algorithm. REFERENCES [1] P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J. Weickert, “Evaluating the true potential of diffusion-based inpainting in a compression context,” Signal Processing: Image Commun., No. 46,40–53 (2016). [2] M. N. Huda, “Study on Huffman Coding,” Graduate Thesis, 2004. [3] The Canterbury Corpus [Online]. http://corpus.canterbury.ac.nz/resources/cantrbry.zip. Accessed 30 May (2018) [4] Habib, A., Rahman, M.S.: Balancing decoding speed and memory usage for Huffman codes using quaternary tree. Appl. Inf. 4, 5 (2017). https://doi.org/10.1186/s40535-016- 0032-z [5] S.R. Kodituwakku and U. S. Amarasinghe, “Comparison of Lossless Data Compression Algorithms for Text Data,” Indian Journal of Computer Science and Engineering, vol. I(4), 2007, pp. 416-426. [6] C. Lamorahan, B. Pinontoan and N. Nainggolan, “ Data Compression Using Shannon-Fano Algorithm,” JdC, Vol. 2, No. 2, September, 2013, pp. 10-17. [7] M. A. Khan, “Evaluation of Basic Data Compression Algorithms in a Distributed Environment,” Journal of Basic & Applied Sciences, Vol. 8, 2012, pp. 362-365. [8] D. Belazzougui, S.J. Puglisi, Range predecessor and Lempel–Ziv parsing, in: Proc. 27th Annual ACM–SIAM Symposium on Discrete Algorithms (SODA), 2016, pp. 2053–2071.
  • 5. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-6, Issue-6, Jun-2020] https://dx.doi.org/10.22161/ijaems.66.2 ISSN: 2454-1311 www.ijaems.com Page | 235 [9] P. Bille, M.B. Ettienne, I.L. Gørtz, H.W. Vildhøj, Time- space trade-offs for Lempel–Ziv compressed indexing, Theoret. Comput. Sci. 713 (2018) 66–77. [10] D. Kempa, D. Kosolobov, LZ-End parsing in compressed space, in: Proc. 27th Data Compression Conference (DCC), 2017, pp. 350–359, Extended version in https://arxiv.org/pdf/1611.01769.pdf. [11] G. Navarro, A self-index on Block Trees, in: Proc. 24th International Symposium on String Processing and Information Retrieval (SPIRE), in: LNCS, vol. 10508, 2017, pp. 278–289. [12] T. Nishimoto, T.I.S. Inenaga, H. Bannai, M. Takeda, Dynamic index and LZ factorization in compressed space, in: Proc. Prague Stringology Conference (PSC), 2016, pp. 158–170. [44] T. Nishimoto, T.I.S. Inenaga [13] D. Belazzougui, S.J. Puglisi, Y. Tabei, Access, rank, select in grammar-compressed strings, in: Proc. 23rd Annual European Symposium on Algorithms (ESA), in: LNCS, vol. 9294, 2015, pp. 142–154. [14] D. Kempa, N. Prezza, At the roots of dictionary compression: string attractors, in: Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2018, pp. 827–840.