In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm
This document describes techniques for creating compact and fast n-gram language models. It presents several implementations of n-gram language models that are both small in size and fast to query. The most compact implementation can store the 4 billion n-grams from the Google Web1T corpus in just 23 bits per n-gram, using techniques like implicitly encoding words and applying variable-length encodings to context deltas. It also discusses methods for improving query speed, such as a novel language model caching technique that speeds up both their implementations and SRILM by up to 300%.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Supplementary material for my following paper: Infinite Latent Process Decomp...Tomonari Masada
This document proposes an infinite latent process decomposition (iLPD) model for microarray data. iLPD extends latent process decomposition (LPD) to allow for an infinite number of latent processes. It presents iLPD's generative process and joint distribution. The document also introduces auxiliary variables and provides a collapsed variational Bayesian inference approach for iLPD. This involves deriving a lower bound for the log evidence to evaluate iLPD's efficiency and compare it to LPD. The proposed method improves on past work by treating full posterior distributions over hyperparameters and removing dependencies in computing the variational lower bound.
Efficiency lossless data techniques for arabic text compressionijcsit
This document summarizes a study that evaluated the efficiency of the LZW and BWT data compression techniques on Arabic text files of different sizes and categories (vowelized, partially vowelized, unvowelized). It found that the enhanced LZW technique, which took advantage of Arabic letters having a single case, achieved the highest compression ratios. The enhanced LZW performed better than standard LZW and BWT for all categories of Arabic text. While LZW generally compressed Arabic text better than English text, BWT only performed better on vowelized Arabic versus English. The study concluded the enhanced LZW was the most effective technique for compressing Arabic texts.
BERT is a pre-trained language representation model that uses the Transformer architecture. It is pre-trained using two unsupervised tasks: masked language modeling and next sentence prediction. BERT can then be fine-tuned on downstream NLP tasks like question answering and text classification. When fine-tuned on SQuAD, BERT achieved state-of-the-art results by using the output hidden states to predict the start and end positions of answers within paragraphs. Later work like RoBERTa and ALBERT improved on BERT by modifying pre-training procedures and model architectures.
In this paper, low linear architectures for analyzing the first two maximum or minimum values are of paramount importance in several uses, including iterative decoders. The min-sum giving out step is to that it produces only two diverse output magnitude values irrespective of the number of incoming bit-to check communication. These new micro-architecture structures would utilize the minimum number of comparators by exploiting the concept of survivors in the search. These would result in reduced number of comparisons and consequently reduced energy use. Multipliers are complex units and play an important role in finding the overall area, speed and power consumption of digital designs. By using the multiplier we can minimize the parameters like latency, complexity and power consumption. The decoding algorithms we propose generalize and unify the decoding schemes originally presented the product codes and those of low-density parity-check codes.
Analysis of LDPC Codes under Wi-Max IEEE 802.16eIJERD Editor
The LDPC codes have been shown to be by-far the best coding scheme capable for transmitting message over noisy channel. The main aim of this paper is to study the behaviour of LDPC codes on under IEEE 802.16e guidelines. The rate- ½ LDPC codes have been implemented on AWGN channel and the result shows that they can be used on such channels with low BER performance. The BER can be further minimized by increasing the block length.
This document describes techniques for creating compact and fast n-gram language models. It presents several implementations of n-gram language models that are both small in size and fast to query. The most compact implementation can store the 4 billion n-grams from the Google Web1T corpus in just 23 bits per n-gram, using techniques like implicitly encoding words and applying variable-length encodings to context deltas. It also discusses methods for improving query speed, such as a novel language model caching technique that speeds up both their implementations and SRILM by up to 300%.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Supplementary material for my following paper: Infinite Latent Process Decomp...Tomonari Masada
This document proposes an infinite latent process decomposition (iLPD) model for microarray data. iLPD extends latent process decomposition (LPD) to allow for an infinite number of latent processes. It presents iLPD's generative process and joint distribution. The document also introduces auxiliary variables and provides a collapsed variational Bayesian inference approach for iLPD. This involves deriving a lower bound for the log evidence to evaluate iLPD's efficiency and compare it to LPD. The proposed method improves on past work by treating full posterior distributions over hyperparameters and removing dependencies in computing the variational lower bound.
Efficiency lossless data techniques for arabic text compressionijcsit
This document summarizes a study that evaluated the efficiency of the LZW and BWT data compression techniques on Arabic text files of different sizes and categories (vowelized, partially vowelized, unvowelized). It found that the enhanced LZW technique, which took advantage of Arabic letters having a single case, achieved the highest compression ratios. The enhanced LZW performed better than standard LZW and BWT for all categories of Arabic text. While LZW generally compressed Arabic text better than English text, BWT only performed better on vowelized Arabic versus English. The study concluded the enhanced LZW was the most effective technique for compressing Arabic texts.
BERT is a pre-trained language representation model that uses the Transformer architecture. It is pre-trained using two unsupervised tasks: masked language modeling and next sentence prediction. BERT can then be fine-tuned on downstream NLP tasks like question answering and text classification. When fine-tuned on SQuAD, BERT achieved state-of-the-art results by using the output hidden states to predict the start and end positions of answers within paragraphs. Later work like RoBERTa and ALBERT improved on BERT by modifying pre-training procedures and model architectures.
In this paper, low linear architectures for analyzing the first two maximum or minimum values are of paramount importance in several uses, including iterative decoders. The min-sum giving out step is to that it produces only two diverse output magnitude values irrespective of the number of incoming bit-to check communication. These new micro-architecture structures would utilize the minimum number of comparators by exploiting the concept of survivors in the search. These would result in reduced number of comparisons and consequently reduced energy use. Multipliers are complex units and play an important role in finding the overall area, speed and power consumption of digital designs. By using the multiplier we can minimize the parameters like latency, complexity and power consumption. The decoding algorithms we propose generalize and unify the decoding schemes originally presented the product codes and those of low-density parity-check codes.
Analysis of LDPC Codes under Wi-Max IEEE 802.16eIJERD Editor
The LDPC codes have been shown to be by-far the best coding scheme capable for transmitting message over noisy channel. The main aim of this paper is to study the behaviour of LDPC codes on under IEEE 802.16e guidelines. The rate- ½ LDPC codes have been implemented on AWGN channel and the result shows that they can be used on such channels with low BER performance. The BER can be further minimized by increasing the block length.
This project report presents an analysis of compression algorithms like Huffman encoding, run length encoding, LZW, and a variant of LZW. Testing was conducted on these algorithms to analyze compression rates, compression and decompression times, and memory usage. Huffman encoding achieved around 40% compression while run length encoding resulted in negligible compression. LZW performed better than its variant in compression but the variant used significantly less memory. The variant balanced compression performance and memory usage better than standard LZW for large files.
Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...ijwmn
Non-binary low-density parity check (NB-LDPC) codes are an extension of binary LDPC codes with
significantly better performance. Although various kinds of low-complexity iterative decoding algorithms
have been proposed, there is a big challenge for VLSI implementation of NBLDPC decoders due to its high
complexity and long latency. In this brief, highly efficient check node processing scheme, which the
processing delay greatly reduced, including Min-Max decoding algorithm and check node unit are
proposed. Compare with previous works, less than 52% could be reduced for the latency of check node
unit. In addition, the efficiency of the presented techniques is design to demonstrate for the (620, 310) NBQC-
LDPC decoder.
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
In this paper, we have designed the VLSI hardware for a novel RS decoding algorithm suitable for Multi-Gb/s Communication Systems. Through this paper we show that the performance benefit of the algorithm is truly witnessed when implemented in hardware thus avoiding the extra processing time of Fetch-Decode-Execute cycle of traditional microprocessor based computing systems. The new algorithm with less time complexity combined with its application specific hardware implementation makes it suitable for high speed real-time systems with hard timing constraints. The design is implemented as a digital hardware using VHDL
A Compression & Encryption Algorithms on DNA Sequences Using R 2 P & Selectiv...IJMERJOURNAL
ABSTRACT: The size of DNA (Deoxyribonucleic Acid) sequences is varying in the range of millions to billions of nucleotides and two or three times bigger annually. Therefore efficient lossless compression technique, data structures to efficiently store, access, secure communicate and search these large datasets are necessary. This compression algorithm for genetic sequences, based on searching the exact repeat, reverse and palindrome (R2P) substring substitution and create a Library file. The R2P substring is replaced by corresponding ASCII character where for repeat, selecting ASCII characters ranging from 33 to 33+72, for reverse from 33+73 to 33+73+72 and for palindrome from 179 to 179+72. The selective encryption technique, the data are encrypted either in the library file or in compressed file or in both, also by using ASCII code and online library file acting as a signature. Selective encryption, where a part of message is encrypted keeping the remaining part unencrypted, can be a viable proposition for running encryption system in resource constraint devices. The algorithm can approach a moderate compression rate, provide strong data security, the running time is very few second and the complexity is O(n2 ). Also the compressed data again compressed by renounced compressor for reducing the compression rate & ratio. This techniques can approach a compression rate of 2.004871bits/base.
Reduced Energy Min-Max Decoding Algorithm for Ldpc Code with Adder Correction...ijceronline
In this paper, high linear architectures for analysing the first two maximum or minimum values are of paramount importance in several uses, including iterative decoders. We proposed the adder and LDPC. The min-sum processing step that it gives only two different output magnitude values irrespective of the number of incoming bit-to check messages. These new micro-architecture layouts would employ the minimum number of comparators by exploiting the concept of survivors in the search. These would result in reduced number of comparisons and consequently reduced energy use. Multipliers are complex units and play an important role in finding the overall area, speed and power consumption of digital designs. By using the multiplier we can minimize the parameters like latency, complexity and power consumption.
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)Enrique Monzo Solves
Implementation of a high speed decoding of non-binary irregular LDPC codes using CUDA GPUs.
Moritz Beermann, Enrique Monzó, Laurent Schmalen, Peter Vary
IEEE SiPS, Oct. 2013, Taipei, Taiwan
This document discusses various data compression techniques. It begins with an introduction to compression and its goals of reducing storage space and transmission time. Then it discusses lossless techniques like Huffman coding, Lempel-Ziv coding, run-length encoding and pattern substitution. The document also briefly covers lossy compression and entropy encoding algorithms like Shannon-Fano coding and arithmetic coding. Key compression methods and their applications are summarized throughout.
Deep Packet Inspection with Regular Expression MatchingEditor IJCATR
Deep packet inspection directs, persists, filters and logs IP-based applications and Web services traffic based on content
encapsulated in a packet's header or payload, regardless of the protocol or application type. In content scanning, the packet payload is
compared against a set of patterns specified as regular expressions. With deep packet inspection in place through a single intelligent
network device, companies can boost performance without buying expensive servers or additional security products. They are typically
matched through deterministic finite automata (DFAs), but large rule sets need a memory amount that turns out to be too large for
practical implementation. Many recent works have proposed improvements to address this issue, but they increase the number of
transitions (and then of memory accesses) per character. This paper presents a new representation for DFAs, orthogonal to most of the
previous solutions, called delta finite automata (FA), which considerably reduces states and transitions while preserving a transition
per character only, thus allowing fast matching. A further optimization exploits Nth order relationships within the DFA by adopting
the concept of temporary transitions.
The document discusses different techniques for compressing multimedia data such as text, images, audio and video. It describes how compression works by removing redundancy in digital data and exploiting properties of human perception. It then explains different compression methods including lossless compression, lossy compression, entropy encoding, and specific algorithms like Huffman encoding and arithmetic coding. The goal of compression is to reduce the size of files to reduce storage and bandwidth requirements for transmission.
This document summarizes a research paper on improving regular expression matching using a look-ahead finite automata approach. It proposes LaFA, which optimizes regular expression detection by allowing out-of-order detection, systematically reordering the detection sequence, and sharing states among automata. LaFA uses three main modules - a timestamp lookup module to detect non-repeating patterns, a character lookup module to check individual characters, and a repetition detection module to handle repeating patterns. The system architecture involves capturing packets, extracting payloads, and using these modules in parallel to match regular expressions and improve detection speed for intrusion detection systems.
This document summarizes a method for recognizing online handwritten Tamil characters using fractal features. [1] A novel fractal coding technique is used that exploits redundancy to achieve better compression while requiring less memory and encoding time with minimal distortion. [2] Fractal codes are generated from preprocessed Tamil characters by partitioning them into ranges and finding the most similar transformed domain segments. [3] These fractal codes are then used to classify characters with 90% accuracy, outperforming a nearest neighbor classifier.
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
This document discusses data compression using Python. It begins with explaining how information theory can be used to more efficiently encode messages by assigning shorter codewords to more common symbols. An example encoding scheme is shown to reduce redundancy from 32% to 3%. The document then demonstrates a Python implementation of Huffman encoding and decoding to compress a sample text file by 45%. Key information theory concepts like entropy, uncertainty, and coding efficiency are also overviewed.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
Data Compression - Text Compression - Run Length EncodingMANISH T I
Run-length encoding (RLE) replaces consecutive repeated characters in data with a single character and count. For example, "aaabbc" would compress to "3a2bc". RLE works best on data with many repetitive characters like spaces. It has limitations for natural language text which contains few repetitions longer than doubles. Variants include digram encoding which compresses common letter pairs, and differencing which encodes differences between successive values like temperatures instead of absolute values.
This document provides an overview of coding theory and recent advances in low-density parity-check (LDPC) codes. It discusses Shannon's channel coding theorem and how modern error-correcting codes achieve rates close to channel capacity. LDPC codes are described as having sparse parity-check matrices and being decoded iteratively using message passing. The performance of LDPC codes can be analyzed using density evolution and threshold calculations. Linear programming decoding is introduced as an alternative decoding approach that has connections to message passing decoding.
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...Koji Matsuda
The document presents a unified approach for measuring semantic similarity between texts at multiple levels (sense, word, text) using semantic signatures. It generates semantic signatures through multi-seeded random walks over the WordNet graph. It then aligns and disambiguates words and senses to extract sense "seeds" for the signatures. Finally, it calculates signature similarity using measures like cosine similarity, weighted overlap, and top-k Jaccard. The approach provides a unified framework for semantic similarity that can be applied to various NLP tasks.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
P REPROCESSING FOR PPM: COMPRESSING UTF - 8 ENCODED NATURAL LANG UAGE TEXTijcsit
n this paper,
several new universal preprocessing techniques
are described
to improve Prediction by
Partial Matching (PPM) compression of UTF
-
8 encoded natural language text. These methods essentially
adjust the alphabet in some manner (for exampl
e, by expanding or reducing it) prior to the compression
algorithm then being applied to the amended text.
Firstly,
a simple bigraphs (two
-
byte) substitution
technique
is described
that leads to significant improvement in compression for many languages whe
n they
are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian,
9% for Persian, 15% for
Russian, 1% for Chinese text, and over 5% for both English and Welsh text)
.
Secondly,
a new
preprocessing technique t
hat outputs separate vocabulary
an
d symbols streams
–
that are subsequently
encoded separately
–
is also investigated
. This also leads to significant improvement in compression for
many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally
,
novel p
reprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text
are described for dotted and non
-
dotted forms of the language
P REPROCESSING FOR PPM: COMPRESSING UTF - 8 ENCODED NATURAL LANG UAGE TEXTijcsit
In this paper,
several new universal preprocessing techniques
are described
to improve Prediction by
Partial Matching (PPM) compression of UTF
-
8 encoded natural language text. These methods essentially
adjust the alphabet in some manner (for exampl
e, by expanding or reducing it) prior to the compression
algorithm then being applied to the amended text.
Firstly,
a simple bigraphs (two
-
byte) substitution
technique
is described
that leads to significant improvement in compression for many languages whe
n they
are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian,
9% for Persian, 15% for
Russian, 1% for Chinese text, and over 5% for both English and Welsh text)
.
Secondly,
a new
preprocessing technique t
hat outputs separate vocabulary
an
d symbols streams
–
that are subsequently
encoded separately
–
is also investigated
. This also leads to significant improvement in compression for
many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally
,
novel p
reprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text
are described for dotted and non
-
dotted forms of the language
This project report presents an analysis of compression algorithms like Huffman encoding, run length encoding, LZW, and a variant of LZW. Testing was conducted on these algorithms to analyze compression rates, compression and decompression times, and memory usage. Huffman encoding achieved around 40% compression while run length encoding resulted in negligible compression. LZW performed better than its variant in compression but the variant used significantly less memory. The variant balanced compression performance and memory usage better than standard LZW for large files.
Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...ijwmn
Non-binary low-density parity check (NB-LDPC) codes are an extension of binary LDPC codes with
significantly better performance. Although various kinds of low-complexity iterative decoding algorithms
have been proposed, there is a big challenge for VLSI implementation of NBLDPC decoders due to its high
complexity and long latency. In this brief, highly efficient check node processing scheme, which the
processing delay greatly reduced, including Min-Max decoding algorithm and check node unit are
proposed. Compare with previous works, less than 52% could be reduced for the latency of check node
unit. In addition, the efficiency of the presented techniques is design to demonstrate for the (620, 310) NBQC-
LDPC decoder.
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
In this paper, we have designed the VLSI hardware for a novel RS decoding algorithm suitable for Multi-Gb/s Communication Systems. Through this paper we show that the performance benefit of the algorithm is truly witnessed when implemented in hardware thus avoiding the extra processing time of Fetch-Decode-Execute cycle of traditional microprocessor based computing systems. The new algorithm with less time complexity combined with its application specific hardware implementation makes it suitable for high speed real-time systems with hard timing constraints. The design is implemented as a digital hardware using VHDL
A Compression & Encryption Algorithms on DNA Sequences Using R 2 P & Selectiv...IJMERJOURNAL
ABSTRACT: The size of DNA (Deoxyribonucleic Acid) sequences is varying in the range of millions to billions of nucleotides and two or three times bigger annually. Therefore efficient lossless compression technique, data structures to efficiently store, access, secure communicate and search these large datasets are necessary. This compression algorithm for genetic sequences, based on searching the exact repeat, reverse and palindrome (R2P) substring substitution and create a Library file. The R2P substring is replaced by corresponding ASCII character where for repeat, selecting ASCII characters ranging from 33 to 33+72, for reverse from 33+73 to 33+73+72 and for palindrome from 179 to 179+72. The selective encryption technique, the data are encrypted either in the library file or in compressed file or in both, also by using ASCII code and online library file acting as a signature. Selective encryption, where a part of message is encrypted keeping the remaining part unencrypted, can be a viable proposition for running encryption system in resource constraint devices. The algorithm can approach a moderate compression rate, provide strong data security, the running time is very few second and the complexity is O(n2 ). Also the compressed data again compressed by renounced compressor for reducing the compression rate & ratio. This techniques can approach a compression rate of 2.004871bits/base.
Reduced Energy Min-Max Decoding Algorithm for Ldpc Code with Adder Correction...ijceronline
In this paper, high linear architectures for analysing the first two maximum or minimum values are of paramount importance in several uses, including iterative decoders. We proposed the adder and LDPC. The min-sum processing step that it gives only two different output magnitude values irrespective of the number of incoming bit-to check messages. These new micro-architecture layouts would employ the minimum number of comparators by exploiting the concept of survivors in the search. These would result in reduced number of comparisons and consequently reduced energy use. Multipliers are complex units and play an important role in finding the overall area, speed and power consumption of digital designs. By using the multiplier we can minimize the parameters like latency, complexity and power consumption.
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)Enrique Monzo Solves
Implementation of a high speed decoding of non-binary irregular LDPC codes using CUDA GPUs.
Moritz Beermann, Enrique Monzó, Laurent Schmalen, Peter Vary
IEEE SiPS, Oct. 2013, Taipei, Taiwan
This document discusses various data compression techniques. It begins with an introduction to compression and its goals of reducing storage space and transmission time. Then it discusses lossless techniques like Huffman coding, Lempel-Ziv coding, run-length encoding and pattern substitution. The document also briefly covers lossy compression and entropy encoding algorithms like Shannon-Fano coding and arithmetic coding. Key compression methods and their applications are summarized throughout.
Deep Packet Inspection with Regular Expression MatchingEditor IJCATR
Deep packet inspection directs, persists, filters and logs IP-based applications and Web services traffic based on content
encapsulated in a packet's header or payload, regardless of the protocol or application type. In content scanning, the packet payload is
compared against a set of patterns specified as regular expressions. With deep packet inspection in place through a single intelligent
network device, companies can boost performance without buying expensive servers or additional security products. They are typically
matched through deterministic finite automata (DFAs), but large rule sets need a memory amount that turns out to be too large for
practical implementation. Many recent works have proposed improvements to address this issue, but they increase the number of
transitions (and then of memory accesses) per character. This paper presents a new representation for DFAs, orthogonal to most of the
previous solutions, called delta finite automata (FA), which considerably reduces states and transitions while preserving a transition
per character only, thus allowing fast matching. A further optimization exploits Nth order relationships within the DFA by adopting
the concept of temporary transitions.
The document discusses different techniques for compressing multimedia data such as text, images, audio and video. It describes how compression works by removing redundancy in digital data and exploiting properties of human perception. It then explains different compression methods including lossless compression, lossy compression, entropy encoding, and specific algorithms like Huffman encoding and arithmetic coding. The goal of compression is to reduce the size of files to reduce storage and bandwidth requirements for transmission.
This document summarizes a research paper on improving regular expression matching using a look-ahead finite automata approach. It proposes LaFA, which optimizes regular expression detection by allowing out-of-order detection, systematically reordering the detection sequence, and sharing states among automata. LaFA uses three main modules - a timestamp lookup module to detect non-repeating patterns, a character lookup module to check individual characters, and a repetition detection module to handle repeating patterns. The system architecture involves capturing packets, extracting payloads, and using these modules in parallel to match regular expressions and improve detection speed for intrusion detection systems.
This document summarizes a method for recognizing online handwritten Tamil characters using fractal features. [1] A novel fractal coding technique is used that exploits redundancy to achieve better compression while requiring less memory and encoding time with minimal distortion. [2] Fractal codes are generated from preprocessed Tamil characters by partitioning them into ranges and finding the most similar transformed domain segments. [3] These fractal codes are then used to classify characters with 90% accuracy, outperforming a nearest neighbor classifier.
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
This document discusses data compression using Python. It begins with explaining how information theory can be used to more efficiently encode messages by assigning shorter codewords to more common symbols. An example encoding scheme is shown to reduce redundancy from 32% to 3%. The document then demonstrates a Python implementation of Huffman encoding and decoding to compress a sample text file by 45%. Key information theory concepts like entropy, uncertainty, and coding efficiency are also overviewed.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
Data Compression - Text Compression - Run Length EncodingMANISH T I
Run-length encoding (RLE) replaces consecutive repeated characters in data with a single character and count. For example, "aaabbc" would compress to "3a2bc". RLE works best on data with many repetitive characters like spaces. It has limitations for natural language text which contains few repetitions longer than doubles. Variants include digram encoding which compresses common letter pairs, and differencing which encodes differences between successive values like temperatures instead of absolute values.
This document provides an overview of coding theory and recent advances in low-density parity-check (LDPC) codes. It discusses Shannon's channel coding theorem and how modern error-correcting codes achieve rates close to channel capacity. LDPC codes are described as having sparse parity-check matrices and being decoded iteratively using message passing. The performance of LDPC codes can be analyzed using density evolution and threshold calculations. Linear programming decoding is introduced as an alternative decoding approach that has connections to message passing decoding.
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...Koji Matsuda
The document presents a unified approach for measuring semantic similarity between texts at multiple levels (sense, word, text) using semantic signatures. It generates semantic signatures through multi-seeded random walks over the WordNet graph. It then aligns and disambiguates words and senses to extract sense "seeds" for the signatures. Finally, it calculates signature similarity using measures like cosine similarity, weighted overlap, and top-k Jaccard. The approach provides a unified framework for semantic similarity that can be applied to various NLP tasks.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
P REPROCESSING FOR PPM: COMPRESSING UTF - 8 ENCODED NATURAL LANG UAGE TEXTijcsit
n this paper,
several new universal preprocessing techniques
are described
to improve Prediction by
Partial Matching (PPM) compression of UTF
-
8 encoded natural language text. These methods essentially
adjust the alphabet in some manner (for exampl
e, by expanding or reducing it) prior to the compression
algorithm then being applied to the amended text.
Firstly,
a simple bigraphs (two
-
byte) substitution
technique
is described
that leads to significant improvement in compression for many languages whe
n they
are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian,
9% for Persian, 15% for
Russian, 1% for Chinese text, and over 5% for both English and Welsh text)
.
Secondly,
a new
preprocessing technique t
hat outputs separate vocabulary
an
d symbols streams
–
that are subsequently
encoded separately
–
is also investigated
. This also leads to significant improvement in compression for
many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally
,
novel p
reprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text
are described for dotted and non
-
dotted forms of the language
P REPROCESSING FOR PPM: COMPRESSING UTF - 8 ENCODED NATURAL LANG UAGE TEXTijcsit
In this paper,
several new universal preprocessing techniques
are described
to improve Prediction by
Partial Matching (PPM) compression of UTF
-
8 encoded natural language text. These methods essentially
adjust the alphabet in some manner (for exampl
e, by expanding or reducing it) prior to the compression
algorithm then being applied to the amended text.
Firstly,
a simple bigraphs (two
-
byte) substitution
technique
is described
that leads to significant improvement in compression for many languages whe
n they
are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian,
9% for Persian, 15% for
Russian, 1% for Chinese text, and over 5% for both English and Welsh text)
.
Secondly,
a new
preprocessing technique t
hat outputs separate vocabulary
an
d symbols streams
–
that are subsequently
encoded separately
–
is also investigated
. This also leads to significant improvement in compression for
many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally
,
novel p
reprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text
are described for dotted and non
-
dotted forms of the language
Parallel random projection using R high performance computing for planted mot...TELKOMNIKA JOURNAL
Motif discovery in DNA sequences is one of the most important issues in bioinformatics. Thus, algorithms for dealing with the problem accurately and quickly have always been the goal of research in bioinformatics. Therefore, this study is intended to modify the random projection algorithm to be implemented on R high performance computing (i.e., the R package pbdMPI). Some steps are needed to achieve this objective, ie preprocessing data, splitting data according to number of batches, modifying and implementing random projection in the pbdMPI package, and then aggregating the results. To validate the proposed approach, some experiments have been conducted. Several benchmarking data were used in this study by sensitivity analysis on number of cores and batches. Experimental results show that computational cost can be reduced, which is that the computation cost of 6 cores is faster around 34 times compared with the standalone mode. Thus, the proposed approach can be used for motif discovery effectively and efficiently.
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATIONcscpconf
We present symbolic and neural approaches for Arabic paraphrasing that yield high paraphrasing accuracy. This is the first work on sentence level paraphrase generation for
Arabic and the first using neural models to generate paraphrased sentences for Arabic. We present and compare several methods for para- phrasing and obtaining monolingual parallel data. We share a large coverage phrase dictionary for Arabic and contribute a large parallel monolingual corpus that can be used in developing new seq-to-seq models for paraphrasing. This is the first large monolingual corpus of Arabic. We also present first results in Arabic
paraphrasing using seq-to-seq neural methods. Additionally, we propose a novel automatic evaluation metric for paraphrasing that correlates highly with human judgement.
The document discusses various data compression techniques. It defines data compression as the process of reducing the size of data through the use of compression algorithms. There are two main types of compression: lossless, where the original and decompressed data are identical, and lossy, where some data may be lost during compression. Common lossless techniques mentioned include run-length encoding, Huffman coding, and Lempel-Ziv encoding, while lossy methods discussed are JPEG for images and MPEG for video. The document also provides examples to illustrate how several of these compression algorithms work.
The document discusses various data compression techniques. It defines data compression as the process of reducing the size of data through the use of compression algorithms. There are two main types of compression: lossless, where the original and decompressed data are identical, and lossy, where some data may be lost during compression. Common lossless techniques mentioned include run-length encoding, Huffman coding, and Lempel-Ziv encoding, while lossy methods discussed are JPEG for images and MPEG for video. The document also provides examples to illustrate how several of these compression algorithms work.
Performance Improvement Of Bengali Text Compression Using Transliteration And...IJERA Editor
In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant performance improvement is achieved in terms of decoding speed and space requirement compared to Unicode compression.
Performance Improvement Of Bengali Text Compression Using Transliteration And...IJERA Editor
In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant performance improvement is achieved in terms of decoding speed and space requirement compared to Unicode compression
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
- The document discusses compilation analysis and performance analysis of Feel++ scientific applications using Scalasca.
- It presents compilation analysis of Feel++ using examples of mesh manipulation and discusses performance analysis using Feel++'s TIME class or Scalasca instrumentation.
- The document analyzes the laplacian case study in Feel++ using different compilation options and polynomial dimensions and presents results from performance analysis with Scalasca.
This document discusses neural machine translation techniques for translating English to Japanese and Korean to Japanese. It proposes using character sequences rather than word sequences to address issues with rare words. It also describes using a bidirectional RNN encoder-decoder with attention and representing target characters with B-I tags. Experimental results show the modified RNN model outperforms earlier versions and that character-level phrase-based machine translation for Korean-Japanese performs comparably to word-level methods.
11.the novel lossless text compression technique using ambigram logic and huf...Alexander Decker
This document summarizes a novel lossless text compression technique that uses ambigrams and Huffman coding. It begins with an introduction to ambigrams and Huffman coding. It then describes how the technique works by first using ambigrams to compress the text by about 50% and then further compressing it with Huffman coding to achieve over 60% compression overall. The technique is lossless as it allows perfect reconstruction of the original text. It also discusses potential applications in steganography by embedding confidential data into digital images.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
A Comparison of Serial and Parallel Substring Matching Algorithmszexin wan
This document summarizes a study comparing serial and parallel substring matching algorithms. It describes implementing a naive serial algorithm and dynamic programming algorithm in parallel using RPI's supercomputer. Testing on randomly generated and real-world text data showed the parallel naive algorithm outperformed the parallel dynamic algorithm for smaller files, while the dynamic algorithm scaled better to larger files. Analysis found the parallel algorithms grew exponentially with input size due to blocking MPI calls, though the dynamic implementation had memory issues for large files due to its 2D array structure. On average, the parallel implementations provided a 25x speedup over the fastest serial algorithm.
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...IJCNCJournal
In the past years, Interconnection Networks have been used quite often and especially in applications where parallelization is critical. Message packets transmitted through such networks can be interrupted
using buffers in order to maximize network usage and minimize the time required for all messages to reach
their destination. However, preempting a packet will result in topology reconfiguration and consequently in
time cost. The problem of scheduling message packets through such a network is referred to as PBS and is
known to be NP-Hard. In this paper we haveimproved,
ritically, variations of polynomially solvable
instances of Open Shop to approximate PBS. We have combined these variations and called the induced
algorithmI_HSA (Improved Hybridic Scheduling Algorithm). We ran experiments to establish the efficiency
of I_HSA and found that in all datasets used it produces schedules very close to the optimal. In addition, we
tested I_HSA with datasets that follow non-uniform distributions and provided statistical data which
illustrates better its performance.To further establish I_HSA’s efficiency we ran tests to compare it to SGA,
another algorithm which when tested in the past has yielded excellent results.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
This document summarizes research on developing parallel algorithms to optimize solving the longest common subsequence (LCS) problem. LCS is commonly used for sequence comparison in bioinformatics. Traditional sequential dynamic programming algorithms have complexity of O(mn) for sequences of lengths m and n. The document reviews parallel algorithms developed using tools like OpenMP and GPUs like CUDA to reduce computation time. It proposes the authors' own optimized parallel algorithm for multi-core CPUs using OpenMP.
Extractive Document Summarization - An Unsupervised ApproachFindwise
1. This paper presents and evaluates an unsupervised extractive document summarization system that uses TextRank, K-means clustering, and one-class SVM algorithms for sentence ranking.
2. The system achieves state-of-the-art performance on the DUC 2002 English dataset with a ROUGE score of 0.4797 and can also summarize Swedish documents.
3. Domain knowledge is added through sentence boosting to improve summarization of news articles, and similarities between sentences are calculated to avoid redundancy for multi-document summarization.
This paper introduces a new computer-based plagiarism detection technique that combines longest common substring clustering, substring matching, and keyword similarity. Documents are clustered based on longest common subsequence to create groups of similar documents. Substring matching is then used to compare substrings between documents in each cluster. Finally, keyword similarity is evaluated based on input keywords to further detect plagiarism. The technique aims to more efficiently detect plagiarism compared to traditional substring-based algorithms by first clustering similar documents before applying the detection methods.
This paper introduces a new computer-based plagiarism detection technique that combines longest common substring clustering, substring matching, and keyword similarity. Documents are clustered based on longest common subsequence to create groups of similar documents. Substring matching is then used to compare substrings between documents in each cluster. Finally, keyword similarity is evaluated based on input keywords to further detect plagiarism. The technique aims to more efficiently detect plagiarism compared to traditional substring-based algorithms by first clustering similar documents before applying the detection methods.
Similar to Grammar Based Pre-Processing for PPM (20)
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
In the era of data-driven warfare, the integration of big data and machine learning (ML) techniques has
become paramount for enhancing defence capabilities. This research report delves into the applications of
big data and ML in the defence sector, exploring their potential to revolutionize intelligence gathering,
strategic decision-making, and operational efficiency. By leveraging vast amounts of data and advanced
algorithms, these technologies offer unprecedented opportunities for threat detection, predictive analysis,
and optimized resource allocation. However, their adoption also raises critical concerns regarding data
privacy, ethical implications, and the potential for misuse. This report aims to provide a comprehensive
understanding of the current state of big data and ML in defence, while examining the challenges and
ethical considerations that must be addressed to ensure responsible and effective implementation.
Cloud Computing, being one of the most recent innovative developments of the IT world, has been
instrumental not just to the success of SMEs but, through their productivity and innovative contribution to
the economy, has even made a remarkable contribution to the economic growth of the United States. To
this end, the study focuses on how cloud computing technology has impacted economic growth through
SMEs in the United States. Relevant literature connected to the variables of interest in this study was
reviewed, and secondary data was generated and utilized in the analysis section of this paper. The findings
of this paper revealed that there have been meaningful contributions that the usage of virtualization has
made in the commercial dealings of small firms in the United States, and this has also been reflected in the
economic growth of the country. This paper further revealed that as important as cloud-based software is,
some SMEs are still skeptical about how it can help improve their business and increase their bottom line
and hence have failed to adopt it. Apart from the SMEs, some notable large firms in different industries,
including information and educational services, have adopted cloud computing technology and hence
contributed to the economic growth of the United States. Lastly, findings from our inferential statistics
revealed that no discernible change has occurred in innovation between small and big businesses in the
adoption of cloud computing. Both categories of businesses adopt cloud computing in the same way, and
their contribution to the American economy has no significant difference in the usage of virtualization.
Energy-constrained Wireless Sensor Networks (WSNs) have garnered significant research interest in
recent years. Multiple-Input Multiple-Output (MIMO), or Cooperative MIMO, represents a specialized
application of MIMO technology within WSNs. This approach operates effectively, especially in
challenging and resource-constrained environments. By facilitating collaboration among sensor nodes,
Cooperative MIMO enhances reliability, coverage, and energy efficiency in WSN deployments.
Consequently, MIMO finds application in diverse WSN scenarios, spanning environmental monitoring,
industrial automation, and healthcare applications.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication. IJCSIT publishes original research papers and review papers, as well as auxiliary material such as: research papers, case studies, technical reports etc.
With growing, Car parking increases with the number of car users. With the increased use of smartphones
and their applications, users prefer mobile phone-based solutions. This paper proposes the Smart Parking
Management System (SPMS) that depends on Arduino parts, Android applications, and based on IoT. This
gave the client the ability to check available parking spaces and reserve a parking spot. IR sensors are
utilized to know if a car park space is allowed. Its area data are transmitted using the WI-FI module to the
server and are recovered by the mobile application which offers many options attractively and with no cost
to users and lets the user check reservation details. With IoT technology, the smart parking system can be
connected wirelessly to easily track available locations.
Welcome to AIRCC's International Journal of Computer Science and Information Technology (IJCSIT), your gateway to the latest advancements in the dynamic fields of Computer Science and Information Systems.
Computer-Assisted Language Learning (CALL) are computer-based tutoring systems that deal with
linguistic skills. Adding intelligence in such systems is mainly based on using Natural Language
Processing (NLP) tools to diagnose student errors, especially in language grammar. However, most such
systems do not consider the modeling of student competence in linguistic skills, especially for the Arabic
language. In this paper, we will deal with basic grammar concepts of the Arabic language taught for the
fourth grade of the elementary school in Egypt. This is through Arabic Grammar Trainer (AGTrainer)
which is an Intelligent CALL. The implemented system (AGTrainer) trains the students through different
questions that deal with the different concepts and have different difficulty levels. Constraint-based student
modeling (CBSM) technique is used as a short-term student model. CBSM is used to define in small grain
level the different grammar skills through the defined skill structures. The main contribution of this paper
is the hierarchal representation of the system's basic grammar skills as domain knowledge. That
representation is used as a mechanism for efficiently checking constraints to model the student knowledge
and diagnose the student errors and identify their cause. In addition, satisfying constraints and the number
of trails the student takes for answering each question and fuzzy logic decision system are used to
determine the student learning level for each lesson as a long-term model. The results of the evaluation
showed the system's effectiveness in learning in addition to the satisfaction of students and teachers with its
features and abilities.
In the realm of computer security, the importance of efficient and reliable user authentication methods has
become increasingly critical. This paper examines the potential of mouse movement dynamics as a
consistent metric for continuous authentication. By analysing user mouse movement patterns in two
contrasting gaming scenarios, "Team Fortress" and "Poly Bridge," we investigate the distinctive
behavioral patterns inherent in high-intensity and low-intensity UI interactions. The study extends beyond
conventional methodologies by employing a range of machine learning models. These models are carefully
selected to assess their effectiveness in capturing and interpreting the subtleties of user behavior as
reflected in their mouse movements. This multifaceted approach allows for a more nuanced and
comprehensive understanding of user interaction patterns. Our findings reveal that mouse movement
dynamics can serve as a reliable indicator for continuous user authentication. The diverse machine
learning models employed in this study demonstrate competent performance in user verification, marking
an improvement over previous methods used in this field. This research contributes to the ongoing efforts to
enhance computer security and highlights the potential of leveraging user behavior, specifically mouse
dynamics, in developing robust authentication systems.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
Image segmentation and classification tasks in computer vision have proven to be highly effective using neural networks, specifically Convolutional Neural Networks (CNNs). These tasks have numerous
practical applications, such as in medical imaging, autonomous driving, and surveillance. CNNs are capable
of learning complex features directly from images and achieving outstanding performance across several
datasets. In this work, we have utilized three different datasets to investigate the efficacy of various preprocessing and classification techniques in accurssedately segmenting and classifying different structures
within the MRI and natural images. We have utilized both sample gradient and Canny Edge Detection
methods for pre-processing, and K-means clustering have been applied to segment the images. Image
augmentation improves the size and diversity of datasets for training the models for image classification
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
This research aims to further understanding in the field of continuous authentication using behavioural
biometrics. We are contributing a novel dataset that encompasses the gesture data of 15 users playing
Minecraft with a Samsung Tablet, each for a duration of 15 minutes. Utilizing this dataset, we employed
machine learning (ML) binary classifiers, being Random Forest (RF), K-Nearest Neighbors (KNN), and
Support Vector Classifier (SVC), to determine the authenticity of specific user actions. Our most robust
model was SVC, which achieved an average accuracy of approximately 90%, demonstrating that touch
dynamics can effectively distinguish users. However, further studies are needed to make it viable option
for authentication systems. You can access our dataset at the following
link:https://github.com/AuthenTech2023/authentech-repo
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
In the realm of computer security, the importance of efficient and reliable user authentication methods has
become increasingly critical. This paper examines the potential of mouse movement dynamics as a
consistent metric for continuous authentication. By analysing user mouse movement patterns in two
contrasting gaming scenarios, "Team Fortress" and "Poly Bridge," we investigate the distinctive
behavioral patterns inherent in high-intensity and low-intensity UI interactions. The study extends beyond
conventional methodologies by employing a range of machine learning models. These models are carefully
selected to assess their effectiveness in capturing and interpreting the subtleties of user behavior as
reflected in their mouse movements. This multifaceted approach allows for a more nuanced and
comprehensive understanding of user interaction patterns. Our findings reveal that mouse movement
dynamics can serve as a reliable indicator for continuous user authentication. The diverse machine
learning models employed in this study demonstrate competent performance in user verification, marking
an improvement over previous methods used in this field. This research contributes to the ongoing efforts to
enhance computer security and highlights the potential of leveraging user behavior, specifically mouse
dynamics, in developing robust authentication systems.
Image segmentation and classification tasks in computer vision have proven to be highly effective using neural networks, specifically Convolutional Neural Networks (CNNs). These tasks have numerous
practical applications, such as in medical imaging, autonomous driving, and surveillance. CNNs are capable
of learning complex features directly from images and achieving outstanding performance across several
datasets. In this work, we have utilized three different datasets to investigate the efficacy of various preprocessing and classification techniques in accurssedately segmenting and classifying different structures
within the MRI and natural images. We have utilized both sample gradient and Canny Edge Detection
methods for pre-processing, and K-means clustering have been applied to segment the images. Image
augmentation improves the size and diversity of datasets for training the models for image classification.
This work highlights transfer learning’s effectiveness in image classification using CNNs and VGG 16 that
provides insights into the selection of pre-trained models and hyper parameters for optimal performance.
We have proposed a comprehensive approach for image segmentation and classification, incorporating preprocessing techniques, the K-means algorithm for segmentation, and employing deep learning models such
as CNN and VGG 16 for classification.
- The document presents 6 different models for defining foot size in Tunisia: 2 statistical models, 2 neural network models using unsupervised learning, and 2 models combining neural networks and fuzzy logic.
- The statistical models (SM and SHM) are based on applying statistical equations to morphological foot data.
- The neural network models (MSK and MHSK) use self-organizing Kohonen maps to cluster foot data and model full and half sizes.
- The fuzzy neural network models (MSFK and MHSFK) incorporate fuzzy logic into the neural network learning process to better account for uncertainty in foot sizes.
The security of Electric Vehicle (EV) charging has gained momentum after the increase in the EV adoption
in the past few years. Mobile applications have been integrated into EV charging systems that mainly use a
cloud-based platform to host their services and data. Like many complex systems, cloud systems are
susceptible to cyberattacks if proper measures are not taken by the organization to secure them. In this
paper, we explore the security of key components in the EV charging infrastructure, including the mobile
application and its cloud service. We conducted an experiment that initiated a Man in the Middle attack
between an EV app and its cloud services. Our results showed that it is possible to launch attacks against
the connected infrastructure by taking advantage of vulnerabilities that may have substantial economic and
operational ramifications on the EV charging ecosystem. We conclude by providing mitigation suggestions
and future research directions.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
The AIRCC's International Journal of Computer Science and Information Technology (IJCSIT) is devoted to fields of Computer Science and Information Systems. The IJCSIT is a open access peer-reviewed scientific journal published in electronic form as well as print form. The mission of this journal is to publish original contributions in its field in order to propagate knowledge amongst its readers and to be a reference publication.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
Grammar Based Pre-Processing for PPM
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
DOI:10.5121/ijcsit.2017.9101 1
GRAMMAR-BASED PRE-PROCESSING FOR PPM
William J. Teahan1
and Nojood O. Aljehane2
1
Department of Computer Science,University of Bangor Bangor, UK
2
Department of Computer Science,University of Tabuk,Tabuk, Saudi Arabia
ABSTRACT
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
KEYWORDS
CFG, Grammar-based,Preprocessing, PPM, Encoding.
1. INTRODUCTION
1.1. PREDICTION BY PARTIAL MATCHING
The Prediction by Partial Matching (PPM) compression algorithm is one of the most effective
kinds of statistical compression. First described by Cleary and Witten in 1984 [1], there are many
variants of the basic algorithm, such as PPMA and PPMB [1], PPMC [2], PPMD [3], PPM* [4],
PPMZ [5] and PPMii [6]. The prediction of PPM depends on the bounded number of previous
characters or symbols. In PPM, to predict the next character or symbol, different orders of models
are used, starting from the highest orders down to the lowest orders. An escape probability
estimates if a new symbol appears in the context [1, 2]. Despite the cost of the terms of memory
and the speed of execution, PPM usually attains better compression rates compared with other
well-known compression methods.
One of the primary motivations for our research is the application of PPM to natural language
processing. Therefore, we report in this paper results on the use of PPM on natural language texts
as well as results on the Calgary Corpus, a standard corpus used to compare text compression
algorithms. PPM has achieved excellent results in various natural language processing
applications such as language identification and segmentation, text categorisation, cryptology,
and optical character recognition (OCR) [7].
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
2
Each symbol or character in PPM is encoded by using arithmetic coding using the probabilities
that are estimated by different models [8].To discuss the operation of PPM, Table 1 shows the
state of different orders for PPM, where k=2, 1, 0 and -1 after input string “abcdbc” has been
handled. In this exampleto estimate the probability for symbols in the contexts, the PPM
algorithm starts from the highest order k=2. To estimate the probability of the upcoming symbol
or character, if the context predicts the next symbol or character successfully, the associated
probability for this symbol or character will be used to encode it. Otherwise, the probability of the
escape will be estimated to let the encoder move down to the next highest order which is in this
case isk=1until the encoder reaches (if needed) the lowest order which is k=-1. Then the
probability for all symbols or characters will be estimated and encoded by| |
whereA is the size of
alphabets in the contexts. The experiments show the maximum order that usually gets good
compression rates for English is five [1][8][7]. For Arabic text, the experiments show that order
seven the PPM algorithm gives a good compression rate [9].
TABLE 1: PPMC model after processing the string “abcdbc”.
For example, if “c” followed the string “abcdbc”, the probability for an order 2 model with the
input symbol “abcdbc” would be because a successful prediction for the clause “ab→c”can be
made. Suppose the string “a” follows the input string “abcdbc” instead.The probability of for
an escape for the order 2 model would be encoded arithmetically, and the process of encoding
downgrades from order 2 model to the next order 1 model. Moreover, in order 1, the encoder does
not predict string “a”, so another escape (with probability ) is going to be encoded, and the
process of encoding downgrades to order 0. In order 0, the probability is willbe for string
“a”.Therefore, to encode string “a”, the total probability is * * = , which is 5.2 bits.
If a previously unseen character new string “n” follows the input string “abcdbc”, the process
encodes down to the lowest order which is k=-1 model. In this model k=-1, all strings are
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
3
encoded with the same probabilities which is | |
where A is size of alphabet. Supposing the size of
the alphabet is 256, so the probability for the new string “n” is for the order -1 model.
Therefore, the character “n” is encoded using the probability * * * which is 11.4 bits.
As our method uses both grammar-based and text pre-processing techniques prior to compression,
we will now briefly discuss related work in these two areas.
1.2.GRAMMAR-BASED COMPRESSION ALGORITHMS
Grammar-based compression algorithms depend on using a context-free grammar (CFG) to help
compress the input text. The grammar and the text is compressed by arithmetic coding or by
different statistical encoders [10]. Two examples of grammar-based compression schemes are the
Sequitur algorithm [11] and the Re-Pair algorithm [12].
Sequitur is an on-line algorithm that was developed by Nevill-Manning and Witten in 1997 [11].
Sequitur uses hierarchical structure as specified by a recursive grammar that is generated
iteratively in order to compress the text. Sequitur depends on repeatedly adding rules into the
grammar for the most frequent digram sequence (which may consist of terminal symbols that
appear in the text or non-terminal symbols that have been added to the grammar previously). The
rule for the start symbol S shows the current state of the corrected text sequence as it is being
processed. For instance, a text “abcdbcabc” is converted into three rules, which are S as the start
symbol and A and B as nonterminal symbols: S→BdAB, A→bc, B→aA.
In contrast, the Re-Pair algorithm proposed by Larsson and Moffat [12] in 2000 is off-line. Like
Sequitur, Re-Pair replaces the most frequent pair of symbols with a new symbol in the source
message essentially extending the alphabet. The frequencies of symbol pairs are then re-evaluated
and the process repeats until there are no longer any pair of symbols that occur twice. Through
this off-line process, what can be considered to be a dictionary has been generated, and then an
explicit representation of this dictionary is encoded as part of the compressed message.
1.3.TEXT PRE-PROCESSING FOR DATA COMPRESSION
Abel and Teahan [13] discuss text pre-processing techniques that have been found useful at
improving the overall compression for the Gzip, Burrows-Wheeler Transform (BWT) and PPM
schemes. The techniques work in such a way that they can easily be reversed while decoding in a
post-processing stage that follows the decompression stage. The methods discussed include a long
list of prior work in this area and various new techniques, and presents experimental results that
show significant improvement in overall compression for the Calgary Corpus. The methods most
similar to our method described in this paper are the bigraph replacement scheme described by
Teahan [7] and the token replacement scheme described by Abel and Teahan [13].
The main contribution of the work described in this paper is the improved pre-processing method
for PPM. This is due to the discovery that instead of using a fixed set of bigraphs for replacement
from a standard reference source (such as the Brown corpus),significantlybetter compression
results can be achieved by using bigraphs obtained from the text being compressed itself.
The rest of the paper is organised as follows. Our new approach is discussed in the next section.
Then we discuss experimental results on natural language texts and the Calgary Corpus by
comparing how well the new scheme performs compared to other well-known methods. The
summary and conclusions are presented in the final section.
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
4
2. GRAMMAR BASED PRE-PROCESSING FOR PPM (GR-PPM)
In this section, a new off-line approach based on Context Free Grammars (CFG) is presented for
compressing text files. This algorithm, which we call GR-PPM (which is short for Grammar
based pre-processing for PPM), uses both CFGs and PPM as a general-purpose adaptive
compression method for text files.
In our approach, we essentially use the N most frequent n-graphs (e.g. bigraphs when n=2) in the
source files to first generate a grammar with one rule for each n-graph. We then substitute every
time one of these n-graphs occurs in the text with the single unique non-terminal symbol as
specified by its rule in the grammar in a single pass through the file. (Further schemes described
below may use multiple passes to repeatedly substitute commonly occurring sequences of n-
graphs and non-terminal symbols). This is done during the pre-processing phase prior to the
compression phase in a manner that allows the text to be easily regenerated during the
postprocessing stage. For bigraphs, for example, we call the variant of our scheme GRB- PPM
(Grammar Bigraphs for PPM). Our new method shows good results when replacing the N most
frequent symbols for bigraphs (for example, when N=100) but also using trigraphs (when n=3) in
a variant which we have called GRT-PPM (Grammar Trigraphs for PPM).
Each natural language text contains a high percentage of common n-graphs which comprise a
significant proportion of the text [7]. Substitution of these n-graphs using our context-free
grammar scheme and standard PPM can significantly improve overall compression as shown
below. For example, natural languages contain common sequences of two characters (bigraphs)
that often repeat in the same order in many different words, such as in the English “th”, “ed”and
“in”, and for the Arabic language, such as “ ,”ﺍﻝ “ﻓﻲ” and “ﻻ ” and so on.
The frequencies of common sequences of characters in reference corpora (such as the Brown
Corpus for American English [14] and the LOB Corpus for British English [15]) can be used to
define the n-graphs that will be substituted (without the need to encode the grammar separately,
making it possible to have the algorithm work in an online manner rather than offline). However,
although this method can be quite effective, what we have found to be most effective for our
scheme is to determine the list of n-graphs that define the grammar in a single or double pass
through the text being compressed prior to the compression phase, and then encoding the
grammar separately along with the corrected text which is encoded using PPM.
Our method replaces common sequences similar to both Sequitur and Re-Pair, but unlike them,
this is not done iteratively on the current most common digram sequence or phrase, but is done by
replacing the most common sequences as the text is processed from beginning to end in a single
pass (although further passes may occur later). Also, the PPM algorithm is used as the encoder
once the common sequences have been replaced whereas Re-Pair uses a dictionary based
approach for the encoding stage. Like Re-Pair, our method is only off-line during the phase which
generates the grammar.
Our approach adapts the bigraph replacement text pre-processing approach of Teahan [7] by
using an offline technique to generate the list of bigraphs first from the source file being
compressed. This approach is considered within a grammar-based context, and the approach is
further extended by considering multiple passes and recursive grammars.
Figure 1 shows the whole process of GR-PPM. First, the CFG will be generated from the original
source files by taking the N most frequent n-graphs and replacing them with the non-terminal
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
5
symbols as defined by their rules in the grammar. After rules are produced, the sender will do
grammar-based pre-processing to correct the text. Then, the corrected text is encoded by using
PPMD, and the resulting compressed text is then sent to the receiver. The receiver then decodes
the text by using PPMD to decompress the compressed file that was sent. Grammar-based post-
processing then facilitates the reverse mapping by replacing the encoded non-terminal symbols
with the original n characters or n-graphs.
Figure 1: The complete process of grammar based pre-processing for Prediction by Partial Matching (GR-
PPM).
One final step is the need to encode the grammar so that it is also known by the receiver since, as
stated, we have found we can achieve better compression by encoding separately a grammar
generated from the source file itself rather than using a general grammar (known to both sender
and receiver) that was generated from a reference corpus. We choose to use a very simple method
for encoding the grammar—simply transmitting the lists of bigraphs or trigraphs directly. (So
encoding a grammar containing N = 100 rules for the GRB-PPM scheme where n = 2 and
character size is 1 byte (for ASCII text files, say), this will incur an overhead of N × n bytes or
200 bytes for each file being encoded).
Table 2 illustrates the process of GRB-PPM using a line taken from the song by Manfred Mann
(one published paper uses this song as a good example of repeating characters [12]). The
sequence is “do wah diddy diddy dum diddy do”. For GRB-PPM, for example, there are five
bigraphs which are repeated more than once in the first pass through the text: in, do, di and dd .
Thus, these bigraphs will be substituted with new non-terminal symbols, say A, B, C and D,
respectively, and included in the grammar (for example, if we choose N = 4). Note that
substitution is performed as the text is processed from left to right. If a bigraph is substituted, the
process will move to the character following the bigraph before continuing.
Original
Text
Generate
Grammar
Original Text and
Grammar
Grammar-based
Pre-processing
Corrected Text and
Grammar
Encode
Compressed Text and
GrammarDecode
Corrected Text and
Grammar
Grammar-based
Post-processing
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
6
Table 2: An example of how GR-PPM works for a sample text.
Pass
Grammar String & Corrected Strings
singing.do.wah.diddy.diddy.dum.diddy.do
1st A→in, B→do, C→di, D→dd sAgAg.B.wah.CDy.CDy.dum.CDy.B
2nd E→Ag, F→CD sEE.B.wah.Fy.Fy.dum.Fy.B
In GRBB-PPM (Grammar Bigraph to Bigraph for PPM), new rules are formed in a second pass
through the text in the corrected text that resulted from the first pass for the further bigraphs, Ag,
CD. These will then be represented with non-terminals E and F that are added to the expanded
grammar. After all bigraphs have been substituted, the message is reduced to the new sequence
sEE.B.wah.Fy.Fy.dum.Fy.B.
Note that we ignore spaces and any punctuation characters because based on our experiments,
including these symbols decreases the compression rate. Moreover, the grammar will be
transmitted to the receiver with the original text after all bigraphs are substituted in the original
text with their non-terminal symbols. In the above example, it is clear that the number of symbols
is reduced from 31 symbols in the original text to 23 symbols in the first pass (GRB-PPM) and to
20 symbols in the next pass (GRBB-PPM).
The grammar in both GRB-PPM and GRT-PPM share the same characteristic, which is that no
pair of characters appears in the grammar more than once. This property ensures that every
bigraph in the grammar is unique, a property called non-terminal uniqueness using the same
terminology proposed by Neville-Manning and Witten [11]. To make sure that each rule in the
grammar is useful, the second property, referred to as rule utility, is that every rule in the
grammar is used more than once in the corrected text sequence. These two features are in the
grammar that GR-PPM generates and are discussed in more detail next.
2.1. NON-TERMINAL UNIQUENESS
For the more general case (i.e. considering n-graphs, not just bigraphs), each n-graph has to
appear only once in the grammar and is also substituted once by a non-terminal symbol. To
prevent the same n-graph from occurring elsewhere in the corrected text sequence, each n-graph
is substituted based on the n-graph that will be generated by the algorithm. In the example of
Table 2, the list of most frequent bigraphs that form the grammar are added at the same time as
the text is processed in each pass. For example, when di appears in the text more than once, the
new non-terminal symbol A is substituted. On the other hand, in the Sequitur algorithm [11], only
the most frequent digram (i.e. bigraph using our terminology) are added incrementally to the
grammar.
2.2.RULE UTILITY
Every rule in the grammar should be used more than one time to ensure that the rule utility
constraint is applied. When di appears in the text more than once, the new non-terminal symbol C
is substituted. In our approach, rule utility does not require the creating and deleting of rules,
which makes the rules more stable. This method retains the tracking of long files, thus avoiding
7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
the requirement of exterior data structures. However, in the Sequitur algorithm, when the new
letter in the input appears in first rule
digram [11]. The process of deleting and creating of the rules in the grammar each time when the
new symbol appears is unstable and inefficient. In the Sequitur approach, the grammar
dynamically created. To avoid separate data structures, we apply a multi
PPM and add multiple symbols to the grammar at the same time and this causes a greater stability
and efficiency.
2.3.HIERARCHICAL GRAMMATICAL
In order to further illustrate our method, Figures 2a and 2b show the hierarchical grammatical
structures that are generated by our
The hierarchical grammatical structures are formed based on the most frequent two characters or
bigraph in each text. For example, in Figure 2a, the word
bigraph in GRB-PPM, and the non
bigraph for the next pass for GRBB
Figure 2: Hierarchical structure in the grammars generated by our algorithm for sample sequences in two
The same algorithm generates the Arabic version in Figure 2b, where the word
ﻛﻒ and ﻩin a similar way (ﻛﻒ
spaces to make them more visible,
algorithms and structures. On the other hand, in the Sequitur algorithm, spaces are part of the
algorithm, which means it is also part of the grammatical structures generated by the algorithm.
Figure 3: Pseudo
Figure 3 summarises the algorithm using pseudo
passes the algorithm performs (from 1 up to a maximum of
to find the N most frequent bigraphs and substitute them with
through 5 implement the bigraphs utility constraints. Line 7 compresses the final text file by using
PPMD after substituting N bigraphs.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
the requirement of exterior data structures. However, in the Sequitur algorithm, when the new
ears in first rule S, the grammar creates a new rule and deletes the old
]. The process of deleting and creating of the rules in the grammar each time when the
new symbol appears is unstable and inefficient. In the Sequitur approach, the grammar
dynamically created. To avoid separate data structures, we apply a multi-pass approach for GR
PPM and add multiple symbols to the grammar at the same time and this causes a greater stability
RAMMATICAL STRUCTURES
In order to further illustrate our method, Figures 2a and 2b show the hierarchical grammatical
structures that are generated by our approach for two different languages, English and Arabic.
The hierarchical grammatical structures are formed based on the most frequent two characters or
bigraph in each text. For example, in Figure 2a, the word the is split into th and
PPM, and the non-terminal that represents it and the letter e forms the second
bigraph for the next pass for GRBB-PPM), and so on for other words in the texts.
Figure 2: Hierarchical structure in the grammars generated by our algorithm for sample sequences in two
languages: (a) English (b) Arabic.
The same algorithm generates the Arabic version in Figure 2b, where the word ﻛﻔﻪis also split into
is the bigraph and ﻩis the second bigraph). We use bullets for
more visible, but nevertheless, spaces will not be considered in our
algorithms and structures. On the other hand, in the Sequitur algorithm, spaces are part of the
algorithm, which means it is also part of the grammatical structures generated by the algorithm.
ure 3: Pseudo-code for the GR-PPM algorithm.
Figure 3 summarises the algorithm using pseudo-code. Line 1 is for a loop to define how many
passes the algorithm performs (from 1 up to a maximum of P passes). Lines 2 to 4 are for a loop
requent bigraphs and substitute them with non-terminal symbols. Lines 3
through 5 implement the bigraphs utility constraints. Line 7 compresses the final text file by using
bigraphs.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
7
the requirement of exterior data structures. However, in the Sequitur algorithm, when the new
e and deletes the old
]. The process of deleting and creating of the rules in the grammar each time when the
new symbol appears is unstable and inefficient. In the Sequitur approach, the grammar is being
pass approach for GR-
PPM and add multiple symbols to the grammar at the same time and this causes a greater stability
In order to further illustrate our method, Figures 2a and 2b show the hierarchical grammatical
guages, English and Arabic.
The hierarchical grammatical structures are formed based on the most frequent two characters or
and e (th is the
forms the second
Figure 2: Hierarchical structure in the grammars generated by our algorithm for sample sequences in two
is also split into
is the second bigraph). We use bullets for
but nevertheless, spaces will not be considered in our
algorithms and structures. On the other hand, in the Sequitur algorithm, spaces are part of the
algorithm, which means it is also part of the grammatical structures generated by the algorithm.
code. Line 1 is for a loop to define how many
passes). Lines 2 to 4 are for a loop
terminal symbols. Lines 3
through 5 implement the bigraphs utility constraints. Line 7 compresses the final text file by using
8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
8
3. EXPERIMENTAL RESULTS
This section discusses experimental results for variants of our method GR-PPM de-scribed above
for compression of various text files compared with other well-known schemes.
We have found that variants of the GR-PPM algorithm achieve the best compression ratio for
texts in different languages, such as English, Arabic, Chinese, Welsh and Persian. Also, we have
compared the results with those of different compression methods that are known to obtain good
results, including Gzip and Bzip2. BS-PPM uses order 4 PPMD to compress UTF-8 text files and
is a recently published variant of PPM that achieves excellent results for natural language texts
[9]. For both GRB-PPM and GRT-PPM, we use the 100 most frequent bigraphs or trigraphs and
order 4 PPMD for the encoding stage.
Table 3 compares the results of using GRB-PPM and GRT-PPM (using N = 100 and order 4
PPMD) with different versions of well-known compression methods, such as Gzip, Bzip2 and
BS-PPM order 4. (PPMD has become the de facto standard for comparing variants of the PPM
algorithm, and so is included with our results listed here.) We do these experiments with different
text files in different languages, including American English, British English, Arabic, Chinese,
Welsh and Persian. We use the Brown corpus for American English and LOB for British English.
For Arabic, we use the BACC [16]. For Persian, the Hamshahri corpus is used [17]. The LCMC
corpus is used for Chinese [18] and the CEG corpus is used for Welsh [19].
It is clear that GRB-PPM achieves the best compression rate (shown in bold font) in bits per
character (bpc) for all cases in different languages. Also, GRB-PPM is significantly better than
various other compression methods. For instance, for Arabic text, GRB-PPM shows a nearly 45%
improvement over Gzip and approximately 15% improvement over Bzip2. For Chinese, GRB-
PPM shows a 36% improvement over BS-PPM and 38% improvement over Gzip. For the Brown
corpus, GRB-PPM shows nearly a 35% improvement over Gzip and approximately 15%
improvement over Bzip2. For the LOB corpus, GRB-PPM shows a 36% improvement over Gzip
and 15% improvement over Bzip2. For Welsh, GRB-PPM shows a 31% improvement over BS-
PPM and 48% improvement over Gzip. For Persian, GRB-PPM shows nearly a 50%
improvement over Gzip and approximately 22% improvement over Bzip2.
GRBB-PPM and GRTT-PPM, which are the second passes of GRB-PPM and GRT-PPM
respectively, achieve better compression ratios than their single pass variants for all the different
language texts, such as English, Arabic, Welsh and Persian. In GRBB-PPM and GRTT-PPM, we
use the 100 most frequent bigraphs for both passes and order 4 PPMD for the encoding stage as
before.
9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
9
Table 3: GRB-PPM and GRT-PPM compared with other compression methods for different natural
language texts.
File Language Size Gzip Bzip2 PPMD4 BS-
PPM
GRT-
PPM
GRB-
PPM
(bpc) (bpc) (bpc) (bpc) (bpc) (bpc)
Brown American
English
5968707 3.05 2.33 2.14 2.11 2.03 1.99
LOB British
English
6085270 2.95 2.22 2.03 2.10 1.92 1.88
LCMC Chinese 5379203 2.58 1.64 1.61 2.49 1.61 1.59
BACC Arabic 69497469 2.14 1.41 1.68 1.32 1.29 1.21
Hamshahri Persian 53472934 2.58 1.64 1.68 1.25 1.31 1.22
CEG Welsh 6753317 2.91 1.95 1.69 2.20 1.57 1.51
Average 2.14 1.86 1.80 1.91 1.62 1.56
Table 4 shows the results of the different single pass and double pass variants of GR-PPM. The
results for the single pass variants GRB-PPM and GRT-PPM have been included from Table 3 for
ease of comparison as well as the results of the grammar-based compression algorithm Sequitur.
It is clear that GRBB-PPM achieves the best compression rate (bpc) for almost all cases in the
different language texts with only the single result on the Chinese text, the LCMC corpus, being
better for the Sequitur algorithm. For Arabic, GRBB-PPM shows 29% improvement over PPMD
and a nearly 19% improvement over the Sequitur algorithm. For American English, Brown
GRBB-PPM shows 8% improvement over PPMD and a nearly 23% improvement over Sequitur.
For British English, LOB GRBB-PPM shows a 20% improvement over the Sequitur algorithm.
For Welsh, GRBB-PPM shows 12% improvement over PPMD and 27% improvement over the
Sequitur algorithm. For Persian, GRBB-PPM shows 27% improvement over PPMD and a nearly
14% improvement over Sequitur.
Table 4: Variants of GR-PPM compared with other compression methods for different natural language
texts.
File PPMD
Order4
Sequitur GRT-
PPM
GRTT-
PPM
GRB-
PPM
GRBB-
PPM
(bpc) (bpc) (bpc) (bpc) (bpc) (bpc)
Brown 2.14 2.55 2.03 2.00 1.99 1.97
LOB 2.03 2.34 1.92 1.90 1.88 1.88
LCMC 1.61 1.45 1.61 1.61 1.59 1.59
BACC 1.68 1.47 1.29 1.29 1.21 1.20
Hamshahri 1.68 1.42 1.31 1.31 1.22 1.22
CEG 1.69 2.04 1.57 1.54 1.51 1.49
Average 1.80 1.87 1.62 1.60 1.56 1.55
Table 5 shows the compression rate for PPMC, Sequitur [20], Gzip, GRB-PPM and GRBB-PPM
on the Calgary corpus. Overall, the GRBB-PPM algorithm outperforms all the well-known
compression methods. For the Sequitur algorithm, GRBB-PPM shows on average a nearly 19%
improvement and on average a 17%, 12% and 1% improvement over Gzip, PPMC and GRB-
PPM, respectively. Although GRBB-PPM achieves similar results on the book1, book2, news and
pic files compared to GRB-PPM, GRBB-PPM is better than GRB-PPM for the other files.
10. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
10
Table 5: Performance of various compression schemes on the Calgary Corpus.
File Size PPMC Gzip Sequitur GRB- PPM GRBB-
PPM
(bytes) (bpc) (bpc) (bpc) (bpc) (bpc)
bib 111261 2.12 2.53 2.48 1.87 1.85
book1 768771 2.52 3.26 2.82 2.25 2.25
book2 610856 2.28 2.70 2.46 1.91 1.91
news 377109 2.77 3.07 2.85 2.32 2.32
paper1 53161 2.48 2.79 2.89 2.34 2.32
paper2 82199 2.46 2.89 2.87 2.29 2.26
pic 513216 0.98 0.82 0.90 0.81 0.81
progc 39611 2.49 2.69 2.83 2.36 2.33
progl 71646 1.87 1.81 1.95 1.66 1.61
progp 49379 1.82 1.82 1.87 1.70 1.64
trans 93695 1.75 1.62 1.69 1.48 1.45
Average 2.14 2.29 2.32 1.91 1.88
4. SUMMARY AND CONCLUSIONS
In this paper, we have described new algorithms for improving the compression for different
natural language texts. These algorithms work by substituting a repeated symbol (bigraph or
trigraph) with a non-terminal symbol from a grammatical rule in a CFG before using PPM to
compress the text files. These algorithms are maintained by two constraints, which are non-
terminal uniqueness and rule utility. These techniques also work well as good compression
methods for general text files.
REFERENCES
[1] J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,”
Commun. IEEE Trans., vol. 32, no. 4, pp. 396–402, 1984.
[2] A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no.
11, pp. 1917–1921, 1990.
[3] P. Howard, “The design and analysis of efficient lossless data compression systems,” Ph.D.
dissertation, Dept. Comput. Sci.,Brown Univ., Providence, RI, Jun. 1993.
[4] J. Cleary and Teahan, W. “Unbounded Length Contexts for PPM,” Computing Journal, vol. 40,
nos. 2 and 3, pp. 67–75, Feb. 1997.
[5] C. Bloom. “Solving the problems of context modeling.” Informally published report, see
http://www.cbloom.com/papers. 1998.
[6] D. Shkarin. “PPM: One step to practicality”. Proc. Data Compression Conference, pp. 202-211,
2002. IEEE.
[7] W. Teahan, “Modelling English text,” Ph.D. dissertation, School of Computer Science, University of
Waikato, 1998.
[8] Witten, I., Neal, R. & Cleary, J. “Arithmetic coding for data compression”. Communications of the
ACM, vol. 30 Issue 6, June 1987.
[9] W. Teahan and K. M. Alhawiti, “Design, compilation and preliminary statistics of compression
corpus of written Arabic,” Technical Report, Bangor University, School of Computer Science,
2013.
[10] J. Kieffer and E. Yang, “Grammar-based codes: a new class of universal lossless source codes,” Inf.
Theory, IEEE Trans., vol. 46, no. 3, pp. 737–754, May. 2000.
11. International Journal of Computer Science & Information Technology (IJCSIT) Vol 9, No 1, February 2017
11
[11] C. Nevill-Manning and I. Witten,“Identifying hierarchical structure in sequences: A linear-time
algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
[12] N. Larsson and A. Moffat, “Off-line dictionary-based compression,” Proc. IEEE, vol. 88, pp. 1722–
1732, Nov. 2000.
[13] J. Abel and W. Teahan, “Universal text pre-processing for data compression,” IEEE Transactions on
Computers, 54.5: 497-507, 2005.
[14] W. Francis, W. and Kucera, H. “Brown corpus manual.” Brown University. 1979.
[15] S. Johansson. “The tagged LOB Corpus: User ́s Manual.” 1986.
[16] W. Teahan and K. Alhawiti,“pre-processing for PPM: Compressing UTF-8 encoded natural language
text,” Int. J. Comput., vol. 7, no. 2, pp. 41–51, Apr. 2015.
[17] Ale Ahmad et al., “Hamshahri: A standard Persian text collection,” Knowledge-Based System, vol.
22, no. 5, pp. 382–387, 2009.
[18] A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual
and contrastive language study,” Religion, vol. 17, pp. 3–4, 2004.
[19] N.C. Ellis et al., “Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and
frequency count for Welsh,” 2001.
[20] C. Nevill-Manning and I. Witten, “Compression and explanation using hierarchical grammars,”
Comput. J., vol. 40, no. 2/3, pp 103–116, 1997.