Data compression plays a significant role and is necessary to minimize the storage size and accelerate the data transmission by the communication channel object, The quality of dictionary-based text compression is a new approach, in this concept, the hybrid dictionary compression (HDC) is used in one compression system, instead of several compression systems of the characters, syllables or words, HDC is a new technique and has proven itself especially on the text files. The compression effectiveness is affected by the quality of hybrid dictionary of characters, syllables, and words. The dictionary is created with a forward analysis of text file a. In this research, the genetic algorithm (GAs) will be used as the search engine for obtaining this dictionary and data compression, therefore, GA is stochastic search engine algorithms related to the natural selection and mechanisms of genetics, the s research aims to combine Huffman’s multiple trees and GAs in one compression system. The Huffman algorithm coding is a tactical compression method of creating the code lengths of variable-length prefix code, and the genetic algorithm is algorithms are used for searching and finding the best Huffman tree [1], that gives the best compression ratio. GAs can be used as search algorithms for both Huffman method of different codes in the text characters leads to a difference in the trees, The GAs work in the same way to create the population, and can be used to guide the research to better solutions, after applying the test survive and their genetic information is used.
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
Dna data compression algorithms based on redundancyijfcstjournal
Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA.In frequent cases there are four bases in a DNA. They are a (Adenine),c (Cytosine),g(Guanine)
and t (Thymine).Each of these bases can be represented by two bits as 2 powers 2 =4 i.e.a–00,c–01,g–11 and t–10 respectively, although this choice is random.Soredundancy within a sequence is more likely to exist.That’s why in this paper wehave explored different types of repeat to compress DNA.These are direct repeats, palindrome or reverse direct repeat , inverted exact repeats or complementary palindrome or exact reverse complement, inverted approximate repeats or approximate complementary palindrome or approximate reverse complement, interspersed or dispersed repeats,
flanking repeats or terminal repeats etc. Better compression gives better network speed and save storage space.
Comparison of Compression Algorithms in text data for Data MiningIJAEMSJORNAL
Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
Text Steganography Using Compression and Random Number GeneratorsEditor IJCATR
A lot of techniques are used to protect and hide information from any unauthorized users such as Steganography and Cryptography. Steganography hides a message inside another message without any suspicion, and Cryptography scrambles a message to conceal its contents. This paper uses a new text steganography that is applicable to work with different languages, the approach, based on the Pseudorandom Number Generation (PRNG), embeds the secret message into a generated Random Cover-text. The output (Stego-Text) is compressed to reduce the size. At the receiver side the reverse of these operations must be carried out to get back the original message. Two secret keys (Hiding Key & Extraction Key) for authentication are used at both ends in order to achieve a high level of security. The model has been applied to different message languages and both encrypted and unencrypted messages. The experimental results show the model‟s capacity and the similarity test values.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
Dna data compression algorithms based on redundancyijfcstjournal
Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA.In frequent cases there are four bases in a DNA. They are a (Adenine),c (Cytosine),g(Guanine)
and t (Thymine).Each of these bases can be represented by two bits as 2 powers 2 =4 i.e.a–00,c–01,g–11 and t–10 respectively, although this choice is random.Soredundancy within a sequence is more likely to exist.That’s why in this paper wehave explored different types of repeat to compress DNA.These are direct repeats, palindrome or reverse direct repeat , inverted exact repeats or complementary palindrome or exact reverse complement, inverted approximate repeats or approximate complementary palindrome or approximate reverse complement, interspersed or dispersed repeats,
flanking repeats or terminal repeats etc. Better compression gives better network speed and save storage space.
Comparison of Compression Algorithms in text data for Data MiningIJAEMSJORNAL
Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
Text Steganography Using Compression and Random Number GeneratorsEditor IJCATR
A lot of techniques are used to protect and hide information from any unauthorized users such as Steganography and Cryptography. Steganography hides a message inside another message without any suspicion, and Cryptography scrambles a message to conceal its contents. This paper uses a new text steganography that is applicable to work with different languages, the approach, based on the Pseudorandom Number Generation (PRNG), embeds the secret message into a generated Random Cover-text. The output (Stego-Text) is compressed to reduce the size. At the receiver side the reverse of these operations must be carried out to get back the original message. Two secret keys (Hiding Key & Extraction Key) for authentication are used at both ends in order to achieve a high level of security. The model has been applied to different message languages and both encrypted and unencrypted messages. The experimental results show the model‟s capacity and the similarity test values.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Detection and Privacy Preservation of Sensitive Attributes Using Hybrid Appro...rahulmonikasharma
Privacy Preserving Record Linkage (PPRL) is a major area of database research which entangles in colluding huge multiple heterogeneous data sets with disjunctive or additional information about an entity while veiling its private information. This paper gives an enhanced algorithm for merging two datasets using Sorted Neighborhood Deterministic approach and an improved Preservation algorithm which makes use of automatic selection of sensitive attributes and pattern mining over dynamic queries. We guarantee strong privacy, less computational complexity and scalability and address the legitimate concerns over data security and privacy with our approach.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
Short tandem repeats (STRs) have become important molecular markers for a broad range of applications, such
as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a
range of molecular ecology and diversity studies. These repeated DNA sequences are found in both Plants and
bacteria. Most of the computer programs that find STRs failed to report its number of occurrences of the
repeated pattern, exact position and it is difficult task to obtain accurate results from the larger datasets. So we
need high performance computing models to extract certain repeats. One of the solution is STRs using parallel
string matching, it gives number of occurrences with corresponding line number and exact location or position
of each STR in the genome of any length. In this, we implemented parallel string matching using JAVA Multithreading
with multi core processing, for this we implemented a basic algorithm and made a comparison with
previous algorithms like Knuth Morris Pratt, Boyer Moore and Brute force string matching algorithms and from
the results our new basic algorithm gives better results than the previous algorithms. We apply this algorithm in
parallel string matching using multi-threading concept to reduce the time by running on multicore processors.
From the test results it is shown that the multicore processing is a remarkably efficient and powerful compared
to lower versions and finally this proposed STR using parallel string matching algorithm is better than the
sequential approaches.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYIJNSA Journal
As is commonly known, the steganographic algorithms employ images, audio, video or text files as the medium to ensure hidden exchange of information between multiple contenders to protect the data from the prying eyes. However, using text as the target medium is relatively difficult as compared to the other target media, because of the lack of available redundant information in a text file. In this paper, in the backdrop of the limitations in the prevalent text based steganographic approaches, we propose simple, yet novel approaches that overcome the same. Our approaches are based on combining the random character sequence and feature coding methods to hide a character. We also analytically evaluate the approaches based on metrics viz. hiding strength, time overhead and memory overhead entailed. As compared to other methods, we believe the approaches proposed impart increased randomness and thus aid higher security at lower overhead.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
Abstract DNA cryptography the new era of cryptography enhanced the cryptography in terms of time complexity as well as capacity. It uses the DNA strands to hide the information. The repeated sequence of the DNA makes highly difficult for unintended authority to get the message. The level of security can be increased by using the more than one cover medium. This paper discussed a technique that uses two cover medium to hide the message one is DNA and the other is image. The message is encrypted and converted to faked DNA then this faked DNA is hided to the image. Keywords: DNA, cryptography, DNA cryptography
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Detection and Privacy Preservation of Sensitive Attributes Using Hybrid Appro...rahulmonikasharma
Privacy Preserving Record Linkage (PPRL) is a major area of database research which entangles in colluding huge multiple heterogeneous data sets with disjunctive or additional information about an entity while veiling its private information. This paper gives an enhanced algorithm for merging two datasets using Sorted Neighborhood Deterministic approach and an improved Preservation algorithm which makes use of automatic selection of sensitive attributes and pattern mining over dynamic queries. We guarantee strong privacy, less computational complexity and scalability and address the legitimate concerns over data security and privacy with our approach.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
Short tandem repeats (STRs) have become important molecular markers for a broad range of applications, such
as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a
range of molecular ecology and diversity studies. These repeated DNA sequences are found in both Plants and
bacteria. Most of the computer programs that find STRs failed to report its number of occurrences of the
repeated pattern, exact position and it is difficult task to obtain accurate results from the larger datasets. So we
need high performance computing models to extract certain repeats. One of the solution is STRs using parallel
string matching, it gives number of occurrences with corresponding line number and exact location or position
of each STR in the genome of any length. In this, we implemented parallel string matching using JAVA Multithreading
with multi core processing, for this we implemented a basic algorithm and made a comparison with
previous algorithms like Knuth Morris Pratt, Boyer Moore and Brute force string matching algorithms and from
the results our new basic algorithm gives better results than the previous algorithms. We apply this algorithm in
parallel string matching using multi-threading concept to reduce the time by running on multicore processors.
From the test results it is shown that the multicore processing is a remarkably efficient and powerful compared
to lower versions and finally this proposed STR using parallel string matching algorithm is better than the
sequential approaches.
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYIJNSA Journal
As is commonly known, the steganographic algorithms employ images, audio, video or text files as the medium to ensure hidden exchange of information between multiple contenders to protect the data from the prying eyes. However, using text as the target medium is relatively difficult as compared to the other target media, because of the lack of available redundant information in a text file. In this paper, in the backdrop of the limitations in the prevalent text based steganographic approaches, we propose simple, yet novel approaches that overcome the same. Our approaches are based on combining the random character sequence and feature coding methods to hide a character. We also analytically evaluate the approaches based on metrics viz. hiding strength, time overhead and memory overhead entailed. As compared to other methods, we believe the approaches proposed impart increased randomness and thus aid higher security at lower overhead.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
Abstract DNA cryptography the new era of cryptography enhanced the cryptography in terms of time complexity as well as capacity. It uses the DNA strands to hide the information. The repeated sequence of the DNA makes highly difficult for unintended authority to get the message. The level of security can be increased by using the more than one cover medium. This paper discussed a technique that uses two cover medium to hide the message one is DNA and the other is image. The message is encrypted and converted to faked DNA then this faked DNA is hided to the image. Keywords: DNA, cryptography, DNA cryptography
Smart Huffman Compression is a software appliance designed to compress a file in a better way. By functioning as an JSP, it provides high level abstraction of java Servlet. For example, Smart Huffman Compression encodes the digital information using fewer bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. It also provides us the decompression facility also. Smart Huffman Compression provides our organization with effective solutions to reduce the file size or lossless compression of data. It also expedites security of data using the encoding functionality. It is necessary to analyze the relationship between different methods and put them into a framework to better understand and better exploit the possibilities that compression provides us image compression, data compression, audio compression, video compression etc.
AUDIO CRYPTOGRAPHY VIA ENHANCED GENETIC ALGORITHMijma
As communication technologies surged recently, the secrecy of shared information between communication parts has gained tremendous attention. Many Cryptographic techniques have been proposed/implemented to secure multimedia data and to allay public fears during communication. This paper expands the scope of audio data security via an enhanced genetic algorithm. Here, each individual (audio sample) is genetically engineered to produce new individuals. The enciphering process of the proposed technology acquires, conditions, and transforms each audio sample into bit strings. Bits fission, switching, mutation, fusion, and deconditioning operations are then applied to yield cipher audio signals. The original audio sample is recovered at the receiver's end through a deciphering process without the loss of any inherent message. The novelty of the proposed technique resides in the integration of fission and fusion into the traditional genetic algorithm operators and the use of a single (rather than two) individual(s) for reproduction. The effectiveness of the proposed cryptosystem is demonstrated through simulations and performance analyses.
Presentation is about genetic algorithms. Also it includes introduction to soft computing and hard computing. Hope it serves the purpose and be useful for reference.
Loss less DNA Solidity Using Huffman and Arithmetic CodingIJERA Editor
DNA Sequences making up any bacterium comprise the blue print of that bacterium so that understanding and analyzing different genes with in sequences has become an exceptionally significant mission. Naturalists are manufacturing huge volumes of DNA Sequences every day that makes genome sequence catalogue emergent exponentially. The data bases such as Gen-bank represents millions of DNA Sequences filling many thousands of gigabytes workstation storing capability. Solidity of Genomic sequences can decrease the storage requirements, and increase the broadcast speed. In this paper we compare two lossless solidity algorithms (Huffman and Arithmetic coding). In Huffman coding, individual bases are coded and assigned a specific binary number. But for Arithmetic coding entire DNA is coded in to a single fraction number and binary word is coded to it. Solidity ratio is compared for both the methods and finally we conclude that arithmetic coding is the best.
Convolutional Neural Network for Text ClassificationAnaïs Addad
Work under Pr. Nolan, with a team of 4 to implement a convolutional neural network for text classification in TensorFlow using a dataset of Amazon reviews
Image Compression Through Combination Advantages From Existing TechniquesCSCJournals
The tremendous growth of digital data has led to a high necessity for compressing applications either to minimize memory usage or transmission speed. Despite of the fact that many techniques already exist, there is still space and need for new techniques in this area of study. With this paper we aim to introduce a new technique for data compression through pixel combinations, used for both lossless and lossy compression. This new technique is also able to be used as a standalone solution, or with some other data compression method as an add-on providing better results. It is here applied only on images but it can be easily modified to work on any other type of data. We are going to present a side-by-side comparison, in terms of compression rate, of our technique with other widely used image compression methods. We will show that the compression ratio achieved by this technique tanks among the best in the literature whilst the actual algorithm remains simple and easily extensible. Finally the case will be made for the ability of our method to intrinsically support and enhance methods used for cryptography, steganography and watermarking.
Genetic algorithm guided key generation in wireless communication (gakg)IJCI JOURNAL
In this paper, the proposed technique use high speed stream cipher approach because this approach is useful where less memory and maximum speed is required for encryption process. In this proposed approach Self Acclimatize Genetic Algorithm based approach is exploits to generate the key stream for encrypt / decrypt the plaintext with the help of key stream. A widely practiced approach to identify a good set of parameters for a problem is through experimentation. For these reasons, proposed enhanced Self Acclimatize Genetic Algorithm (GAKG) offering the most appropriate exploration and exploitation behavior. Parametric tests are done and results are compared with some existing classical techniques, which shows comparable results for the proposed system.
Similar to Develop and design hybrid genetic algorithms with multiple objectives in data compression (20)
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Develop and design hybrid genetic algorithms with multiple objectives in data compression
1. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 201732
Manuscript received October 5, 2017
Manuscript revised October 20, 2017
Develop and Design Hybrid GeneticAlgorithms with Multiple
Objectives in Data Compression
Khalil Ibrahim Mohammad Abuzanouneh
Qassim University, College of Computer, IT Department, Saudi Arabia
Summary
In this paper, Data compression plays a significant role and is
necessary to minimize the storage size and accelerate the data
transmission by the communication channel object, The quality
of dictionary-based text compression is a new approach, in this
concept, the hybrid dictionary compression (HDC) is used in one
compression system, instead of several compression systems of
the characters, syllables or words, HDC is a new technique and
has proven itself especially on the text files. The compression
effectiveness is affected by the quality of hybrid dictionary of
characters, syllables, and words. The dictionary is created with a
forward analysis of text file a. In this research, the genetic
algorithm (GAs) will be used as the search engine for obtaining
this dictionary and data compression, therefore, GA is stochastic
search engine algorithms related to the natural selection and
mechanisms of genetics, the s research aims to combine
Huffman’s multiple trees and GAs in one compression system.
The Huffman algorithm coding is a tactical compression method
of creating the code lengths of variable-length prefix code, and
the genetic algorithm is algorithms are used for searching and
finding the best Huffman tree [1], that gives the best
compression ratio. GAs can be used as search algorithms for
both Huffman method of different codes in the text characters
leads to a difference in the trees, The GAs work in the same way
to create the population, and can be used to guide the research
to better solutions, after applying the test survive and their
genetic information is used.
The proposal is a new multi-trees approach to compute and
present the best Huffman code trees of characters, syllables, and
words to get the best possible of minimization compression [2].
The frequency increases to the text have an impact on
compression rate. This approach generates more distinctive code
words and efficient of space consumption and computation for
compression rate without losing any data on the original files, the
experiments and results show that a significant improvement and
accuracy can be achieved by using the code words derived [3].
Key words:
Huffman algorithm, Data compression, Genetic information,
Huffman code trees.
1. Introduction
Data compression is one of the main important topics in
the last years. Big data must be stored in data warehouses
or archives, and a big data must be transmitted through
communication channels. There are several of Data
compression algorithms were designed for data processing
byte by byte [4]. For example, input any symbols are
taken from ASCII code table, therefore, the size of all
symbols are is 256 represented by 8-bits. and it’s easy to
store all symbols into compressed data, because it's
available on every system. Data compression is the
creating binary process of representations of the file which
requires minimum storage space than the original file. The
main data compression techniques are a lossy and lossless
compression. The lossless data compression is used when
the data has to be uncompressed exactly as it was before
compression. The files is stored by using lossless
technique, losing any single character can in the worst
case make the data misleading. There are limits to the
amount of space saving that can be gotten with lossless
compression. Lossless compression ratios are generally in
the range of 0.5 to 0.2. The compression of lossy works on
the assumption that the file doesn't have to be stored
perfectly. Much information can be thrown away from
some images, data of video and audio, and when
uncompressed some of the data will still be acceptable
quality. The popular method for data compression is
Huffman Code runs under various platforms. It is used as
a multistep compression process to produce better codes;
the best code produces if the symbol probabilities are
negative and power of 2.
The compression methods are developed and have the best
research for the data compression By D. Huffman in 1952,
the algorithm was started to build a list of all the symbol
probabilities in descending order. It can construct a
Huffman trees at every leaf, with a symbol node, from the
down up, it’s done in several steps, and for each step, the
two symbols are selected with minimum probabilities,
where it will be added to the partial tree top, and from the
list is deleted, in the next step, the two original symbols
can be replaced with an auxiliary symbol, where the list is
reduced to one auxiliary symbol representing all alphabet,
the characters after completing , the tree will be traversed
to determine all codes .
2. Huffman algorithm
All Huffman code defines a frequency based on codes of
scheme that can generate a set of variables with size code
words of the minimum average length as follows:
2. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 2017 33
1- The frequency table will be constructed and sorted in
descending order from text file to Build binary trees,
derive Human tree and generate Huffman code to be
compressed and find the best Huffman trees, as in the
figure 1 below.
The data will be readied from file text of the proposed
system and to create the initial population, and from the
initial population, the chromosomes will be produced.
Huffman encoding example: When a message has n
symbols, block encoding requires log n bits to encode
each symbol. Message Size = frequency character * fixed
length 43*8= 44 bits.
Huffman encode massage:
10 01 10 001 111 01 111 10 01 001 10 000 000 10 01 110
10 01 001 10 110 111 110 10 001 111 01 10 01 10 01 10
10 01 111 110 10 01 111 000 111 10 01 10
All character in this message a=111, b=01, c=001, d=000,
e=10, f=110.
Message Size after encoding =
a(3*7)+b(2*10)+c(3*4)+d(3*3)+ e(10*15) + f(3*4) = 224
bits.
Fig 1: Huffman Code Trees
The Huffman encoding = 224 bits long. The Huffman
encode tree saves: 344 - 224= 220 bits. Data compression
ratio is the ratio between the uncompressed data size and
compressed data size: CompRatio=UncompSize /
CompSize =1.535 to 1. DataRate savings is defined as the
reduction of DataRate relative to the uncompressed of
DataRate saving =1- Compressed DataRate/
Uncompressed DataRate = 34.5%.
3. Design of Huffman Multiple Trees
This section describes the GAs applying for Huffman’s
Multi-Trees. The GAs design involves several operators:
Genetic_ representation, Population initialization,
Ftness_function, Selection scheme, Crossover, and
Mutation. Huffman multi-trees consist adjacent nodes
sequence in the graph. Hence, this is a natural choice to
apply the path oriented of the encoding method. The path
based crossover, the behavior and effectiveness of a
genetic algorithm depend on the several parameters
settings, all these parameters include the population size,
crossover probability, mutation probability , generations
number, where mutation range are also very popular [6].
Siding degree and elitism are used for h better individuals
in selection scheme. In this paper, the multiple point
crossover is applied to get its positive effect, when used
them with populations to avoid creating unproductive
clones for the algorithm.
In this research ,will be applied the elitism algorithm to
get and prevent losing of the best solution that found, the
elitism t means that instead of replacing some old
individuals with the new individual, and keep set of the
old individuals as long as and it will be better than the
worst individuals of the next new population. Too much
elitism may cause premature convergence, which is a
really unpleasant consequence. To avoid this, we control
applying of the elitism algorithm to select individuals as
small number no more one percent of the population
selection.
The chromosomes will be produced randomly from the
initial population length by depending on the variety in the
text file of the inserted symbols. For each character the
frequency is computed and for each character the
probability is found in our research and for each
chromosome will build the Huffman tree in the population
by depending on its probability. The code word for each
gene is determined from its tree. The fitness function is
computed by evaluating the chromosome using the
variance of characters that depend on calculating the
average for each chromosome.
Chromosome 1 :
1 1 0 0 0 1 0 1
Chromosome 2 :
0 1 0 1 0 1 0 0
Chromosomes 3 :
1 1 1 0 1 1 0 0
The fitness function is calculated depend on the individual
evaluation by computing the average size, is shown in the
Eq.1, 2.
n
A= ∑ (Pi * ai) , (1)
i=1
n
Variance = ∑n pi (ai – A)2
, (2)
i=1
Where pi: probability, ai: number of bits; A: average size.
Crossover operations generates a variety of individuals
and can avoids the conflict between the genes constructing
and the chromosome, and the most important property
must be available in our approach, and to complete
building of the Hoffman trees for each new individual, the
fitness function will be evaluated after each computing of
tree. In next step the new offspring will be applied to
compare it with the worst individuals in the exiting
population and will be exchanged with the worst
individual’s selection to produce the best variety in the
next population.
3. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 201734
3.1 Fitness Function
In this research there are different methods to evaluate
fitness function, the byte series method uses to evaluate
the bytes series data, the second is the byte frequency
method to evaluate the frequency bytes, the words
frequency method is to evaluate the words frequency, and
the decision method is to select the best decision for each
method as multi-objectives [7]. The fitness functions are
applying for measuring the problems quality of each
method. The genetic algorithms are evaluated using
different methods to find the best results of combinations
methods, and performance, also to determine the best
corresponding between environmental characteristics files
and their algorithms , genetic algorithms will be selected
a fitness measurement each process and produced a new
individuals of population,. all objectives based on its
fitness function will be forced optimize using the
evolution process ,the fitness function is the absolute
variation value average between the compression ratio and
achieved compression in the environmental characteristics
files, and also the fitness of the byte frequency method can
be expressed as the output of the byte frequency of file n,
file n is the file in the experiment files, and the
compression saving file n can be denoted as
Comp(algorithm, file n) , the file compression saving n
when using the compression algorithm .
3.2 Genetic Operators
In this research will be applied a multi-objective algorithm
of the individuals, there are several options to apply multi-
objective genetic operators to generate new individual
based on multi-objective to guide evolution through the
search space as follows: 1- Using a particular operator that
has been selected to all models within an individual or
select a different operator for each model. 2-The crossover
technique will be applied between homogenous structures
models, used crossing over trees at different positions in
the swapping of useless material of genetic algorithm for
restricting the crossover positions during evolution.3- The
system will be selected an operator and a predefined
probabilities for Huffman coding on characters, Huffman
coding on syllables, and Huffman coding on words. 4- In
the crossover will be applied only homologous data are
allowed for crossing process. 5- The system will take the
structural constrains as multi-objective of Huffman code
trees into consideration to be ensure its syntax is
implemented.
4. Proposed Compression Method
In this research, the author proposed a multi-trees
representation, first tree is selected to analyze the byte
series, second tree is selected to analyze the byte
frequency, and a third tree is selected to analyze the words
frequency, and decision method to select the best
prediction for the compression ratio.
The test data compression coding and its strategy is based
on compression techniques such as run length coding,
statistical coding, and dictionary based coding. In this
paper, the researcher evolves Huffman multi-trees and
syllable-based compression of LZW using GAs to predict
the compression ratio of a specific compression models to
apply with different text[12], so the data representation
will be depended on the file type, for example each unit
for the file text will be represented a character as ASCII
text, and if a file type is an executable or instructions file
can be represented as numeric or textual data, the file type
could be determined by file name or extensions, and after
this stage , the compression model can be decided to
build multi-trees representation based on decision method
[9].
The proposed compression method demonstrated the
optimal Huffman code trees via a genetic algorithm, the
experiment determined the all possible strings were
known during of processing time, the genetic algorithm
applied to select the most efficient set of strings to use it
in the Huffman multi-trees encoding, the bits string can
be represented as a collection of variable-length for
matching series of bits 0, 1, the matching vectors are
assigned a prefix free code, a string can be represented as
a set of matching-series coding with the values of the
unspecified positions, GAs will be generated a set of
multi-trees for matching strings and used it to compress
the files data , the better fitness is producing optimal
compression ratio .
The binary tree is the best way to visualize any particular
encoding diagram, and any character can be stored at a
leaf node. The character encoding is obtained by
following the path from the root to its node tree, left going
edge represents a 0, and right going edge a 1, as Shawn in
the figures 3,4 , this diagram tree using the fixed length
encoding are given in Eq. 3, 4.
𝐴𝑚 = ∑ (𝑃𝑖 ∗ 𝑎𝑖 + 𝑇𝑙)
𝑛.𝑘
𝑖=1.𝑙=0
(3)
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ∑ 𝑃𝑖((𝑎𝑖 + 𝑇 𝑙) − 𝐴𝑚 )
𝑛,𝑘
𝑖=1,𝑙=0
(4)
d=000, &=001, c=010, f=101, k=100, a=101, b=110, e=111
Fig 2: Illustrates the binary tree diagrams of the fixed-length encoding.
37
k=5
14
25
51
12
Left=0 Right=1
6 8
e=15b=10a=7c=4 f=4&=3d=3
4. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 2017 35
MassageSize=d(3*3)+&(3*3)+c(3*4)+f(3*4)+k(3*5)+a(3*7)+b(
3*10)+e(3*15)=153 bits.
Frequency d & C f K a B E
Symbols 3*3 3*3 3*4 3*4 3*5 3*7 3*103*15
Bits 000 010 010 011 100 101 110 111
Fig 3: Illustrates the binary tree diagrams of the variable-length encoding.
Massage Size=a(4*3)+d(4*3)+ & (4*3)+c(4*4)+ f(3*4) +
k(3*5)+b(2*10)+e(2*15) = 129 bits.
Fig 4: illustrates the Huffman multi-trees diagrams of the variable-length
encoding.
T=T1+T2 + …….
MassageSize=d(3*2)+&(3*2)+c(3*4)+f(3*4)+k(2*5)+a(2
*7)+b(2*10)+e(2*15)= 110 bits.
Tree 1=36 bits Tree 2=74 bits
Frequency d & c f K a b e
Symbols 3*23*23*43*4 2*52*72*102*15110
Bits 00 01 10 11 00 01 10 11
The different types mutation will be applied, which have
named bytes’ mutation, is identical to that used in genetic
system, there are a set of binary nodes in Huffman
multiple trees, each of its node can be a leaf or an internal
node and replaced with another node, in this case can be
swapped between an internal node and a leaf node as in
Figures 5 and 6.
Fig 5: The swap mutation probability in genetic algorithm has been
applied as a prefix-free tree.
Fig 6: Illustrates a prefix free tree before the combine mutations have
been implemented.
Fig 7: Illustrates the trees from figure 6 after the combine mutations have
been applied, with the nodes ‘cross’ with ‘over’ and ‘some’ with ‘thing’
selected for combination.
There is another new mutation, known as join mutation to
provide the functionality of creating new long string is
constructed from a set of words, there are the set of
symbols occur more frequently than others symbols in
English text. For example the most common letters are E
and T, and therefore to create an optimal Huffman code
tree for the English alphabet, and encode some texts with
it to apply the compression ratio and the quality of results
[9].
In this part of research, they are three priorities, the system
will take words in the first priority, in the second priority
will select frequent syllables, and in the third priority will
take individual letters, and in a typical text file, there are
some words occur more than others. And to represent
these words in Huffman binary code, the common words
will be replaced with short letter sequences and
uncommon words with longer sequences, for increasing
the information entropy of file text, the compression ratio
Frequency d & c F k a b e
Symbols 4*3 4*3 4*4 4*4 3*5 3*7 2*10 2*15
Bits 1111 1110 1000 1001 001 101 01 11
b=10
11
15
30
8
Left=0
Right=1
6
e=15
k=5
f=4c=4&=3
21
71
d=3
a=7
51
14
12
37
k=5
8 256
&=3d=3 f=4c=4
a=7
b=10 e=15
Level =0 Level=1
T1=Tree 1 T2=Tree 2
k=516 23
30
86
a=7
b=10
f=4c=4&=3
21
71
d=3
e=15
51
great11 15
30
great
crossover
21
71
something
51
great11 15
30
86
great
crossover
thingsome
c=4over
21
71
cross
something
51
5. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 201736
will be more efficient, when the compressed of file text
includes on frequented words with longer sequence of
[11].
The sequence of frequented words are computing the
probability for a new string, and randomly selected the
leaf nodes, two strings are joined together into a new one
string, a new node is created for containing that string.
The new node is selected to be new leaf node in Huffman
binary tree, the new node is created its sibling.
There is another mutation, known as words mutation to
provide the functionality of swapping some words
between an internal node and a leaf node in the Huffman
binary tree described above.
There are randomly selected two leaf nodes for swapping,
the words are created into a Huffman binary tree, and the
new node is created to include that word as one string.
One of the two string, randomly selected of leaf nodes,
brought down a level, and the new node is made its
children [14].Figures 3 and 4 are an illustration of one
particular case of the word mutation.
Unlike in genetic programming; by the nature of the trees
used in this algorithm, any crossover would have a
significant potential to cause some strings in the tree to
occur twice while others disappeared entirely. The fitness
of a chromosome is determined by the length of a file, in
bits, as compressed using the encoding it represents; in
this experiment, therefore, fitness is to be minimized.
In this research the tree binary structure will be
represented by 2n−1 bits of n internal nodes, a prefix free
encoding tree will be represented in its entirety by
selecting a preorder traversal of the tree as it is a full
binary tree to identify each of the nodes visited is a leaf
node or an internal node. The uncompressed string in the
encoding has 8L bits for each, where L is the string length
in bytes, characters in the code word will be calculated for
the total lengths of symbols, the compressed data length is
determined by looking up each encoding trees.
Consider a multicast tree is defined by G(VT,ET), where
VT ⊆V, ET⊆E and G⊆T, and all trees are associated
with link cost C(Ti), is shown in Eq. 5.
T=T1+T2+T3 ……,Level={0, 1,2,3…k} (5)
A unicast probabilities request from the root node r to
node d as to the destination node with the encoding
probability bound (Δ), see Eq. 6.
∆ (Pi ) = ∑
𝑛
𝑒∈𝑃 𝑇𝑖(v,di)
𝑃 𝑇𝑖 ≤ ∆ (6)
The shortest path of encoding probability is obtained by
selecting the path from the root to its node to find a series
of paths Pi; i∈ {0, 1, 2.3….}, is given Eq.7.
𝑃 𝑇𝑖 = ∑ (𝑃𝑖 ∗ 𝑎𝑖 + 𝑙 ) (7)
𝑛
𝑖=1, 𝑙=0
Consider a symbol a occurs at a node n at depth k in the
tree, and D array encoding of length k is given by C(a) =
D = {d1,d2,d2….dk},where di ∈ [D],all prefixes Codes
of C (di) are matching to nodes on the path from root(r) to
destination (di). While length max Lmax can be the
longest encoded symbol, the possible value of leaves at
depth lmax is D(Lmax), The code word leaf of depth i
will be D(Lmax − Li ) , and so Symbol_Coding C(a) =
Encoding lengths E(La), see Eq. 8.
𝐶(𝑎) = ∑ 𝐷(𝐿max − 𝐿𝑖) (8)
𝑘
𝑖=1
In a series of graphs Gi, where i ∈{ 0,1,.2.3….}, the
value of fitness function, used to evaluate the solution
quality is the tree cost among a set of candidate solutions,
is given in Eq.9.
𝑓(𝐶 𝑇) = ∑ 𝑃𝑇𝑖 𝑇𝑖 T (9)
𝑛
𝑖=1
|The cost of the minimal tree will satisfy the encoding
probability of constraint, the minimum cost of encoding
probability, is given in Eq. 10.
C(T) = 𝑀𝑖𝑛 𝑃𝑇 ∑ 𝐶(𝑇𝑖)
𝑛
𝑒𝜖𝑃 𝑇𝑖(v,𝑑 𝑖)
(10)
The Stability of the best solution’s fitness function can be
measure after calculating the standard deviation between
the different solution of fitness for each process, the
fitness solutions and average increasing indicates that the
system produce good visible solutions in each new process
[5]. The fitness value should be accurately evaluated as is
quality and determined by the fitness function [10]. In the
fitness algorithm for finding the minimum cost of
encoding probability between the root of tree and the leaf
node. the path cost is the primary criterion of solution
quality between a set of candidate solutions as minimum
cost of the fitness value , for example a chromosome Chi
representing the path P, is denoted as F(Chj), see Eq. 11 ,
below.
F(Chi) = [ ∑ C(chi)
chi∈Pi(r,l)
]
−1
(11)
In this research the system will be started randomly to
initialize the individuals population using standard genetic
operators for guiding evolution during the search space
related to the compression field , and construct Huffman
multiple trees encoding that work with different data
types and data frequencies sizes, and after several studies
to get advantages of all available knowledge concerning
the data compression parameters, for implementing
various compression algorithms to the heterogeneous data
file, will generate different compression ratios with
extremely time periods consuming, the best Testing
compression algorithms to determine data compression
and avoid time consuming ,when the data size is less than
6. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 2017 37
1GB [11], and to avoid applying a random compression
algorithm that could be loosed efficiency by increasing
storage space or it might even get unexpected results, the
researcher applied several compression algorithm side by
side with estimating and analyzing the compression ratio
to be helpful for saving the computational resource and
the time required for executing the compression process
[12].
In this paper, the proposed algorithms presented a new
compression algorithm of lossless data and used Genetic
algorithm to apply the different data compressibility and
compare it with different multi-trees for minimizing the
file total size. When GAs was successfully implemented
for constructed different Huffman’s multiple trees to give
an evaluation of a compression ratio, the proposed
approach based on GAs and Huffman’s multi-trees to
implement program that use to predict data compression
and high efficiency for different compression model trees,
and to get a fast analysis for determining and which
compression model trees is to be used minimum resources
and save the time processed to run Huffman multi-trees
[13].
3. Compression methods and Technical
In this research, the author supposed the word-based
compression will be more suitable for large files, the syllable-
based compression will be useful for middle-sized files, and the
character based compression will be more suitable for short files,
used syllable based compression of LZW and Human coding
with genetic algorithm to compare each compression methods
with their counterpart variants for words, syllables, and
characters for making the best decisions to have the minimum
cost of Huffman encoding probability [14].
We have created for each language a database of frequent words
to improve a compression, the words from database are used for
compressing algorithms initialization. As we know the English
languages have a set of syllables characteristically [15]. So set of
testing documents was created for English language, and we
created for it decomposition algorithm into syllables one
database of frequent syllables from letters, and also created
databases of frequent other syllables.
For encoding length of syllables are used three Huffman trees,
the first for Huffman coding trees on characters, the second for
Huffman coding trees on syllables. The third for Huffman
coding trees on words, all Trees are initialized from received
from text documents.
Optimizing Compression Algorithms: The proposed system
applied genetic algorithms to determine the most repetitive
strings inside the text file to be a compressed file, the population
search as contained a set of nodes that represent different strings,
the Huffman trees are generated to encode higher frequency with
lower references. In this work, the researchers utilized
specialized search operators. The final results demonstrated that
GAs was used with Huffman trees to achieve higher
compression than the standard coding algorithm [15].
Fig 8: Illustrates the block diagram of the proposed system and its
algorithms.
4. Experiments and Results
There are two main criteria in the compression methods should
be taken into consideration: the compression ratio and the quality
of results, the differences between the input file and the output
file. GAs use a genome including a binary trees structures,
while a genotype represents a particular prefix free encoding, and
the initializing of population consists of a set of trees with 256
possible bytes, the experiments were implemented using the Java
Genetic Algorithms Package , All experiments were performed
using the following parameters settings: a population size 100-
200 individuals, maximum generations 1000, a crossover
probability 80%, a mutation probability 5%, and a maximum tree
depth 15. The population size and number of generations
dependent on type of the experimental analysis, in each mutation
was applied full probability to get optimal solution. There were
used for ‘swap’ and ‘combine’ mutation for each new individual.
In the first experiment, the space saving will be calculated and
defined as the decreasing in data size relative to the
uncompressed data size: Space Savings = 1− Compressed Size /
Uncompressed Size. The best compression performs is The
higher value for the space saving. In our first tests, we used the
set of text files to study the compression behavior with trees
including 256 possible 8bits ASCII characters, the achieved
space savings were compared to ratios that can be achieved using
Huffman coding tree SpaceSaving, Table 1, presents the results
of applying the GAs Huffman coding based tree to all files.We
performed 10 runs using four text file and let each run for 1000
total of generations, the best SpaceSaving achieved was 49%,
resulting in a file size of 5055 bits using GAs Huffman coding
tree comparing 49% resulting in the same file using Huffman
coding tree.
Table 1. Comparison Huffman coding Tree and GAs Huffman coding
Tree SpaceSaving taking into consideration effect of different files sizes,
while the population size and generation number are fixed.
File HuffTree
Comp_Size
GAHuffTree
Comp_Size
HuffTree
SpaceSaving
GAs HuffTree
SpaceSaving
File1 4211 3762 0.33 0.41
File2 4502 4122 0.36 0.42
File3 4952 4491 0.40 0.46
File4 5055 4901 0.47 0.49
7. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 201738
In the second experiment, we applied the GAs Huffman
coding algorithm in a set of the files, using different
generation numbers, the file must be more than 800
generation numbers, in order to get maximum
compression ratio and SpaceSaving. In this experiment,
the results satisfied the probability of compression and
gave good results are illustrated in Table (2).
Table 2. Illustrates a set of experiments results taking into consideration
different generation number, while the population size and file sizes are
fixed.
Huff word GAs+Huff word
File FileSize CompSize SpaseSaving CompSize SpaseSaving
File5 547632 301112 45.02 312621 42.91
File6 717302 395679 44.84 416992 41.87
File7 884235 524361 40.70 545112 38.35
File8 1023351 611224 40.27 639881 37.47
File9 1272142 825545 35.11 841235 33.87
In the third experiment, we developed a GAs to be able
extending the standard Huffman coding to multi-trees of
character encodings for getting better SpaceSaving, we
used the effect of the probability of genes on the
compression ratio, using GAs Huffman code multi-trees in
set of the files, the increment in the probability of genes
gives the gene the shortest code word and as a result
increases the Space Saving of compression. In the
experiment, results will be satisfied the probability of
compression and gave good results are illustrated in Table
(3).
Table 3. Illustrates a set of experiments results taking into consideration
different generation number, while the population size and file sizes are
fixed.
No
Generation
Number
Population
Size
Compression
Size
GAsHuffTree
SpaceSaving
1 200 42 83021 0.54
2 400 42 80136 0.56
3 600 42 72126 0.60
4 800 42 69141 0.62
In the Fourth experiment, we used GAs with LZWL and
HuffSyllble compression algorithm side by side with
estimating and analyzing the compression ratio ,the results
have been increased compression ratio and getting better
SpaceSaving , we used ‘swap’ and combine mutation to
improve the probability of genes in the compression
process, using GAs Huffman code multi-trees in a set of
the files, the increment in the probability of genes gives
the gene the good code word and as a result increases the
Space Saving of compression. The experiment result was
satisfied the probability of compression and gave good
results are illustrated in Figure 9.
Fig 9. Illustrates a set of examples that comparing different algorithm
with fixed files.
In the fifth experiment, we have to compare GA Huffman
coding with WLZW word-compression algorithm, in a set
of the different files sizes, while the other parameters are
fixed, GAs Huffman encoding that designed as HuffWord,
we compared in this test with compressor LZ77 designed
as the WLZ77 Word-based compression method, this
experiments were effectiveness of GAs Huffman encoding,
Results has shown in figure 10. This test shown, that GAs
Huffman encoding algorithm is much better than WLZ77
Word-based compression algorithm.
Fig 10. Illustrates a set of examples that comparing GAs Huffman
coding with WLZW algorithm in a set of the different files sizes, while
the other parameters are fixed.
Acknowledgment
The author would like to acknowledge the financial support of
this work from the Deanship of Scientific Research, Qassim
University, according to the agreement of the funded project
No.1597-COC-2016-1-12-S, Qassim, Kingdom of Saudi Arabia.
References
[1] D. A. Huffman, ‘A Method for the Construction of
Minimum Redundancy Codes’,'Proceedings of the
IRE,Vol.40, pp.1098-1101,1952 .
[2] Jeffrey N.Ladino,”Data Compression Algorithms”,
http://www.faqs.org/faqs/compression-
faq/part2/section1.html.
8. IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.10, October 2017 39
[3] Article-Compressing and Decompressing Data using Java
by Qusay H.Mahmoud with contributions from
KonstantinKladko. February 2002.
[4] The Data Compression Book, 2nd edition by Mark Nelson
and Jean-loup Gailly, M&T Books, New York, NY
1995 ,ISBN 1-55851-434-1.
[5] T. A. Welch, “A Technique for High -Performance Data
Compression,'' Computer, pp. 8--18, 1984.
[6] Khalil IbrahimMohammad Abuzanouneh. “Hybrid
MultiObjectives Genetic Algorithms and Immigrants
Scheme for DynamicRouting Problems in Mobile
Networks”. International Journal of Computer Applications
164(5): 49 -57, April 2017.
[7] Khalil IbrahimMohammad Abuzanouneh “Parallel and
Distributed Genetic Algorithm with MultipleObjectives to
Improve and Develop of Evolutionary Algorithm ,
International Journal of Advanced Computer Science and
Applications, Volume7 Issue 5, 2016.
[8] L.ansky J., Zemlicka M. Compression of Small Text Files
Using Syllables.Technical report no. 2006/1. KSI MFF ,
Praha, January 2006.
[9] U¸colu¨k G., Toroslu H.: A Genetic Algorithm Approach
for Verification of the Syllable Based Text Compression
Technique. Journal of Information Science, Vol. 23, No. 5,
(1997) 365–372
[10] Mark.Nelson, Interactive Data Compression Tutor & The
data compression book2nd Ed. by M&T books, http://
www.bham. ac.uk.
[11] LZW Data Compression by Mark Nelson, Dr. Dobb's
Journal October, 1989.
[12] D. Hankerson, P. D. Johnson, and G. A. Harris,
"Introduction to Information Theory and Data
Compression“.
[13] Soumit Chowdhury, Amit Chowdhury, S. R. Bhadra
Chaudhuri, C.T. Bhunia “Data Transmission using Online
Dynami Dictionary Based Compression Technique of Fixed
and Variable Length Coding” published at International
Conference on Computer Science and Information
Technology, 2008.
[14] P.G.Howard , J.C.Vitter, “Arithmetic Coding for Data
Compression,” Proceedings of the IEEE, vol. 82, no.6, 1994,
pp.857-865.
[15] Ahmed Kattan, Riccardo Poli, "Evolutionary lossless
compression with GP-ZIP*," in Proceedings of the 10th
annual conference on Genetic and evolutionary computation,
Atlanta, Georgia, USA, 2008, 2008, pp. 1211-1218.