This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Arabic named entity recognition using deep learning approachIJECEIAES
Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
A MULTI-LAYER ARABIC TEXT STEGANOGRAPHIC METHOD BASED ON LETTER SHAPINGIJNSA Journal
Text documents are widely used, however, the text steganography is more difficult than other media because of a little redundant information. This paper presents a text steganography methodology appropriate for Arabic Unicode texts that do not use a normal sequential inserting process to overcome the security issues of the current approaches that are sensitive to steg-analysis. The Arabic Unicode text is kept within main unshaped letters, and the proposed method is used text file as cover text to hide a bit in each letter by reshaping the letters according to its position (beginning, middle, end of the word, or standalone), this hiding process is accomplished through multi-embedding layer where each layer contains all words with the same Tag detected using the POS tagger, and the embedding layers are selected randomly using the stego key to improve the security issues. The experimental result shows that the purposed method satisfied the hiding capacity requirements, improve security, and imperceptibility is better than currently developed approaches.
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentTELKOMNIKA JOURNAL
Training speech recognizer with under-resourced language data still proves difficult. Indonesian language is considered under-resourced because the lack of a standard speech corpus, text corpus, and dictionary. In this research, the efficacy of augmenting limited Indonesian speech training data with highly-resourced-language training data, such as English, to train Indonesian speech recognizer was analyzed. The training was performed in form of shared-hidden-layer deep-neural-network (SHL-DNN) training. An SHL-DNN has language-independent hidden layers and can be pre-trained and trained using multilingual training data without any difference with a monolingual deep neural network. The SHL-DNN using Indonesian and English speech training data proved effective for decreasing word error rate (WER) in decoding Indonesian dictated-speech by achieving 3.82% absolute decrease compared to a monolingual Indonesian hidden Markov model using Gaussian mixture model emission (GMM-HMM). The case was confirmed when the SHL-DNN was also employed to decode Indonesian spontaneous-speech by achieving 4.19% absolute WER decrease.
Proposed Arabic Text Steganography Method Based on New Coding TechniqueIJERA Editor
Steganography is one of the important fields of information security that depend on hiding
secret information in a cover media (video, image, audio, text) such that un authorized person fails to realize its
existence. One of the lossless data compression techniques which are used for a given file that contains many
redundant data is run length encoding (RLE). Sometimes the RLE output will be expanded rather than
compressed, and this is the main problem of RLE. In this paper we will use a new coding method such that its
output will be contains sequence of ones with few zeros, so modified RLE that we proposed in this paper will be
suitable for compression, finally we employ the modified RLE output for stenography purpose that based on
Unicode and non-printed characters to hide the secret information in an Arabic text.
Arabic named entity recognition using deep learning approachIJECEIAES
Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
A MULTI-LAYER ARABIC TEXT STEGANOGRAPHIC METHOD BASED ON LETTER SHAPINGIJNSA Journal
Text documents are widely used, however, the text steganography is more difficult than other media because of a little redundant information. This paper presents a text steganography methodology appropriate for Arabic Unicode texts that do not use a normal sequential inserting process to overcome the security issues of the current approaches that are sensitive to steg-analysis. The Arabic Unicode text is kept within main unshaped letters, and the proposed method is used text file as cover text to hide a bit in each letter by reshaping the letters according to its position (beginning, middle, end of the word, or standalone), this hiding process is accomplished through multi-embedding layer where each layer contains all words with the same Tag detected using the POS tagger, and the embedding layers are selected randomly using the stego key to improve the security issues. The experimental result shows that the purposed method satisfied the hiding capacity requirements, improve security, and imperceptibility is better than currently developed approaches.
Shared-hidden-layer Deep Neural Network for Under-resourced Language the ContentTELKOMNIKA JOURNAL
Training speech recognizer with under-resourced language data still proves difficult. Indonesian language is considered under-resourced because the lack of a standard speech corpus, text corpus, and dictionary. In this research, the efficacy of augmenting limited Indonesian speech training data with highly-resourced-language training data, such as English, to train Indonesian speech recognizer was analyzed. The training was performed in form of shared-hidden-layer deep-neural-network (SHL-DNN) training. An SHL-DNN has language-independent hidden layers and can be pre-trained and trained using multilingual training data without any difference with a monolingual deep neural network. The SHL-DNN using Indonesian and English speech training data proved effective for decreasing word error rate (WER) in decoding Indonesian dictated-speech by achieving 3.82% absolute decrease compared to a monolingual Indonesian hidden Markov model using Gaussian mixture model emission (GMM-HMM). The case was confirmed when the SHL-DNN was also employed to decode Indonesian spontaneous-speech by achieving 4.19% absolute WER decrease.
Proposed Arabic Text Steganography Method Based on New Coding TechniqueIJERA Editor
Steganography is one of the important fields of information security that depend on hiding
secret information in a cover media (video, image, audio, text) such that un authorized person fails to realize its
existence. One of the lossless data compression techniques which are used for a given file that contains many
redundant data is run length encoding (RLE). Sometimes the RLE output will be expanded rather than
compressed, and this is the main problem of RLE. In this paper we will use a new coding method such that its
output will be contains sequence of ones with few zeros, so modified RLE that we proposed in this paper will be
suitable for compression, finally we employ the modified RLE output for stenography purpose that based on
Unicode and non-printed characters to hide the secret information in an Arabic text.
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
Imran Sarwar Bajwa, M. Abbas Choudhary [2006], "A Rule Based System for Speech Language Context Understanding", International Journal of Donghua University (English Edition), Jun 2006, Vol. 23 No. 06, pp:39-42
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYIJNSA Journal
As is commonly known, the steganographic algorithms employ images, audio, video or text files as the medium to ensure hidden exchange of information between multiple contenders to protect the data from the prying eyes. However, using text as the target medium is relatively difficult as compared to the other target media, because of the lack of available redundant information in a text file. In this paper, in the backdrop of the limitations in the prevalent text based steganographic approaches, we propose simple, yet novel approaches that overcome the same. Our approaches are based on combining the random character sequence and feature coding methods to hide a character. We also analytically evaluate the approaches based on metrics viz. hiding strength, time overhead and memory overhead entailed. As compared to other methods, we believe the approaches proposed impart increased randomness and thus aid higher security at lower overhead.
THE EVOLUTION OF IDS SOLUTIONS IN WIRELESS AD-HOC NETWORKS TO WIRELESS MESH N...IJNSA Journal
The domain of wireless networks is inherently vulnerable to attacks due to the unreliable wireless medium. Such networks can be secured from intrusions using either prevention or detection schemes. This paper focuses its study on intrusion detection rather than prevention of attacks. As attackers keep on improvising too, an active prevention method alone cannot provide total security to the system. Here in lies the importance of intrusion detection systems (IDS) that are solely designed to detect intrusions in real time. Wireless networks are broadly classified into Wireless Ad-hoc Networks (WAHNs), Mobile Adhoc Networks (MANETs), Wireless Sensor Networks (WSNs) and the most recent Wireless Mesh Networks (WMNs). Several IDS solutions have been proposed for these networks. This paper is an extension to a survey of IDS solutions for MANETs and WMNs published earlier in the sense that the present survey offers a comparative insight of recent IDS solutions for all the sub domains of wireless networks.
TEXT STEGANOGRAPHIC APPROACHES: A COMPARISONIJNSA Journal
This paper presents three novel approaches of text steganography. The first approach uses the theme of missing letter puzzle where each character of message is hidden by missing one or more letters in a word of cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover and stego file. The second approach hides a message in a wordlist where ASCII value of embedded character determines length and starting letter of a word. The third approach conceals a message, without degrading cover, by using start and end letter of words of the cover. For enhancing the security of secret message, the message is scrambled using one-time pad scheme before being concealed and cipher text is then concealed in cover. We also present an empirical comparison of the proposed approaches with some
of the popular text steganographic approaches and show that our approaches outperform the existing approaches.
TEXT STEGANOGRAPHIC APPROACHES: A COMPARISONIJNSA Journal
This paper presents three novel approaches of text steganography. The first approach uses the theme of
missing letter puzzle where each character of message is hidden by missing one or more letters in a word
of cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover and
stego file. The second approach hides a message in a wordlist where ASCII value of embedded character
determines length and starting letter of a word. The third approach conceals a message, without
degrading cover, by using start and end letter of words of the cover. For enhancing the security of secret
message, the message is scrambled using one-time pad scheme before being concealed and cipher text is
then concealed in cover. We also present an empirical comparison of the proposed approaches with some
of the popular text steganographic approaches and show that our approaches outperform the existing
approaches.
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information in textual material. This can include determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying specific emotions or opinions expressed in the text, that involves the use of advanced machine and deep learning techniques. Recently, transformer-based language models make this task of human emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that spend a lot of time on sequential processing, making them prone to fail when it comes to processing long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based language models on opinion mining and provide a high-level comparison between them to highlight their key particularities. Additionally, our comparative study shows leads and paves the way for production engineers regarding the approach to focus on and is useful for researchers as it provides guidelines for future research subjects.
Traid-bit embedding process on Arabic text steganography methodjournalBEEI
The enormous development in the utilization of the Internet has driven by a continuous improvement in the region of security. The enhancement of the security embedded techniques is applied to save the intellectual property. There are numerous types of security mechanisms. Steganography is the art and science of concealing secret information inside a cover media such as image, audio, video and text, without drawing any suspicion to the eavesdropper. The text is ideal for steganography due to its ubiquity. There are many steganography embedded techniques used Arabic language to embed the hidden message in the cover text. Kashida, Shifting Point and Sharp-edges are the three Arabic steganography embedded techniques with high capacity. However, these three techniques have lack of performance to embed the hidden message into the cover text. This paper present about traid-bit method by integrating these three Arabic text steganography embedded techniques. It is an effective way to evaluate many embedded techniques at the same time, and introduced one solution for various cases.
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
Imran Sarwar Bajwa, M. Abbas Choudhary [2006], "A Rule Based System for Speech Language Context Understanding", International Journal of Donghua University (English Edition), Jun 2006, Vol. 23 No. 06, pp:39-42
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
EXPERIMENTING WITH THE NOVEL APPROACHES IN TEXT STEGANOGRAPHYIJNSA Journal
As is commonly known, the steganographic algorithms employ images, audio, video or text files as the medium to ensure hidden exchange of information between multiple contenders to protect the data from the prying eyes. However, using text as the target medium is relatively difficult as compared to the other target media, because of the lack of available redundant information in a text file. In this paper, in the backdrop of the limitations in the prevalent text based steganographic approaches, we propose simple, yet novel approaches that overcome the same. Our approaches are based on combining the random character sequence and feature coding methods to hide a character. We also analytically evaluate the approaches based on metrics viz. hiding strength, time overhead and memory overhead entailed. As compared to other methods, we believe the approaches proposed impart increased randomness and thus aid higher security at lower overhead.
THE EVOLUTION OF IDS SOLUTIONS IN WIRELESS AD-HOC NETWORKS TO WIRELESS MESH N...IJNSA Journal
The domain of wireless networks is inherently vulnerable to attacks due to the unreliable wireless medium. Such networks can be secured from intrusions using either prevention or detection schemes. This paper focuses its study on intrusion detection rather than prevention of attacks. As attackers keep on improvising too, an active prevention method alone cannot provide total security to the system. Here in lies the importance of intrusion detection systems (IDS) that are solely designed to detect intrusions in real time. Wireless networks are broadly classified into Wireless Ad-hoc Networks (WAHNs), Mobile Adhoc Networks (MANETs), Wireless Sensor Networks (WSNs) and the most recent Wireless Mesh Networks (WMNs). Several IDS solutions have been proposed for these networks. This paper is an extension to a survey of IDS solutions for MANETs and WMNs published earlier in the sense that the present survey offers a comparative insight of recent IDS solutions for all the sub domains of wireless networks.
TEXT STEGANOGRAPHIC APPROACHES: A COMPARISONIJNSA Journal
This paper presents three novel approaches of text steganography. The first approach uses the theme of missing letter puzzle where each character of message is hidden by missing one or more letters in a word of cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover and stego file. The second approach hides a message in a wordlist where ASCII value of embedded character determines length and starting letter of a word. The third approach conceals a message, without degrading cover, by using start and end letter of words of the cover. For enhancing the security of secret message, the message is scrambled using one-time pad scheme before being concealed and cipher text is then concealed in cover. We also present an empirical comparison of the proposed approaches with some
of the popular text steganographic approaches and show that our approaches outperform the existing approaches.
TEXT STEGANOGRAPHIC APPROACHES: A COMPARISONIJNSA Journal
This paper presents three novel approaches of text steganography. The first approach uses the theme of
missing letter puzzle where each character of message is hidden by missing one or more letters in a word
of cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover and
stego file. The second approach hides a message in a wordlist where ASCII value of embedded character
determines length and starting letter of a word. The third approach conceals a message, without
degrading cover, by using start and end letter of words of the cover. For enhancing the security of secret
message, the message is scrambled using one-time pad scheme before being concealed and cipher text is
then concealed in cover. We also present an empirical comparison of the proposed approaches with some
of the popular text steganographic approaches and show that our approaches outperform the existing
approaches.
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information in textual material. This can include determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying specific emotions or opinions expressed in the text, that involves the use of advanced machine and deep learning techniques. Recently, transformer-based language models make this task of human emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that spend a lot of time on sequential processing, making them prone to fail when it comes to processing long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based language models on opinion mining and provide a high-level comparison between them to highlight their key particularities. Additionally, our comparative study shows leads and paves the way for production engineers regarding the approach to focus on and is useful for researchers as it provides guidelines for future research subjects.
Traid-bit embedding process on Arabic text steganography methodjournalBEEI
The enormous development in the utilization of the Internet has driven by a continuous improvement in the region of security. The enhancement of the security embedded techniques is applied to save the intellectual property. There are numerous types of security mechanisms. Steganography is the art and science of concealing secret information inside a cover media such as image, audio, video and text, without drawing any suspicion to the eavesdropper. The text is ideal for steganography due to its ubiquity. There are many steganography embedded techniques used Arabic language to embed the hidden message in the cover text. Kashida, Shifting Point and Sharp-edges are the three Arabic steganography embedded techniques with high capacity. However, these three techniques have lack of performance to embed the hidden message into the cover text. This paper present about traid-bit method by integrating these three Arabic text steganography embedded techniques. It is an effective way to evaluate many embedded techniques at the same time, and introduced one solution for various cases.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
Recent development of generative pretrained language models has been proven very successful on a wide
range of NLP tasks, such as text classification, question answering, textual entailment and so on. In this
work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding
Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by
both automatic metrics and human annotators, and demonstrated that the architecture achieves the stateof-the-art comparable result on large scale corpus – ‘CNN/Daily Mail1
As the best of our knowledge’, this
is the first work that applies BERT based architecture to a text summarization task and achieved the stateof-the-art comparable result.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONijaia
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
While speech recognition Word Error Rate (WER) has reached human parity for English, continuous
speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation
and punctuation problems, resulting from irregular pausing patterns or slow speakers. Transformer
sequence tagging models are effective at capturing long bi-directional context, which is crucial for
automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained
by real-time requirements, making it hard to incorporate the right context when making punctuation
decisions. Context within the segments produced by ASR decoders can be helpful but limiting in overall
punctuation performance for a continuous speech session. In this paper, we propose a streaming approach
for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact
on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation
issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine Translation (MT).
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
While speech recognition Word Error Rate (WER) has reached human parity for English, continuous speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. Context within the segments produced by ASR decoders can be helpful but limiting in overall punctuation performance for a continuous speech session. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine Translation (MT).
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
Similar to A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD TAGGING AND RGB COLOR CODING (20)
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How libraries can support authors with open access requirements for UKRI fund...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD TAGGING AND RGB COLOR CODING
1. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
DOI: 10.5121/ijnsa.2018.10601 1
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY
FOR SECRET COMMUNICATION USING WORD
TAGGING AND RGB COLOR CODING
Ali F. Al-Azzawi1
1
Department of Software Engineering, IT Faculty, Philadelphia University, Amman,
Jordan
ABSTRACT
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and
recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high
hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process
and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding
limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique.
The linguistic technique is used to divide the cover text into embedding layers where each layer consists of
a sequence of words that has a single part of speech detected by POS tagger, while the format-based
technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from
the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover,
the robustness is accomplished through a multi-layer embedding process, and the generated stego key
significantly assists the security of the embedding messages and its size. The experimental results
comparison shows that the purpose approach is better than currently developed approaches in providing
an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
KEYWORDS
Text Stenography, Python Programming language, Multi-layer encoding, Natural Language
Prepossessing, Color space
1. INTRODUCTION
Text steganography can involve whatever besides altering the format of a text content, to editing
words within a text, to randomize character sequences, and the use of context-free grammars to
create a readable text [1]. Text steganography is accepted to be the trickiest and critical because
of insufficiency of redundant data where it is available in an image, sound or a video file and in
another hand can easily be discovered[2]. The text file is identical to what we watch, while in
different in another sort of file, for example, in the image, the structure of image file is different
as what we watch [3]. Unnoticed changes can be made to an image, an audio file or a video file,
but, in text files, any change can be noticed by a reader even an extra letter or punctuation. Text
steganography is preferable over other steganography approaches, since it used a text file that
requires less storing memory and faster sending/receiving time over communication channels[4,
5].
Text steganography can be extensively divided into two sorts: Format based strategy, and
Linguistic strategy. Format based strategy deals with changing the physical format of a text to
embed information which can be classified by word shift, line shift, extend the space and encode
features, while Linguistic strategy deals with natural language to embed information and
Linguistic approach can be classified into the syntax and semantics, these methods as shown in
Figure[6, 7].
2. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
2
Figure 1. Text steganography types
Linguistic steganography approach considers analyzing linguistic properties of natural language
text, and its linguistic structure to hide secret messages[8], and most of these approaches use the
steganographic knowledge to hide secret messages inside the syntactical structure itself. The
linguistic steganography may use either semantic properties or syntactic properties[9]. In
syntactic Steganography includes modifying the text structure without significantly altering the
cover text meaning to hide secret messages [10], while in semantic steganography strategy is a
way to deal with considered semantic by introducing an adjustment within the meaning of the text
to hide secret message [8, 11].
Format based text Steganography can be classified into different types: Format based method
which focuses around the inclusion of spaces or non-showed characters, and resizing of text styles
[1], but if the stego text is opened, incorrect spellings, additional space characters will get
distinguished and changed text styles sizes can stimulate doubt to a human [3].
Text steganography is one among the hardest regions of information hiding since the human eye
can be effortlessly recognized any change that shows up between the original text content and the
stego text content and it might essentially distinguish [12]. Essentially, there are three
fundamental criteria to examine the execution of Text steganographic approach: imperceptibility,
embedding hiding limit, and robustness criteria [12], section 2.4 will demonstrate these
parameters.
Most of the existing text Steganography methods aren't sufficiently secure enough to stop
detection and are planned to be either overcome hiding limit, or robustness problems. In this
work, the imperceptibility, high hiding limit, and robustness problems have been considered to
investigate the performance of the proposed text steganography approach, and the most objective
is to get vital increment within the capacity of hiding data in the cover text and generate a stego
keys for security improvement. To realize this objective, we proposed stenography based on a
linguistic technique and a format-based technique. The linguistic technique is used to divide the
cover text into embedding layers where each layer consists of sequence of words that have single
part of speech detected by POS tagging, while the format-based technique is used to hide secret
data in those layers by coloring letters with near RGB color..
The following sections are: Section two investigates the background, Section three discusses the
proposed work, Section 4, discusses the results of the proposed work, and finally, in Section five,
shows the conclusion and future works.
2. BACKGROUND
In this section, we review the fundamental concepts to play out our end goal to implement an
efficient text steganography approach considering using part of speech as layer, selecting a
3. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
3
suitable range of RGB color by using CIELAB visual perception, and then considering natural
Language processing Toolkit (NLTK), python-docx package, and Python programming language
as a development environment and what are the most common criteria in designing of an efficient
text steganography algorithm?.
2.1. PART OF SPEECH
Parts-of-speech (POS) are valuable in light of the huge information may give about a word and
word's neighbors. Knowing the tag of the word is important to deal with probably neighbor words
and concerning grammar structure around the word (the noun is typically a part of a noun phrase),
that makes part-of-speech tagging a very important element of grammar parsing[13]. While there
are several lists of POS, an up-to-date process on the English language are used 45-tag labeled
with a Penn Treebank tag set as a large type-of corpora [14], and Table 1 shows some of these
tags.
Table 1. Some Part-of-speech tags for Penn Treebank
In this work, the cover text words are classified according to 45-tag of the Penn Treebank tag set,
and the classification process may divide the cover text into 45 hiding layers or less.
2.2. NATURAL LANGUAGE PROCESSING USING PYTHON
To play out our specific end goal, we implement the proposed approach using Natural Language
Processing Toolkit (NLTK), pyt hon-docx p ackage,and Python programming language.
NLTK is an arrangement of projects and libraries that gives many functionality for example,
tokenization, part of speech tagging, classification, parsing, and semantic analysis for the English
language. It is written in the Python programming language and was produced by Edward Loper
and Steven Bird at Pennsylvania University [15], while the python-docx package is used for
Microsoft document file editing, and altering RGB letters for Microsoft document file [16].
2.3. COLOR PERCEPTUAL COMPARISON
The CIELAB model (CIE, Commission International de l’Eclairage), is one LAB space models
employed to measure the colors differences with relevance to human vision, where LAB
expresses by L for lightness, and A, B for green-red and blue-yellow colors respectively. The
main purpose is to make a linear area for the colors so that the distance between points can define
in terms of the perceptual difference between colors [17].
Consider (L1, a1, b1), and (L1, a1, b1) are two colors in CIELAB model space, but there are
different formula series for LAB application such as CIE76, CIE94, and CIE2000, and this work
CIE76 formula is used for its simplicity, and for a limited used RGB range (0-15). The color
difference can be defined in following [17, 18]:
4. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
4
2 2 2
2 1 2 1 2 1( ) ( ) ( )abE L L a a b b
(1)
which indicates:
not perceptible if ΔE <1
unnoticeable, if 0 < ∆E < 1
only experienced can notice, 1 < ∆E < 2
only inexperienced can notice 2 < ∆E < 3.5
noticeable, if 3.5 < ∆E < 5
different colors, if 5 ∆E> 5
and the formulas to convert RGB to Lab is defined as follows:
116 , 500 , 200
n n n n n
Y X Y Y Z
L g a g g b g g
Y X Y Y Z
(2)
1/3
0.0082
( ) 16.3
7. 083
1
8
17
0.0 8 2
t for
g
t
for t
t
(3)
The proposed approach in this work depends on changing color letters of the cover text from
black to a near chosen color so that the human eye cannot see any difference in that change. The
RGB color space is selected within the range 0 to 15 for each parameter R, G, and B. All
differences between black color and other RGB colors used in the range 0-15 were measured
according to above metric CIELAB visual perception, and these differences were found less than
one and within the range not perceptible by the human eye.
2.4. EVALUATION CRITERIA OF TEXT STEGANOGRAPHY
The researchers consider many criteria in designing algorithms for text steganography, and the
most common criteria for recently proposed approaches concern with hiding capacity, invisibility,
robustness; however, an appropriate algorithm must provide an optimal trade-off between these
evaluation criteria in line with the appliance needs of approach. [7,19, 20, 21].
1. HIDING CAPACITY:
The hiding capacity can be detected by calculating the maximum embedding length of the secret
message in bits that can be inserted in the cover text to construct the stego-text (4), but it is
unhelpful for an algorithm to provide high capacity without having a desirable protection[19, 22,
23]. The maximum length of the secret message can be determined with respect to bit per
location where location means a specific embed position in the cover text (e.g., after punctuations
and space between words).
Maximum Embedding Bit Per Location Total Locations (4)
2. IMPERCEPTIBILITY:
Imperceptibility is the ability of a concealed message to be not seen by the human eye. This is
achieved by the high level of likeness among the cover text content, and the stego text content.
The source cover content needs to stand almost the same after embedding the message and should
not be corrupt cover text. The most ideal method of measuring the degree of invisibility is to look
at the variety of the cover text before and after inserting the secret message and the
5. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
5
imperceptibility may be formulated through the following mathematical expression which shows
that the cover text will be approximately same after inserting the secret message [19, 25].
CT ≈ ST, where CT is the cover text and ST is the stego text (5)
3. ROBUSTNESS:
The robustness of a steganography algorithm is the resistance to different steg-analysis attacks to
modify, or to extract of a concealed secret message by an intruder without having authorization
key. The robustness can be numerically estimated by using losing probability (LP) which
indicates how much of embedded secret message will be lost from the stego text, and lower
probability prompts a more robust algorithm. The distortion robustness (DR) can found by:
1D R L P (6)
where
, 1
N E L
L P N E L C T
C T L
NEL =No. Of Embedding Locations, CTL=Cover Text Length
There is a safety level of security that keeps attackers from recognizing the secret message
visually or from extracting it [7], which relies upon embedding capacity, invisibility, and
robustness and a productive steganography approach must give appropriate balance between the
three criteria [24].
3. THE PROPOSED APPROACH
The proposed approach embeds any kind of information as a string of bits in a text document. For
instance, a secret text message is converted to a string of bits, and the essential hiding element is
a chunk of 12 bits of the secret message, that’s 12 bits embedded in the single letter by changing
its color with a different RGB color close to black color. So our plan is to replace the original
RGB color for each letter in the cover text to a close RGB color to embed 12 bits from a secret
message.
At first, the NLTK tokenizer and tagging modules are used to divide the cover text into a
sequence of words with its associated part of speech, all words with the same part of speech are
collected to construct a layer, the frequencies (no. of words with the same tag) of these layers are
calculated, and sorted in ascending order according to its frequencies. Each part of speech with its
frequency represents an embedding layer and the capacity of the cover text is determined using
the number of layers and the number of words found in each layer through the following
equation:
1 1
*12, , 1.. , 1.. .
i j
M N
N Wof embeded bits len word j j Layer i i N j Mo
(7)
Maximum number of embedding layers can be selected from the total number of layers, a list of
non-repeated random integers from 0 to maximum layer is generated to be used in encoding or
decoding process, and the layers are selected in the sequence that appears in the random layer list
and a word is selected from that layer to encode or decode a part of the secret message. As long
as, a binary chunk of secret message remains to hide, a word is taken from a layer until there are
no more binary chunk to hide, at that point, the cover text contains the secret message and turned
into a stego text, and a secret stego key is produced which is utilized to identify the integrity of
6. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
6
stego text at extracting process. The following two sections explain the detailed proposed
algorithms for embedding and extracting process.
PROCESS. 3.1. EMBEDDING DATA
Embedding secret message algorithm
Input : Cover text file, the Secret message to hide
Output : Stego text file, key
1. The NLTK tokenize module is used to divide the cover text into a sequence of words, denoted
by W = (W1, ..., Wm)
2. The NLTK tagging module is used to assign a part of speech to each word in the sequence W
and generate new sequence TW as shown below, where each word associates with its part of
speech.
TW = ((W1, T1),..., (Wm, Tk)), where Wi is a word that has part of speech Ti.
3. The sequence TW of the cover text is divided into a sequence of K layers, denoted by L1, L2,
…, Lk, where each layer contains all words with the same part of speech (POS), these layers are
indexed by an integer 1, 2, ., k, where k is a total number of entire POS discovered into the cover
text.
4. Sort the K layers in ascending order according to the length of layer, where
Length of i layer= Number of the words ∈ Li
5. Select number of embedding layers K, and generate a non-repeated random list for the layers
indices, i.e SL= (7, 1, k-1, ., 4), where each layer index <= k.
6. Construct an embedding word sequence:
WS= {{WS11, WS12, …, WS1k}, {WS21, WS22, .., WS2k }, {WS31, WS32, …,WS3k}…,....}
such that WSij is a word at the location i of j layer
7. Convert the secret message to binary string BS.
8. Split the bits of BS into a sequence of chunks CL= [C1, C2,…, Ct] such that each chunk
contains 12 bits.
9. For each word W in embedding word sequence WS:
For each letter in W:
If still there is a chunk in CL:
Get next chunk C and split it into three groups of 4 bits (R, G, B)
Color the letter with (R, G, B)
Else exit
10. Produce a key from the number of embedding letters, and the selected layer list SL.
3.2. EXTRACTING DATA
Extracting Secret message Algorithm
7. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
7
Input : Stego text file, key
Output : Secret message
1. The NLTK tokenizer module is used to divide the cover text into a sequence of words, denoted
by W = (W1, ..., Wm)
2. The NLTK tagging module is used to assign a part of speech to each word in the sequence TW
and generate new sequence TW that associate each word with its part of speech, denoted by TW
= ((W1, T1),..., (Wm, Tk)), where Wi is a word that has part of speech Ti.
3. The sequence TW of the cover text is divided into a sequence of K layers, denoted by L1,
L2,…, Lk, each layer contains all words in the cover text with the same tag, these layers are
indexed by an integer 1, 2, ., k, where k is the total number of part of speech (tag) found in the
cover text.
4. Sort the K layers in ascending order according to the length of layer, where
Length of i layer= Number of the words ∈ Li
5. Determine number of embedding letters EL and list layers SL from the key.
6. Construct an embedding word sequence:
WS= {{WS11, WS12, …, WS1k}, {WS21, WS22, .., WS2k}, {WS31 , WS32, …,WS3k}…, .... }
such that WSij is a word at the location i of j layer.
7. For each word W in the embedding word sequence WS:
For each letter in W:
If there are more embedding letters than:
Read the RGB color of embedding letter
Convert RGB to a string of 12 bits and add it binary secret message BS
Else exit
8. Convert the binary string BS to Unicode representation.
3.3. AN EXAMPLE FOR THE PROPOSED APPROACH
Consider a cover document contains the text "Count your age by friends, not years. Count
your life by smiles, not tears", and a secret message to embed is ‘‘May you live every day of
your life.”
Using the algorithm in section 3.1:
1. In step 1 and step 2, the cover text is divided into a sequence of words, and a part of speech is
detected for each word. The list of words with its associate part of speech is as shown in Figure
2.
[('Count', 'NNP'), ('your', 'PRP$'), ('age', 'NN'), ('by', 'IN'), ('friends', 'NNS'), ('not',
'RB'), ('years', 'NNS'), ('Count', 'NNP'), ('your', 'PRP$'), ('life', 'NN'), ('by', 'IN'),
('smiles', 'NNS'), (',', ','), ('not', 'RB'), ('tears', 'NNS')]
Figure 2. Words with its POS
2. In step 3 and step 4, a sequence of layers for the cover text is detected, each layer contains all
words with the same POS then are sorted according to its length, which is shown Figure 3, there
8. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
8
are six layers, where the first layer is tagged by “NNS” that contains four words, second layer is
tagged by “NNP” that contains two words and so on.
[('NNS', (4, 0)), ('NNP', (2, 0)), ('PRP$', (2, 0)), ('NN', (2, 0)), ('IN', (2, 0)), ('RB', (2, 0))]
a. Layers arranged according to the length
1 2 3 4 0 5 0 1 2 3 4 0 5 0
‘‘Count your age by friends, not years. Count your life by smiles, not tears ”
b. Words with its layers
Figure 3. POS Layers and words layers
3. In step 5 and step 6, select a number of layers to be used for embedding and a non-repeated
random list of is generated. Consider we select four layers for embedding, Figure 4 shows the
selected layers list and the word in these layers.
[('NNS', (4, 0)), ('NNP', (2, 0)), ('PRP$', (2, 0)), ('NN', (2, 0))]
a. Selected Layers
1 2 3 0 0 1 2 3 0 0
Count your age friends years. Count your life smiles tears
b. Words in the selected layers (Each color represents a layer)
Figure 4. Selected layers and words in it
Assume the generated random list is [2, 0, 3, 1], and the words in the embedding layer will be
as shows in Figure 5:
2 0 3 1
2 2 0 0 0 0 3 3 1 1
your your friends years smiles tears age life Count Count
Figure 5. Words in the embedding layers
For embedding, each time a word is selected from a layer according to the order shown in Figure
5, but if the layer does not contain any word, it will move again to the next layer. The embedding
word sequence appears in Figure 6.
[ 2 0 3 1 ] [2 0 3 1] [ 0 0]
your friends age Count your years life Count smiles tears
Figure 6. Embedding word sequence
4. In step 7, the secret message is converted to a binary bit form as below:
010011010110000101111001001000000111100101101111011101010010000001101100011010
010111011001100101001000000110010101110110011001010111001001111001001000000110
010001100001011110010010000001101111011001100010000001111001011011110111010101
1100100010000001101100011010010110011001100101
9. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
9
5. In step 8 and step 9, the binary bit string is divided into a sequence of 12 bit chunks, each
chunk is divided into 4 bit groups, and the values of these groups represent R, G, and B colors
respectively. The embedding word sequence, bit string selected for each word, and RGB color
selected for each letter are shown in Figure 7.
Word Word coloring
your
bits 010011010110 000101111001 0010 00000111 100101101111
RGB [4, 13, 6] [1, 7, 9] [2, 0, 7] [9, 6, 15]
friends
bits
011101010010 000001101100 011010010111 011001100101
001000000110 010101110110 011001010111
RGB
[7, 5, 2] [0, 6, 12] [6, 9, 7] [6, 6, 5]
[2, 0, 6] [5, 7, 6] [6, 5, 7]
age
Bits 001001111001 001000000110 010001100001
RGB [2, 7, 9] [2, 0, 6] [4, 6, 1]
Count
Bits 011110010010 000001101111 011001100010 000001111001 011011110111
RGB [7, 9, 2] [0, 6, 15] [6, 6, 2] [0, 7, 9] [6, 15, 7]
your
Bits 10101110010 001000000110 110001101001 011001100110
RGB [5, 7, 2] [2, 0, 6] [12, 6, 9] [6, 6, 6]
years Bits 010100000000
RGB [5, 0, 0]
Figure 7. Word bit string, and RGB for letters
The following stego text will be generated after coloring each letter in the selected words to a
near RGB color, that will be identical to the cover text:
‘Count your age by friends, not years. Count your life by smiles, not tears ”
6. In step 10, the key is generated from the number of embedding letters and the random list. For
the example, the key will be 24, [2, 0, 3, 1] which means the secret message needs 24 letters to be
hidden, and will be distributed into the words of the layers 2, 0, 3, and 1. Using the equation (7),
the maximum message length that can be hidden in the cover text example will be 59*12=708
bits which is equal to 88 characters.
3.4. IMPLEMENTATION OF EMBEDDING AND EXTRACTING
The embedding and extracting processes are implemented by using Python programming
language, natural language framework NLTK, and python-docx package as described below:
1. NLTK is utilized to classify each word part-of-speech in the cover text content.
2. Python-docx package is used to read the Microsoft document file, write a Microsoft
document file, and change RGB letters in a Microsoft document file.
The complete implementation for hiding and extracting processes and tests are found in the link
https://drive.google.com/file/d/1-L8hmZ3AiOXILAfK2srca-
k1jWDe0tUG/view?usp=sharing
4. EXPERIMENTAL RESULTS
There are a lot of approaches that have similar properties to the proposed text steganography,
such as Space mimic, WhiteSteg, SNOW, TWSM, wbSteg040pen, UniSpaCh, and AITSteg. But
since UniSpaCh and AITSteg perform better than all others [19,26], the comparison of the
10. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
10
proposed approach is done on AITSteg and UniSpaCh according to the common criteria: hiding
capacity, invisibility, and robustness[19]. For fair comparison, we use the same cover texts and
the same secret message that is shown in the Table 2 [19] which are used to evaluate AITSteg and
UniSpaCh.
Table 2. Cover Texts and Secret Massages
The following evaluation, analysis of the proposed approach is considered:
Hiding Capacity: is a major parameter for a steganography algorithm performance the proposed
approach and AITSteg approach have the same number of embedding characters for all tests as
shown in Table 3. AITSteg approach inserts a ZWC character to hide 2 bits from secret message,
which means that AITSteg inserts 32 ZWC characters in the cover text to hide only 8 and this will
increase the cove text size unacceptably, while, the proposed algorithm can hide 12 binary in
each letter of the cover text without inserting or adding any character. For UniSpaCh approach,
the proposed approach is vastly improved as appeared in the Table 3.
Table 3. Comparisons with other approaches
CL= Cover Tex6t Length, SL = Secret Message Length
Invisibility: Figure 6 shown that the proposed approach does not alternate any cover text
character after hiding the secret message since the embedding algorithm only recolors letters of
the cover text to a close RGB so that the human eye cannot see any difference. All differences
between the black RGB color and all other RGB colors in the range 0-15 were measured
according to above metric CIELAB visual perception, and we found that these differences are less
than one and within the range not perceptible by the human eye.
Robustness: the reversibility of the secret message from the stego text content is the most
fundamental criteria in steganography. In this work, the measurement has been achieved by
hiding the secret message in the cover text words through alerting the color of the words which
are selected from multiple layers chosen randomly. The distortion robustness values of the
proposed approach are calculated using the equation (7) and from Table 3, our proposed approach
is more robust than UniSpaCh with a small difference than AITSteg. But AITSteg approach
considered that the whole message will be embedded at one location at the beginning of the cover
content, so that the first character of the message will be hidden in first the location, the next
11. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
11
character of the message will be hidden in the following location and so on. Which leads that the
message will be hidden in one continuous stream, and its destruction is easier than the message
being distributed in the cover text. Beside, AITSteg approach inserts a character for each 2 bits
from a message, which leads also to destroy the embedding messages easily. As an example for
that, a message for test1 has 6 characters (48 bits), needs to embed 24 characters in the cover text
of length 40 characters and even it inserted in one location, it can be easily destroyed the cause of
its length.
In the proposed approach, even if the hidden message is discovered, it cannot be decoded or
dropped from the stego text, since the proposed hiding algorithm follows a tricky way to hide
information through using not ordinary stream inserting process, a randomly multi-layer hiding
technique, recoloring cover text letters with a suitable RGB color to provide imperceptibility. In
the light of these evidences, it provides a good balance between embedding, imperceptibility, and
robustness criteria.
5. CONCLUSIONS
This work presents a text Steganography based on word tagging and RGB coding and suggests
using a word tagging technique to supply a high security, and RGB coding technique to supply
high hiding capacity, and imperceptibility. This text Steganography approach groups together all
words with the same POS in a layer to build multi hiding layers to ensure the security and
changes the RGB color of each letter of the cover text to close RGB color to get high hiding
capacity. Also, we tend to provide a text steganography approach that dynamically selects number
multilayers, part-of-speech tagging, and its sequence for hiding a secret message to make the
steganalysis process more difficult, besides coloring the words in multilayers tagging of the cover
text letters with closest RGB is an improvement of the linguistic steganography approach that
provides clearer good smart tools for data concealment
REFERENCES
[1] K. Benett, (2004), “Linguistic steganography- survey, analysis and robustness concerns for hiding
information in text”, Purdue University, CERIAS Tech. Report 2004-13,
[2] S Bhattacharya, P Indu, Duta, SA Biswas, G Sanyal, (2011) “Hiding data in text through in alphabet
letter patterns (CALP)”. Journal of Global Research in Computer Science, 2(3): 33-39
[3] Agarwal, M., (2013). "Text steganographic approaches: a comparison”. arXiv preprint
arXiv:1302.2718.
[4] Shirali-Shahreza, M.H. and Shirali-Shahreza, M., (2006), July. “A new approach to Persian/Arabic
text steganography”. In Computer and Information Science, 2006 and 2006 1st IEEE/ACIS
International Workshop on Component-Based Software Engineering, Software Architecture and
Reuse. ICIS-COMSAR 2006. 5th IEEE/ACIS International Conference on(pp. 310-315). IEEE.
[5] S. H. Low, N. F. MaxemchUK, J. T. Brassil, and L. O. Gorman, (1995). “Document marking and
identification using both line and word shifting”, INFOCOM95 Proceedings of the Fourteenth Annual
Joint Conf. Of the IEEE Computer and Communication Societies, 1995, pp. 853-860.
[6] Singh, H., Singh, P.K. and Saroha, K., (2009), February. “A survey on text based steganography”.
In Proceedings of the 3rd National Conference (pp. 26-27).
[7] Taleby Ahvanooey, M., Li, Q., Shim, H.J. and Huang, Y., (2018). “A Comparative Analysis of
Information Hiding Techniques for Copyright Protection of Text Documents”. Security and
Communication Networks, 2018.
[8] Banerjee, I., Bhattacharyya, S. and Sanyal, G., (2011), January. “Novel text steganography through
special code generation”. Int. Conf. On Systemics, Cybernetics, and Informatics (pp. 298-303).
12. International Journal of Network Security & Its Applications (IJNSA) Vol. 10, No.6, November 2018
12
[9] Shirali-Shahreza, M., (2008), February. “Text steganography by changing words spelling.
In Advanced Communication Technology”, ICACT 2008. 10th International Conference on (Vol. 3,
pp. 1912-1913). IEEE.
[10] Topkara, M., Topkara, U. and Atallah, M.J., (2006), October. "Words are not enough: sentence level
natural language watermarking":. In Proceedings of the 4th ACM international workshop on Contents
protection and security (pp. 37-46). ACM.
[11] Kumar, K.A. and Pabboju, S., (2018). “AN OPTIMIZED TEXT STEGANOGRAPHY APPROACH
USING DIFFERENTLY SPELT ENGLISH WORDS".
[12] Krishnan, R. B., Thandra, P. K., & Baba, M. S. (2017). "An overview of text steganography. In Signal
Processing, Communication and Networking" (ICSCN), 2017 Fourth International Conference on (pp.
1-6). IEEE
[13] Jurafsky, D. and Martin, J.H., (20164). "Speech and language processing". London: Pearson.
[14] Marcus, M.P., Marcinkiewicz, M.A. and Santorini, B., (1993). “Building a large annotated corpus of
English: The Penn Treebank. Computatioal linguistics”, 19(2), pp.313-330.
[15] Hardeniya, N., Perkins, J., Chopra, D., Joshi, N. and Mathur, I., (2016). “Natural Language
Processing: Python and NLTK”. Packt Publishing Ltd.
[16] http://textx.readthedocs.io/en/v1.4.x/#projects-using-textx, last visit, September, (2018).
[17] Mokrzycki, W.S. and Tatol, M., 2011. "Colour difference∆ E-A survey". Machine Graphics and
Vision, 20(4), pp.383-411.
[18] Patel, I. and Goud, J., (2012). "Colour recognition for blind and colour blind people". Int J. Eng
Innovat Technol, 2(6), pp.38-42.
[19] Ahvanooey, M.T., Li, Q., Hou, J., Mazraeh, H.D. and Zhang, J., (2018). "AITSteg: An Innovative
Text Steganography Technique for Hidden Transmission of Text Message via Social Media". IEEE
Access.
[20] Taleby Ahvanooey, M., Dana Mazraeh, H. and Tabasi, S.H., (2016). "An innovative technique for
web text watermarking" (AITW). Information Security Journal: A Global Perspective, 25(4-6),
pp.191-196.
[21] Aman, M., Khan, A., Ahmad, B. and Kouser, S., (2017). "A hybrid text steganography approach
utilizing Unicode space characters and zero-width character". International Journal on Information
Technologies and Security, 9(1), pp.85-100.
[22] Alotaibi, R.A. and Elrefaei, L.A., (2018). "Improved capacity Arabic text watermarking methods
based on open word space". Journal of King Saud University-Computer and Information
Sciences, 30(2), pp.236-248.
[23] Kouser, S. and Khan, A., (2017). "a novel feature extraction approach: capacity based zero-text
steganography". international journal on information technologies and security, 9(3), pp.85-98.
[24] Zhang, W., Meng, J. and Ma, C., (2018), June. “Research progress of applying digital watermarking
technology for printing”. In 2018 Chinese Control And Decision Conference (CCDC) (pp. 4479-
4482). IEEE.
[25] Kamaruddin, N.S., Kamsin, A., Por, L.Y. and Rahman, H., (2018). "A Review of Text
Watermarking: Theory, Methods, and Applications". IEEE Access, 6, pp.8011-8028.
[26] Kumar, R., Malik, A., Singh, S., Kumar, B. and Chand, S., (2016), April. "A Space based reversible
high capacity text steganography scheme using Font type and style". In Computing, Communication,
and Automation (ICCCA), 2016 International Conference on (pp. 1090-1094). IEEE.