This document discusses unsupervised techniques for deciphering encrypted documents. It presents an approach using noisy-channel modeling and expectation-maximization (EM) to find the most likely plaintext given a ciphertext. As an example, it applies this method to deciphering a basic English letter substitution cipher. The algorithm initializes substitution probabilities uniformly, then iterates to adjust the probabilities based on plaintext and ciphertext letter co-occurrence counts to converge on the most probable plaintext.
This document describes techniques for creating compact and fast n-gram language models. It presents several implementations of n-gram language models that are both small in size and fast to query. The most compact implementation can store the 4 billion n-grams from the Google Web1T corpus in just 23 bits per n-gram, using techniques like implicitly encoding words and applying variable-length encodings to context deltas. It also discusses methods for improving query speed, such as a novel language model caching technique that speeds up both their implementations and SRILM by up to 300%.
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...Koji Matsuda
The document presents a unified approach for measuring semantic similarity between texts at multiple levels (sense, word, text) using semantic signatures. It generates semantic signatures through multi-seeded random walks over the WordNet graph. It then aligns and disambiguates words and senses to extract sense "seeds" for the signatures. Finally, it calculates signature similarity using measures like cosine similarity, weighted overlap, and top-k Jaccard. The approach provides a unified framework for semantic similarity that can be applied to various NLP tasks.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
R is a statistical programming language that is free, open source, and has a large community of contributors. It allows for reproducible research through scripting. R can be used for statistical analysis, modeling, data visualization, and is a general purpose programming language. It has many packages for tasks like time series analysis, machine learning, and reading various data formats. Alternatives include Stata, SPSS and SAS but R is more flexible as a programming language.
This paper presents the alignment of verbal predicate constructions with the clitic pronoun "lhe" in the European (EP) and Brazilian (BP) varieties of Portuguese, such as in the sentences "Já lhe} arrumaram a bagagem" | "Sua bagagem está seguramente guardada" 'His baggage is safely stowed away', where the EP dative proclisis "lhe" contrasts with the BP possessive pronoun "sua". We have selected several different paraphrastic contrasts, such as proclisis and enclisis, clitic pronouns co-occurring with relative pronouns and negation-type adverbs, among other constructions to illustrate the linguistic phenomenon. Some differences correspond to real contrasts between the two Portuguese varieties, while others purely represent stylistic choices. The contrasting variants were manually aligned in order to constitute a gold standard dataset, and a typology has been established to be further enlarged and made publicly available. The paraphrastic alignments were performed in the e-PACT corpus using the CLUE-Aligner tool. The research work was developed in the framework of the eSPERTo project.
Understanding the risk factors of learning in adversarial environmentsPluribus One
This document summarizes research on developing a theoretical foundation for robust machine learning classifiers that can provide assurances against adversarial manipulation. It proposes measuring a classifier's robustness based on how much its decision boundary rotates under small perturbations to the training data (contamination). For linear classifiers, robustness can be quantified as the expected angular change between the classifier's weight vectors trained on clean vs. contaminated data. This provides an intuitive way to compare learning algorithms and inform the development of more robust algorithms.
This document describes techniques for creating compact and fast n-gram language models. It presents several implementations of n-gram language models that are both small in size and fast to query. The most compact implementation can store the 4 billion n-grams from the Google Web1T corpus in just 23 bits per n-gram, using techniques like implicitly encoding words and applying variable-length encodings to context deltas. It also discusses methods for improving query speed, such as a novel language model caching technique that speeds up both their implementations and SRILM by up to 300%.
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...Koji Matsuda
The document presents a unified approach for measuring semantic similarity between texts at multiple levels (sense, word, text) using semantic signatures. It generates semantic signatures through multi-seeded random walks over the WordNet graph. It then aligns and disambiguates words and senses to extract sense "seeds" for the signatures. Finally, it calculates signature similarity using measures like cosine similarity, weighted overlap, and top-k Jaccard. The approach provides a unified framework for semantic similarity that can be applied to various NLP tasks.
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
R is a statistical programming language that is free, open source, and has a large community of contributors. It allows for reproducible research through scripting. R can be used for statistical analysis, modeling, data visualization, and is a general purpose programming language. It has many packages for tasks like time series analysis, machine learning, and reading various data formats. Alternatives include Stata, SPSS and SAS but R is more flexible as a programming language.
This paper presents the alignment of verbal predicate constructions with the clitic pronoun "lhe" in the European (EP) and Brazilian (BP) varieties of Portuguese, such as in the sentences "Já lhe} arrumaram a bagagem" | "Sua bagagem está seguramente guardada" 'His baggage is safely stowed away', where the EP dative proclisis "lhe" contrasts with the BP possessive pronoun "sua". We have selected several different paraphrastic contrasts, such as proclisis and enclisis, clitic pronouns co-occurring with relative pronouns and negation-type adverbs, among other constructions to illustrate the linguistic phenomenon. Some differences correspond to real contrasts between the two Portuguese varieties, while others purely represent stylistic choices. The contrasting variants were manually aligned in order to constitute a gold standard dataset, and a typology has been established to be further enlarged and made publicly available. The paraphrastic alignments were performed in the e-PACT corpus using the CLUE-Aligner tool. The research work was developed in the framework of the eSPERTo project.
Understanding the risk factors of learning in adversarial environmentsPluribus One
This document summarizes research on developing a theoretical foundation for robust machine learning classifiers that can provide assurances against adversarial manipulation. It proposes measuring a classifier's robustness based on how much its decision boundary rotates under small perturbations to the training data (contamination). For linear classifiers, robustness can be quantified as the expected angular change between the classifier's weight vectors trained on clean vs. contaminated data. This provides an intuitive way to compare learning algorithms and inform the development of more robust algorithms.
Learning for semantic parsing using statistical syntactic parsing techniquesUKM university
This document describes Ruifang Ge's Ph.D. final defense presentation on using statistical syntactic parsing techniques for learning semantic parsing. It introduces two novel syntax-based approaches to semantic parsing called SCISSOR and SYNSEM. SCISSOR is an integrated syntactic-semantic parser that allows both syntax and semantics to be used simultaneously to obtain an accurate combined syntactic-semantic analysis. SYNSEM exploits an existing syntactic parser to produce disambiguated parse trees that drive the compositional meaning composition. Experimental results on two datasets show that SCISSOR achieves competitive performance compared to other semantic parsing systems, and that leveraging syntactic knowledge improves performance on longer sentences.
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
This document presents a method for latent domain word alignment to improve alignment accuracy when training on heterogeneous corpora containing data from different domains. It proposes adding a latent domain layer to the standard hidden Markov alignment model to condition alignment probabilities on the domain. The model is trained using an EM algorithm with partial domain supervision from seed samples. Experimental results show the latent domain model improves over a baseline by disentangling domain-specific translation relationships and alignment probabilities, achieving higher precision, recall and lower alignment error rates.
EMNLP 2019 parallel iterative edit models for local sequence transduction広樹 本間
- The document presents a Parallel Iterative Edit (PIE) model for local sequence transduction tasks like grammatical error correction.
- The PIE model achieves accuracy competitive with encoder-decoder models by predicting edits instead of tokens, iteratively refining predictions, and factorizing logits over edits and tokens to leverage pre-trained language models.
- Experiments show the PIE model provides a 5-15x speed improvement over encoder-decoder models for grammatical error correction while maintaining comparable accuracy.
Open Source Natural Language Processing - Francis Bondjasonong
The document is a presentation on open source natural language processing (NLP) given by Francis Bond. It introduces Bond's background and outlines, covering topics like machine translation examples, why open source is important for NLP, and the current state of the art. It provides examples of open source machine translation tools like MOSES and LOGON, and discusses challenges in NLP including ambiguity and the need for large language models and corpora.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
This document provides an overview of security tools that can be used in software development. It discusses coding standards, compiler warnings, version control, design for security, testing, and analysis tools. Coding standards and following compiler warnings can help catch simple errors. Version control allows reviewing changes and implementing hooks to catch issues early. Designing for security from the start is important. Testing on multiple levels from units to full systems helps improve quality. Static and dynamic analysis tools can find bugs without or during execution. Overall, applying security best practices throughout the development cycle leads to higher quality software.
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED SPEECH S...Tomoki Koriyama
The document proposes incorporating a simple recurrent unit (SRU) into deep Gaussian processes (DGP) for speech synthesis to enable utterance-level sequential modeling. The SRU-DGP model outperformed feedforward DGP and LSTM RNN baselines in subjective evaluations, and achieved faster speech generation than an LSTM RNN. Experimental results on a Japanese speech corpus showed the SRU-DGP model yielded smaller spectral distortion than other neural network and Bayesian neural network baselines. Future work will investigate incorporating other differentiable components like attention into the DGP framework.
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
Presentazione al Meetup di Marzo del Machine Learning / Data Science Meetup di Roma: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/events/248063386/
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
This document discusses various probabilistic language models used in natural language processing applications. It covers n-gram models like bigram and trigram models used for tasks like speech recognition. It describes how probabilistic language models assign probabilities to strings of text based on counting word occurrences. It also discusses techniques like additive smoothing and linear interpolation that are used to handle zero probability word pairs in n-gram models. Finally, it introduces probabilistic context-free grammars which use rewrite rules with associated probabilities to model language structure.
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
In beginning there was the "rule based" machine translation, like Babelfish, that didn't work at all. Then came the Statistical Machine translation, powering the like of Google Translate, and all was good. Nowadays, it's all about Deep Learning and the Neural Machine Translation is the state of the art, with unmatched translation fluency. Let's dive into the internals of a Neural Machine Translation system, explaining the principles and the advantages over the past.
This document proposes a novel framework called smooth sparse coding for learning sparse representations of data. It incorporates feature similarity or temporal information present in data sets via non-parametric kernel smoothing. The approach constructs codes that represent neighborhoods of samples rather than individual samples, leading to lower reconstruction error. It also proposes using marginal regression rather than lasso for obtaining sparse codes, providing a dramatic speedup of up to two orders of magnitude without sacrificing accuracy. The document contributes a framework for incorporating domain information into sparse coding, sample complexity results for dictionary learning using smooth sparse coding, an efficient marginal regression training procedure, and successful application to classification tasks with improved accuracy and speed.
This document summarizes an adaptive Turkish anti-spam filtering algorithm that uses both artificial neural networks and Bayes filtering. It has two parts: a morphology module that extracts word roots from Turkish text, and a classification module that learns to classify emails as spam or normal using the extracted word roots. Experimental results showed the Bayes filtering approach achieved up to 95% accuracy for spam detection and 90% for normal emails, outperforming the neural network approaches tested.
This document summarizes research on developing an adaptive Turkish anti-spam filtering system using artificial neural networks and Bayes filtering. The system has two parts: a morphology module that extracts word roots from Turkish text, and a learning module that classifies emails. Experimental results found up to 95% accuracy for spam detection and 90% for normal emails using a Bayes filter approach with a binary representation of words. Processing times were under a minute, demonstrating the method is effective for Turkish language spam filtering while requiring reasonable computation. Future work aims to improve accuracy further on larger email corpora.
This document summarizes research on developing an adaptive Turkish anti-spam filtering system using artificial neural networks and Bayes filtering. The system has two parts: a morphology module that extracts word roots from Turkish text, and a learning module that classifies emails. Experimental results found up to 95% accuracy for spam detection and 90% for normal emails using a Bayes filter approach with a binary feature representation of words. Processing times were around 6 seconds for morphology and under 1 minute for classification. Future work could explore improving accuracy rates and processing larger email datasets.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
Learning for semantic parsing using statistical syntactic parsing techniquesUKM university
This document describes Ruifang Ge's Ph.D. final defense presentation on using statistical syntactic parsing techniques for learning semantic parsing. It introduces two novel syntax-based approaches to semantic parsing called SCISSOR and SYNSEM. SCISSOR is an integrated syntactic-semantic parser that allows both syntax and semantics to be used simultaneously to obtain an accurate combined syntactic-semantic analysis. SYNSEM exploits an existing syntactic parser to produce disambiguated parse trees that drive the compositional meaning composition. Experimental results on two datasets show that SCISSOR achieves competitive performance compared to other semantic parsing systems, and that leveraging syntactic knowledge improves performance on longer sentences.
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
This document presents a method for latent domain word alignment to improve alignment accuracy when training on heterogeneous corpora containing data from different domains. It proposes adding a latent domain layer to the standard hidden Markov alignment model to condition alignment probabilities on the domain. The model is trained using an EM algorithm with partial domain supervision from seed samples. Experimental results show the latent domain model improves over a baseline by disentangling domain-specific translation relationships and alignment probabilities, achieving higher precision, recall and lower alignment error rates.
EMNLP 2019 parallel iterative edit models for local sequence transduction広樹 本間
- The document presents a Parallel Iterative Edit (PIE) model for local sequence transduction tasks like grammatical error correction.
- The PIE model achieves accuracy competitive with encoder-decoder models by predicting edits instead of tokens, iteratively refining predictions, and factorizing logits over edits and tokens to leverage pre-trained language models.
- Experiments show the PIE model provides a 5-15x speed improvement over encoder-decoder models for grammatical error correction while maintaining comparable accuracy.
Open Source Natural Language Processing - Francis Bondjasonong
The document is a presentation on open source natural language processing (NLP) given by Francis Bond. It introduces Bond's background and outlines, covering topics like machine translation examples, why open source is important for NLP, and the current state of the art. It provides examples of open source machine translation tools like MOSES and LOGON, and discusses challenges in NLP including ambiguity and the need for large language models and corpora.
This document describes the process of constructing a corpus of spoken and written Santome, a Portuguese-related creole language spoken in Sao Tome and Principe. The corpus contains over 184,000 words from written sources like newspapers and books, as well as transcribed spoken recordings. Efforts were made to standardize the orthography and develop part-of-speech tags for annotation. Metadata is encoded for each text, and the corpus will be made available through a concordancing tool to allow searches while copyright permissions are obtained. The goal is for this and related Gulf of Guinea creole corpora to enable comparative linguistic research.
Tutorial on Parallel Computing and Message Passing Model - C4Marcirio Chaves
This document provides a tutorial on communicating non-contiguous data and mixed data types in parallel computing using MPI (Message Passing Interface). It discusses several strategies for sending this type of complex data, including sending multiple messages, buffering using pack/unpack, and defining derived datatypes. It also covers collective communication operations like broadcast, scatter/gather, and reductions.
This document provides an overview of security tools that can be used in software development. It discusses coding standards, compiler warnings, version control, design for security, testing, and analysis tools. Coding standards and following compiler warnings can help catch simple errors. Version control allows reviewing changes and implementing hooks to catch issues early. Designing for security from the start is important. Testing on multiple levels from units to full systems helps improve quality. Static and dynamic analysis tools can find bugs without or during execution. Overall, applying security best practices throughout the development cycle leads to higher quality software.
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED SPEECH S...Tomoki Koriyama
The document proposes incorporating a simple recurrent unit (SRU) into deep Gaussian processes (DGP) for speech synthesis to enable utterance-level sequential modeling. The SRU-DGP model outperformed feedforward DGP and LSTM RNN baselines in subjective evaluations, and achieved faster speech generation than an LSTM RNN. Experimental results on a Japanese speech corpus showed the SRU-DGP model yielded smaller spectral distortion than other neural network and Bayesian neural network baselines. Future work will investigate incorporating other differentiable components like attention into the DGP framework.
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONcsandit
The Internet and the ubiquitous presence of computing devices anywhere is generating a
continuously growing amount of information. However, the information entropy is not uniform.
It allows the use of data compression algorithms to reduce the demand for more powerful
processors and larger data storage equipment. This paper presents an adaptive rule-driven
device - the adaptive automata - as the device to identify repetitive patterns to be compressed in
a grammar based lossless data compression scheme.
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
Presentazione al Meetup di Marzo del Machine Learning / Data Science Meetup di Roma: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/events/248063386/
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching
(PPM) compression algorithm. This achieves significantly better compression for different natural
language texts compared to other well-known compression methods. Our method first generates a grammar
based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in
the text being compressed and then substitutes these sequences using the respective non-terminal symbols
defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly
improved results in compression for various natural languages (a 5% improvement for American English,
10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We
describe further improvements using a two pass scheme where the grammar-based pre-processing is
applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary
Corpus and also achieve significantly improved results in compression, between 11% and 20%, when
compared with other compression algorithms, including a grammar-based approach, the Sequitur
algorithm.
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
This document discusses various probabilistic language models used in natural language processing applications. It covers n-gram models like bigram and trigram models used for tasks like speech recognition. It describes how probabilistic language models assign probabilities to strings of text based on counting word occurrences. It also discusses techniques like additive smoothing and linear interpolation that are used to handle zero probability word pairs in n-gram models. Finally, it introduces probabilistic context-free grammars which use rewrite rules with associated probabilities to model language structure.
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
In beginning there was the "rule based" machine translation, like Babelfish, that didn't work at all. Then came the Statistical Machine translation, powering the like of Google Translate, and all was good. Nowadays, it's all about Deep Learning and the Neural Machine Translation is the state of the art, with unmatched translation fluency. Let's dive into the internals of a Neural Machine Translation system, explaining the principles and the advantages over the past.
This document proposes a novel framework called smooth sparse coding for learning sparse representations of data. It incorporates feature similarity or temporal information present in data sets via non-parametric kernel smoothing. The approach constructs codes that represent neighborhoods of samples rather than individual samples, leading to lower reconstruction error. It also proposes using marginal regression rather than lasso for obtaining sparse codes, providing a dramatic speedup of up to two orders of magnitude without sacrificing accuracy. The document contributes a framework for incorporating domain information into sparse coding, sample complexity results for dictionary learning using smooth sparse coding, an efficient marginal regression training procedure, and successful application to classification tasks with improved accuracy and speed.
This document summarizes an adaptive Turkish anti-spam filtering algorithm that uses both artificial neural networks and Bayes filtering. It has two parts: a morphology module that extracts word roots from Turkish text, and a classification module that learns to classify emails as spam or normal using the extracted word roots. Experimental results showed the Bayes filtering approach achieved up to 95% accuracy for spam detection and 90% for normal emails, outperforming the neural network approaches tested.
This document summarizes research on developing an adaptive Turkish anti-spam filtering system using artificial neural networks and Bayes filtering. The system has two parts: a morphology module that extracts word roots from Turkish text, and a learning module that classifies emails. Experimental results found up to 95% accuracy for spam detection and 90% for normal emails using a Bayes filter approach with a binary representation of words. Processing times were under a minute, demonstrating the method is effective for Turkish language spam filtering while requiring reasonable computation. Future work aims to improve accuracy further on larger email corpora.
This document summarizes research on developing an adaptive Turkish anti-spam filtering system using artificial neural networks and Bayes filtering. The system has two parts: a morphology module that extracts word roots from Turkish text, and a learning module that classifies emails. Experimental results found up to 95% accuracy for spam detection and 90% for normal emails using a Bayes filter approach with a binary feature representation of words. Processing times were around 6 seconds for morphology and under 1 minute for classification. Future work could explore improving accuracy rates and processing larger email datasets.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
The goal of this work is to present an efficient implementation of the Backpropagation (BP) algorithm to train Artificial Neural Networks with general feedforward topology. This will lead us to the "consecutive retrieval problem" that studies how to arrange
efficiently sets into a sequence so that every set appears contiguously in the sequence. The BP implementation is analyzed, comparing efficiency results with another similar tool. Together with the BP implementation, the data description and manipulation features of our toolkit facilitates the development of experiments in numerous
fields.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming TELKOMNIKA JOURNAL
Genomic repeats, i.e., pattern searching in the string processing process to find
repeated base pairs in the order of deoxyribonucleic acid (DNA), requires
a long processing time. This research builds a big-data computational
model to look for patterns in strings by modifying and implementing
the Boyer-Moore algorithm on Apache Spark Streaming for human DNA
sequences from the ensemble site. Moreover, we perform some experiments
on cloud computing by varying different specifications of computer clusters
with involving datasets of human DNA sequences. The results obtained show
that the proposed computational model on Apache Spark Streaming is faster
than standalone computing and parallel computing with multicore. Therefore,
it can be stated that the main contribution in this research, which is to develop
a computational model for reducing the computational costs, has been
achieved.
This document discusses using R analytics in the cloud. It provides an introduction to bioinformatics and analyzing gene expression data from C. elegans to study aging. It explains that R is popular for bioinformatics but limited to single machines. Hadoop and tools like Segue allow scaling R to the cloud. Segue creates AWS clusters and implements lapply for distributed computing. An example analyzes gene correlation at scale using Segue on AWS. The goal is to discover genes responsible for aging through clustered gene expression maps.
This document discusses using decipherment techniques to improve machine translation when parallel data is scarce. It presents an overview of machine translation pipelines and notes that performance drops when parallel data is limited. The document proposes using monolingual data to improve machine translation in real-world scenarios with limited parallel data. It outlines contributions including fast, accurate decipherment of over 1 billion tokens with 93% accuracy, and using decipherment to improve machine translation for domain adaptation and low-resource languages.
This talk will cover various aspects of Logic Programming. We examine Logic Programming in the contexts of Programming Languages, Mathematical Logic and Machine Learning.
We will we start with an introduction to Prolog and metaprogramming in Prolog. We will also discuss how miniKanren and Core.Logic differ from Prolog while maintaining the paradigms of logic programming.
We will then cover the Unification Algorithm in depth and examine the mathematical motivations which are rooted in Skolem Normal Form. We will describe the process of converting a statement in first order logic to clausal form logic. We will also discuss the applications of the Unification Algorithm to automated theorem proving and type inferencing.
Finally we will look at the role of Prolog in the context of Machine Learning. This is known as Inductive Logic Programming. In that context we will briefly review Decision Tree Learning and it's relationship to ILP. We will then examine Sequential Covering Algorithms for learning clauses in Propositional Calculus and then the more general FOIL algorithm for learning sets of Horn clauses in First Order Predicate Calculus. Examples will be given in both Common Lisp and Clojure for these algorithms.
Pierre de Lacaze has over 20 years’ experience with Lisp and AI based technologies. He holds a Bachelor of Science in Applied Mathematics and Computer Science and a Master’s Degree in Computer Science. He is the president of LispNYC.org
70 C o m m u n i C at i o n s o f t h E a C m | j u ly 2 0 1 3 | v o l . 5 6 | n o . 7
contributed articles
V o i c e i n p u t i s a major requirement for practical
question answering (QA) systems designed for
smartphones. Speech-recognition technologies are
not fully practical, however, due to fundamental
problems (such as a noisy environment, speaker
diversity, and errors in speech). Here, we define the
information distance between a speech-recognition
result and a meaningful query from which we can
reconstruct the intended query, implementing this
framework in our RSVP system.
In 12 test cases covering male, female, child, adult,
native, and non-native English speakers, each with
57 to 300 questions from an independent test set of
300 questions, RSVP on average re-
duced the number of errors by 16% for
native speakers and by 30% for non-
native speakers over the best-known
speech-recognition software. The idea
was then extended to translation in
the QA domain.
In our project, which is supported
by Canada’s International Develop-
ment Research Centre (http://www.
idrc.ca/), we built a voice-enabled
cross-language QA search engine for
cellphone users in the developing
world. Using voice input, a QA system
would be a convenient tool for people
who do not write, for people with im-
paired vision, and for children who
might wish their Talking Tom or R2-
D2 really could talk.
The quality of today’s speech-rec-
ognition technologies, exemplified
by systems from Google, Microsoft,
and Nuance does not fully meet such
needs for several reasons:
˲ Noisy environments in common
audio situations;1
˲ Speech variations, as in, say,
adults vs. children, native speakers
vs. non-native speakers, and female
vs. male, especially when individual
voice-input training is not possible, as
in our case; and
˲ Incorrect and incomplete sen-
tences; even customized speech-rec-
ognition systems would fail due to
coughing, breaks, corrections, and
the inability to distinguish between,
say, “sailfish” and “sale fish.”
Speech-recognition systems can be
trained for a “fixed command set” of
up to 10,000 items, a paradigm that
information
Distance
Between What
i said and
What it heard
D o i : 1 0 . 1 1 4 5 / 2 4 8 3 8 5 2 . 2 4 8 3 8 6 9
The RSVP voice-recognition search engine
improves speech recognition and translation
accuracy in question answering.
BY YanG tanG, Di WanG, JinG Bai, XiaoYan Zhu, anD minG Li
key insights
focusing on an infinite but highly
structured domain (such as Qa),
we significantly improve general-purpose
speech recognition results and
general-purpose translation results.
assembling a large amount of internet
data is key to helping us achieve these
goals; in the highly structured Qa domain,
we collected millions of human-asked
questions covering 99% of question types.
RsVP development is guided by a theory
involving informatio.
Similar to Unsupervised analysis for decipherment problems (20)
This document is a table of contents and introduction for a book titled "jQuery Fundamentals" by Rebecca Murphey. The book covers jQuery basics, core concepts, events, effects, Ajax, plugins, and advanced topics. It includes over 50 code examples to demonstrate jQuery syntax and techniques. The book is available under a Creative Commons license and the source code is hosted on GitHub.
This document provides a preface and table of contents for a book on jQuery concepts. The preface explains that the book is intended to teach intermediate and advanced jQuery concepts through code examples. It highlights some stylistic approaches used in the book, such as emphasizing code over text explanations and using color coding. It also defines some key terms that will be used, and recommends reviewing the jQuery documentation and understanding how the text() method works before reading the book. The table of contents then outlines the book's 12 chapters and their respective sections, which cover topics like selecting, traversing, manipulating, events, plugins and more.
This document proposes techniques for embedding unique codewords in electronic documents to discourage illicit copying and distribution. It describes three coding methods - line-shift coding, word-shift coding, and feature coding - that alter document formatting or text elements in subtle, hard-to-detect ways. Experimental results show the line-shift coding method can reliably decode documents even after photocopying, enabling identification of the intended recipient. The techniques aim to make unauthorized distribution at least as difficult as obtaining documents legitimately from the publisher.
This document discusses the field of computer forensics. It defines computer forensics as the collection, preservation, and analysis of computer-related evidence. The goal is to provide solid legal evidence that can be admitted in court and understood by laypeople. Computer forensics is used to investigate various incidents including human behavior like fraud, physical events like hardware failures, and organizational issues like staff changes. It aims to determine the root cause of system disruptions and failures.
This document discusses techniques for data hiding, which involves embedding additional data into digital media files like images, audio, or text. It describes several constraints on data hiding, such as the amount of data to hide, ensuring the data remains intact if the file is modified, and preventing unauthorized access to the hidden data. The document outlines traditional and novel data hiding techniques and evaluates them for applications like copyright protection, tamper-proofing, and adding supplemental data to files. It also discusses tradeoffs between hiding more data versus making the data more robust against modifications to the file.
This document summarizes an analysis of over 200,000 websites engaged in badware behavior according to Google's Safe Browsing initiative. The analysis found that over half of infected sites were located in China, with the top three Chinese network blocks accounting for 68% of infections in that country. In contrast, infected sites in the US were more distributed. Compared to the previous year, the total number of infected sites increased, likely due to expanded scanning and increased malware distribution through websites.
Steganography has been used for over 2500 years to hide secret messages. The paper explores steganography's history from ancient times through modern digital applications. It discusses early examples like Johannes Trithemius' steganographic treatise in the 15th century. Modern uses include microdots, digital images, audio, and digital watermarks for copyright protection. Terrorist groups may use steganography but there is no public evidence yet. Steganography continues to evolve with technology while attackers work to defeat new techniques.
The document discusses various cryptographic techniques including symmetric and asymmetric encryption. Symmetric encryption uses the same key for encryption and decryption, while asymmetric encryption uses two different keys. The document then describes the Data Encryption Standard (DES) algorithm and its variants, including Triple DES. It also covers the Advanced Encryption Standard (AES) algorithm, its design principles, and modes of operation for block ciphers like ECB, CBC, CFB and OFB.
This document discusses the topic of steganography, which is hiding secret messages within other harmless messages. It outlines different techniques for hiding messages in text, images, and audio files. For text, it describes line shift coding, word shift coding, and feature coding methods. For images, it explains least significant bit insertion and exploiting the limitations of the human visual system. For audio, it mentions low-bit encoding and other techniques like phase coding and spread spectrum. It also discusses steganalysis, which aims to detect and destroy hidden messages within files.
This document discusses the need for computer security and provides an introduction to key concepts. It explains that security is necessary to protect vital information, provide authentication and access control, and ensure availability of resources. The document then outlines common security threats like firewall exploits, software bugs, and denial of service attacks. It also discusses basic security components of confidentiality, integrity, and availability as well as goals of preventing attacks, detecting violations, and enabling recovery.
The document discusses various types of malicious programs including buffer overflows, viruses, worms, Trojan horses, backdoors, and logic bombs. It describes how buffer overflows can corrupt the program stack and be exploited by attackers. It explains that viruses attach themselves to other programs and replicate, worms replicate across networks, and Trojan horses masquerade as legitimate programs. It also outlines different approaches for antivirus software including signature-based, heuristic, activity monitoring, and full-featured protection.
This document discusses various topics relating to web security, including:
1) Different types of web pages like static, dynamic, and active pages and the technologies used to create them like JavaScript, Java, and CGI.
2) Security issues associated with technologies like ActiveX, Java applets, JavaScript, and cookies.
3) Protocols for secure communication like HTTPS, digital certificates, and single sign-on systems.
4) Methods for secure electronic commerce including SET and digital cash technologies.
This document provides an overview of network security topics including attacks like diffing, sniffing, session hijacking and spoofing. It discusses protocols for secure communication including SSL, TLS and IPSec. SSL and TLS provide security at the transport layer by encrypting data between a client and server. IPSec provides security at the network layer for both transport and tunnel modes. Authentication Header and Encapsulating Security Payload are the two security protocols used in IPSec.
This document provides an overview of network security topics including diffing, sniffing, session hijacking, spoofing, SSL, TLS, IPSec, and VPNs. It discusses how these attacks work and methods to protect against them, such as encryption. Network layer security protocols like IPSec are described, which uses authentication headers or encapsulating security payloads to provide security services to packets. Transport layer security protocols SSL and TLS are also summarized, including how they establish encrypted sessions between clients and servers.
This document discusses various topics related to computer security authorization, including multilevel security models like Bell-LaPadula and Biba's model, covert channels, inference control, CAPTCHAs, firewalls, and intrusion detection systems. It also provides an overview of network layers like the network layer, transport layer, TCP, and UDP. The key models discussed are Bell-LaPadula for confidentiality and Biba's model for integrity. Covert channels, inference control, and intrusion detection systems are described as techniques for authorization and access control.
This document discusses various methods of authentication, including message authentication, entity authentication, and digital signatures. It describes techniques such as hashing, message authentication codes (MACs), digital signatures using RSA, and challenge-response authentication. It also covers other authentication methods such as passwords, biometrics, and zero-knowledge proofs. The goal of authentication is to verify the identity of entities and ensure the integrity and authenticity of messages.
This document discusses the discrete-time Fourier transform (DTFT). It begins by introducing the DTFT and how it can be used to represent aperiodic signals as the sum of complex exponentials. Several properties of the DTFT are then discussed, including linearity, time/frequency shifting, periodicity, and conjugate symmetry. Examples are provided to illustrate how to compute the DTFT of simple signals. The document also discusses how the DTFT can be used to represent periodic signals and impulse trains.
This document discusses the continuous-time Fourier transform. It begins by developing the Fourier transform representation of aperiodic signals as the limit of Fourier series coefficients as the period increases. It then defines the Fourier transform pairs and discusses properties like convergence. Several examples of calculating the Fourier transform of common signals like exponentials, pulses and periodic signals are provided. Key concepts like the sinc function are also introduced.
Chapter3 - Fourier Series Representation of Periodic SignalsAttaporn Ninsuwan
This document discusses Fourier series representation of periodic signals. It introduces continuous-time periodic signals and their representation as a linear combination of harmonically related complex exponentials. The coefficients in the Fourier series representation can be determined by multiplying both sides of the representation by complex exponentials and integrating over one period. The key steps are: 1) multiplying both sides by e-jω0t, 2) integrating both sides from 0 to T=2π/ω0, and 3) using the fact that the integral equals T when k=n and 0 otherwise to obtain an expression for the coefficients an. Examples are provided to illustrate these concepts.
Chapter3 - Fourier Series Representation of Periodic Signals
Unsupervised analysis for decipherment problems
1. Unsupervised Analysis for Decipherment Problems
Kevin Knight, Anish Nair, Nishit Rathod Kenji Yamada
Information Sciences Institute Language Weaver, Inc.
and Computer Science Department 4640 Admiralty Way, Suite 1210
University of Southern California Marina del Rey, CA 90292
knight@isi.edu, {anair,nrathod}@usc.edu kyamada@languageweaver.com
Abstract The method follows the well-known noisy-channel
framework. At the top level, we want to find the plain-
We study a number of natural language deci-
text that maximizes the probability P(plaintext cipher-
pherment problems using unsupervised learn- text). We first build a probabilistic model P(p) of the
ing. These include letter substitution ciphers, plaintext source. We then build probabilistic channel
character code conversion, phonetic decipher-
model P(c p) that explains how plaintext sequences
ment, and word-based ciphers with relevance (like p) become ciphertext sequences (like c). Some of
to machine translation. Straightforward unsu- the parameters in these models can be estimated with
pervised learning techniques most often fail on supervised training, but most cannot.
the first try, so we describe techniques for un-
When we face a new ciphertext sequence c, we first
derstanding errors and significantly increasing
use expectation-maximization (EM) (Dempster, Laird,
performance.
and Rubin, 1977) to set all free parameters to maximize
1 Introduction P(c), which is the same (by Bayes Rule) as maximiz-
ing the sum over all p of P(p) P(c p). We then use
¡
Unsupervised learning holds great promise for break- the Viterbi algorithm to choose the p maximizing P(p)
throughs in natural language processing. In cases like
¡ P(c p), which is the same (by Bayes Rule) as our
(Yarowsky, 1995), unsupervised methods offer accu-
original goal of maximizing P(p c), or plaintext given
racy results than rival supervised methods (Yarowsky, ciphertext.
1994) while requiring only a fraction of the data prepa- Figures 1 and 2 show standard EM algorithms
ration effort. Such methods have also been a key (Knight, 1999) for the case in which we have a bi-
driver of progress in statistical machine translation, gram P(p) model (driven by a two-dimensional b ta-
which depends heavily on unsupervised word align- ble of bigram probabilities) and a one-for-one P(c p)
ments (Brown et al., 1993). model (driven by a two-dimensional s table of substi-
There are also interesting problems for which super- tution probabilities). This case covers Section 3, while
vised learning is not an option. These include deci- more complex models are employed in later sections.
phering unknown writing systems, such as the Easter
Island rongorongo script and the 20,000-word Voynich
manuscript. Deciphering animal language is another
3 English Letter Substitution
case. Machine translation of human languages is an- An informal substitution cipher (Smith, 1943) dis-
other, when we consider language pairs where little or guises a text by substituting code letters for normal
no parallel text is available. Ultimately, unsupervised letters. This system is usually exclusive, meaning that
learning also holds promise for scientific discovery in each plaintext letter maps to only one ciphertext letter,
linguistics. At some point, our programs will begin and vice versa. There is surprisingly little published
finding novel, publishable regularities in vast amounts on this problem, e.g., (Peleg and Rosenfeld, 1979), be-
of linguistic data. cause fast computers led to public-key cryptography
before much computer analysis was done on such old-
2 Decipherment style ciphers. We study this problem first because it re-
In this paper, we look at a particular type of unsuper- sembles many of the other problems we are interested
vised analysis problem in which we face a ciphertext in, and we can generate arbitrary amounts of test data.
stream and try to uncover the plaintext that lies behind We estimate unsmoothed parameter values for an
it. We will investigate several applications that can be English letter-bigram P(p) from news data. This is a
profitably analyzed this way. We will also apply the 27x27 table that includes the space character. We then
same technical solution these different problems. set up a uniform P(c | p), which also happens to be a
499
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499–506,
Sydney, July 2006. c 2006 Association for Computational Linguistics
3. encoding, with space separators; we use integers to rep-
resent these bytes, as in Figure 5(a). Our plaintext is a
large collection of UTF8 standard Hindi. UTF8 builds
complex Hindi character “chunks” out of up to 3 simple
and combining characters. A Hindi word is a sequence
of chunks, and words are separated by spaces.
We know that c is Hindi—we imagine that it was
once UTF8, but that it somehow got enciphered.
Modeling is more complex than in the previous sec-
tion. First, we have to decide what our plaintext tokens
will be. Our first approach was to use chunks. Chunk
boundaries are essentially those where we could draw
a vertical line in written Hindi without disturbing any
characters. We could then set up a model of how UTF8
is “encoded” to the mystery sequence in the putative
channel—namely, we let each source chunk map to a
particular target byte sequence. (By analogy, we would
divide up English text into mostly letters, but would
chunk ligatures like “fi” together. In fact, in extracting
English text from pdf, we often find “fi” encoded by
a single byte). This model is quite general and holds
Figure 4: Decipherment error on letter substitution.
up across the encodings we have dealt with. However,
there are over 900 chunks to contend with, and vast
indeed, it is not correct to measure accuracy on a tun- numbers of target byte sequences, so that the P(c | p)
ing/development data set. Rather, we have demon- table is nearly unmanageable.
strated some general strategies and observations (more Therefore, we use a simpler model. We divide p into
data, larger n-grams, stability of good language mod- individual characters, and we set up a channel in which
els) that we can apply to other real decipherment situ- plaintext characters can map into either one or two ci-
ations. In many such situations, there is only a test set, phertext bytes. Instead of a table like P(c c | p), we
and tuning is impossible even in principle—fortunately, set up two tables: P(f | p) for character fertility, and
we observe that the general strategies work robustly P(c | p) for character-to-byte substitution. This is sim-
across a number of decipherment domains. ilar to Model 3 of (Brown et al., 1993), but without
null-generated elements or re-ordering.
4 Character Code Conversion Our actual ciphertext is an out-of-domain web page
Many human languages are straightforwardly repre- with 11,917 words of song lyrics in Hindi, in an id-
sented at the character level by some widely-adopted iosyncratic encoding. There is no known tool to con-
standard (e.g., ASCII). In dealing with other languages vert from this encoding. In order to report error rates,
(like Arabic), we must be equally prepared to process we had to manually annotate a portion of this web page
a few different standards. Documents in yet other lan- with correct UTF8. This was quite difficult. We were
guages (like Hindi) are found spread across the web in completely unable to do this manually by relying only
dozens if not hundreds of specialized encodings. These on the ciphertext byte sequence—even though this is
come with downloadable fonts for viewing. However, what we are asking our machine to do! But as Hindi
they are difficult to handle by computer, for example, readers, we also have access to the web-site rendering
to build a full-coverage Hindi web-search engine, or to in Hindi glyphs, which helps us identify which byte se-
pool Hindi corpora for training machine translation or quences correspond to which Hindi glyphs, and then
speech recognition. to UTF8. The labeled portion of our ciphertext con-
Character conversion tools exist for many pairs of sists of 59 running words (281 ciphertext bytes and 201
major encoding systems, but it has been the experi- UTF8 characters).
ence of many researchers that these tools are flawed, Because the machine decipherment rarely consists of
despite the amount of work that goes into them. 100% exactly 201 UTF8 characters, we report edit distance
accuracy is not to be found. Furthermore, nothing ex- instead of error rate. An edit distance of 0 is perfect,
ists for most pairs. We believe that mild annotation while the edit distance for long incorrect decipherments
techniques allow people to generate conversion tables may be greater than 201. With a source character bi-
quite quickly (and we show some results on this), but gram model, and the above channel, we obtain an edit
we follow here an unsupervised approach, as would distance of 161. With a trigram model, we get 127.
be required to automatically generate a consistently- Now we introduce another idea that has worked
encoded Hindi web. across several decipherment problems. We use a fixed,
Our ciphertext c is a stream of bytes in an unknown uniform fertility model and allow EM only to manip-
501
4. (a) ... 13 5 14 . 16 2 25 26 2 25 . 17 2 13 . 15 2 8 . 7 2 4 2 9 2 2 ...
(b) ... 6 35 . 12 28 49 10 28 . 3 4 6 . 1 10 3 . 29 4 8 20 4 ...
(c) ... 6 35 24 . 12 28 21 4 . 11 6 . 12 25 . 29 8 22 4 ...
(d) ... 6/35/24 . 12/28 21/28 . 3/4 6 . 1/25 . 29 8 20/4 ... *
Figure 5: Hindi character code decipherment. (a) is the Hindi ciphertext byte sequence, (b) is an EM decipherment
using a UTF8 trigram source model, (c) is a decipherment using a UTF8 word frequency model, and (d) is correct
UTF8 (chunks joined with slash). Periods denote spaces between words; * denotes the correct answer.
P(13 | 6) = 0.66 * P( 8|24) = 0.48 for some regions where re-ordering indeed happens.
P(32 | 6) = 0.19 P(14|24) = 0.33 *
P( 2 | 6) = 0.13 P(17|24) = 0.14
We are able to move back to our chunk-based model
P(16 | 6) = 0.02 P(25|24) = 0.04 in semi-supervised mode, which avoids the re-ordering
problem, and we obtain near-perfect decipherment ta-
P( 5 | 35) = 0.61 * P(16|12) = 0.58 * bles when we asked a human to re-type a few hundred
P(14 | 35) = 0.25 P( 2|12) = 0.32 * words of mystery-encoded text in a UTF8 editor.
P( 2 | 35) = 0.15 P(31|12) = 0.03
5 Phonetic Decipherment
Figure 6: A portion of the learned P(c | p) substitution
probabilities for Hindi decipherment. Correct map- This section expands previous work on phonetic de-
pings are marked with *. cipherment (Knight and Yamada, 1999). Archaeol-
ogists are often faced with an unknown writing sys-
tem that is believed to represent a known spoken lan-
ulate substitution probabilities. This prevents the al- guage. That is, the written characters encode phonetic
gorithm from locking onto bad solutions. This gives an sequences (sometimes individual phonemes, and some-
improved solution edit distance of 93, as in Figure 5(b), times whole words), and the relationship between text
which can be compared to the correct decipherment in and sound is to be discovered, followed by the mean-
5(d). Figure 6 shows a portion of the learned P(c | p) ing. Viewing text as a code for speech was radical
substitution table, with * indicating correct mappings. some years ago. It is now the standard view of writ-
15 out of 59 test words are deciphered exactly cor- ing systems, and many even view written Chinese as a
rectly. Another 16 out of 59 are perfect except for the straightforward syllabary, albeit one that is much larger
addition of one extra UTF8 character (always “4” or and complex than, say, Japanese kana. Both Linear
“25”). Ours are the first results we know of with unsu- B and Mayan writing were deciphered by viewing the
pervised techniques. observed text as a code/cipher for an approximately-
We also experimented with using a word-based known spoken language (Chadwick, 1958; Coe, 1993).
source model in place of the character n-gram model. We follow (Knight and Yamada, 1999) in using
We built a word-unigram P(p) model out of only the Spanish as an example. The ciphertext is a 6980-
top 5000 UTF8 words in our source corpus—it assigns character passage from Don Quixote, as in Figure 7(a).
probability zero to any word not in this list. This is The plaintext is a very large out-of-domain Span-
a harsh model, considering that 16 out of 59 words in ish phoneme sequence from which we compute only
our UTF8-annotated test corpus do not even occur in phoneme n-gram probabilities. We try deciphering
the list, and are thus unreachable. On the plus side, EM without detailed knowledge of spoken Spanish words
considers only decipherments consisting of sequences and grammar. The goal is for the decipherment to be
of real Hindi words, and the Viterbi decoder only gen- understandable by modern Spanish speakers.
erates genuine Hindi words. The resulting decipher- First, it is necessary to settle on the basic inventory
ment edit distance is encouraging at 92, with the result of sounds and characters. Characters are easy; we sim-
shown in Figure 5(c). This model correctly deciphers ply tabulate the distinct ones observed in ciphertext.
25 out of 59 words, with only some overlap to the pre- For sounds, we use a Spanish-relevant subset of the
vious 15 correct out of 59—one or other of the models International Phonetic Alphabet (IPA), which seeks to
is able to perfectly decipher 31 out of 59 words already, capture all sounds in all languages; the implementation
making a combination promising. is SAMPA (Speech Assessment Methods Phonetic Al-
Our machine is also able to learn in a semi- phabet). Here we show the sound and character inven-
supervised manner by aligning a cipher corpus with tories:
a manually-done translation into UTF8. EM searches Sounds:
for the parameter settings that maximize P(c | p), and
a Viterbi alignment is a by-product. For the intuition, B, D, G, J (ny as in canyon), L (y as
in yarn), T (th as in thin), a, b, d,
see Figure 5(a and d), in which plaintext character “6” e, f, g, i, k, l, m, n, o, p, r,
occurs twice and may be guessed to correspond with rr (trilled), s, t, tS (ch as in chin),
ciphertext byte “13”. EM does this perfectly, except u, x (h as in hat)
502
5. (a) primera parte del ingenioso hidalgo don quijote de la mancha
(b) primera parte des intenioso liDasto don fuiLote de la manTia
(c) primera parte del inGenioso biDalGo don fuiLote de la manTia
(d) primera parte del inxenioso iDalGo don kixote de la manSa *
Figure 7: Phonetic decipherment. (a) is written Spanish ciphertext, (b) is an initial decipherment, (c) is an improved
decipherment, and (d) is the correct phonetic transcription.
Characters: ñ, á, é, í, ó, ú, a, b, c, d, e, f, g, h, i, j, k, l, stand for consonant sounds, we can break them down
m, n, o, p, q, r, s, t, u, v, w, x, y, z further.
Our first approach is knowledge-free. We put to-
The correct decipherment (Figure 7(d)) is a sequence gether a fully-connected, uniform trigram source model
of 6759 phonemes (here in SAMPA IPA). P(p) over the tokens C, V, and SPACE. Our channel
We use a P(c | p) model that substitutes a single let- model P(c | p) is also fully-connected and uniform.
ter for each phoneme throughout the sequence. This We allow source as well as channel probabilities to
considerably violates the rules of written Spanish (e.g., float during training. This almost works, as shown in
the K sound is often written with two letters q u, and Figure 8(b). It correctly clusters letters into vowels
the two K S sounds are often written x), so we do not and consonants, but assigns exactly the wrong labels!
expect a perfect decipherment. We do not enforce ex- A complex cluster analysis (Finch and Chater, 1991)
clusivity; for example, the S sound may be written as c yields similar results.
or s. Our second approach uses syllable theory. Our
An unsmoothed phonetic bigram model gives an edit source model generates each source word in three
distance (error) of 805, as in Figure 7(b). Here we phases. First, we probabilistically select the number
study smoothing techniques. A fixed-lambda interpo- of syllables to generate. Second, we probabilistically
lation smoothing yields 684 errors, while giving each fill each slot with a syllable type. Every human lan-
phoneme its own trainable lambda yields a further re- guage has a clear inventory of allowed syllable types,
duction to 621. The corresponding edit distances for and many languages share the same inventory. Some
a trigram source model are 595, 703, and 492, the lat- examplars are (1995):
ter shown in Figure 7(c), an error of 7%. (This result
is equivalent to Knight Yamada [1999]’s 4% error, V CV CVC VC CCV CCVC CVCC VCC CCVCC
Hua
which did not count extra incorrect phonemes produced Cayuvava T
T
T
by decipherment, such as pronunciations of silent let- Cairene
Mazateco T
T
T
T
T
ters). Quality smoothing yields the best results. While Mokilese T T T T
Sedang
even the best decipherment is flawed, it is perfectly un-
T T T T
Klamath T T T
Spanish
derstandable when synthesized, and it is very good with Finnish
T
T
T
T
T
T
T
T
T T
T T
respect to the structure of the channel model. Totonac T T T T T T
English T T T T T T T T T
6 Universal Phonetic Decipherment
For our purposes, we allow generation of V, VC, VCC,
What if the language behind the script is unknown? CV, CVC, CCV, CVCC, CCVC, or CCVCC. Elements
The next two sections address this question in two dif- of the syllable type sequence are chosen independently
ferent ways. of each other, except that we disallow vowel-initial syl-
One idea is to look for universal constraints on lables following consonant-final syllables, following
phoneme sequences. If we somehow know that P(K the phonetic universal tendency to “maximize the on-
AE N UW L IY) is high, while P(R T M K T K) set” (the initial consonant cluster of a syllable). Third,
is low, that we may be able to exploit such knowl- we spell out the chosen syllable types, so that the whole
edge in deciphering an alphabetic writing system. In source model yields sequences over the tokens C, V,
fact, many universal constraints have been proposed by and SPACE, as before. This spelling-out is determinis-
linguists. Two major camps include syllable theorists tic, except that we may turn a V into either one or two
(who say that words are composed of syllables, and syl- Vs, to account for dipthongs. The channel model again
lables have internal regular structure (Blevins, 1995)) maps {C, V} onto {a, b, c, . . . }, and we again run EM
and anti-syllable theorists (who say that words are com- to learn both source and channel probabilities.
posed of phonemes that often constrain each other even Figure 8(c) shows that this almost works. To make
across putative syllable boundaries (Steriade, 1998)). it work, 8(d), we force the number of syllables per
We use the same Don Quixote ciphertext as in the word in the model to be fixed and uniform, rather than
previous section. While the ultimate goal is to la- learned. This prevents the system from making analy-
bel each letter with a phoneme, we first attack a more ses that are too short. We also execute several EM runs
tractable problem, that of labeling each letter as C (con- with randomly initialized P(c | p), and choose the run
sonant) or V (vowel). Once we know which letters with the highest resulting P(c).
503
6. (a) primera parte del ingenioso hidalgo don quijote de la mancha
(b) VVCVCVC VCVVC VCV CVVCVVCVC VCVCVVC VCV VCVVCVC VC VC VCVVVC
(c) CCV.CV.CV CVC.CV CVC VC.CVC.CV.CV CV.CVC.CV CVC CVC.CV.CV CV CV CVC.CCV
(d) CCV.CV.CV CVC.CV CVC VC.CV.CV.V.CV CV.CVC.CV CVC CV.V.CV.CV CV CV CVC.CCV
(e) NSV.NV.NV NVS.NV NVS VS.NV.SV.V.NV NV.NVS.NV NVS NV.V.NV.NV NV NV NVS.NSV
Figure 8: Universal phonetic decipherment. The ciphertext (a) is the same as in the previous figure. (b) is an
unsupervised consonant-vowel decipherment, (c) is a decipherment informed by syllable structure, (d) is an im-
proved decipherment, and (e) is a decipherment that also attempts to distinguish sonorous (S) and non-sonorous
(N) consonants.
We see that the Spanish letters are accurately divided prepared to make the third sound, while AE N P does
into consonants and vowels, and it is also straight- not. These and other constraints complement the model
forward to ask about the learned syllable generation by also working across syllable boundaries. There are
probabilities—they are CV (0.50), CVC (0.20), V also constraints on phoneme inventory (no voiced con-
(0.16), VC (0.11), CCV (0.02), CCVC (0.0002). sonant like B without its unvoiced partner like P) and
As a sanity check, we manually remove all P(c | p) syllable inventory (no CCV without CV).
parameters that match C with Spanish vowel-letters (a,
e, i, o, u, y, and accented versions) and V with Spanish 7 Brute-Force Phonetic Decipherment
consonant-letters (b, c, d, etc), then re-run the same EM Another approach to universal phonetic decipherment
learning. We obtain the same P(c). is to build phoneme n-gram databases for all human
Exactly the same method works for Latin. Inter- languages, then fully decipher with respect to each in
estingly, the fully-connected P(c | p) model leads to turn. At the end, we need an automatic procedure for
a higher P(c) than the “correctly” constrained chan- evaluating which source language has the best fit.
nel. We find that in the former, the letter i is some- There do not seem to be sizeable phoneme-sequence
times treated as a vowel and other times as a consonant. corpora for many languages. Therefore, we used
The word “omnium” is analyzed by EM as VC.CV.VC, source character models as a stand in, decoding as in
while “iurium” is analyzed as CVC.CVC. Section 3. We built 80 different source models from
We went a step further to see if EM could iden- sequences we downloaded from the UN Universal Dec-
tify which letters encode sonorous versus non-sonorous laration of Human Rights website.1
consonants. Sonorous consonants are taken to be per- Suppose our ciphertext starts “cevzren cnegr qry...”
ceptually louder, and include n, m, l, and r. Addition- as in Figure 9(a). We decipher it against all 80 source
ally, vowels are more sonorous than consonants. A uni- language models, and the results are shown in Fig-
versal tendency (the sonority hierarchy) is that sylla- ure 9(b-f), ordered by post-training P(c). The sys-
bles have a sonority peak in the middle, which falls off tem believes 9(a) is enciphered Spanish, but if not,
to the left and right. This captures why the syllable G then Galician, Portuguese, or Kurdish. Spanish is ac-
R A R G sounds more typical than R G A G R. There tually the correct answer, as the ciphertext is again
are exceptions, but the tendency is strong. Don Quixote (put through a simple letter substitution to
We modify our source model to generate S (sonorous show the problem from the computer’s point of view).
consonant), N (non-sonorous consonant), V, and Similarly, EM detects that “fpn owoktvcv hu ihgzsnwfv
SPACE. We do this by changing the spell-out to prob- rqcffnw cw...” is actually English, and deciphers it as
abilistically transform CCVC, for example, into either “the analysis of wocuments pritten in...”
N S V S or N S V N, both of which respect the sonority Many writing systems do not write vowel sounds.
hierarchy. The result is imperfect, with the EM hijack- We can also do a brute force decipherment of vowel-
ing the extra symbols. However, if we first run our C, V, less writing by extending our channel model: first, we
SPACE model and feed the learned model to the S, N, deterministically remove vowel sounds (or letters, in
V, SPACE model, then it works fairly well, as shown in the above case), then we probabilistically substitute let-
Figure 8(e). Learned vowels include (in order of gen- ters according to P(c | p). For the ciphertext “ceze ceg
eration probability): e, a, o, u, i, y. Learned sonorous qy...”, EM still proposes Spanish as the best source lan-
consonants include: n, s, r, l, m. Learned non-sonorous guage, with decipherment “prmr prt dl...”
consonants include: d, c, t, l, b, m, p, q. The model
bootstrapping is good for dealing with too many pa- 8 Word-Based Decoding
rameters; we see a similar approach in Brown et al’s
Letter-based substitution/transposition schemes are
(1993) march from Model 1 to Model 5.
technically called ciphers, while systems that make
There are many other constraints to explore. For ex- whole-word substitutions are called codes. As an ex-
ample, physiological constraints make some phonetic ample code, one might write “I will bring the parrot to
combinations more unlikely. AE N T and AE M P
1
work because the second sound leaves the mouth well- www.un.org/Overview/right.html
504
7. (a) cevzren cnegr qry vatravbfb uvqnytb qba dhvwbgr qr yn znapun
P(c) proposed final
perplexity source edit-dist best P(p | c) decipherment
(b) 166.28 spanish 434 primera parte del ingenioso hidalgo don quijote de la mancha
(c) 168.75 galician 741 primera palte der ingenioso cidalgo don quixote de da mancca
(d) 169.07 portug. 1487 privera porte dal ingenioso didalgo dom quivote de ho concda
(e) 169.33 kurdish 4041 xwelawe berga mas estaneini hemestu min jieziga ma se lerdhe
...
(f) 179.19 english 4116 wizaris asive bec uitedundl pubsctl bly whualve be ks asequs
Figure 9: Brute-force phonetic decipherment. (a) is ciphertext in an unknown source language, while (b-f) show
the best decipherments obtained for some of the 80 candidate source languages, automatically sorted by P(c).
Canada” instead of “I will bring the money to John”— 511 France/French Britain/British
511 Britain/British France/French
or, one might encode every word in a message. Ma-
362 Canada/Canadian Britain/British
chine translation has code-like characteristics, and in- 362 Britain/British Canada/Canadian
deed, the initial models of (Brown et al., 1993) took a 182 France/French Canada/Canadian
word-substitution/transposition approach, trained on a 182 Canada/Canadian France/French
parallel text. 140 Britain/British Australia/Australian
140 Australia/Australian Britain/British
133 Canada/Canadian Australia/Australian
133 Australia/Australian Canada/Canadian
Because parallel text is scarce, it would be very good etc.
to extend unsupervised letter-substitution techniques to
word-substitution in MT. Success to date has been lim- Each corpus induces a kind of world map, with high
ited, however. Here we execute a small-scale example, frequency indicating closeness. The task is to figure
but completely from scratch. out how elements of the two world maps correspond.
We train a source English bigram model P(p) on the
plaintext, then set up a uniform P(c | p) channel with
In this experiment, we know the Arabic cipher names 7x7=49 parameters. Our initial result is not good: EM
of seven countries: m!lyzy!, !lmksyk, knd!, bryT!ny!, locks up after two iterations, and every English word
frns!, !str!ly!, and !ndwnysy!. We also know a set of learns the same distribution. When we choose a ran-
English equivalents, here in no particular order: Mex- dom initialization for P(c | p), we get a better result, as
ico, Canada, Malaysia, Britain, Australia, France, and 4 out of 7 English words correctly map to their Arabic
Indonesia. Using non-parallel corpora, can we figure equivalents. With 5 random restarts, we achieve 5 cor-
out which word is a translation of which? We use nei- rect, and with 40 or more random restarts, all 7 assign-
ther spelling information nor exclusivity, since these ments are always correct. (From among the restarts, we
are not exploitable in the general MT problem. select the one with the best post-EM P(c), not the best
accuracy on the task.) The learned P(c | p) dictionary is
shown here (correct mappings are marked with *).
To create a ciphertext, we add phrases X Y and Y
P(!str!ly! | Australia/Australian) = 0.93 *
X to the ciphertext whenever X and Y co-occur in the P(!ndwnysy! | Australia/Australian) = 0.03
same sentence in the Arabic corpus. Sorting by fre- P(m!lyzy! | Australia/Australian) = 0.02
P(!mksyk | Australia/Australian) = 0.01
quency, this ciphertext looks like:
P(bryT!ny! | Britain/British) = 0.98 *
P(!ndwnysy! | Britain/British) = 0.01
P(!str!ly! | Britain/British) = 0.01
3385 frns! bryT!ny! P(knd! | Canada/Canadian) = 0.57 *
3385 bryT!ny! frns! P(frns! | Canada/Canadian) = 0.33
450 knd! bryT!ny! P(m!lyzy! | Canada/Canadian) = 0.06
450 bryT!ny! knd! P(!ndwnysy! | Canada/Canadian) = 0.04
410 knd! frns! P(frns! | France/French) = 1.00 *
410 frns! knd!
386 knd! !str!ly! P(!ndwnysy! | Indonesia/Indonesian) = 1.00 *
386 !str!ly! knd!
P(m!lyzy! | Malaysia/Malaysian) = 0.93 *
331 frns! !str!ly! P(!lmksyk | Malaysia/Malaysian) = 0.07
331 !str!ly! frns!
etc. P(!lmksyk | Mexico/Mexican) = 0.91 *
P(m!lyzy! | Mexico/Mexican) = 0.07
9 Conclusion
We create an English training corpus using the same
method on English text, from which we build a bigram We have discussed several decipherment problems and
P(p) model: shown that they can all be attacked by the same basic
505
8. method. Our primary contribution is a collection of first Graehl, Jonathan. 1997. Carmel finite-state toolkit.
empirical results on a number of new problems. We http://www.isi.edu/licensed-sw/carmel/.
also studied the following techniques in action: Knight, K. 1999. Decoding complexity in word-replacement
translation models. Computational Linguistics, 25(4).
U executing random restarts
Knight, K. and K. Yamada. 1999. A computational approach
U cubing learned channel probabilities before de- to deciphering unknown scripts. In ACL Workshop on Un-
coding supervised Learning in Natural Language Processing.
U using uniform probabilities for parameters of less Peleg, S. and A. Rosenfeld. 1979. Breaking substitution ci-
phers using a relaxation algorithm. Communications of the
interest
ACM, 22(11).
U checking learned P(c) against the P(c) of a “cor- Smith, L. 1943. Cryptography. Dover Publications, NY.
rect” model
Steriade, D. 1998. Alternatives to syllable-based accounts of
U using a well-smoothed source model P(p) consonantal phonotactics. In Proc. of Conf. on Linguistic
and Phonetics (LP’98).
U bootstrapping larger-parameter models with
Yarowsky, D. 1994. Decision lists for lexical ambiguity res-
smaller ones
olution: Application to accent restoration in Spanish and
U appealing to linguistic universals to constrain French. In Proc. ACL.
models Yarowsky, D. 1995. Unsupervised word sense disambigua-
tion rivaling supervised methods. In Proc. ACL.
Results on all of our applications were substantially im-
proved using these techniques, and a secondary contri-
bution is to show that they lead to robust improvements
across a range of decipherment problems.
All of the experiments in this paper were carried
out with the Carmel finite-state toolkit, (Graehl, 1997),
which supports forward-backward EM with epsilon
transitions and loops, parameter tying, and random
restarts. It also composes two or more transducers
while keeping their transitions separate (and separately
trainable) in the composed model. Work described in
this paper strongly influenced the toolkit’s design.
Acknowledgements
We would like to thank Kie Zuraw and Cynthia
Hagstrom for conversations about phonetic universals,
and Jonathan Graehl for work on Carmel. This work
was funded in part by NSF Grant 759635.
References
Blevins, J. 1995. The syllable in phonological theory. In
J. Goldsmith, editor, Handbook of Phonological Theory.
Basil Blackwell, London.
Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer.
1993. The mathematics of statistical machine translation:
Parameter estimation. Computational Linguistics, 19(2).
Chadwick, J. 1958. The Decipherment of Linear B. Cam-
bridge University Press, Cambridge.
Coe, M. 1993. Breaking the Maya Code. Thames and Hud-
son, New York.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Max-
imum likelihood from incomplete data via the EM algo-
rithm. Journal of the Royal Statistical Society, 39(B).
Finch, S. and N. Chater. 1991. A hybrid approach to the
automatic learning of linguistic categories. Artificial In-
telligence and Simulated Behaviour Quarterly, 78.
506