This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
English kazakh parallel corpus for statistical machine translationijnlc
This paper presents problems and solutions in developing English-Kazakh parallel corpus at the School of
Mechanics and Mathematics of the al-Farabi Kazakh National University. The research project included
constructing a 1,000,000-word English-Kazakh parallel corpus of legal texts, developing an English-
Kazakh translation memory of legal texts from the corpus and building a statistical machine translation
system. The project aims at collecting more than ten million words. The paper further elaborates on the
procedures followed to construct the corpus and develop the other products of the research project.
Methods used for collecting data and the results are discussed, errors during the process of collecting data
and how to handle these errors will be described
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Lecture 2: From Semantics To Semantic-Oriented ApplicationsMarina Santini
From the "Natural Language Processing" LinkedIn group:
John Kontos, Professor of Artificial Intelligence
I wonder whether translating into formal logic is nothing more than transliteration which simply isolates the part of the text that can be reasoned upon using the simple inference mechanism of formal logic. The real problem I think lies with the part of text that CANNOT be translated one the one hand and the one that changes its meaning due to civilization advances. My own proposal is to leave NL text alone and try building inference mechanisms for the UNTRANSLATED text depending on the task requirements.
All the best
John"
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
English kazakh parallel corpus for statistical machine translationijnlc
This paper presents problems and solutions in developing English-Kazakh parallel corpus at the School of
Mechanics and Mathematics of the al-Farabi Kazakh National University. The research project included
constructing a 1,000,000-word English-Kazakh parallel corpus of legal texts, developing an English-
Kazakh translation memory of legal texts from the corpus and building a statistical machine translation
system. The project aims at collecting more than ten million words. The paper further elaborates on the
procedures followed to construct the corpus and develop the other products of the research project.
Methods used for collecting data and the results are discussed, errors during the process of collecting data
and how to handle these errors will be described
NEURAL SYMBOLIC ARABIC PARAPHRASING WITH AUTOMATIC EVALUATIONcscpconf
We present symbolic and neural approaches for Arabic paraphrasing that yield high paraphrasing accuracy. This is the first work on sentence level paraphrase generation for
Arabic and the first using neural models to generate paraphrased sentences for Arabic. We present and compare several methods for para- phrasing and obtaining monolingual parallel data. We share a large coverage phrase dictionary for Arabic and contribute a large parallel monolingual corpus that can be used in developing new seq-to-seq models for paraphrasing. This is the first large monolingual corpus of Arabic. We also present first results in Arabic
paraphrasing using seq-to-seq neural methods. Additionally, we propose a novel automatic evaluation metric for paraphrasing that correlates highly with human judgement.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSijnlc
Parsing of Arabic phrases is a crucial requirement for many applications such as question answering and machine translation. The calculus of pregroup introduced by Lambek as an algebraic computational machinery for the grammatical analysis of natural languages. Pregroup grammar used to analyse sentence structure in many European languages such as English, and non-European languages such as Japanese. In Arabic language, Lambek employed the notions of pregroup to analyse some grammatical structures such as conjugating the verb, tense modifiers and equational sentences. This work attempts to develop an initial phase of an efficient automatic pregroup grammar parser by using linear approach to analysethe verbal phrases of Modern Standard Arabic (MSA). The proposed system starts building Arabic lexicon contains all possible categories of Arabic verbs, then analysing the input verbal Arabic phrase to check if it is wellformed by using linear parsing algorithm based on pregroup grammar rules.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
Rule-Based Phonetic Matching Approach for Hindi and MarathiCSEIJJournal
Phonetic matching plays an important role in multilingual information retrieval, where data is
manipulated in multiple languages and user need information in their native language which may be
different from the language where data has been maintained. In such an environment, we need a system
which matches the strings phonetically in any case. Once strings match, we can retrieve the information
irrespective of languages. In this paper, we proposed an approach which matches the strings either in
Hindi or Marathi or in cross-language in order to retrieve information. We compared our proposed
method with soundex and Q-gram methods. We found better results as compared to these approaches.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...ijnlc
Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar-English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines for Myanmar-English word alignment annotation between two languages over contrastive learning and built the Myanmar-English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus contains confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual words. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
Indexing of Arabic documents automatically based on lexical analysiskevig
The continuous information explosion through the Internet and all information sources makes it necessary to perform all information processing activities automatically in quick and reliable manners. In this paper, we proposed and implemented a method to automatically create and Index for books written in Arabic language. The process depends largely on text summarization and abstraction processes to collect main topics and statements in the book. The process is developed in terms of accuracy and performance and results showed that this process can effectively replace the effort of manually indexing books and document, a process that can be very useful in all information processing and retrieval applications.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Essentials of Automations: Optimizing FME Workflows with Parameters
Ceis 3
1. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Sentence Alignment using MR and GA
Mohamed Abdel Fattah
FIE, Helwan University
Cairo, Egypt
mohafi2003@helwan.edu.eg
Abstract
In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on mathematical
regression (MR) and genetic algorithm (GA) classifiers are presented. A feature vector is extracted from the text pair
under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set
of manually prepared training data was assigned to train the mathematical regression and genetic algorithm models.
Another set of data was used for testing. The results of (MR) and (GA) outperform the results of length based approach.
Moreover these new approaches are valid for any languages pair and are quite flexible since the feature vector may
contain more, less or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese
texts, than the ones used in the current research.
Key words: Sentence Alignment, English/ Arabic Parallel Corpus, Parallel Corpora, Machine Translation, Mathematical
Regression, Genetic Algorithm.
1. Introduction
Recent years have seen a great interest in bilingual corpora that are composed of a source text along with a translation
of that text in another language. Nowadays, bilingual corpora have become an essential resource for work in multilingual
natural language processing systems (Fattah et al., 2007; Moore, 2002; Gey et al., 2002; Davis and Ren, 1998), including
data-driven machine translation (Dolan et al., 2002), bilingual lexicography, automatic translation verification, automatic
acquisition of knowledge about translation (Simard, 1999), and cross-language information retrieval (Chen and Gey,
2001; Oard, 1997). It is required that the bilingual corpora be aligned. Given a text and its translation, an alignment is a
segmentation of the two texts such that the nth segment of one text is the translation of the nth segment of the other (as a
special case, empty segments are allowed, and either corresponds to translator’s omissions or additions) (Simard, 1999;
Christopher and Kar, 2004). With aligned sentences, further analysis such as phrase and word alignment analysis (Fattah
et al., 2006; Ker and Chang, 1997; Melamed, 1997), bilingual terminology and collocation extraction analysis can be
performed (Dejean et al., 2002; Thomas and Kevin, 2005).
In the last few years, much work has been reported in sentence alignment using different techniques. Length-based
approaches (length as a function of sentence characters (Gale and Church, 1993) or sentence words (Brown et al., 1991))
are based on the fact that longer sentences in one language tend to be translated into longer sentences in the other
language, and that shorter sentences tend to be translated into shorter sentences. These approaches work quite well with a
clean input, such as the Canadian Hansards corpus, whereas they do not work well with noisy document pairs (Thomas
and Kevin, 2005). Cognate based approaches were also proposed and combined with the length-based approach to
improve the alignment accuracy (Simard et al., 1992; Melamed, 1999; Danielsson and Muhlenbock, 2000; Ribeiro et al.,
2001).
Sentence cognates such as digits, alphanumerical symbols, punctuation, and alphabetical words have been used.
However all cognate based approaches are tailored to close Western language pairs. For disparate language pairs, such as
Arabic and English, with a lack of a shared Roman alphabet, it is not possible to rely on the aforementioned cognates to
achieve high-precision sentence alignment of noisy parallel corpora (however, cognates may be efficient when used with
some other approaches). Some other sentence alignment approaches are text based approaches such as the hybrid
dictionary approach (Collier et al., 1998), part-of-speech alignment (Chen and Chen, 1994), and the lexical method
(Chen, 1993). While these methods require little or no prior knowledge of source and target languages and give good
results, they are relatively complex and require significant amounts of parallel text and language resources.
Instead of a one-to-one hard matching of punctuation marks in parallel texts, as used in the cognate approach of Simard
(Simard et al., 1992), Thomas (Thomas and Kevin, 2005) allowed no match and one-to-several matching of punctuation
matches. However, neither Simard nor Thomas took into account the text length between two successive cognates
(Simard’s case) or punctuations (Thomas’s case), which increased the system confusion and lead to an increase in
execution time and a decrease in accuracy. We have avoided this drawback by taking the probability of text length
between successive punctuation marks into account during the punctuation matching process, as will be shown in the
following sections.
2. In this paper, non-traditional approaches for English-Arabic sentence alignment are presented. For sentence alignment,
we may have a 1-0 match, where one English sentence does not match any of the Arabic sentences, a 0-1 match where
one Arabic sentence does not match any English sentences. The other matches we focus on are 1-1, 1-2, 2-1, 2-2, 1-3 and
3-1.
There may be more categories in bi-texts, but they are rare. Therefore, only the previously mentioned categories are
considered. If the system finds any other categories they will automatically be misclassified. As illustrated above, we
have eight sentence alignment categories. As such, sentence alignment can be considered as a classification problem,
which may be solved by using a mathematical regression or genetic algorithm classifiers.
The paper is organized as follows. Section 2, introduces English–Arabic text features. Section 3, illustrates the new
approaches. Section 4, discusses English-Arabic corpus creation. Section 5, shows the experimental results. Finally,
section 6 gives concluding remarks and discusses future work.
2. English–Arabic text features
As explained in (Fattah et al., 2007), the most important feature for text is the text length, since Gale & Church
achieved good results using this feature.
The second text feature is punctuation marks. We can classify punctuation matching into the following categories:
A. 1-1 matching type, where one English punctuation mark matches one Arabic punctuation mark.
B. 1-0 matching type, where one English punctuation mark does not match any of the Arabic punctuation marks.
C. 0-1 matching type, where one Arabic punctuation mark does not match any of the English punctuation marks.
The probability that a sequence of punctuation marks APi Ap1Ap 2 ......Api in an Arabic language text translates to a
sequence of punctuation marks EPj Ep1Ep 2 ......Ep j in an English language text is P(APi, EPj). The system searches
for the punctuation alignment that maximizes the probability over all possible alignments given a pair of punctuation
sequences corresponding to a pair of parallel sentences from the following formula:
arg max P (AL | APi , EPj ) ,
AL
(1)
Since “AL” is a punctuation alignment. Assume that the probabilities of the individually aligned punctuation pairs are
independent. The following formula may be considered:
P (APi, EPj) = P (Ap k , Ep k ).P ( k | match ) ,
AL
(2)
Where P (Ap k , Ep k ) = the probability of matching Ap k with Ep k , and it may be calculated as follows:
P (Ap k , Ep k ) = Number of punctuation pair ( Apk , Epk )
Total number of punctuation pairs in the manually aligned data
(3)
P ( k | match ) is the length-related probability distribution function. k is a function of the text length (text length
between punctuation marks Ep k and Ep k 1 ) of the source language and the text length (text length between punctuation
marks Apk and Apk 1 ) of the target language.
P ( k | match ) is derived straight as in (Fattah et al., 2007).
After specifying the punctuation alignment that maximizes the probability over all possible alignments given a pair of
punctuation sequences (using a dynamic programming framework as in (Gale and Church, 1993)), the system calculates
the punctuation compatibility factor for the text pair under consideration as follows:
c
max (m , n )
Where γ = the punctuation compatibility factor,
c = the number of direct punctuation matches,
n = the number of Arabic punctuation marks,
m = the number of English punctuation marks.
The punctuation compatibility factor is considered as the second text pair feature.
The third text pair feature is the cognate. For disparate language pairs, such as Arabic and English, that lack a shared
alphabet, it is not possible to rely only on cognates to achieve high-precision sentence alignment of noisy parallel
corpora.
3. However many UN and scientific Arabic documents contain some English words and expressions. These words may
be used as cognates. We define the cognate factor (cog) as the number of common items in the sentence pair. When a
sentence pair has no cognate words, the cognate factor is 0.
3. The Proposed sentence alignment model
The classification framework of the proposed sentence alignment model has two modes of operation. First, is training
mode where features are extracted from 7653 manually aligned English-Arabic sentence pairs and used to train a
Mathematical Regression (MR) and Genetic Algorithm (GA). Second, is testing mode where features are extracted from
the testing data and are aligned using the previously trained models. Alignment is done using a block of 3 sentences for
each language. After aligning a source language sentence and target language sentence, the next 3 sentences are then
looked at. We have used 18 input units and 8 output units for MR and GA. Each input unit represents one input feature
as in (Fattah et al., 2007). The input feature vector X is as follows:
L (S 1a ) L (S1a ) L (S 2a ) L (S1a ) L (S1a ) L (S 2a ) L (S1a ) L (S 2a ) L (S 3a )
X =[ , , , , ,
L (S 1e ) L (S1e ) L (S1e ) L (Se 2 ) L (S1e ) L (S 2e ) L (S1e )
L (S1a )
, (S1a, S1e ) , (S1a, S 2a, S1e ) , (S1a, S1e , S 2e ) , (S1a, S 2a, S1e , S 2e )
L (S1e ) L (S 2e ) L (S 3e )
, (S1a, S 2a, S 3a, S1e ) , (S1a, S1e , S 2e , S 3e ) , Cog (S1a, S1e ) , Cog (S1a, S 2a, S1e ) , Cog (S1a, S1e , S 2e )
, Cog (S1a, S 2a, S1e , S 2e ) , Cog (S1a, S 2a, S 3a, S1e ) , Cog (S1a, S1e , S 2e , S 3e ) ]
Where:
L (X ) is the length in characters of sentence X ; ( X , Y ) is the punctuation compatibility factor between sentences
X and Y ; Cog ( X , Y ) is the cognate factor between sentences X and Y .
The output is from 8 categories specified as follows. S1a 0 means that the first Arabic sentence has no English
match.
S1e 0 means that the first English sentence has no Arabic match. Similarly, the remaining outputs
are S1a S1e , S1a S 2a S1e , S1a S1e S 2e , S1a S 2a S1e S 2e ,
S1a S1e S 2e S 3e and S1a S 2a S 3a S1e .
3.1. Mathematical Regression Model (MR)
Mathematical regression is a good model to estimate the text feature weights (Jann, 2005; Richard, 2006). In this
model a mathematical function relates output to input. The feature parameters of the 7653 manually aligned English-
Arabic sentence pairs are used as independent input variables and the corresponding dependent outputs are specified in
the training phase. A relation between inputs and outputs is tried to model the system. Then testing data are introduced to
the system model for evaluation of its efficiency. In matrix notation, regression can be represented as follow:
w1
Y 0 X 01 X 02 X 03 ......... X 018
Y 1 : : : : : w2
w
: : : : : :: . 3
:
: : : : : : :
Ym Xm1
Xm 2 Xm3 .......... Xm18
w18
(4)
Where
[Y] is the output vector.
[X] is the input matrix (feature parameters).
[W] is the linear statistical model of the system (the weights w1, w2,……w18 in equation 4 ).
m is almost the total number of sentence pairs in the training corpus.
For testing, 1200 English–Arabic sentence pairs were used as the testing data. Equation 4 was applied after using the
defined weights from [W].
4. 3.2. Genetic Algorithm Model (GA)
The basic purpose of genetic algorithms (GAs) is optimization. Since optimization problems arise frequently, this
makes GAs quite useful for a great variety of tasks. As in all optimization problems, we are faced with the problem of
maximizing/minimizing an objective function f(x) over a given space X of arbitrary dimension (Russell, & Norvig, 1995;
Yeh, Ke, Yang & Meng, 2005). Therefore, GA can be used to specify the weight of each text feature.
For block of 3 sentences, a weighted score function, as shown in the following equation is exploited to integrate all the
18 feature scores, where wi indicates the weight of fi.
Score w1.Score f1 w2 .Score f 2 ... w18.Score f18
(5)
The genetic algorithm (GA) is exploited to obtain an appropriate set of feature weights using the 7653 manually
aligned English-Arabic sentence pairs. A chromosome is represented as the combination of all feature weights in the
form of (w1 , w2 ,..., w18 ) . 1000 genomes for each generation were produced. Thousand genomes for each generation were
produced. Evaluate fitness of each genome (we define fitness as the average accuracy obtained with the genome when
the alignment process is applied on the training corpus), and retain the fittest 10 genomes to mate for new ones in the
next generation. In this experiment, 100 generations are evaluated to obtain steady combinations of feature weights. A
suitable combination of feature weights is found by applying GA. For testing, a set of 1200 sentence pairs was used. Eq.
(5) is applied after using the defined weights from GA execution.
4. English-Arabic corpus
Although, there are very popular Arabic-English resources among the statistical machine translation community that
may be found in some projects such as the “DARPA TIDES program, http://www.ldc.upenn.edu/Projects/TIDES/”, we
have decided to construct our Arabic-English parallel corpus from the Internet to have significant parallel data from
different domains. The approach of (Fattah et al., 2006) is used to construct the English-Arabic corpus.
5. Experimental results
5.1. Length based approach
Let’s consider Gale & Church’s length based approach as explained in (Fattah et al., 2007). We constructed a dynamic
programming framework to conduct experiments using their length based approach as a baseline experiment to compare
with our proposed system. First of all it was not clear to us, which variable should be considered as the text length,
character or word? To answer this question, we had to do some statistical measurements on 1000 manually aligned
English – Arabic sentence pairs, randomly selected from the previously mentioned corpus. We considered the
relationship between English paragraph length and Arabic paragraph length as a function of the number of words. The
results showed that, there is a good correlation (0.987) between English paragraph length and Arabic paragraph length.
Moreover, the ratio and corresponding standard deviation were 0.9826 and 0.2046 respectively. We also considered the
relationship between English paragraph length and Arabic paragraph length as a function of the number of characters.
The results showed a better correlation (0.992) between English paragraph length and Arabic paragraph length.
Moreover, the ratio and corresponding standard deviation were 1.12 and 0.1806 respectively. In comparison to the
previous results, the number of characters as the text length variable is better than words since the correlation is higher
and the standard deviation is lower.
We applied the length based approach (using text length as a function of the number characters) experiment on a 1200
sentence pair sample, not taken from the training data. Table 1 shows the results. The first column in table 1 represents
the category, the second column is the total number of sentence pairs related to this category, the third column is the
number of sentence pairs that were misclassified and the fourth column is the percentage of this error. Although 1-0, 0-1
and 2-2 are rare cases, we have taken them into consideration to reduce error. Moreover, we did not consider other cases
like (3-3, 3-4, etc.) since they are very rare cases and considering more cases requires more computations and processing
time. When the system finds these cases, it misclassifies them.
5.2. Mathematical Regression approach
5. The system extracted features from 7653 manually aligned English-Arabic sentence pairs and used them to train a
Mathematical Regression model. 1200 English–Arabic sentence pairs were used as the testing data. These sentences
were used as inputs to the Mathematical Regression after the feature extraction step. Alignment was done using a block
of 3 sentences for each language. After aligning a source language sentence and target language sentence, the next 3
sentences were then looked at as follows:
1- Extract features from the first three English sentences and do the same with the first three Arabic sentences.
2- Construct the feature vector X .
3- Use this feature vector as an input of the Mathematical Regression model.
4- According to the model output, construct the second feature vector. For instance, if the result of the model
is S 1a 0 , then read the fourth Arabic sentence and use it with the second and third Arabic sentences with the
first three English sentences to generate the feature vector X .
5- Continue using this approach until there are no more English-Arabic text pairs.
Table 2 shows the results when we applied this approach on the 1200 English – Arabic sentence pairs. It is clear from
table 2 that the results have been improved in terms of accuracy over the Length based approach. Additionally, we
applied the MR approach on the entire English-Arabic corpus containing 191,623 English sentences and 183,542 Arabic
sentences. Then we randomly selected 500 sentence pairs from the sentence aligned output file and manually checked
them. The system reported a total error rate of 6.1%.
We decreased the number of sentence pairs used for training the MR model to 4000 sentence pairs to investigate the
effect of the training data size on the total system performance. These 4000 sentence pairs were randomly selected from
the training data. The constructed model was then used to align the entire English-Arabic corpus. Then, we randomly
selected 500 sentence pairs from the sentence aligned output file and manually checked them. The system reported a
total error rate of 6.2%. The reduction of the training data set does not significantly change total system performance.
5.3. Genetic Algorithm approach
The system extracted features from 7653 manually aligned English-Arabic sentence pairs and used them to construct a
Genetic Algorithm model. 1200 English–Arabic sentence pairs were used as the testing data. Using formula (5) and the 5
steps mentioned in the previous section (using GA instead of MR) the 1200 English–Arabic sentence pairs were aligned.
Table 3 shows the results when we applied this approach. Additionally, we applied the GA approach on the entire
English-Arabic corpus. Then, we randomly selected 500 sentence pairs from the sentence aligned output and manually
checked them. The system reported a total error rate of 5.6%.
We decreased the number of sentence pairs used for training the GA to 4000 sentence pairs as in the previous section
to investigate the effect of the training data size on total system performance. These 4000 sentence pairs were randomly
selected from the training data. The constructed model was then used to align the entire English-Arabic corpus. Then, we
randomly selected 500 sentence pairs from the sentence aligned output and manually checked them. The system reported
a total error rate of 5.8%. The reduction of the training data set from 7653 to 4000 does not significantly change the total
system performance.
Table 1
The results using length based approach
Category Frequency Error % Error
1-1 1099 54 4.9%
1-0, 0-1 2 2 100%
1-2, 2-1 88 14 15.9%
2-2 2 1 50%
1-3, 3-1 9 6 66%
Total 1200 77 6.4%
Table 2
The results using the Mathematical Regression approach
Category Frequency Error % Error
1-1 1099 49 4.4%
1-0, 0-1 2 2 100%
6. 1-2, 2-1 88 14 15.9%
2-2 2 1 50%
1-3, 3-1 9 6 66.6%
Total 1200 72 6%
Table 3
The results using Genetic Algorithm model
Category Frequency Error % Error
1-1 1099 45 4.1%
1-0, 0-1 2 2 100%
1-2, 2-1 88 13 14.7%
2-2 2 1 50%
1-3, 3-1 9 6 66.6%
Total 1200 67 5.6%
5.4. Discussion
The Length based approach did not give bad results, as shown in table 1 (the total error was 6.4%). This is explained
by the fact that there is a good correlation (0.992) and low standard deviation (0.1806) between English and Arabic text
lengths. These factors lead to good results as shown in table 1. The Mathematical Regression and Genetic Algorithm
approaches decreased the total error compared to the length based approach.
Using feature extraction criteria allows researchers to use different feature values when trying to solve natural language
processing problems in general.
6. Conclusions
In this paper, we investigated the use of Mathematical Regression and Genetic Algorithm models for sentence
alignment. We have applied our new approaches on a sample of an English – Arabic parallel corpus. Our approach
results outperform the length based. The proposed approaches have improved total system performance in terms of
effectiveness (accuracy). Our approaches have used the feature extraction criteria, which gives researchers an
opportunity to use many varieties of these features based on the language pairs used and their text types (Hanzi
characters in Japanese-Chinese texts may be used for instance).
References
Brown, P., Lai, J., Mercer, R., 1991. Aligning sentences in parallel corpora: In Proceedings of the 29th annual meeting
of the association for computational linguistics, Berkeley, CA, USA.
Cain, B. J., 1990. Improved probabilistic neural networks and its performance relative to the other models: Proc. SPIE,
Applications of Artificial Neural Networks, 1294, 354–365.
Chen, A., Gey, F., 2001. Translation Term Weighting and Combining Translation Resources in Cross-Language
Retrieval: TREC 2001.
Chen, K.H., Chen, H.H., 1994. A Part-of-Speech-Based Alignment Algorithm: Proceedings of 15th International
Conference on Computational Linguistics, Kyoto, 166-171.
Chen, S. F., 1993. Aligning Sentences in Bilingual Corpora Using Lexical Information: Proceedings of ACL-93,
Columbus OH, 9-16.
Christopher, C., Kar, Li., 2004. Building parallel corpora by automatic title alignment using length-based and text-based
approaches: Information Processing and Management 40, 939–955.
Collier, N., Ono, K., Hirakawa, H., 1998. An Experiment in Hybrid Dictionary and Statistical Sentence Alignment:
COLING-ACL, 268-274.
Danielsson, P., Mühlenbock, K., 2000. The Misconception of High-Frequency Words in Scandinavian Translation:
AMTA, 158-168.
Davis, M., Ren, F., 1998. Automatic Japanese-Chinese Parallel Text Alignment: Proceedings of International Conference
on Chinese Information Processing, 452-457.
Dejean, H., Gaussier, É., Sadat, F., 2002. Bilingual Terminology Extraction: An Approach based on a Multilingual
thesaurus Applicable to Comparable Corpora: Proceedings of the 19th International Conference on Computational
Linguistics COLING 2002, Taipei, Taiwan, 218-224.
7. Dolan, W. B., Pinkham, J., Richardson, S. D., 2002. MSR-MT, The Microsoft Research Machine Translation System:
AMTA, 237-239.
Fattah, M., Ren, F., Kuroiwa, S., 2006. Stemming to Improve Translation Lexicon Creation form Bitexts: Information
Processing & Management, 42 (4), 1003 – 1016.
Gale, W. A., Church, K. W., 1993. A program for aligning sentences in bilingual corpora: Computational Linguistics, 19,
75-102.
Ganchev, T., Tasoulis, D. K., Vrahatis, M. N., Fakotakis, N., 2003. Locally recurrent probabilistic neural networks for
text independent speaker verification: Proc. of the EuroSpeech, 3, 1673–1676.
Gey, F.C., Chen, A., Buckland, M.K., Larson, R. R., 2002. Translingual vocabulary mappings for multilingual
information access: SIGIR, 455-456.
Ker, S.J., Chang, J.S., 1997. A class-based approach to word alignment: Computational Linguistics, 23(2), 313-344.
Kraaij, W., Nie, J. Y., Simard, M., 2003. Embedding Web-Based Statistical Translation Models in Cross-Language
Information Retrieval: Computational Linguistics, 29(3), 381-419.
Melamed, I. D., 1997. A portable algorithm for mapping bitext correspondence: In The 35 th Conference of the
Association for Computational Linguistics (ACL 1997), Madrid, Spain.
Melamed, I. D., 1999. Bitext Maps and Alignment via Pattern Recognition: Computational Linguistics, March, 25(1),
107-130.
Moore, R.C., 2002. Fast and Accurate Sentence Alignment of Bilingual Corpora: AMTA, 135-144.
Oard, D. W., 1997. Alternative approaches for cross-language text retrieval: In D. Hull, & D. Oard (Eds.), AAAI
symposium in cross-language text and speech retrieval. American Association for Artificial Intelligence, March.
Pellom, B.L., Hansen, J.H.L., 1998. An Efficient Scoring Algorithm for Gaussian Mixture Model based Speaker
Identification: IEEE Signal Processing Letters, 5(11), 281-284.
Resnik, P., Smith, N. A., 2003. The Web as a Parallel Corpus: Computational Linguistics, 29(3), 349-380.
Reynolds, D., 1995. Speaker identification and verification using Gaussian mixture speaker models: Speech
Communication, 17, 91–108.
Ribeiro, A., Dias, G., Lopes, G., Mexia, J., 2001. Cognates Alignment: In Bente Maegaard (ed.), Proceedings of the
Machine Translation Summit VIII (MT Summit VIII) – Machine Translation in the Information Age, Santiago de
Compostela, Spain, 287–292.
Simard, M., 1999. Text-translation alignment: three languages are better than two: In Proceedings of EMNLP/VLC- 99,
College Park, MD.
Simard, M., Foster, G., Isabelle, P., 1992. Using cognates to align sentences in bilingual corpora: Proceedings of TMI92,
Montreal, Canada, 67-81.
Specht, D.F., 1990. Probabilistic neural networks: Neural Networks, 3(1), 109-118.
Thomas, C., Kevin, C., 2005. Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria: Computational
Linguistics and Chinese Language Processing , 10 (1), 95-122.
Fattah, M., Ren, F., Kuroiwa, S., 2007. Sentence Alignment using P-NNT and GMM: Computer Speech and Language,
21 (4), 594 – 608.