This document provides a survey of methods for aligning parallel text corpora. It discusses the historical background of using parallel texts in language processing from the 1950s onward. Key early methods are described, including ones based on sentence length, lexical mapping between words, and identifying cognates. The document also evaluates major efforts to create benchmark datasets and evaluate system performance against gold standard alignments. It surveys the evolution of various alignment techniques and lists some relevant tools and projects in the field.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
One of the difficult tasks on Natural Language
Processing (NLP) is to resolve the sense ambiguity of
characters or words on text, such as polyphones, homonymy,
and homograph. The paper addresses the ambiguity issue of
Chinese character polyphones and disambiguity approach for
such issues. Three methods, dictionary matching, language
models and voting scheme, are used to disambiguate the
prediction of polyphones. Compared with the well-known MS
Word 2007 and language models (LMs), our approach is
superior to these two methods for the issue. The final precision
rate is enhanced up to 92.75%. Based on the proposed
approaches, we have constructed the e-learning system in
which several related functions of Chinese transliteration are
integrated.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
Improving Neural Abstractive Text Summarization with Prior KnowledgeGaetano Rossiello, PhD
Abstractive text summarization is a complex task whose goal is to generate a concise version of a text without necessarily reusing the sentences from the original source, but still preserving the meaning and the key contents. In this position paper we address this issue by modeling the problem as a sequence to sequence learning and exploiting Recurrent Neural Networks (RNN). Moreover, we discuss the idea of combining RNNs and probabilistic models in a unified way in order to incorporate prior knowledge, such as linguistic features. We believe that this approach can obtain better performance than the state-of-the-art models for generating well-formed summaries.
This paper deals about the sentiment analysis of the Manipuri article. The language is very highly
agglutinative in Nature. The document files are the letters to the editor of few local daily newspapers. The
text is processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The lexicon of
verbs is modified with the sentiment polarity (Positive or Negative or Neutral) manually. With the POS
tagger the verbs of each sentence are identified and the modified lexicon of verbs is used to notify the
polarity of the sentiment in the sentence. The total number of polarity for each category that is positive,
negative and neutral is counted separately. The highest total of the three is the deciding factor of the
sentiment polarity of the document. The system shows a recall of 72.10%, a precision of 78.14% and a Fmeasure
of 75.00%.
Isolation Lemma for Directed Reachability and NL vs. Lcseiitgn
We discuss versions of isolation lemma for problems in NP and NL. We then show that improving the success probability of the isolation lemma is equivalent to some complexity theoretic collapses such as NP is in P/poly and NL is in L/poly. Basic familiarity with complexity classes such as NP, P/poly, NL will be assumed. All the other notions used in the talk will be introduced during the talk.
Where are you? Get found in Google by customersJeremy Dawes
Presentation on google positioning by Jeremy of https://www.jezweb.com.au
Each of the 4 main listing options in Google are described; Google Ads, Google Maps, Google Business and Google Search.
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
Improving Neural Abstractive Text Summarization with Prior KnowledgeGaetano Rossiello, PhD
Abstractive text summarization is a complex task whose goal is to generate a concise version of a text without necessarily reusing the sentences from the original source, but still preserving the meaning and the key contents. In this position paper we address this issue by modeling the problem as a sequence to sequence learning and exploiting Recurrent Neural Networks (RNN). Moreover, we discuss the idea of combining RNNs and probabilistic models in a unified way in order to incorporate prior knowledge, such as linguistic features. We believe that this approach can obtain better performance than the state-of-the-art models for generating well-formed summaries.
This paper deals about the sentiment analysis of the Manipuri article. The language is very highly
agglutinative in Nature. The document files are the letters to the editor of few local daily newspapers. The
text is processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The lexicon of
verbs is modified with the sentiment polarity (Positive or Negative or Neutral) manually. With the POS
tagger the verbs of each sentence are identified and the modified lexicon of verbs is used to notify the
polarity of the sentiment in the sentence. The total number of polarity for each category that is positive,
negative and neutral is counted separately. The highest total of the three is the deciding factor of the
sentiment polarity of the document. The system shows a recall of 72.10%, a precision of 78.14% and a Fmeasure
of 75.00%.
Isolation Lemma for Directed Reachability and NL vs. Lcseiitgn
We discuss versions of isolation lemma for problems in NP and NL. We then show that improving the success probability of the isolation lemma is equivalent to some complexity theoretic collapses such as NP is in P/poly and NL is in L/poly. Basic familiarity with complexity classes such as NP, P/poly, NL will be assumed. All the other notions used in the talk will be introduced during the talk.
Where are you? Get found in Google by customersJeremy Dawes
Presentation on google positioning by Jeremy of https://www.jezweb.com.au
Each of the 4 main listing options in Google are described; Google Ads, Google Maps, Google Business and Google Search.
Aboriginal Post-Secondary Information Program - Jolene John - SASSY 2014TEDx Adventure Catalyst
Presented at the 2014 Student Affairs and Services Symposium at York University.
Learn about the Aboriginal Post-Secondary Information
Program (APSIP) and how it empowers Aboriginal learners,
leaders, educators, institutions, and communities to collaborate to increase access, retention, and inclusion of Indigenous peoples, pedagogies, epistemologies, and methodologies within academia.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
Indexing of Arabic documents automatically based on lexical analysiskevig
The continuous information explosion through the Internet and all information sources makes it necessary to perform all information processing activities automatically in quick and reliable manners. In this paper, we proposed and implemented a method to automatically create and Index for books written in Arabic language. The process depends largely on text summarization and abstraction processes to collect main topics and statements in the book. The process is developed in terms of accuracy and performance and results showed that this process can effectively replace the effort of manually indexing books and document, a process that can be very useful in all information processing and retrieval applications.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
Introduction to complexity theory that solves your assignment problem it contains about complexity class,deterministic class,big- O notation ,proof by mathematical induction, L-Space ,N-Space and characteristics functions of set and so on
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
Similar to A survey on parallel corpora alignment (20)
Slides from a ligthning talk oabout the Perl module Text::Perfide::BookPairs, presented on the I International Per-fide Workshops, at University of MInho, 2011.
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
Slides from a presentation about Text::Perfide::BookCleaner given at PtPW2011. T::P::BC is a Perl module created to clean books in plain text format, making them suitable for further automatic text processing activities.
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
Presentation given at I Workshop Per-Fide, UMinho, about GuardaLivros, an application being developed to detect and resolve problems in simple-text documents to be automatically processed (e.g., bi-text alignment) [PT].
Bigorna - a toolkit for orthography migration challengesandrefsantos
Paper written by José João Almeida, André Santos and Alberto Simões and submitted, accepted and presented at LREC2010 - http://www.lrec-conf.org/lrec2010/
Slides from a ligthning talk on "Bigorna – a toolkit for orthography migration challenges", at 3T (Time Trial Talks), an event organized by CeSIUM (http://cesium.di.uminho.pt).
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Overview on Edible Vaccine: Pros & Cons with Mechanism
A survey on parallel corpora alignment
1. A survey on parallel corpora alignment
Andr´ Santos1
e
Universidade do Minho
A parallel text is the set formed by a text and its translation (in which case it is called a bitext)
or translations. Parallel text alignment is the task of identifying correspondences between blocks
or tokens in each halve of a bitext. Aligned parallel corpora are used in several different areas of
linguistic and computational linguistics research.
In this paper, a survey on parallel text alignment is presented: the historical background is
provided, and the main methods are described. A list of relevant tools and projects is presented
as well.
Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing
General Terms: Algorithms, Languages, Human factors, Performance
Additional Key Words and Phrases: Alignment, Parallel, Corpora, NLP
1. INTRODUCTION
1.1 Aligned parallel texts
A parallel text is the set formed by a text and its translation (in which case it
is called a bitext) or translations. Parallel text alignment is the task of identifying
correspondences between blocks or tokens of each halve of a bitext. Aligned parallel
texts are currently used in a wide range of areas: knowledge retrieval, machine
learning, natural language processing, and others [V´ronis 2000].
e
Bitexts are generally obtained by performing an alignment between two texts.
This alignment can usually be made at paragraph, sentence or even word level.
1.2 Precision and recall
Like in many knowledge retrieval, natural language processing or pattern recognition
systems, parallel text alignment algorithms performance is usually measured in
terms of precision, recall [Olson and Delen 2008] and F-score [Rijsbergen 1979].
In this specific context, these concepts can be simply explained as follows: given
a set of parallel documents D, there is a number Ctotal of correct correspondences
between them. After aligning the texts with an aligner A, TA(D) represents the total
number of correspondences found by A in D, and CA(D) represents the number of
correct correspondences found by A in D. The precision PA(D) and recall RA(D)
are then given by:
CA(D)
PA(D) = (1)
TA(D)
1 CeSIUM, Dep. Inform´tica, Universidade do Minho, Campus de Gualtar, 4710-057 BRAGA
a
1 Email:pg15973@alunos.uminho.pt
Universidade do Minho - Master Course on Informatics - State of the Art Reports 2011
MI-STAR 2011, Pages 117–128.
2. 118 · Andr´ Santos
e
CA(D)
RA(D) = (2)
Ctotal
The F-score is the harmonic mean between precision and recall, obtained with the
following equation:
2∗P ∗R
F-score = (3)
P +R
1.3 Alignment details
The alignment of texts may be performed at several levels of granularity. Usually,
when the objective is to achieve low level alignment, like word alignment, higher
level alignment is performed first, which allows to obtain improved results.
Independently of the type of alignment being performed, several outcomes must
be considered for each correspondence found: the most common case is when a
source text sentence corresponds exactly to a target text sentence (1:1). Less frequently,
there are omissions (1:0), additions (0:1) or something more complex (m:n), usually
with 1 ≤ m, n ≤ 2.
An example of a sentence-level alignment is presented in Table I.
Table I. Extract of sentence-level alignment performed using Portuguese and Russian subtitles
from the movie Tron.
Portuguese Russian
A actividade do laser come¸ar´ em 30
c a Лазер будет включен через 30 се-
segundos. кунд.
Ponham os ´culos de prote¸ao.
o c˜ Наденьте защитные очки и покиньте
Abandonem a area.
´ главный зал
Deixe-me ver se n´s temos a luz verde.
o Интересно, будет ли на этот раз зе-
лёное свечение...
´
Area do alvo protegida? - Облучаемая область свободна?
Verificaram a seguran¸a.
c - Да , её уже проверили.
20 segundos 20 секунд.
- Parece bom. - Вроде всё нормально.
- Deixe-o iniciar. - Можно запускать.
Isto ´ o que n´s temos esperado.
e o Включаю.
Through the years, several alignment methods have been proposed. The next
section describes the evolution of the field and details the most relevant efforts to
create gold standards and evaluate existing systems (in this case, gold standards
are hand-performed alignments which were thoroughly checked and are considered
“correct”; they are generally used to test and assess other systems). Section 3
describes the alignment process step-by-step, and Section 4 lists several of the most
relevant tools and projects currently being developed or used. The last section
presents some conclusions about this state-of-the-art and the corpora alignment
area of knowledge itself.
MI-STAR 2011.
3. Artigos para MI-STAR · 119
2. BACKGROUND
Parallel text use in automatic language processing was first tried in the late fifties,
but probably due to limitations in the computers of that time (both in storage space
and computing power) and the reduced availability of large amount of textual data
in a digital format, the results were not encouraging [V´ronis 2000].
e
In the eighties, with computers being several orders of magnitude faster, a similar
increase in storage capacity and an increasing amount of large corpora available,
a new interest in text alignment emerged, resulting in several publications years
later.
The first documented approaches to sentence alignment were based on measuring
the length of the sentences in each text. Kay [1991] system was the first to devise
an automatic parallel alignment method. This method was based on the idea that
when a sentence corresponds to another, the words in them must also correspond.
All the necessary information, including lexical mapping, was derived from the texts
themselves.
The algorithm initially sets up a matrix of correspondence, based on sentences
which are reasonable candidates: the initial and final sentences have a good probability
to correspond to each other, and the remaining sentences should be distributed
close to the matrix main diagonal. Then the word distributions are calculated, and
words with similar distribution values are matched, and considered anchor points,
narrowing the “alignment corridor” of the candidate sentences. The algorithm then
iterates until it converges on a minimal solution.
This system was not efficient enough to apply to large corpora [Moore 2002].
Two other similar methods were proposed, the main difference between the two
being how the sentence length was measured: Brown et al. [1991] counted words,
while Gale and Church [1993] counted characters. The authors assumed that the
length of a sentence is highly correlated with the length of its translation. Additionally,
they concluded that there is a relatively fixed ratio between the sentences lengths
in any two languages. The optimal alignment minimizes the total dissimilarity of
all the aligned sentences. Gale and Church [1993] achieve the optimal alignment
with a dynamic programming algorithm, while Brown et al. [1991] applied Hidden
Markov Models.
In 1993, Chen [1993] proposed an alignment method which relied on additional
lexical information. In spite of not being the first of its kind, their algorithm was the
first lexical-based alignment which was efficient enough for large corpora. It required
a minimum of human intervention – at least 100 sentences had to be aligned for each
language pair to bootstrap the translation model – and it was capable of handling
large deletions in text.
Another family of algorithms was proposed soon after, based on the concept
of cognates. The concept of cognate may vary a little, but generally cognates
are defined as occurrences of tokens that are graphically or otherwise identical
in some way. These tokens may be dates, proper nouns, some punctuation marks or
even closely-spelled words – Simard and Plamondon [1998] considered cognates any
two words which shared the first four characters. When recognisable elements like
dates, technical terms or markup are present in considerable amounts, this method
works well even with unrelated languages, despite some loss of performance being
MI-STAR 2011.
4. 120 · Andr´ Santos
e
noticeable.
Most methods developed ever since rely on one or more of these main ideas –
length based, lexical or dictionary based, or partial similarity (cognate) based.
2.1 ARCADE
In the mid-nineties, the amount of studies describing alignment techniques was
already impressive. Comparing their performance, however, was difficult due to the
aligners sensitivity to factors as the type of text used or different interpretations of
what is a “correct alignment”. Details on the protocols or the evaluation performed
on the systems developed were also frequently not disclosed. A precise evaluation
of the techniques would be most valuable; yet, there was no established framework
for evaluation of parallel text alignment.
The ARCADE I was a project to evaluate parallel text alignment systems which
took place between 1995 and 1999, consisting in a competition among systems at the
international level [Langlais et al. 1998; V´ronis and Langlais 2000]. It was divided
e
in two phases: the first two years were spent on corpus collection and preparation,
and methodology definition, and a small competition on sentence alignment was
held; in the remaining time, the sentence alignment competition was opened to
a larger number of teams, and a second track was created to evaluate word-level
alignment.
For the sentence alignment track, several types of texts were selected: institutional
texts, technical manuals, scientific articles and literature. To build the reference
alignment, an automatic aligner was used, followed by hand-verification by two
different persons, who annotated the text.
F-score values were based on the number of correct alignments, proposed alignments
and reference alignments, and calculated for different types of granularity. For the
sentence alignment track, several types of texts were selected for the competition:
institutional texts, technical manuals, scientific articles and literature. The best
systems achieved an F-score of over 0.985 on institutional and scientific texts. On
the other hand, all systems performed poorly on the literature texts.
For the word track, the systems had to perform translation spotting, a sub-
problem of full alignment: for a given word or expression, the objective is to find its
translation in the target text. A set of sixty French words (20 adjectives, 20 nouns
and 20 verbs) were selected based on frequency criteria and polysemy features. The
results were better for adjectives (0.94 precision and recall for the best system)
than for verbs (0.72 and 0.62 respectively). The overall results were 0.77 and 0.73
respectively for the best system.
The ARCADE project had a few limitations: the systems were tested by limited
tasks which did not reflect their full capacity, and only one language pair (French-
English) was tested. Nonetheless, it allowed to conclude, at the time, that the
techniques were satisfactory for texts with a similar structure. The same techniques
were not as good when it came to align texts whose structure did not match
perfectly. Methodological advances on sentence alignment and, in a limited form,
word alignment were also made possible, resulting in methods and tools for the
generation of reference data, and a set of measures for system performance assessment.
Additionally, a large standardized bilingual corpus was constructed and made available
as a gold standard for future evaluation.
MI-STAR 2011.
5. Artigos para MI-STAR · 121
2.2 ARCADE-II
The second campaign of ARCADE was held between 2003 and 2005, and differed
from the first one in its multilingual context and on the type of alignment addressed.
French was used as the pivot language for the study of 10 other language pairs:
English, German, Italian and Spanish for western-European languages; Arabic,
Chinese, Greek, Japanese, Persian and Russian for more distant languages using
non-Latin scripts.
Two tasks were proposed: sentence alignment and word alignment. Each task
involved the alignment of Western-European languages and distant languages as
well. For the sentence alignment, a pre-segmented corpus and a raw corpus were
provided.
The results showed that sentence alignment is more difficult on raw corpora
(and also, curiously, that German is harder than the other languages). The systems
achieved an F-score between 0.94 and 0.97 on raw corpora, and between 0.98 and
0.99 on the segmented one. As for distant languages alignment, two systems (P1 and
P2) were evaluated. The results allowed to conclude that sentence segmentation is
very hard on non-Latin scripts, as P1 was incapable of performing it and the results
for raw corpora alignment of P2 scored 0.421.
Evaluating word alignment presents some additional difficulties, given the differences
in word order, part-of-speech and syntactic structure, discontinuity of multi-token
expressions, etc which are possible to find between texts and its translations.
Sometimes it is not even clear how some words should be aligned when they do
not have a direct correspondence in the other language. Nevertheless, a small
competition was held, in which the systems were supposed to identify named entities
phrases translation in the parallel text.
The scope of this task was limited; yet, it allowed to define a test protocol and
metrics.
2.3 Blinker project
In 1998, a “bilingual linker”, Blinker, was created in order to help bilingual annotators
(people who go through bitexts and annotate the correspondences between sentences
or words) in the process of linking word tokens that are mutual translations in
parallel texts. This tool was developed in the context of a project whose goal was
to manually produce a gold standard which could be used to future research and
development [Melamed 1998b]. Along with this tools, several annotation guidelines
were defined [Melamed 1998a], in order to increase consistency between the annotations
produced by all the teams.
The parallel texts chosen to align in this project were a modern French version and
a modern English version of the Bible. This choice was motivated by the canonical
division of the Bible in verses, which is common across most of the translations.
This was used as a “ready-made, indisputable and fairly detailed bitext map”.
Several methods were used to improve reliability. First, as many annotators as
possible were recruited. This allowed to identify deviations from the norm, and to
evaluate the gold standard produced in terms of inter-annotator agreement rates
(how much the annotators agreed with each other).
Second, Blinker was developed with some unique features, such as forcing the
MI-STAR 2011.
6. 122 · Andr´ Santos
e
annotator to annotate all the words in a given sentence before proceeding (either
by linking them with the corresponding target text word, or by declaring as “not
translated”. These feature aimed to ensure that a good first approximation was
produced, and to discourage the annotators natural tendency to classify words
whose translation was less obvious as “not translated”.
Third, the annotation guidelines were created to reduce experimenter bias, and a
monetary bonus was offered to the annotators who stayed closest to the guidelines.
The inter-annotator agreement rates on the gold standard were approximately
82% (92% if function words were ignored), which indicated that the gold standard
is reasonably reliable and that the task is reasonably easy to replicate.
3. ALIGNMENT PROCESS
Despite the existence of several methods and tools to align corpora, the process
itself shares some common steps. In this section the alignment process is detailed
step-by-step.
3.1 Gathering corpora
As mentioned in the previous section, the lack of available corpora was one of the
reasons appointed to why the earlier efforts made to align texts did not work out.
Fortunately, with the massification of computers and globalization of the internet,
several sources of alignable corpora have appeared:
– Literary Texts. Books published in electronic format are becoming increasingly
common. Despite the vast majority being subject to copyright restrictions, some of
them are not.
Project Gutenberg, for example, is a volunteer effort to digitize and archive
cultural works. Currently it comprises more than 33.000 books, most of them in
public domain [Hart and Newby 1997]. National libraries are beginning to maintain
a repository of eBooks as well [Varga et al. 2005].
Sometimes it is also possible to find publishers who are willing to cooperate by
providing their books as long as it is for research purposes only.
– Religious Texts. The Bible has been translated into over 400 languages, and
other religious books are wide spread as well. Additionally, the Catholic Church
translates papal edicts to other languages from the original Latin, and the Taiz´ e
Community website1 also provides a great deal of its content in several languages.
– International Law. Important legal documents, such as the Universal Declaration
of Human Rights2 or the Kyoto Protocol3 are freely available in many different
languages. One of the first corpora used in parallel text alignment were the Hansards
– proceedings of the Canadian Parliament [Brown et al. 1991; Gale and Church
1993; Kay 1991; Chen 1993].
– Movie subtitles. There are several on-line databases of movie subtitles. Depending
on the movie, it is possible to find dozens of subtitles files, in several different
languages (frequently even more than one version for each language). Subtitles are
1 http://www.taize.fr
2 http://www.un.org/en/documents/udhr/index.shtml
3 http://unfccc.int/kyoto_protocol/items/2830.php
MI-STAR 2011.
7. Artigos para MI-STAR · 123
shared on the internet as plain text files (sometimes tagged with extra information
such as language, genre, release year, etc) [Tiedemann 2007].
– Software internationalization. There is an increasing amount of multilingual
documentation belonging to software available in several countries. Open source
software is particularly useful for not being subject to copyright [Tiedemann et al.
2004].
– Bilingual Magazines. Magazines like National Geographic, Rolling Stone or
frequent flyer magazines are oftenly published in other languages besides English,
and in several countries there are magazines with complete mirror translations into
English.
– Websites. Websites often allow the user to choose the language in which they
are presented. This means that a web crawler may be pointed to this websites to
retrieve all reachable pages – a process called “mining the Web” [Resnik and Smith
2003; Tsvetkov and Wintner 2010].
Corporate webpages are one example of websites available in several languages.
Additionally, some international companies have their subsidiaries publishing their
reports in both their native language and in their common language (usually English).
3.2 Input
The format of the documents to align usually depends on where they come from:
while literary texts are often obtained either as PDF or plain text files; legal
documents usually come as PDF files; corporate websites come in HTML; and
movie subtitles either in SubRip or microDVD format.
Globally accepted as the standard format to share text documents, PDF files are
usually converted to some kind of plain text format: either unformatted plain text,
HTML or XML. In spite of being available several tools to perform this conversion,
given that the final layout of PDF files is dictated by graphical directives, without
the notion of text structure, frequently the final result of the conversion is less that
perfect: remains of page structure, loss of text formatting and noise introduction
are common problems. A comparative study of the tools available can be found
in Robinson [2001].
HTML or XML files are less problematic. Besides being more easily processed,
some of the markup elements may be used to guide the alignment process: for
example, the header tags in HTML may be used to delimit the different sections of
the document, and the text formatting tags as <i> or <b> might be preserved and
used as well. Depending on the schema used, XML annotations can be an useful
source of information too.
3.3 Document alignment
Frequently, a large number of documents is retrieved from a given source with no
information provided of which documents match each other. This is typically the
case when documents are gathered from websites with a crawler.
Fortunately, when there are multiple versions available, in different languages,
of the same page, it is common to include in the name a substring identifying
the language: for instance, Portuguese, por or pt [Almeida and Sim˜es 2010].
o
MI-STAR 2011.
8. 124 · Andr´ Santos
e
These substrings usually follow either the ISO-639-24 or the ISO 31665 standards.
This way, it is possible to write a script with a set of rules which help to find
corresponding documents.
Other ways to match documents are to analyze their meta-information, or the
path where the were stored.
3.4 Paragraph/sentence boundary detection
The higher levels of alignment are section, paragraph or, most commonly, sentence
alignment. In order to perform sentence-level alignment, texts must be segmented
into sentences. Some word-level aligners work based on sentence-segmented corpora
as well, which allows them to achieve better results.
Splitting a text into sentences is not a trivial task because the formal definition
of what is a sentence is a problem that has eluded linguistic research for quite a
while (see Simard [1998] for further details on this subject). V´ronis and Langlais
e
[2000] give an example of several valid yet divergent segmentations of sentences.
A simple heuristic for a sentence segmenter is to consider symbols like .!? as
sentence terminators. This method may be improved by taking into account other
terminator characters or abbreviations patterns; however, this patterns are typically
language-dependent, which makes it impossible to have an exhaustive list for all
languages6 [Koehn 2005].
This process usually outputs the given documents with the sentences clearly
delimited.
3.5 Sentence Alignment
There are several documented algorithms and tools available to perform sentence-
level alignment. Generally speaking, they can be divided into three categories:
length-based, dictionary or lexicon based or partial similarity-based (see section
2 on page 119 for more information).
Generally, sentence aligners take as input the texts to align, and, in some cases,
additional information, such as a dictionary, to help establish the correspondences.
A typical sentence alignment algorithm starts by calculating alignment scores ,
trying to find the most reliable initial points of alignment – denominated “anchor
points”. This score may be calculated based on the similarity in terms of length,
words, lexicon or even syntax-tree [Tiedemann 2010]. After finding the anchor
points, the process is repeated, trying to align the middle points. Typically, this
ends when no new correspondences are found.
The alignment is performed without allowing cross-matching, meaning that the
sentences in the source text must be matched in the same order in the target text.
3.6 Tokenizer
Splitting sentences into words allows to subsequently perform word-level alignment.
As with sentence segmenting, it is possible to implement a very simple heuristic to
4 http://www.loc.gov/standards/iso639-2/php/code_list.php
5 http://www.iso.org/iso/country_codes/iso_3166_code_lists.htm
6 Anexample of such a sentence segmenter can be downloaded at
http://www.eng.ritsumei.ac.jp/asao/resources/sentseg/.
MI-STAR 2011.
9. Artigos para MI-STAR · 125
accomplish this task by defining sets of characters which are to be considered either
word characters or non-word characters. For example, in Portuguese, it is possible
to define A-Z, a-z, - and ’ (and all the characters with diacritics) as belonging
inside words, and all others as being non-word characters.
However, here too there are languages in which other methods must be used:
in Chinese, for example, word boundaries are not as easy to determine as with
Western-European languages. This imposes interesting challenges on how to align
texts in some languages at word-level.
3.7 Word alignment
The word-alignment process presents some similarities with higher level alignments.
However, this is a more complex process, given the more frequent order inversions,
differences in part-of-speech and syntactic structure, and multi-word units. As a
result, research in this area is less advanced than in sentence alignment [V´ronis
e
and Langlais 2000; Chiao et al. 2006].
Multi-word units (MWU) are idiomatic expressions composed by more than
one word. MWUs must be taken into account because frequently they cannot be
translated word-by-word; the corresponding expression must be found, whether it
is also an MWU or a single word [Tiedemann 1999; 2004]. This happens frequently
when aligning English and German, for instance, because of the German composed
words such as Geschwindigkeitsbegrenzung (in English, speed limit).
3.8 Output
The output formats vary greatly. Generally, every format must include some kind
of sentence identification (either by the offsets of its begin and end characters or by
a given identification code) and the pairs of correspondences; optionally, additional
information may also be included, like a confidence value, or the stemmed form of
the word.
BAF, for example, is available in the following formats [Simard 1998]:
– COAL format. An alignment in this format is composed by three files: two
plain text files containing the actual sentences and words originally provided, and
an alignment file which contains a sequence of pairs [(s1, t1), (s2, t2), ..., (sn, tn)].
Each sth segment in the source text has its beginning position given by si , and the
i
end position given by Si+1 − 1. The corresponding segment in the target text is
delimited by ti and ti +1 − 1.
– CES format. This format is also composed by three files: two text files in
CESANA format, enriched with SGML mark-up that uniquely identifies each sentence,
and an alignment file in CESALIGN format, containing a list of pairs of sentence
identifiers. The results of ARCADE II were also stored under this format.
– HTML format. Not intended for further processing purposes. This is a visualization
format, displayable in a common browser.
Other possible formats are translation memory (TM) formats [D´silets et al. 2008].
e
Translation memories are databases (in the broader sense of the word) that store
pairs of “segments” (either paragraphs, sentences or words). TM store the source
text and its translation in language pairs denominated translation units. There are
multiple TM formats available.
MI-STAR 2011.
10. 126 · Andr´ Santos
e
4. CURRENT PROJECTS AND TOOLS
This section lists some of the most relevant projects and tools being developed or
used at the moment.
4.1 NATools
NATools is a set of tools for processing, analyze and extract translation resources
from parallel corpora, developed at Universidade do Minho. It includes sentence
and word aligners, a probabilistic translation dictionary (PTD) extractor, a corpus
server, a set of corpora and dictionary query tools and tools for extracting bilingual
resources [Simoes and Almeida 2007; 2006].
4.2 Giza++
Giza++ is an extension of an older program, Giza. This tool performs statistical
alignment, implementing several Hidden Markov Models and advanced techniques
which allow to improve alignment results [Och and Ney 2000].
4.3 hunalign
hunalign is a sentence-level aligner built on top of [Gale and Church]’s algorithm
written in C++. When provided with a dictionary, hunalign uses its information to
help in the alignment process, despite being able to work without one [Varga et al.
2005].
4.4 Per-Fide
Per-Fide is a project from Universidade do Minho which aims to compile parallel
corpora between Portuguese and other six languages (Espa˜ol, Russian, Fran¸ais,
n c
Italiano, Deutsch and English) [Ara´jo et al. 2010]. This corpora includes different
u
kinds of speech, such as literary, religious, journalistic, legal and technical. The
corpora are sentence-level aligned, and include morphological and morphosyntactic
annotations in the possible languages. Automatic extraction of resources is implemented,
and the results are made available on the net.
4.5 cwb-align
Also known as easy-align, this tool is integrated in the IMS CWB Open Corpus
Workbench, a collection of open source tools for managing and query large text
corpora with linguistic annotations, based of an efficient query processor, CQP [Evert
2001]. It is considered as a standard de facto, given its quality and implemented
features, such as tools for encoding, indexing, compression, decoding, and frequency
distributions, a query processor and a CWB/Perl API for post-processing, scripting
and web interfaces.
4.6 WinAlign
WinAlign is a commercial solution from the Trados package, developed for professional
translators. It accepts previous translations as translations memories and uses it
to guide further alignments. This is also useful when clients provide reference
material from previous jobs. The Trados package may be purchased at http:
//www.translationzone.com/.
MI-STAR 2011.
11. Artigos para MI-STAR · 127
5. CONCLUSIONS
This paper presented the field of parallel corpora alignment. Its historical background
and first development initiatives were described, and the alignment process was
detailed step-by-step. A list of currently relevant projects and tools was also presented.
Several methods and algorithms for sentence-level text alignment have been
devised and improved throughout the years. While the existing methods perform
acceptably for well formatted and almost perfectly parallel texts, their performance
decreases greatly when the bitexts present major deletions, inversions or uncleaned
mark up. The quality of the sentence segmentation process also has a determinant
role in the alignment.
Word level alignment is still at an early stage; being considerable more difficult.
Despite its similarities with sentence alignment, the elevated number of inversions,
deletions and the existence of multi-word units must be taken into account. Here
too the tokenizer quality has a decisive influence on the final results.
There is a wide range of applications for aligned parallel corpora, specially in the
areas of computational linguistics, lexicography, machine translation and knowledge
extraction.
In order to improve the lower level alignments, we feel that there is the need for a
structural aligner, capable of aligning big blocks of text (like sections or chapters),
which would allow to divide the problem of aligning a given pair of texts into smaller
subproblems of aligning each of their sections. This would also allow to detect and
discard beforehand unmatched sections (a common problem when aligning literary
texts, for example).
REFERENCES
Almeida, J. and Simoes, A. 2010. Automatic Parallel Corpora and Bilingual Terminology
˜
extraction from Parallel WebSites. In Proceedings of LREC-2010. 50.
Araujo, S., Almeida, J., Dias, I., and Simoes, A. 2010. Apresenta¸˜o do projecto Per-Fide:
´ ˜ ca
Paralelizando o Portuguˆs com seis outras l´
e ınguas. Linguam´tica, 71.
a
Brown, P., Lai, J., and Mercer, R. 1991. Aligning sentences in parallel corpora. 169–176.
Chen, S. 1993. Aligning sentences in bilingual corpora using lexical information. In Proceedings
of the 31st annual meeting on Association for Computational Linguistics. Association for
Computational Linguistics, 9–16.
Chiao, Y., Kraif, O., Laurent, D., Nguyen, T., Semmar, N., Stuck, F., V´ ronis, J., and
e
Zaghouani, W. 2006. Evaluation of multilingual text alignment systems: the ARCADE II
project. In Proceedings of LREC-2006. Citeseer.
D´ silets, A., Farley, B., Stojanovic, M., and Patenaude, G. 2008. WeBiText: Building
e
large heterogeneous translation memories from parallel web content. Proc. of Translating and
the Computer 30, 27–28.
Evert, S. 2001. The CQP query language tutorial. IMS Stuttgart 13.
Gale, W. and Church, K. 1993. A program for aligning sentences in bilingual corpora.
Computational linguistics 19, 1, 75–102.
Hart, M. and Newby, G. 1997. Project Gutenberg. http://www.gutenberg.org/wiki/Main_
Page.
Kay, M. 1991. Text-translation alignment. In ACH/ALLC ’91: "Making
Connections"Conference Handbook. Tempe, Arizona.
Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. 5.
Langlais, P., Simard, M., Veronis, J., Armstrong, S., Bonhomme, P., Debili, F.,
Isabelle, P., Souissi, E., and Theron, P. 1998. Arcade: A cooperative research project
on parallel text alignment evaluation.
MI-STAR 2011.
12. 128 · Andr´ Santos
e
Melamed, I. D. 1998a. Annotation style guide for the blinker project. Arxiv preprint cmp-
lg/9805004 cmp-lg/9805004.
Melamed, I. D. 1998b. Manual annotation of translational equivalence: The blinker project.
Arxiv preprint cmp-lg/9805005 cmp-lg/9805005.
Moore, R. 2002. Fast and accurate sentence alignment of bilingual corpora. Machine
Translation: From Research to Real Users, 135–144.
Och, F. and Ney, H. 2000. Improved statistical alignment models. 440–447.
Olson, D. and Delen, D. 2008. Advanced data mining techniques. Springer Verlag.
Resnik, P. and Smith, N. 2003. The Web as a parallel corpus. Computational Linguistics 29, 3,
349–380.
Rijsbergen, C. J. V. 1979. Information Retrieval, 2nd ed. Butterworth-Heinemann, Newton,
MA, USA.
Robinson, N. 2001. A Comparison of Utilities for converting from Postscript or Portable
Document Format to Text. Tech. rep., CERN-OPEN-2001.
Simard, M. 1998. The BAF: a corpus of English-French bitext. In First International Conference
on Language Resources and Evaluation. Vol. 1. Citeseer, 489–494.
Simard, M. and Plamondon, P. 1998. Bilingual sentence alignment: Balancing robustness and
accuracy. Machine Translation 13, 1, 59–80.
Simoes, A. and Almeida, J. 2006. NatServer: a client-server architecture for building parallel
corpora applications. Procesamiento del Lenguaje Natural 37, 91–97.
Simoes, A. and Almeida, J. 2007. Parallel corpora based translation resources extraction.
Procesamiento del lenguaje natural 39, 265–272.
Tiedemann, J. 1999. Word alignment-step by step. 216–227.
Tiedemann, J. 2004. Word to word alignment strategies. 212.
Tiedemann, J. 2007. Building a multilingual parallel subtitle corpus. Proc. CLIN .
Tiedemann, J. 2010. Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree
Alignment. In Proceedings of the 7th International Conference on Language Resources and
Evaluation (LREC’2010).
Tiedemann, J., Nygaard, L., and Hf, T. 2004. The OPUS corpus–parallel and free. In In
Proceeding of the 4th International Conference on Language Resources and Evaluation (LREC.
Citeseer.
Tsvetkov, Y. and Wintner, S. 2010. Automatic Acquisition of Parallel Corpora from Websites
with Dynamic Content. In Proceedings of LREC-2010.
Varga, D., Halacsy, P., Kornai, A., Nagy, V., N´ meth, L., and Tron, V. 2005. Parallel
´ e ´
corpora for medium density languages. Recent Advances in Natural Language Processing IV:
Selected Papers from RANLP 2005 .
V´ ronis, J. 2000. From the Rosetta stone to the information society. 1–24.
e
V´ ronis, J. and Langlais, P. 2000. Evaluation of parallel text alignment systems. Vol. 13.
e
369–388.
MI-STAR 2011.