Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts

Traduco: A collaborative web-based
CAT environment for the
interpretation and translation
of texts............................................................................................................................................................
Emiliano Giovannetti, Davide Albanesi, Andrea Bellandi and
Giulia Benotto
Istituto di Linguistica Computazionale ‘A. Zampolli’, Consiglio
Nazionale delle Ricerche, Pisa, Italy
.......................................................................................................................................
Abstract
Traduco is a web-based collaborative tool aimed at supporting the translation of
texts that pose particular challenging interpretative issues. Nowadays, Computer-
Assisted Translation (CAT) tools are mainly applied to the translation of tech-
nical manuals or legislative texts and are aimed at speeding up the translation
process. Traduco extends most of the standard components of a traditional CAT
tool with specific features necessary to support the interpretation and translation
of complex texts (like the Babylonian Talmud, that we here present as a case
study), which pose particular comprehension issues. Traduco goes beyond the
translation and its printing: it includes features for the addition of notes and
annotations and the creation of glossaries. Translators, editors, supervisors, and
end-users accessing Traduco are able to use components that can ease the trans-
lation process through the use of CAT technologies, the supervision and mana-
ging of the whole process of translation and publishing, the exporting of
translations and notes in standard formats for desktop publishing software and
TEI format, and, soon, the possibility to perform automatic linguistic analysis of
the text. Moreover, Traduco allows the users to insert notes, comments, anno-
tations, and bibliographical references. The design and development of Traduco
required the adoption of a multidisciplinary approach, leveraging on advances in
software engineering, computational linguistics, knowledge engineering, and
publishing.
.................................................................................................................................................................................
1 Introduction
Traduco is a web-based collaborative tool aimed at
supporting the translation of texts that pose particu-
larly challenging interpretative issues. The develop-
ment of Traduco was started in 2012 at the Institute
for Computational Linguistics ‘A. Zampolli’ of the
Italian National Research Council (ILC-CNR) for
the translation of the Babylonian Talmud (BT) into
Italian within the context of the ‘Progetto Traduzione
del Talmud Babilonese’ (PTTB), monitored by the
Italian Presidency of the Council of Ministers and
coordinated by the Union of Italian Jewish
Communities and the Italian Rabbinical College.
Nowadays, many Computer-Assisted Translation
(CAT) tools—both commercial and free of
Correspondence:
Emiliano Giovannetti,
Istituto di Linguistica
Computazionale ‘‘‘A.
Zampolli’’’, Consiglio
Nazionale delle Ricerche, Via
G. Moruzzi 1, 56124, Pisa,
Italy.
E-mail:
emiliano.giovannetti@ilc.cnr.it
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017. ß The Author 2016. Published by Oxford University
Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com
i47
doi:10.1093/llc/fqw054 Advance Access published on 26 October 2016
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020

charge—are already available.1
Professional transla-
tors use Translation Memories (TMs) on a regular
basis (Lagoudaki 2009). The best-known commer-
cial systems are Across,2
De´ja` Vu,3
memoQ,4
MultiTrans,5
SDL Trados,6
Similis,7
Transit,8
and Wordfast9
; while non-commercial ones are
OpenTM,10
OmegaT,11
Olanto,12
Transolution,13
and Matecat.14
They are mainly applied to the trans-
lation of technical manuals or legislative texts and
are aimed at speeding up the translation process,
allowing translators to save a significant amount
of time and effort.
The approach adopted for the development of
Traduco is instead more oriented towards covering
aspects related to the specific needs of the translator
community working on texts with particular inter-
pretative issues. To translate these texts, a translator
is required to have two kinds of competences: in
language (as a translator) and in the ‘content’ of
the text to be translated (as a scholar). Since a por-
tion of text (and sometimes even a single word) can
be difficult to interpret and translate, the possibility
provided by a collaborative environment to in-
stantly consult translations done by others becomes
a necessity. From the end-user’s point of view, the
understanding of these texts requires a translation to
be enriched with explanations, notes, and glossary
entries.
Although some of the cited tools integrate
Machine Translation (MT) techniques, the lack of
linguistically annotated resources and large collec-
tions of parallel texts involving the source and the
target languages has prevented us from considering
any statistical MT toolkit. Instead, we have imple-
mented a TM enabling translators to re-elaborate
the plain and literal translation of the text and to
integrate it with explicative additions. To the best of
our knowledge, this is the first application of CAT
technologies specifically designed to support the
translation of complex texts like the BT.
In this article, we describe the main characteris-
tics of Traduco (Section 2) and we discuss all the
features that make it different from other CAT tools
(Section 3). As a matter of fact, although initially
designed to provide basic support in the collabora-
tive translation of the Talmud, over the years
Traduco has undergone several upgrades. Some of
these involve the integration of state-of-the-art
approaches aimed at improving the performance
of the Translation Memory System (TMS), the com-
ponent devoted to suggesting translations to users
(Section 4).
At the current stage of development, Traduco is
almost ready to be released as an open-source lan-
guage and text-independent web collaborative
environment for the translation of scholarly challen-
ging texts. However, we intend to continue to re-
lease new versions, with the integration of new
features, as we shall briefly discuss in Section 5.
2 The Traduco System
Traduco is made up of various components, each
implementing specific functionalities targeted at dif-
ferent types of users (Fig. 1). The system goes
beyond the mere translation and its printing: it in-
cludes features for the addition of notes and anno-
tations and the creation of glossaries. Translators,
editors, supervisors, and end-users accessing
Traduco will therefore be able to use components
that can:
ease the translation process through the use of
CAT technologies, including indexers and TM
tools (see Section 3.1.);
allow the users to insert notes, comments, anno-
tations, and bibliographical references (see
Section 3.2);
supervise and manage the whole process of trans-
lation and publishing (see Section 3.3);
export translations and notes in standard formats
for desktop publishing software and Text
Encoding Initiative (TEI) format (see Section 3.4);
perform automatic linguistic analysis of the text
(see Section 4).
As concerns the current use of Traduco, i.e. the
translation of the BT, by the end of the project,
Traduco will have produced two resources:
the printed edition of the Italian translation of
the BT;
the digital edition of the translated and anno-
tated BT, that users will be able to consult online.
E. Giovannetti et al.
i48 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017

The following two subsections describe both the
general architecture of Traduco and the technical
solutions adopted for its implementation.
2.1. Characteristics of the system
The design of the Traduco architecture took into
account both the guidelines for the creation of
models and tools for digital publishing—as
suggested by the scientific community
(Interedition15
first and foremost)—and the effective
user demands pertaining to the Project for the BT
translation. So far, none of the systems (nor frame-
works) available commercially, freely distributed in
academic circles or, more in general, described in the
literature (see Section 1) is able to satisfy the multi-
plicity of requirements required by a modern
Fig. 1 General architecture of Traduco
Traduco
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i49

environment for the translation of texts raising par-
ticular interpretative issues. Traduco was conceived
to fulfil these requirements. As a matter of fact, the
system is:
Built with a component-based architecture: as
with the pieces of a puzzle, developers willing
to build their own system must be able to draw
from a pool of independent basic components.
The component-based architectural structure
was facilitated by the technology adopted (i.e.
the object-oriented Java programming language);
Accessible through the Web: the Web is the ideal
working environment for collaborative authoring
and publishing activities; as opposed to desktop
applications, which require the installation of
specific client software on computers, the so-
called web-based applications require just the
use of a browser (e.g. Firefox, Safari, Chrome)
and a connection to Internet through which the
user can communicate with the system running
on a remote server. The advantages of web-based
applications are considerable; for example, a
system update can be applied without the users
having to update any software on their
computers;
Collaborative: the online environment, com-
bined with the reliability of the technological
framework used, allows a team of users (transla-
tors, revisors, editors, supervisors, domain ex-
perts, etc.) to work on the same data
collaboratively; the system keeps track of the in-
formation already stored inside the database
(organized as hierarchically structured transla-
tion fragments) and prevents the same sentence
from being translated by more than one person.
Furthermore, the supervisors can keep track, in
real time, of the work done by the translators as
they translate new sections of the text they have
been assigned;
Based on open-source technologies: software de-
velopment based on open-source technology is
encouraged by the scientific community and
allows developers to autonomously implement
ad hoc extensions and customizations to the
code; in this case, the system in question is de-
veloped by using the set of open-source technol-
ogies collected in the Java 2 Standard Edition
(J2SE) framework, which over the years has
become synonymous with the development of
solid, secure, and efficient professional applica-
tions. J2SE is the most stable, tested, and docu-
mented technological platform for the
integration of mission critical systems that re-
quire distributed access, session transactionality,
persistence management, and rich interface com-
ponent libraries;
Equipped with tools for text annotation and
ready for language processing: when available,
annotation and Natural Language Processing
(NLP) tools can be applied to both the source
and the translated texts for tasks such as semantic
annotation, linguistic analysis (typically,
morpho-syntactic tagging, stemming, or lemma-
tization), terminology extraction, named entity
extraction, etc.; an annotated text can be,
among others, (1) queried on a linguistic, lexical,
or semantic basis, and (2) used to boost the
TMS;
Adaptable to different languages: the text pro-
cessing and linguistic analysis components
included in a system should be relatively easy
to adapt to different languages; the technology
included in Traduco, for example, is based on
UTF-8 for character encoding (thus covering
the vast majority of idioms) and on supervised
statistical models for linguistic analysis that can
be re-trained to process other languages (if pre-
annotated corpora are available).
2.2 Technical solutions
From a technical point of view, Traduco was de-
signed as a group of independent web-based com-
ponents connected by interfaces. It is based on the
software design pattern known as ‘three-tier archi-
tecture’, and it exploits Apache Tomcat v7.0 as web
server. The component-based architectural structure
was implemented by the object-oriented J2SE
framework, enhanced with Contexts and
Dependency Injection annotations, using the Weld
v2.2.4 reference implementation. Relational persist-
ence and query services are managed by Hibernate
v4.3.7, which is responsible for the mapping from
Java classes to the Mysql v5.0 database tables. To
provide a very responsive TMS component

we adopted an inverted index data structure (Patil et
al., 2011). Finally, the presentation layer was imple-
mented with JavaServer Faces, a framework for
building component-based user interfaces for web
applications based on the Mojarra Oracle imple-
mentation v2.2.9 and using the Primefaces v5.1
library.
3 The Components of Traduco
Traduco extends most of the standard components
of a traditional CAT tool with specific features ne-
cessary to support the interpretation and translation
of texts like the BT, which pose particular compre-
hension issues. The design and development of
Traduco required the adoption of a multidisciplin-
ary approach, leveraging on advances in software
engineering, computational linguistics, knowledge
engineering, publishing, and, in the case of its ap-
plication to the translation of the BT, on Talmudic
knowledge and Ancient Semitic linguistics.
The basic functioning of Traduco is not different
from other CAT tools. As shown in Fig. 2, a trans-
lator can view the hierarchical structure (on the left)
of the translated text—organized, in the case of the
BT, in tractates, chapters, blocks, logical units, and
strings (the segments). In the central part of the
table, a translator can insert new segments paired
with their translations, either manually one after
the other, or by creating multiple segments all at
once and then translating them separately. Just
above the translation table, a collapsible ‘Filter’ sec-
tion allows the users to execute searches, both on
the source and on the translated texts. Finally, on
the right of the table there are the notes, glossary
entries, and translation suggestions relative to the
selected pair of segments.
Traduco can support in the translation of com-
plex texts often requiring specific reformulations so
as to be correctly understood by non-scholars.
Figure 3 shows how the translation of each segment
can be performed by differentiating the ‘literal’
translation (in bold) from explicative additions,
i.e. ‘contextual information’. Segments having the
same literal part can then differ by their contextual
information. The tractate of the source segment of
each translation is called ‘context’.
In the following subsections we outline the most
relevant features implemented by the components
of Traduco that have been developed to face the
translation of textual corpora with complex philo-
logical and linguistic peculiarities, as in the case of
the BT.
3.1 Translation memory system
One of the core components of a CAT tool is the
TMS. TMSs leverage on a TM consisting in a sen-
tence-pair database which automatically stores all
the translated text segments together with the
source text during the translation process (Reinke
2013). Basically, the main purpose of a TMS is to
allow translators to reuse already done translations.
A TMS generally consists of:
a TM database, containing pairs of segments
(s, t), where s is the source language segment of
text and t is its translation in the target language
(see Section 3.1.1);
a similarity function Sim;
a threshold .
Given a segment sq to be translated, the TMS
returns a translation tq by searching the TM for
the best match, i.e. a pair (s, tq) whose similarity
Sim(s, sq) ! is maximal, if it exists (Sikes 2007).
The function Sim measures the similarity between
two source-language segments. Typically, it pro-
duces a percentage value, where 100% stands for
‘identical segments’, i.e. exact match, and 0% for
‘completely different segments’. Intermediate per-
centage values are called fuzzy matches. The TMS
ranks suggestion by similarities and presents them
to translators. Typically, when there is not an exact
match, the translator needs to edit the proposed
suggestions. The next subsections describe how the
system retrieves pertinent translation suggestions
(see Section 3.1.2), and how the system presents
the suggestions to each translator (see
Section 3.1.3).
3.1.1 Translation memory
The TM is organized at the segment level. A seg-
ment is a portion of original text having an arbitrary
Traduco

length. We formally define the TM ¼ {si, Ti, Ai, ci}
with i ranging from 1 to n, as a set of n tuples, where
each tuple is defined by:
si, the source segment;
Ti ¼ {t1
i , . . . , tk
i }, the set of translations of si with
k ! 1, where each t
j
i includes a literal part j
i
exactly corresponding to the source segment,
and an explicative addition, hereafter referred
to as contextual information j
i, with 1 j k;
Ai ¼ {a1
i , . . . , ak
i }, the set of translator identifiers
of each translation of si in Ti with k ! 1;
ci, the context of si referring to the tractate to
which si belongs.
3.1.2 Retrieval of similar segments
It is well known that most TMSs are based on variants
of the ‘Levenshtein’ distance normalized over the
length of the query segment, i.e. the minimum
Fig. 2 Main interface of Traduco. (a) hierarchical structure of the translated text; (b) translation table and filter; (c)
translation references: notes, glossary entries, translation suggestions
Fig. 3 Example of literal translations (bold font) and contextual information (plain font)

number of edit operations required to transform a
string into another one. In the case of the BT, we
did not take into account variants aware of linguistic
information, since no available NLP tools are suitable
for processing ancient North-western Semitic lan-
guages, such as the different Hebrew and Aramaic
idioms attested in the BT (see Section 4).
We chose to adopt a similarity measure Sim
based on edit distance, ED(s1, s2), considering two
segments to be more similar when the same terms
tend to appear in the same order. Given an sq seg-
ment to be translated, the formula in Figure 4
measures the similarity between sq and a segment s16
.
One of the novelties we introduced is the way in
which suggestions are ranked. In particular, we con-
sider (1) the authors of translations and (2) the context
(the tractate of reference). The information about the
author of the translation and of the tractate of reference
is useful both for translators and revisors. On the one
hand, translators are enabled to evaluate the reliability
of the suggested translations on the basis of the author-
ity and expertise of the relative translators. On the other
hand, revisors can exploit both kinds of information to
ensure a more homogeneous and fluent translation.
Our algorithm is based on dynamic programming,
and its implementation refers to (Navarro 2001). To
compute ED(s1, s2), it builds a matrix M(0.|s1|,0.|s2|),
where each element mi,j represents the minimum
number of token mutations required to transform
s1(1.i) in s2(1.j). The computation process is indi-
cated in Figure 5, where ¼ min(m(i À 1,j), m(i,j À 1),
m(i À 1,j À 1)), and the final cost is represented by
m(|s1|,|s2|). The TMS returns the translated strings
having the lower costs. Basically, given a segment
to be translated, many other source segments can
equal it exactly and each of these segments can
be paired with multiple translations in the TM.
Figure 6 shows an example of ED(s1,s2) computa-
tion. Thanks to the technical solutions described
in Section 2.2, the TMS can retrieve and
present the translation suggestions in only a few
milliseconds.
3.1.3 Presentation of translation suggestions
The Traduco user interface shows each suggestion
accompanied by a number of stars, as appears in
Fig. 7. The number is assigned on the basis of
how fuzzy the match between source segments is:
five-star suggestions are considered perfect (exact
match, Sim ¼ 100%); four stars indicate that a few
corrections are probably required (fuzzy match,
85% Sim 99%); and three stars indicate, in
most cases, acceptable suggestions (weak fuzzy
match, 70% Sim 84%). The TMS orders by con-
text and by author the suggestions that are ranked
with the same number of stars. Each translator, for
example, can then approve as correct the literal
translation j
i, and modify only j
i.
A translator can choose to filter the proposed
suggestions to visualize just his/her own transla-
tions, revised translations or translations belonging
to the particular tractate on which (s)he is working.
Of course, each new translation is added to the TM,
thus increasing the pool of translations available.
3.1.4 TMS performance
The evaluation of the performance of a system like
Traduco is not a trivial task. Unlike a typical CAT
tool, the aim of Traduco is not limited to increasing
the translation pace, but it is meant to support the
translation process by offering a collaborative envir-
onment in which users can translate their own por-
tions of texts by exploiting the translation of similar
source segments (that could greatly differ in the ex-
plicative additions) done by others.
Before undergoing an empirical evaluation, we
analysed the redundancy of the TM by considering
the similar segments. To estimate the TM perform-
ance we conducted a jackknife experiment (Wu
Fig. 4 Similarity function
Fig. 5 Computation process of ED(s1, s2)
Traduco

Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts

Recommended

Recommended

More Related Content

Similar to Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts

Similar to Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts (20)

More from antonellarose

More from antonellarose (10)

Recently uploaded

Recently uploaded (20)

Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts