SlideShare a Scribd company logo
1 of 7
Download to read offline
Traduco: A collaborative web-based
CAT environment for the
interpretation and translation
of texts............................................................................................................................................................
Emiliano Giovannetti, Davide Albanesi, Andrea Bellandi and
Giulia Benotto
Istituto di Linguistica Computazionale ‘A. Zampolli’, Consiglio
Nazionale delle Ricerche, Pisa, Italy
.......................................................................................................................................
Abstract
Traduco is a web-based collaborative tool aimed at supporting the translation of
texts that pose particular challenging interpretative issues. Nowadays, Computer-
Assisted Translation (CAT) tools are mainly applied to the translation of tech-
nical manuals or legislative texts and are aimed at speeding up the translation
process. Traduco extends most of the standard components of a traditional CAT
tool with specific features necessary to support the interpretation and translation
of complex texts (like the Babylonian Talmud, that we here present as a case
study), which pose particular comprehension issues. Traduco goes beyond the
translation and its printing: it includes features for the addition of notes and
annotations and the creation of glossaries. Translators, editors, supervisors, and
end-users accessing Traduco are able to use components that can ease the trans-
lation process through the use of CAT technologies, the supervision and mana-
ging of the whole process of translation and publishing, the exporting of
translations and notes in standard formats for desktop publishing software and
TEI format, and, soon, the possibility to perform automatic linguistic analysis of
the text. Moreover, Traduco allows the users to insert notes, comments, anno-
tations, and bibliographical references. The design and development of Traduco
required the adoption of a multidisciplinary approach, leveraging on advances in
software engineering, computational linguistics, knowledge engineering, and
publishing.
.................................................................................................................................................................................
1 Introduction
Traduco is a web-based collaborative tool aimed at
supporting the translation of texts that pose particu-
larly challenging interpretative issues. The develop-
ment of Traduco was started in 2012 at the Institute
for Computational Linguistics ‘A. Zampolli’ of the
Italian National Research Council (ILC-CNR) for
the translation of the Babylonian Talmud (BT) into
Italian within the context of the ‘Progetto Traduzione
del Talmud Babilonese’ (PTTB), monitored by the
Italian Presidency of the Council of Ministers and
coordinated by the Union of Italian Jewish
Communities and the Italian Rabbinical College.
Nowadays, many Computer-Assisted Translation
(CAT) tools—both commercial and free of
Correspondence:
Emiliano Giovannetti,
Istituto di Linguistica
Computazionale ‘‘‘A.
Zampolli’’’, Consiglio
Nazionale delle Ricerche, Via
G. Moruzzi 1, 56124, Pisa,
Italy.
E-mail:
emiliano.giovannetti@ilc.cnr.it
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017. ß The Author 2016. Published by Oxford University
Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com
i47
doi:10.1093/llc/fqw054 Advance Access published on 26 October 2016
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
charge—are already available.1
Professional transla-
tors use Translation Memories (TMs) on a regular
basis (Lagoudaki 2009). The best-known commer-
cial systems are Across,2
De´ja` Vu,3
memoQ,4
MultiTrans,5
SDL Trados,6
Similis,7
Transit,8
and Wordfast9
; while non-commercial ones are
OpenTM,10
OmegaT,11
Olanto,12
Transolution,13
and Matecat.14
They are mainly applied to the trans-
lation of technical manuals or legislative texts and
are aimed at speeding up the translation process,
allowing translators to save a significant amount
of time and effort.
The approach adopted for the development of
Traduco is instead more oriented towards covering
aspects related to the specific needs of the translator
community working on texts with particular inter-
pretative issues. To translate these texts, a translator
is required to have two kinds of competences: in
language (as a translator) and in the ‘content’ of
the text to be translated (as a scholar). Since a por-
tion of text (and sometimes even a single word) can
be difficult to interpret and translate, the possibility
provided by a collaborative environment to in-
stantly consult translations done by others becomes
a necessity. From the end-user’s point of view, the
understanding of these texts requires a translation to
be enriched with explanations, notes, and glossary
entries.
Although some of the cited tools integrate
Machine Translation (MT) techniques, the lack of
linguistically annotated resources and large collec-
tions of parallel texts involving the source and the
target languages has prevented us from considering
any statistical MT toolkit. Instead, we have imple-
mented a TM enabling translators to re-elaborate
the plain and literal translation of the text and to
integrate it with explicative additions. To the best of
our knowledge, this is the first application of CAT
technologies specifically designed to support the
translation of complex texts like the BT.
In this article, we describe the main characteris-
tics of Traduco (Section 2) and we discuss all the
features that make it different from other CAT tools
(Section 3). As a matter of fact, although initially
designed to provide basic support in the collabora-
tive translation of the Talmud, over the years
Traduco has undergone several upgrades. Some of
these involve the integration of state-of-the-art
approaches aimed at improving the performance
of the Translation Memory System (TMS), the com-
ponent devoted to suggesting translations to users
(Section 4).
At the current stage of development, Traduco is
almost ready to be released as an open-source lan-
guage and text-independent web collaborative
environment for the translation of scholarly challen-
ging texts. However, we intend to continue to re-
lease new versions, with the integration of new
features, as we shall briefly discuss in Section 5.
2 The Traduco System
Traduco is made up of various components, each
implementing specific functionalities targeted at dif-
ferent types of users (Fig. 1). The system goes
beyond the mere translation and its printing: it in-
cludes features for the addition of notes and anno-
tations and the creation of glossaries. Translators,
editors, supervisors, and end-users accessing
Traduco will therefore be able to use components
that can:
 ease the translation process through the use of
CAT technologies, including indexers and TM
tools (see Section 3.1.);
 allow the users to insert notes, comments, anno-
tations, and bibliographical references (see
Section 3.2);
 supervise and manage the whole process of trans-
lation and publishing (see Section 3.3);
 export translations and notes in standard formats
for desktop publishing software and Text
Encoding Initiative (TEI) format (see Section 3.4);
 perform automatic linguistic analysis of the text
(see Section 4).
As concerns the current use of Traduco, i.e. the
translation of the BT, by the end of the project,
Traduco will have produced two resources:
 the printed edition of the Italian translation of
the BT;
 the digital edition of the translated and anno-
tated BT, that users will be able to consult online.
E. Giovannetti et al.
i48 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
The following two subsections describe both the
general architecture of Traduco and the technical
solutions adopted for its implementation.
2.1. Characteristics of the system
The design of the Traduco architecture took into
account both the guidelines for the creation of
models and tools for digital publishing—as
suggested by the scientific community
(Interedition15
first and foremost)—and the effective
user demands pertaining to the Project for the BT
translation. So far, none of the systems (nor frame-
works) available commercially, freely distributed in
academic circles or, more in general, described in the
literature (see Section 1) is able to satisfy the multi-
plicity of requirements required by a modern
Fig. 1 General architecture of Traduco
Traduco
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i49
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
environment for the translation of texts raising par-
ticular interpretative issues. Traduco was conceived
to fulfil these requirements. As a matter of fact, the
system is:
 Built with a component-based architecture: as
with the pieces of a puzzle, developers willing
to build their own system must be able to draw
from a pool of independent basic components.
The component-based architectural structure
was facilitated by the technology adopted (i.e.
the object-oriented Java programming language);
 Accessible through the Web: the Web is the ideal
working environment for collaborative authoring
and publishing activities; as opposed to desktop
applications, which require the installation of
specific client software on computers, the so-
called web-based applications require just the
use of a browser (e.g. Firefox, Safari, Chrome)
and a connection to Internet through which the
user can communicate with the system running
on a remote server. The advantages of web-based
applications are considerable; for example, a
system update can be applied without the users
having to update any software on their
computers;
 Collaborative: the online environment, com-
bined with the reliability of the technological
framework used, allows a team of users (transla-
tors, revisors, editors, supervisors, domain ex-
perts, etc.) to work on the same data
collaboratively; the system keeps track of the in-
formation already stored inside the database
(organized as hierarchically structured transla-
tion fragments) and prevents the same sentence
from being translated by more than one person.
Furthermore, the supervisors can keep track, in
real time, of the work done by the translators as
they translate new sections of the text they have
been assigned;
 Based on open-source technologies: software de-
velopment based on open-source technology is
encouraged by the scientific community and
allows developers to autonomously implement
ad hoc extensions and customizations to the
code; in this case, the system in question is de-
veloped by using the set of open-source technol-
ogies collected in the Java 2 Standard Edition
(J2SE) framework, which over the years has
become synonymous with the development of
solid, secure, and efficient professional applica-
tions. J2SE is the most stable, tested, and docu-
mented technological platform for the
integration of mission critical systems that re-
quire distributed access, session transactionality,
persistence management, and rich interface com-
ponent libraries;
 Equipped with tools for text annotation and
ready for language processing: when available,
annotation and Natural Language Processing
(NLP) tools can be applied to both the source
and the translated texts for tasks such as semantic
annotation, linguistic analysis (typically,
morpho-syntactic tagging, stemming, or lemma-
tization), terminology extraction, named entity
extraction, etc.; an annotated text can be,
among others, (1) queried on a linguistic, lexical,
or semantic basis, and (2) used to boost the
TMS;
 Adaptable to different languages: the text pro-
cessing and linguistic analysis components
included in a system should be relatively easy
to adapt to different languages; the technology
included in Traduco, for example, is based on
UTF-8 for character encoding (thus covering
the vast majority of idioms) and on supervised
statistical models for linguistic analysis that can
be re-trained to process other languages (if pre-
annotated corpora are available).
2.2 Technical solutions
From a technical point of view, Traduco was de-
signed as a group of independent web-based com-
ponents connected by interfaces. It is based on the
software design pattern known as ‘three-tier archi-
tecture’, and it exploits Apache Tomcat v7.0 as web
server. The component-based architectural structure
was implemented by the object-oriented J2SE
framework, enhanced with Contexts and
Dependency Injection annotations, using the Weld
v2.2.4 reference implementation. Relational persist-
ence and query services are managed by Hibernate
v4.3.7, which is responsible for the mapping from
Java classes to the Mysql v5.0 database tables. To
provide a very responsive TMS component
E. Giovannetti et al.
i50 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
we adopted an inverted index data structure (Patil et
al., 2011). Finally, the presentation layer was imple-
mented with JavaServer Faces, a framework for
building component-based user interfaces for web
applications based on the Mojarra Oracle imple-
mentation v2.2.9 and using the Primefaces v5.1
library.
3 The Components of Traduco
Traduco extends most of the standard components
of a traditional CAT tool with specific features ne-
cessary to support the interpretation and translation
of texts like the BT, which pose particular compre-
hension issues. The design and development of
Traduco required the adoption of a multidisciplin-
ary approach, leveraging on advances in software
engineering, computational linguistics, knowledge
engineering, publishing, and, in the case of its ap-
plication to the translation of the BT, on Talmudic
knowledge and Ancient Semitic linguistics.
The basic functioning of Traduco is not different
from other CAT tools. As shown in Fig. 2, a trans-
lator can view the hierarchical structure (on the left)
of the translated text—organized, in the case of the
BT, in tractates, chapters, blocks, logical units, and
strings (the segments). In the central part of the
table, a translator can insert new segments paired
with their translations, either manually one after
the other, or by creating multiple segments all at
once and then translating them separately. Just
above the translation table, a collapsible ‘Filter’ sec-
tion allows the users to execute searches, both on
the source and on the translated texts. Finally, on
the right of the table there are the notes, glossary
entries, and translation suggestions relative to the
selected pair of segments.
Traduco can support in the translation of com-
plex texts often requiring specific reformulations so
as to be correctly understood by non-scholars.
Figure 3 shows how the translation of each segment
can be performed by differentiating the ‘literal’
translation (in bold) from explicative additions,
i.e. ‘contextual information’. Segments having the
same literal part can then differ by their contextual
information. The tractate of the source segment of
each translation is called ‘context’.
In the following subsections we outline the most
relevant features implemented by the components
of Traduco that have been developed to face the
translation of textual corpora with complex philo-
logical and linguistic peculiarities, as in the case of
the BT.
3.1 Translation memory system
One of the core components of a CAT tool is the
TMS. TMSs leverage on a TM consisting in a sen-
tence-pair database which automatically stores all
the translated text segments together with the
source text during the translation process (Reinke
2013). Basically, the main purpose of a TMS is to
allow translators to reuse already done translations.
A TMS generally consists of:
 a TM database, containing pairs of segments
(s, t), where s is the source language segment of
text and t is its translation in the target language
(see Section 3.1.1);
 a similarity function Sim;
 a threshold .
Given a segment sq to be translated, the TMS
returns a translation tq by searching the TM for
the best match, i.e. a pair (s, tq) whose similarity
Sim(s, sq) !  is maximal, if it exists (Sikes 2007).
The function Sim measures the similarity between
two source-language segments. Typically, it pro-
duces a percentage value, where 100% stands for
‘identical segments’, i.e. exact match, and 0% for
‘completely different segments’. Intermediate per-
centage values are called fuzzy matches. The TMS
ranks suggestion by similarities and presents them
to translators. Typically, when there is not an exact
match, the translator needs to edit the proposed
suggestions. The next subsections describe how the
system retrieves pertinent translation suggestions
(see Section 3.1.2), and how the system presents
the suggestions to each translator (see
Section 3.1.3).
3.1.1 Translation memory
The TM is organized at the segment level. A seg-
ment is a portion of original text having an arbitrary
Traduco
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i51
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
length. We formally define the TM ¼ {si, Ti, Ai, ci}
with i ranging from 1 to n, as a set of n tuples, where
each tuple is defined by:
 si, the source segment;
 Ti ¼ {t1
i , . . . , tk
i }, the set of translations of si with
k ! 1, where each t
j
i includes a literal part j
i
exactly corresponding to the source segment,
and an explicative addition, hereafter referred
to as contextual information j
i, with 1 j k;
 Ai ¼ {a1
i , . . . , ak
i }, the set of translator identifiers
of each translation of si in Ti with k ! 1;
 ci, the context of si referring to the tractate to
which si belongs.
3.1.2 Retrieval of similar segments
It is well known that most TMSs are based on variants
of the ‘Levenshtein’ distance normalized over the
length of the query segment, i.e. the minimum
Fig. 2 Main interface of Traduco. (a) hierarchical structure of the translated text; (b) translation table and filter; (c)
translation references: notes, glossary entries, translation suggestions
Fig. 3 Example of literal translations (bold font) and contextual information (plain font)
E. Giovannetti et al.
i52 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
number of edit operations required to transform a
string into another one. In the case of the BT, we
did not take into account variants aware of linguistic
information, since no available NLP tools are suitable
for processing ancient North-western Semitic lan-
guages, such as the different Hebrew and Aramaic
idioms attested in the BT (see Section 4).
We chose to adopt a similarity measure Sim
based on edit distance, ED(s1, s2), considering two
segments to be more similar when the same terms
tend to appear in the same order. Given an sq seg-
ment to be translated, the formula in Figure 4
measures the similarity between sq and a segment s16
.
One of the novelties we introduced is the way in
which suggestions are ranked. In particular, we con-
sider (1) the authors of translations and (2) the context
(the tractate of reference). The information about the
author of the translation and of the tractate of reference
is useful both for translators and revisors. On the one
hand, translators are enabled to evaluate the reliability
of the suggested translations on the basis of the author-
ity and expertise of the relative translators. On the other
hand, revisors can exploit both kinds of information to
ensure a more homogeneous and fluent translation.
Our algorithm is based on dynamic programming,
and its implementation refers to (Navarro 2001). To
compute ED(s1, s2), it builds a matrix M(0.|s1|,0.|s2|),
where each element mi,j represents the minimum
number of token mutations required to transform
s1(1.i) in s2(1.j). The computation process is indi-
cated in Figure 5, where  ¼ min(m(i À 1,j), m(i,j À 1),
m(i À 1,j À 1)), and the final cost is represented by
m(|s1|,|s2|). The TMS returns the translated strings
having the lower costs. Basically, given a segment
to be translated, many other source segments can
equal it exactly and each of these segments can
be paired with multiple translations in the TM.
Figure 6 shows an example of ED(s1,s2) computa-
tion. Thanks to the technical solutions described
in Section 2.2, the TMS can retrieve and
present the translation suggestions in only a few
milliseconds.
3.1.3 Presentation of translation suggestions
The Traduco user interface shows each suggestion
accompanied by a number of stars, as appears in
Fig. 7. The number is assigned on the basis of
how fuzzy the match between source segments is:
five-star suggestions are considered perfect (exact
match, Sim ¼ 100%); four stars indicate that a few
corrections are probably required (fuzzy match,
85% Sim 99%); and three stars indicate, in
most cases, acceptable suggestions (weak fuzzy
match, 70% Sim 84%). The TMS orders by con-
text and by author the suggestions that are ranked
with the same number of stars. Each translator, for
example, can then approve as correct the literal
translation  j
i, and modify only  j
i.
A translator can choose to filter the proposed
suggestions to visualize just his/her own transla-
tions, revised translations or translations belonging
to the particular tractate on which (s)he is working.
Of course, each new translation is added to the TM,
thus increasing the pool of translations available.
3.1.4 TMS performance
The evaluation of the performance of a system like
Traduco is not a trivial task. Unlike a typical CAT
tool, the aim of Traduco is not limited to increasing
the translation pace, but it is meant to support the
translation process by offering a collaborative envir-
onment in which users can translate their own por-
tions of texts by exploiting the translation of similar
source segments (that could greatly differ in the ex-
plicative additions) done by others.
Before undergoing an empirical evaluation, we
analysed the redundancy of the TM by considering
the similar segments. To estimate the TM perform-
ance we conducted a jackknife experiment (Wu
Fig. 4 Similarity function
Fig. 5 Computation process of ED(s1, s2)
Traduco
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i53
Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020

More Related Content

Similar to Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts

An adaptation of Text2Onto for supporting the French language
An adaptation of Text2Onto for supporting  the French language An adaptation of Text2Onto for supporting  the French language
An adaptation of Text2Onto for supporting the French language IJECEIAES
 
LoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment servicesLoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment serviceslocloud
 
Designing the Workflow of a Language Interpretation Device Using Artificial I...
Designing the Workflow of a Language Interpretation Device Using Artificial I...Designing the Workflow of a Language Interpretation Device Using Artificial I...
Designing the Workflow of a Language Interpretation Device Using Artificial I...IOSR Journals
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ijnlc
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...kevig
 
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...PhD Assistance
 
Language translator
Language translatorLanguage translator
Language translatorSumitSumit26
 
Iot ontologies state of art$$$
Iot ontologies state of art$$$Iot ontologies state of art$$$
Iot ontologies state of art$$$Sof Ouni
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingScott Faria
 
Executive Summary ITEA Roadmap 2
Executive Summary ITEA Roadmap 2Executive Summary ITEA Roadmap 2
Executive Summary ITEA Roadmap 2Emmanuel Fuchs
 
Amta 2012-federico (1)
Amta 2012-federico (1)Amta 2012-federico (1)
Amta 2012-federico (1)FabiolaPanetti
 
Genre discovery in corpus management systems (2004)
Genre discovery in corpus management systems (2004)Genre discovery in corpus management systems (2004)
Genre discovery in corpus management systems (2004)Joseba Abaitua
 
From requirements to ready to run
From requirements to ready to runFrom requirements to ready to run
From requirements to ready to runijfcstjournal
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Moses Altovar
 

Similar to Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts (20)

An adaptation of Text2Onto for supporting the French language
An adaptation of Text2Onto for supporting  the French language An adaptation of Text2Onto for supporting  the French language
An adaptation of Text2Onto for supporting the French language
 
LoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment servicesLoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment services
 
Designing the Workflow of a Language Interpretation Device Using Artificial I...
Designing the Workflow of a Language Interpretation Device Using Artificial I...Designing the Workflow of a Language Interpretation Device Using Artificial I...
Designing the Workflow of a Language Interpretation Device Using Artificial I...
 
Guide to MemoQ
Guide to MemoQGuide to MemoQ
Guide to MemoQ
 
ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
 
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
 
Translation
TranslationTranslation
Translation
 
Language translator
Language translatorLanguage translator
Language translator
 
Iot ontologies state of art$$$
Iot ontologies state of art$$$Iot ontologies state of art$$$
Iot ontologies state of art$$$
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 
Te xworks manual
Te xworks manualTe xworks manual
Te xworks manual
 
Executive Summary ITEA Roadmap 2
Executive Summary ITEA Roadmap 2Executive Summary ITEA Roadmap 2
Executive Summary ITEA Roadmap 2
 
Tc Tr
Tc TrTc Tr
Tc Tr
 
Narrative: Text Generation Model from Data
Narrative: Text Generation Model from DataNarrative: Text Generation Model from Data
Narrative: Text Generation Model from Data
 
Amta 2012-federico (1)
Amta 2012-federico (1)Amta 2012-federico (1)
Amta 2012-federico (1)
 
Genre discovery in corpus management systems (2004)
Genre discovery in corpus management systems (2004)Genre discovery in corpus management systems (2004)
Genre discovery in corpus management systems (2004)
 
From requirements to ready to run
From requirements to ready to runFrom requirements to ready to run
From requirements to ready to run
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
 

More from antonellarose

Ethical issues regarding machine assisted translation of literary texts
Ethical issues regarding machine assisted translation of literary textsEthical issues regarding machine assisted translation of literary texts
Ethical issues regarding machine assisted translation of literary textsantonellarose
 
Ethical by Design: Ethics Best Practices for Natural Language Processing
Ethical by Design: Ethics Best Practices for Natural Language ProcessingEthical by Design: Ethics Best Practices for Natural Language Processing
Ethical by Design: Ethics Best Practices for Natural Language Processingantonellarose
 
Mapping the challenges and opportunities of artificial intelligence for the c...
Mapping the challenges and opportunities of artificial intelligence for the c...Mapping the challenges and opportunities of artificial intelligence for the c...
Mapping the challenges and opportunities of artificial intelligence for the c...antonellarose
 
NLP and its Use in Education
NLP and its Use in EducationNLP and its Use in Education
NLP and its Use in Educationantonellarose
 
The Social Impact of NLP
The Social Impact of NLPThe Social Impact of NLP
The Social Impact of NLPantonellarose
 
Natural Language Processing and Language Learning
Natural Language Processing and Language LearningNatural Language Processing and Language Learning
Natural Language Processing and Language Learningantonellarose
 
Natural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and ChallengesNatural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and Challengesantonellarose
 
Translator-computer interaction in action — an observational process study of...
Translator-computer interaction in action — an observational process study of...Translator-computer interaction in action — an observational process study of...
Translator-computer interaction in action — an observational process study of...antonellarose
 

More from antonellarose (10)

Ethical issues regarding machine assisted translation of literary texts
Ethical issues regarding machine assisted translation of literary textsEthical issues regarding machine assisted translation of literary texts
Ethical issues regarding machine assisted translation of literary texts
 
Ethical by Design: Ethics Best Practices for Natural Language Processing
Ethical by Design: Ethics Best Practices for Natural Language ProcessingEthical by Design: Ethics Best Practices for Natural Language Processing
Ethical by Design: Ethics Best Practices for Natural Language Processing
 
Mapping the challenges and opportunities of artificial intelligence for the c...
Mapping the challenges and opportunities of artificial intelligence for the c...Mapping the challenges and opportunities of artificial intelligence for the c...
Mapping the challenges and opportunities of artificial intelligence for the c...
 
IJCoL
IJCoLIJCoL
IJCoL
 
Mediation and AI
Mediation and AIMediation and AI
Mediation and AI
 
NLP and its Use in Education
NLP and its Use in EducationNLP and its Use in Education
NLP and its Use in Education
 
The Social Impact of NLP
The Social Impact of NLPThe Social Impact of NLP
The Social Impact of NLP
 
Natural Language Processing and Language Learning
Natural Language Processing and Language LearningNatural Language Processing and Language Learning
Natural Language Processing and Language Learning
 
Natural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and ChallengesNatural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and Challenges
 
Translator-computer interaction in action — an observational process study of...
Translator-computer interaction in action — an observational process study of...Translator-computer interaction in action — an observational process study of...
Translator-computer interaction in action — an observational process study of...
 

Recently uploaded

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Recently uploaded (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts

  • 1. Traduco: A collaborative web-based CAT environment for the interpretation and translation of texts............................................................................................................................................................ Emiliano Giovannetti, Davide Albanesi, Andrea Bellandi and Giulia Benotto Istituto di Linguistica Computazionale ‘A. Zampolli’, Consiglio Nazionale delle Ricerche, Pisa, Italy ....................................................................................................................................... Abstract Traduco is a web-based collaborative tool aimed at supporting the translation of texts that pose particular challenging interpretative issues. Nowadays, Computer- Assisted Translation (CAT) tools are mainly applied to the translation of tech- nical manuals or legislative texts and are aimed at speeding up the translation process. Traduco extends most of the standard components of a traditional CAT tool with specific features necessary to support the interpretation and translation of complex texts (like the Babylonian Talmud, that we here present as a case study), which pose particular comprehension issues. Traduco goes beyond the translation and its printing: it includes features for the addition of notes and annotations and the creation of glossaries. Translators, editors, supervisors, and end-users accessing Traduco are able to use components that can ease the trans- lation process through the use of CAT technologies, the supervision and mana- ging of the whole process of translation and publishing, the exporting of translations and notes in standard formats for desktop publishing software and TEI format, and, soon, the possibility to perform automatic linguistic analysis of the text. Moreover, Traduco allows the users to insert notes, comments, anno- tations, and bibliographical references. The design and development of Traduco required the adoption of a multidisciplinary approach, leveraging on advances in software engineering, computational linguistics, knowledge engineering, and publishing. ................................................................................................................................................................................. 1 Introduction Traduco is a web-based collaborative tool aimed at supporting the translation of texts that pose particu- larly challenging interpretative issues. The develop- ment of Traduco was started in 2012 at the Institute for Computational Linguistics ‘A. Zampolli’ of the Italian National Research Council (ILC-CNR) for the translation of the Babylonian Talmud (BT) into Italian within the context of the ‘Progetto Traduzione del Talmud Babilonese’ (PTTB), monitored by the Italian Presidency of the Council of Ministers and coordinated by the Union of Italian Jewish Communities and the Italian Rabbinical College. Nowadays, many Computer-Assisted Translation (CAT) tools—both commercial and free of Correspondence: Emiliano Giovannetti, Istituto di Linguistica Computazionale ‘‘‘A. Zampolli’’’, Consiglio Nazionale delle Ricerche, Via G. Moruzzi 1, 56124, Pisa, Italy. E-mail: emiliano.giovannetti@ilc.cnr.it Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017. ß The Author 2016. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@oup.com i47 doi:10.1093/llc/fqw054 Advance Access published on 26 October 2016 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 2. charge—are already available.1 Professional transla- tors use Translation Memories (TMs) on a regular basis (Lagoudaki 2009). The best-known commer- cial systems are Across,2 De´ja` Vu,3 memoQ,4 MultiTrans,5 SDL Trados,6 Similis,7 Transit,8 and Wordfast9 ; while non-commercial ones are OpenTM,10 OmegaT,11 Olanto,12 Transolution,13 and Matecat.14 They are mainly applied to the trans- lation of technical manuals or legislative texts and are aimed at speeding up the translation process, allowing translators to save a significant amount of time and effort. The approach adopted for the development of Traduco is instead more oriented towards covering aspects related to the specific needs of the translator community working on texts with particular inter- pretative issues. To translate these texts, a translator is required to have two kinds of competences: in language (as a translator) and in the ‘content’ of the text to be translated (as a scholar). Since a por- tion of text (and sometimes even a single word) can be difficult to interpret and translate, the possibility provided by a collaborative environment to in- stantly consult translations done by others becomes a necessity. From the end-user’s point of view, the understanding of these texts requires a translation to be enriched with explanations, notes, and glossary entries. Although some of the cited tools integrate Machine Translation (MT) techniques, the lack of linguistically annotated resources and large collec- tions of parallel texts involving the source and the target languages has prevented us from considering any statistical MT toolkit. Instead, we have imple- mented a TM enabling translators to re-elaborate the plain and literal translation of the text and to integrate it with explicative additions. To the best of our knowledge, this is the first application of CAT technologies specifically designed to support the translation of complex texts like the BT. In this article, we describe the main characteris- tics of Traduco (Section 2) and we discuss all the features that make it different from other CAT tools (Section 3). As a matter of fact, although initially designed to provide basic support in the collabora- tive translation of the Talmud, over the years Traduco has undergone several upgrades. Some of these involve the integration of state-of-the-art approaches aimed at improving the performance of the Translation Memory System (TMS), the com- ponent devoted to suggesting translations to users (Section 4). At the current stage of development, Traduco is almost ready to be released as an open-source lan- guage and text-independent web collaborative environment for the translation of scholarly challen- ging texts. However, we intend to continue to re- lease new versions, with the integration of new features, as we shall briefly discuss in Section 5. 2 The Traduco System Traduco is made up of various components, each implementing specific functionalities targeted at dif- ferent types of users (Fig. 1). The system goes beyond the mere translation and its printing: it in- cludes features for the addition of notes and anno- tations and the creation of glossaries. Translators, editors, supervisors, and end-users accessing Traduco will therefore be able to use components that can: ease the translation process through the use of CAT technologies, including indexers and TM tools (see Section 3.1.); allow the users to insert notes, comments, anno- tations, and bibliographical references (see Section 3.2); supervise and manage the whole process of trans- lation and publishing (see Section 3.3); export translations and notes in standard formats for desktop publishing software and Text Encoding Initiative (TEI) format (see Section 3.4); perform automatic linguistic analysis of the text (see Section 4). As concerns the current use of Traduco, i.e. the translation of the BT, by the end of the project, Traduco will have produced two resources: the printed edition of the Italian translation of the BT; the digital edition of the translated and anno- tated BT, that users will be able to consult online. E. Giovannetti et al. i48 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 3. The following two subsections describe both the general architecture of Traduco and the technical solutions adopted for its implementation. 2.1. Characteristics of the system The design of the Traduco architecture took into account both the guidelines for the creation of models and tools for digital publishing—as suggested by the scientific community (Interedition15 first and foremost)—and the effective user demands pertaining to the Project for the BT translation. So far, none of the systems (nor frame- works) available commercially, freely distributed in academic circles or, more in general, described in the literature (see Section 1) is able to satisfy the multi- plicity of requirements required by a modern Fig. 1 General architecture of Traduco Traduco Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i49 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 4. environment for the translation of texts raising par- ticular interpretative issues. Traduco was conceived to fulfil these requirements. As a matter of fact, the system is: Built with a component-based architecture: as with the pieces of a puzzle, developers willing to build their own system must be able to draw from a pool of independent basic components. The component-based architectural structure was facilitated by the technology adopted (i.e. the object-oriented Java programming language); Accessible through the Web: the Web is the ideal working environment for collaborative authoring and publishing activities; as opposed to desktop applications, which require the installation of specific client software on computers, the so- called web-based applications require just the use of a browser (e.g. Firefox, Safari, Chrome) and a connection to Internet through which the user can communicate with the system running on a remote server. The advantages of web-based applications are considerable; for example, a system update can be applied without the users having to update any software on their computers; Collaborative: the online environment, com- bined with the reliability of the technological framework used, allows a team of users (transla- tors, revisors, editors, supervisors, domain ex- perts, etc.) to work on the same data collaboratively; the system keeps track of the in- formation already stored inside the database (organized as hierarchically structured transla- tion fragments) and prevents the same sentence from being translated by more than one person. Furthermore, the supervisors can keep track, in real time, of the work done by the translators as they translate new sections of the text they have been assigned; Based on open-source technologies: software de- velopment based on open-source technology is encouraged by the scientific community and allows developers to autonomously implement ad hoc extensions and customizations to the code; in this case, the system in question is de- veloped by using the set of open-source technol- ogies collected in the Java 2 Standard Edition (J2SE) framework, which over the years has become synonymous with the development of solid, secure, and efficient professional applica- tions. J2SE is the most stable, tested, and docu- mented technological platform for the integration of mission critical systems that re- quire distributed access, session transactionality, persistence management, and rich interface com- ponent libraries; Equipped with tools for text annotation and ready for language processing: when available, annotation and Natural Language Processing (NLP) tools can be applied to both the source and the translated texts for tasks such as semantic annotation, linguistic analysis (typically, morpho-syntactic tagging, stemming, or lemma- tization), terminology extraction, named entity extraction, etc.; an annotated text can be, among others, (1) queried on a linguistic, lexical, or semantic basis, and (2) used to boost the TMS; Adaptable to different languages: the text pro- cessing and linguistic analysis components included in a system should be relatively easy to adapt to different languages; the technology included in Traduco, for example, is based on UTF-8 for character encoding (thus covering the vast majority of idioms) and on supervised statistical models for linguistic analysis that can be re-trained to process other languages (if pre- annotated corpora are available). 2.2 Technical solutions From a technical point of view, Traduco was de- signed as a group of independent web-based com- ponents connected by interfaces. It is based on the software design pattern known as ‘three-tier archi- tecture’, and it exploits Apache Tomcat v7.0 as web server. The component-based architectural structure was implemented by the object-oriented J2SE framework, enhanced with Contexts and Dependency Injection annotations, using the Weld v2.2.4 reference implementation. Relational persist- ence and query services are managed by Hibernate v4.3.7, which is responsible for the mapping from Java classes to the Mysql v5.0 database tables. To provide a very responsive TMS component E. Giovannetti et al. i50 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 5. we adopted an inverted index data structure (Patil et al., 2011). Finally, the presentation layer was imple- mented with JavaServer Faces, a framework for building component-based user interfaces for web applications based on the Mojarra Oracle imple- mentation v2.2.9 and using the Primefaces v5.1 library. 3 The Components of Traduco Traduco extends most of the standard components of a traditional CAT tool with specific features ne- cessary to support the interpretation and translation of texts like the BT, which pose particular compre- hension issues. The design and development of Traduco required the adoption of a multidisciplin- ary approach, leveraging on advances in software engineering, computational linguistics, knowledge engineering, publishing, and, in the case of its ap- plication to the translation of the BT, on Talmudic knowledge and Ancient Semitic linguistics. The basic functioning of Traduco is not different from other CAT tools. As shown in Fig. 2, a trans- lator can view the hierarchical structure (on the left) of the translated text—organized, in the case of the BT, in tractates, chapters, blocks, logical units, and strings (the segments). In the central part of the table, a translator can insert new segments paired with their translations, either manually one after the other, or by creating multiple segments all at once and then translating them separately. Just above the translation table, a collapsible ‘Filter’ sec- tion allows the users to execute searches, both on the source and on the translated texts. Finally, on the right of the table there are the notes, glossary entries, and translation suggestions relative to the selected pair of segments. Traduco can support in the translation of com- plex texts often requiring specific reformulations so as to be correctly understood by non-scholars. Figure 3 shows how the translation of each segment can be performed by differentiating the ‘literal’ translation (in bold) from explicative additions, i.e. ‘contextual information’. Segments having the same literal part can then differ by their contextual information. The tractate of the source segment of each translation is called ‘context’. In the following subsections we outline the most relevant features implemented by the components of Traduco that have been developed to face the translation of textual corpora with complex philo- logical and linguistic peculiarities, as in the case of the BT. 3.1 Translation memory system One of the core components of a CAT tool is the TMS. TMSs leverage on a TM consisting in a sen- tence-pair database which automatically stores all the translated text segments together with the source text during the translation process (Reinke 2013). Basically, the main purpose of a TMS is to allow translators to reuse already done translations. A TMS generally consists of: a TM database, containing pairs of segments (s, t), where s is the source language segment of text and t is its translation in the target language (see Section 3.1.1); a similarity function Sim; a threshold . Given a segment sq to be translated, the TMS returns a translation tq by searching the TM for the best match, i.e. a pair (s, tq) whose similarity Sim(s, sq) ! is maximal, if it exists (Sikes 2007). The function Sim measures the similarity between two source-language segments. Typically, it pro- duces a percentage value, where 100% stands for ‘identical segments’, i.e. exact match, and 0% for ‘completely different segments’. Intermediate per- centage values are called fuzzy matches. The TMS ranks suggestion by similarities and presents them to translators. Typically, when there is not an exact match, the translator needs to edit the proposed suggestions. The next subsections describe how the system retrieves pertinent translation suggestions (see Section 3.1.2), and how the system presents the suggestions to each translator (see Section 3.1.3). 3.1.1 Translation memory The TM is organized at the segment level. A seg- ment is a portion of original text having an arbitrary Traduco Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i51 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 6. length. We formally define the TM ¼ {si, Ti, Ai, ci} with i ranging from 1 to n, as a set of n tuples, where each tuple is defined by: si, the source segment; Ti ¼ {t1 i , . . . , tk i }, the set of translations of si with k ! 1, where each t j i includes a literal part j i exactly corresponding to the source segment, and an explicative addition, hereafter referred to as contextual information j i, with 1 j k; Ai ¼ {a1 i , . . . , ak i }, the set of translator identifiers of each translation of si in Ti with k ! 1; ci, the context of si referring to the tractate to which si belongs. 3.1.2 Retrieval of similar segments It is well known that most TMSs are based on variants of the ‘Levenshtein’ distance normalized over the length of the query segment, i.e. the minimum Fig. 2 Main interface of Traduco. (a) hierarchical structure of the translated text; (b) translation table and filter; (c) translation references: notes, glossary entries, translation suggestions Fig. 3 Example of literal translations (bold font) and contextual information (plain font) E. Giovannetti et al. i52 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020
  • 7. number of edit operations required to transform a string into another one. In the case of the BT, we did not take into account variants aware of linguistic information, since no available NLP tools are suitable for processing ancient North-western Semitic lan- guages, such as the different Hebrew and Aramaic idioms attested in the BT (see Section 4). We chose to adopt a similarity measure Sim based on edit distance, ED(s1, s2), considering two segments to be more similar when the same terms tend to appear in the same order. Given an sq seg- ment to be translated, the formula in Figure 4 measures the similarity between sq and a segment s16 . One of the novelties we introduced is the way in which suggestions are ranked. In particular, we con- sider (1) the authors of translations and (2) the context (the tractate of reference). The information about the author of the translation and of the tractate of reference is useful both for translators and revisors. On the one hand, translators are enabled to evaluate the reliability of the suggested translations on the basis of the author- ity and expertise of the relative translators. On the other hand, revisors can exploit both kinds of information to ensure a more homogeneous and fluent translation. Our algorithm is based on dynamic programming, and its implementation refers to (Navarro 2001). To compute ED(s1, s2), it builds a matrix M(0.|s1|,0.|s2|), where each element mi,j represents the minimum number of token mutations required to transform s1(1.i) in s2(1.j). The computation process is indi- cated in Figure 5, where ¼ min(m(i À 1,j), m(i,j À 1), m(i À 1,j À 1)), and the final cost is represented by m(|s1|,|s2|). The TMS returns the translated strings having the lower costs. Basically, given a segment to be translated, many other source segments can equal it exactly and each of these segments can be paired with multiple translations in the TM. Figure 6 shows an example of ED(s1,s2) computa- tion. Thanks to the technical solutions described in Section 2.2, the TMS can retrieve and present the translation suggestions in only a few milliseconds. 3.1.3 Presentation of translation suggestions The Traduco user interface shows each suggestion accompanied by a number of stars, as appears in Fig. 7. The number is assigned on the basis of how fuzzy the match between source segments is: five-star suggestions are considered perfect (exact match, Sim ¼ 100%); four stars indicate that a few corrections are probably required (fuzzy match, 85% Sim 99%); and three stars indicate, in most cases, acceptable suggestions (weak fuzzy match, 70% Sim 84%). The TMS orders by con- text and by author the suggestions that are ranked with the same number of stars. Each translator, for example, can then approve as correct the literal translation j i, and modify only j i. A translator can choose to filter the proposed suggestions to visualize just his/her own transla- tions, revised translations or translations belonging to the particular tractate on which (s)he is working. Of course, each new translation is added to the TM, thus increasing the pool of translations available. 3.1.4 TMS performance The evaluation of the performance of a system like Traduco is not a trivial task. Unlike a typical CAT tool, the aim of Traduco is not limited to increasing the translation pace, but it is meant to support the translation process by offering a collaborative envir- onment in which users can translate their own por- tions of texts by exploiting the translation of similar source segments (that could greatly differ in the ex- plicative additions) done by others. Before undergoing an empirical evaluation, we analysed the redundancy of the TM by considering the similar segments. To estimate the TM perform- ance we conducted a jackknife experiment (Wu Fig. 4 Similarity function Fig. 5 Computation process of ED(s1, s2) Traduco Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i53 Downloadedfromhttps://academic.oup.com/dsh/article-abstract/32/suppl_1/i47/2418172bygueston09April2020