New Tools and Resources to Support Machine Translation

Anabela Barreiro
barreiro_anabela@hotmail.com
FLUP & CLUP-Linguateca
New York University
New Tools and Resources to Support
Machine Translation
Mestrado em Tradução Jurídica e Empresarial
Anabela Barreiro Lisboa, 8 January 2008

Outline

Human Translation vs Machine Translation
An objective and purpose distinction must be established
between human translation and machine translation!
•They use different methods
•They apply to different types of texts
•They serve different purposes
•They face different barriers
•They are NOT in competition!

Human Translation
Professional translation requires:
•a profound knowledge of the source language and native
proficiency of the target language
•above-average writing skills
•an insightful knowledge of the social-cultural aspects of the
source and target languages
•knowledge of the grammar of the two languages, their
writing conventions, and the situational and cultural context
•In the case of scientific and technical translation, subject
matter knowledge is required, including terminologies of the
field or knowledge domain.

Human Translation
Theory of translation has been dealing with controversial
issues:
•problems related to privileging meaning over form
•visibility or invisibility of the translator
•being faithful to the author or trying to make the text
accessible to the reader (and which kind of reader)
•giving value to the source language culture (foreignise) or
making the text suitable for the target language culture
(domesticate)
•Allowing languages/cultures with more impact to
predominate over languages/cultures with less impact, or being
creative, etc.

Human Translation
The most relevant aspect in translation is to define the
purpose of each translation, which is related to the
characteristics of each text.
… And to define paraphrasing capabilities.

Human Translation: Types of Texts
A certain subjectivity and distance from the source
language text is allowed in translation of literary text for the
sake of maintaining the artistic and aesthetic aspects of the
target language text [Hermans, 1985] [Landers, 2001].
Literary translation may be considered an ART [Leighton,
1990] [Weaver, 2002], where the translator has more freedom
of expression.

Human Translation: Types of Texts
Technical, commercial, and legal translators, like the
authors of the original texts, are more restrained in their use of
language, and they need to be precise and convey the exact
meaning of the original text.
Technical texts are not meant to be beautiful but rather
to be informative, instructive and explanatory. Their main
function is to be clear, so the easier they are to read, the better
they are understood.
Technical translation may be regarded as a CRAFT
[Newmark, 1988] [Biguenet & Schulte, 1989] for which both
technical and linguistic competence is essential, but creativity
and vagueness prohibited.

Machine Translation
With more translation being performed by machines,
new challenges are imposed on the field, theoretical traditions
shaken and the need to rethink the status of translation
becomes more evident. Of all automated applications, machine
translation compels us to reconsider the nature of translation.
ART and CRAFT are NOT appropriate concepts for
machine translation, because it has necessarily to rely on
linguistics and computer science.

Machine Translation
1- Automated translation of text or speech from one natural
language into another
2- An important tool that assists human translators
3- It has become available to the general public in the last few
years due to:
• sophisticated computers
• continuous development of computer software capabilities
• internet boom

Machine Translation (cont.)

Machine Translation Bottlenecks
1.Complexity of language
2.Ambiguity of language
3.Wordiness (related to text quality)

Machine Translation: Limitations
• The task of delivering high-quality machine translation of certain
types of texts and complex linguistic phenomena is difficult
• It is difficult to grasp humour, sarcasm, and other human feelings
expressed in/by means of sophisticated linguistic expression
• Difficulties in handling extra-sentential and extra-textual and
extra-linguistic information (problems of culture or context),
because knowledge of the world cannot be assumed
• Difficult to deal with anaphora resolution

Machine Translation Linguistic Challenges
1.Homography
2.Cross-language phenomena (lexical divergences and idioms
and cross-language syntactic transformations, such as
passives)
3.Identification of named entities
4.Capacity to deal with long sentences and wordiness
5.Unusual alterations to the order of words in the target
language
6.Enhanced dictionaries and grammars to recognize and
translate multiword expressions

Machine Translation Linguistic Challenges: Examples
• Handling of ellipsis
advanced ambiguity problems – related to anaphora
O João visitou muitos países do mundo. A Maria não visitou nenhum.
=> João has visited many countries in the world. Maria hasn’t visited any.

• Common-noun nuance resolution / homography
(1) ele não quis tomar partido de ninguém
(2) ele é um bom partido
(3) ele tirou partido da situação
(4) ele pertence a esse partido (político)
(5) o copo está partido
(6) já esteve em melhor partido

Translation Engine Translation Results
FreeTranslation Francisco Scallop advances even if is it do an effort in the sense of take a decision still this
week, defined advances or not for a candidacy to the RTLRS.
WorldLingo advances despite he is to make an effort in the direction to still take a decision this week,
defining if he advances or he does not stop a candidacy to the RTLRS.
Translation Engine Translation Results
Google Eu não posso fazer a uma decisão sobre qualquer coisa estes dias.
Amikai que eu não posso fazer para uma decisão sobre qualquer coisa estes dias.
FreeTranslation Eu não posso tomar uma decisão sobre algo estes dias.
Babelfish Eu não posso fazer a uma decisão sobre qualquer coisa estes dias.
WorldLingo Eu no posso fazer a uma deciso sobre qualquer coisa estes dias.
E-Translation Server Não posso tomar uma decisão sobre qualquer coisa estes dias.
I can't make a decision about anything these days. [Compara]
Francisco Vieira adianta ainda que está a fazer um esforço no sentido de
tomar uma decisão ainda esta semana, definindo se avança ou não para
uma candidatura à RTLRS. [CdP]

Multiword Expressions: Support Verb Constructions
Support verb construction = predicate noun construction
is a multiword expression containing a verb with weak semantic value
and a noun which is the predicate of the sentence.
Predicate nouns can be:
morphologically related to a verb
fazer uma apresentação de = apresentar
pay a visit to = to visit
autonomous
fazer um mestrado - *mestrar
have fun - *to fun

Main Objectives
1.Build a body of lexical, syntactic and semantic knowledge
around support verb constructions
2.Apply this linguistic knowledge to paraphrasing
3.Improve machine translation

Outcome: Resources
Port4NooJ
•an open source, ontology driven Portuguese linguistic
system, which integrates a bilingual extension for
Portuguese-English machine translation
DicTUM
•Dicionário de Termos e Unidades Multipalavra
•a Dictionary of Multiword Expressions

Outcome: Tools
ReWriter
•a monolingual paraphraser to pre-edit texts, using
paraphrasing capabilities
•Portuguese version ReEscreve
ParaMT
•a bilingual/multilingual paraphraser to be integrated in
machine translation systems

Resources
Port4NooJ - Publicly available at:
http://www.nooj4nlp.net
http://www.linguateca.pt/Repositorio/Port4Nooj/
Based on:
•NooJ linguistic environment (http://www.nooj4nlp.net/)
•OpenLogos English-Portuguese dictionary (http://logos-
os.dfki.de/)
OpenLogos is an open-source derivative of the Logos Machine Translation System
Data Used
•COMPARA (http://www.linguateca.pt/COMPARA)
•METRA (http://www.linguateca.pt/metra)
•Other corpora

HIV,N+FLX=PORTUGAL+AB+state+IMMUN+EN=HIV
doença maníaco-depressiva,N+FLX=CASA+AB+state+MH+EN=manic-depressive disorder
doença bipolar,N+FLX=CASA+AB+state+MH+EN=bipolardisorder
asma,N+FLX=CASA+AB+state+PULM+EN=asthma
Amesterdão,N+PL+city+EN=Amsterdam
Estados Unidos da América,N+PL+coun+EN=United States of America
África,N+PL+cont+EN=Africa
Extremo Oriente,N+PL+othprop+EN=Far East
Mediterrâneo,N+FLX=ANO+PL+water+EN=Mediterranean
Alpes Peninos,N+FLX=ALPES+PL+othprop+EN=Pennine Alps
ONU,N+AN+org+EN=UN
Syntactic-
Semantic
Attributes
English
Transfer
Inflectional
Paradigm
Part of
Speech
Lemma
mesa,N+FLX=CASA+CO+surf+EN=table
cair,V+FLX=ATRAIR+INMO+IntoType+EN=fall
holandês,A+FLX=INGLÊS+AN+lang+EN=Dutch
actualmente,ADV+FLX=FACILMENTE+TEMP+punc+pres+EN=nowadays
alguém,PRO+IMPERS+INDEF+EN=somebody
porque,RELINT+why+EN=why
e,CONJ+JOIN+EN=and
durante,PREP+TEMP+EN=during
cada,DET+IMPERS+INDEF+SG+EN=each
terceiro+NUM+ord+EN=one third
Port4NooJ Dictionaries
a curto prazo,ADV+TEMP+EN=in the short run
a favor de,PREP+CAUS+EN=in favor of
cada um,PRO+INDEF+SG+EN=each one
de quem,INT+ThatType+EN=whose
quem quer que seja,REL+WhateverType+EN=whoever
além disso,CONJ+COOR+EN=besides
um quarto,NUM+frac+EN=one fourth
adro da igreja,N+FLX=MENINO+PL+encl+EN=churchyard
cabo de vassoura,N+FLX=MENINO+COtool+EN=broomstick
bebida alcoólica,N+FLX=CASA+MA+liqu+EN=alcoholic drink+UNAMB
bebida alcoólica,N+FLX=CASA+MA+liqu+EN=booze+slang
cor de laranja,A+NAV+Apred+EN=orange
sul-americano,A+FLX=ALTO+AN+des+EN=South American
a curto prazo,ADV+LocTime+TEMP+EN=in the short run
fora de serviço,ADV+STAT+phr+EN=out of order
há muito tempo,ADV+LocTime+TEMP+puncpast+EN=a long time ago
isto é,CONJ+COOR+EN=i.e.
já não,CONJ+COOR+EN=no longer
mesmo assim,CONJ+SUB+EN=even so
juntamente com,PREP+ASSOC+EN=along with
à direita de,PREP+Loc+AT+EN=at the right of
em conformidade com,PREP+ALOG+EN=in congruence with
General dictionary
sample representing all
PoS, variable and
invariable forms Sample of the
dictionary of Terms
and
Multiword Expressions
DicTUMSample of invariable
compounds in the
general dictionary
Sample of the
dictionary of
Biomedical Terms
Sample of the
dictionary of
Proper Names

Port4NooJ Dictionaries
Sample of terms
classified as Information
+ Instructional/legal

Syntactic-Semantic Ontology

    Representation abstract language
    Hierarchical taxonomy (sets, supersets and (sometimes) subsets)
    Based on Logos SAL ontology
    Integrated in the dictionary
    It represents both meaning (semantics), and structure (syntax)
    Over 1,000 categories


Noun Supersets
concrete
mass
animate
place
information
abstract
process (intr)
process (tr)
measure
time
aspective
Sets and Subsets of the CONCRETE Noun Superset
Click on CONCRETE Superset, sets and subsets for explanations
functionals
receptacles
bearing surfaces
links/bridges
thresholds, focal
points, barriers
conduits
fasteners
devices, tools
cloth thing
structural elements
concretizations of
verbals
concretizations of
mass nouns
undifferentiated
functionals
product/brand
names
* * *
agentives
software
vehicles
meters
machines/systems
communication agents
concrete chemical
agents
undifferentiated
agentives
* * *
natural things
minute flora
plants
trees
trees/wood
miscellaneous natural
things
* * *
other concrete sets*
impulses/lights
blemishes/marks
edibles (non-mass)
edibles/color
classifiers
amorphous
atomistic
undifferentiated
concrete things
* * *
*With one exception, these
sets have no subsets

Category Mnemonic Examples in English Examples in Portuguese
agentives CO+undagt See subsets See subsets
software CO+soft routine rotina, ficheiro
concrete chemical agents CO+chem catalyst, warhead ácido sulfúrico
machines/systems CO+mach battery, camera máquina fotográfica
vehicles CO+vehic truck, ship automóvel
meters CO+meter clock, gauge manómetro
communication agents CO+comm radio, radar rádio
functionals CO+undfunc trinket, ornament ornamento
devices/tools CO+tool pliers alicate
fasteners CO+fast nail, tendon prego
bearing surfaces CO+surf table, shelf mesa
receptacles CO+recp bottle, barrel garrafa
conduits CO+cond chute, artery artéria
thresholds/focal points/barriers CO+barr wall, door porta
links/bridges CO+link circuit, nerve circuito
cloth things CO+cloth shirt, blanket camisola
structural elements CO+struc spar, bone osso
concretizations of verbals CO+verb threading
concretizations of mass nouns CO+mass acid lining
product/brand names CO+brand Windows NT Windows NT
natural things CO+nat See subsets See subsets
minute flora CO+flora algae, spore alga
plants CO+plant rose, weed erva
trees CO+tree apple, willow macieira
trees/wood CO+trwd oak, maple carvalho
misc. natural things CO+mnat pebble, iceberg iceberg
edibles (non-mass) CO+ednm pork chop costoleta
edibles/color CO+edcol orange, cherry laranja
impulses/lights Col+ight lamp, beam lâmpada
blemishes/marks CO+blem scratch, freckle sarda
classifiers CO+class element elemento
amorphous CO+amor breeze, tide brisa
atomistic CO+atom electron, atom átomo
undifferentiated CO+obj trifle, curio

Categories of
CONCRETE nouns

ME - MEASURE Noun Sets and Subsets
Sets and Subsets
Mnemonics (=
SynSem)
Examples
abstract concepts measured by unit ME+abs humidity, length
discrete measurable concepts ME+dis sum, increment
units of measure ME+unit See subsets
units of weight ME+unit+wt ounce, pound
units of velocity ME+unit+vel mph, megahertz
units of volume measure ME+unit+vol gallon, liter
units of temperature ME+unit+temp degrees celsius
units of energy/force ME+unit+ener watt, horsepower
measurement systems ME+unit+sys fahrenheit, kelvin
units of duration ME+unit+dur hour, minute, year
specialized units of measure ME+unit+spec oersted, ohm, phon
units of money/value ME+unit+value dollar, euro, forint
units of linear/area measure ME+unit+lin inch, yard, mile
general undifferentiated measure ME+undif degree, gross, share

Categories of
MEASURE nouns

Inflectional and Derivational Description
Noun Inflectional Paradigm
Adjective Inflectional
Paradigm
Pronoun Inflectional Paradigm
Verb Inflectional Paradigm
Adverb Inflectional Paradigm Determiner Inflectional Paradigm
Interrogative Pronoun Inflectional
Paradigm Nominalization Derivational
Paradigm

Paraphrasing and Translation Grammars
Translation and
bilingual paraphrasing
of simple sentences
Graph to translate simple
sentences

Verb entries:
• Identification of derivational paradigms for nominalizations
(annotation NDRV) and predicate adjectives (annotation ADRV)
• Link to the derived noun’s support verbs and to the adjective’s
copula verbs (annotation VSUP and annotation VCOP)
adaptar,V+FLX=FALAR+Aux=1+INOP57+Subset=132+EN=adapt+VSUP=fazer+DRV=NDRV00:CANÇÃO
azedar,V+FLX=LIMPAR+Aux=1+OBJTRundif98+Subset=740+EN=sour+VCOP=estar+DRV=ADRV00:ALTO
Explicit Marking of Derivation and Support Verb

Adjective entries:
• Identification of derivational paradigms for adverbializations
(annotation AVDRV)
literal,A+FLX=PRINCIPAL+IN+symb+EN=literal+DRV=AVDRV00:LITERALMENTE
Autonomous predicate nouns:
• Identification of autonomous predicate nouns (annotation
Npred)
• Identification of a semantically related verb
curso,N+FLX=ANO+Npred+IN+inst+EN=course+VSUP=tirar+VRB=estudar+NPrep=de+Det=um
Explicit Marking of Derivation and Semantic Verb Association

ReWriter: a Monolingual Standalone Paraphraser
Recognition and monolingual paraphrasing
of support verb constructions
(support verb construction / morphologically related lexical verb)

ReWriter: Examples
Recognition and paraphrasing of elementary
support verb constructions
co-occurring with predicate nouns
of the biomedical field
(support verb construction / lexical verb or
stylistic variant / non-elementary support verb
construction)
Elementary SVC > Lexical Verb
Elementary SVC > non-elementary SVC
realizar/efectuar
Elementary SVC > sujeitar-se a
submeter-se a
ONLY if the SUBJECT is a patient

ReWriter: Application - Interface
Interactive ReWriter
for word processing applications
such as text editing

ReWriter: Application - Interface

ReWriter: Extensibility
1.Applications to General Language
2.Applications to Technical Language

ReWriter: Extensibility - Examples
[Paraphrasing adverbials]
à volta da órbita ≡ periorbital (popular versus technical)
around the orbit of the eye periorbital≡
[Paraphrasing relative clauses - into adjectival past
participles]
N0 que têm sido escritos N0 que foram descritos N0≡ ≡
escritos
N0 that have been written N0 that were described≡ ≡
N0 written

[Paraphrasing if clauses]
se for necessário se necessário≡
if it is necessary if necessary≡Mestrado em Tradução Jurídica e Empresarial

[Paraphrasing coordinated noun phrases - conjoining
or disjoining]
recursos linguísticos para o ensino e para a investigação
Ŧ ?linguistic resources for teaching and for research
≡ recursos linguísticos para o ensino e a investigação
Ŧ linguistic resources for teaching and research
[Paraphrasing subjunctive clauses - into infinitives]
pedimos o favor que confirme a sua participação
Ŧ *we ask the favor that you confirm your attendance
≡ pedimos o favor de confirmar a sua participação
Ŧ *we ask the favor of confirming your attendance

[Paraphrasing marked-up constructions]
se a necessidade do utilizador é criar um texto em linguagem controlada
Ŧ ?if the end-user need is to create controlled language text
≡ se o utilizador necessita de criar um texto em linguagem controlada
Ŧ if the end-user needs to create controlled language text
[Paraphrasing of vague and undefined or null subject sentences]
(whenever the real subject/actor is known)
[-] houve um grito na rua [N-PRON]/≡ alguém gritou na rua
Ŧ there was shouting in the street [N-PRON]/≡ someone shouted in the
street

[Paraphrasing passives - whenever suitable]
Esse livro foi escrito por Saramago em 2008 ≡ Saramago escreveu
esse livro em 2008
That book was written by Saramago in 2008 Saramago wrote that≡
book in 2008
Florida foi atingida por um tornado ≡ Um tornado atingiu a Florida
Florida was hit by a tornado A tornado hit Florida≡
O carro foi roubado ≡ Alguém roubou o carro
The car was stolen ≡ Someone stole the car

ParaMT: a Bilingual/Multilingual Paraphraser for MT
Recognition and bilingual paraphrasing of support verb constructions
(Portuguese support verb construction / corresponding English verb)

Preliminary Quantitative Results

SVC Recognition
Precision
SVC Recognition
Recall
SVC Paraphrasing
Precision
Pôr 73/73 - 100% 73/100 – 73% 72/73 - 98.6%
Tomar 75/75 - 100% 75/100 – 75% 68/73 - 93.1%
Ter 65/65 - 100% 65/100 – 65% 59/65 - 90.7%
Dar 57/60 - 95% 57/100 – 57% 46/51 - 90.1%
Fazer 43/45 – 95.5% 43/100 – 43% 40/45 - 88.8%
Average 62.6/63.6 - 98.4% 62.6/100 - 62.6% 57/61 - 93.4%
Evaluation of recognition and paraphrasing
of support verb constructions
500 sentences
100 for each elementary support verb

Conclusions
Linguistic knowledge applied to a machine
translation system improves its output quality.
Effective results from linguistically based research
on paraphrases can save substantial effort and
resources employed by machine translation systems

Thank you for your attention!
Acknowledgements
This work was partly supported by grant SFRH/BD/14076/2003
from Fundação para a Ciência e a Tecnologia, co-financed by
POSI and partly by Fundação para a Computação Científica
Nacional.

New Tools and Resources to Support Machine Translation

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to New Tools and Resources to Support Machine Translation

Similar to New Tools and Resources to Support Machine Translation (20)

More from INESC-ID (Spoken Language Systems Laboratory - L2F)

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Recently uploaded

Recently uploaded (20)

New Tools and Resources to Support Machine Translation

Editor's Notes