Your SlideShare is downloading. ×
0
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
When Multiwords Go Bad in Machine Translation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

When Multiwords Go Bad in Machine Translation

512

Published on

This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos …

This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos
rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct reasons, multiwords remain a problematic area for MT independently of the approach, and require adequate linguistic quality evaluation metrics founded on a systematic categorization of errors by MT expert linguists. We propose an empirically-driven taxonomy for multiwords, and highlight the need for the development of specific corpora for multiword evaluation. Finally, the paper presents the Logos approach to multiword processing, illustrating how semantico-syntactic rules contribute to multiword translation quality.

Published in: Technology, Business
2 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total Views
512
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
2
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. technology from seed WHEN MULTIWORDS GO BAD IN MACHINE TRANSLATION Anabela Barreiro, L2F - INESC-ID, Portugal Johanna Monti, University of Sassari, Italy Brigitte Orliac, Logos Institute, USA Fernando Batista, L2F - INESC-ID and ISCTE, Portugal
  • 2. • Introduction • Multiwords in NLP and MT • RBMT and SMT approaches to multiword processing • Multiword Assessment – Corpus – Multiword taxonomy – Error Categorization for multiword translations – Quantitative results – Analysis of relevant problems • OL semantico-syntactic rules for multiword translation precision • Main conclusions and future work OUTLINE
  • 3. • MT has become popular, widespread and useful to society • Every internet user is now using MT (with or without knowing it!) • HOWEVER, linguistic quality is still a serious problem, as translations contain morphological, syntactic and semantic errors • Successful MWU processing still represents one of the most significant linguistic challenges for MT systems INTRODUCTION
  • 4. • Crucial role in NLP • Essential in MT – Unfeasible to include all MWU in dictionaries – Poor syntactic and semantic analysis – reduces the performance of NLP systems – Fragmentation of any part of a MWU leads to generation errors – Incorrect MWU generation has a negative impact on the understandability and quality of the translated text MULTIWORDS IN NLP AND MT
  • 5. • Lack/degree of compositionality • Constituent dependencies • contiguous (adjacent) - no inserts • non-contiguous (remote) – inserts • Morphosyntactic variations Freely available RBMT and SMT fail at translating MWU: – RBMT systems fail for lack of MWU coverage – SMT systems fail for lack of linguistic (semantico-syntactic) knowledge to process them, leading to structural problems CRITICAL PROBLEMS FOR MULTIWORD PROCESSING
  • 6. MT APPROACHES TO MULTIWORD PROCESSING • Importance of a correct processing of MWU so that they can be translated correctly by MT systems • Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti, 2013 • Solutions to resolve MWU translation problems: – Use of generative dependency grammars with features • Diaconescu, 2004 – grouping bilingual MWU before performing statistical alignment • Lambert and Banchs, 2006 – paraphrasing MWU • Barreiro, 2010
  • 7. MULTIWORD PROCESSING IN RBMT • Lexical approach – MWU as single lemmata in dictionaries – Suitable for contiguous compounds • Compositional approach – MWU processing by means of POS tagging and syntactic analysis of its different components – Suitable for compounds not coded in the dictionary and for verbal constructions
  • 8. MULTIWORD PROCESSING IN SMT • Traditional approach to word alignment • Brown et al., 1993 – Inability to handle many-to-many correspondences • Current state-of-the-art phrase-based SMT systems • Koehn et al., 2003 – The correct translation of MWU occurs if the its constituents are marked and aligned as parts of consecutive phrases in the training set – Phrases are defined as sequences of contiguous words (n-grams) with limited linguistic information (mostly syntactic) • will stay - linguistically meaningful • that he - no linguistic significance
  • 9. MULTIWORD PROCESSING IN SMT • MWU processing and translation in SMT as a problem of: – Automatically learning and integrating translations – Word sense disambiguation – Word alignment • Barreiro et al., 2013 • Integration of phrase-based models with linguistic knowledge: – Identification of possible monolingual MWU • Wu et al., 2008, Okita et al., 2010 – Integration of bilingual domain MWU in SMT • Ren et al., 2009 – Incorporation of machine-readable dictionaries and glossaries, treating these resources as phrases in the phrase-based table • Okuma et al., 2008 – Identification and grouping of MWU prior to statistical alignment • Lambert and Banchs, 2006
  • 10. • Linguistic analysis and error categorization of the MWU translations • 2 MT systems: • OpenLogos RBMT • Google Translate SMT • 3 language pairs: EN-FR, EN-IT and EN-PT • Analysis of MWU translations by 3 MT expert linguists • MWU taxonomy to evaluate MWU (in any system independently of the approach) • OpenLogos solution to MWU processing in MT OUR WORK
  • 11. • Created a Corpus of MWU • Translated sentences with the MWU into EN, IT and PT using the OL and GT systems The purpose of our work WAS NOT to compare and evaluate systems The purpose of our work WAS to assess and measure the quality of MWU translation independently of the two systems considered • Developed an empirically-driven taxonomy for MWU • Analysed the MWU translation errors based on this taxonomy – The different errors were categorized by MT expert linguists of the respective target languages METHODOLOGY
  • 12. CORPUS • 150 English sentences - news and internet • Average of ~4-5 MWU per sentence • The corpus was divided into 3 sets of 50 sentences translated for each language pair by the 2 systems • 3 native linguists reviewed 50 sentences each for the 3 target languages, and evaluated the MWU translations for each of these languages (1 evaluator for each language), classifying the translations according to a binary evaluation metrics: • OK for correct translations • ERR for incorrect translations • None of the systems was specifically trained for the task - texts were not domain specific
  • 13. MULTIWORD TAXONOMY Type Subtype Acronym Example VERB Compound Verb COMPV may have been done, have [already] shown Support Verb Construction SVC make a presentation, be meaningful, have [particularly good] links, give an illustration of, be the ADV cause of, fall [so far] short of, take a seat; to play a [very important] role Prepositional Verb PREPV deal with, give N to Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with Other Verbal Expression VEXPR in trying to, hold N in place NOUN Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning Prepositional Noun PREPN interest in, right side of ADJECTIVE Compound Adjective COMPADJ cost-cutting Prepositional Adjective PREPADJ famous for; similar to ADVERB Compound Adverb COMPADV in a fast way, most notably, last time Prepositional Adverb PREPADV in front of DETERMINER Compound Determiner COMPDET certain of these Prepositional Determiner PREPDET most of CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than PREPOSITION Compound Preposition COMPPREP as part of OTHER EXPRESSION Named Entity NE Economic Council Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you
  • 14. • The results shed some light on the demand for higher precision MWU translation • MWU occur frequently in our corpus - several times within the same sentence: Witnesses said the speeding car may have been playing tag with another vehicle when it veered into the southbound lane occupied by Lopez' truck shortly before 8 p.m. Sunday • may have been playing tag with - COMPV - idiomatic PREPSVC • veered into - PREPV • southbound lane - COMPN • 8 p.m. Sunday - double temporal expression (time + date) RESULTS
  • 15. QUANTITATIVE RESULTS Correct and incorrect MWU translations 15 System Lang pair OK ERR Total OL EN-FR 40 48 88 EN-IT 36 83 119 EN-PT 60 96 156 Total 136 227 363 GT EN-FR 70 38 108 EN-IT 59 47 106 EN-PT 67 47 114 Total 196 132 328
  • 16. Performance for the 3 most frequent MWU 16 EN-FR OL GT Type Ok Error Ok Error VERB 17 21 27 12 COMPN 8 10 13 18 NE 6 4 16 4 EN-IT OL GT Type Ok Error Ok Error COMPN 14 39 26 21 VERB 10 12 6 15 NE 2 8 14 2 EN-PT OL GT Type Ok Error Ok Error VERB 30 21 11 23 COMPN 28 12 18 17 NE 11 26 9 9 QUANTITATIVE RESULTS
  • 17. General language or domain-specific COMPN - 32,5% • hit-run driver pilot hit run chauffeur/conducteur ayant commis un délit de fuite • nuclear fuel cycle cycle de combustible nucléaire cycle de combustion nucléaire SVC - 18,6% • is a bit misleading (adjectival) est un égarement de morceau est quelque peu trompeur • it has [wide] applicability (nominal) il a l’applicabilité large il a de nombreuses possibilités d’application MULTIWORDS “GOING BAD” IN FRENCH
  • 18. LOGOS APPROACH TO MULTIWORD TRANSLATION Main linguistic knowledge bases of the LOGOS system: • Dictionaries • Semantico-syntactic rules - analysis, transfer and generation • Semantic Table SEMTAB - language-pair specific rules – Analysis and translation of words in their context – invoked after dictionary look-up and during the execution of target transfer rules to solve analysis and lexical ambiguity problems • verb dependencies - different verb argument structures – speak to – speak against – speak of – speak on N (radio, TV, television, etc.) • MWU of different nature
  • 19. SAL - Semantico-syntactic Abstraction Language – Taxonomy: 3 levels organized hierarchically: • Supersets / Sets / Subsets – Semantico-Syntactic continuum from NL word to Word Class • Literal word: airport • Head morph: port • SAL Subset: Agfunc (agentive functional location) • SAL Set: func (functional location) • SAL Superset: PL (place) • Word Class: N – SAL combines both the lexical and the compositional approaches in order to process different types of MWU LOGOS APPROACH TO MULTIWORD TRANSLATION
  • 20. RESOLUTION OF POLYSEMY NL String SEMTAB Rule Portuguese Transfer raise a child  V(‘raise’) N(ANdes)  criar. . . raise corn  V(‘raise’) N(MAedib)  cultivar. . . raise the rent  V(‘raise’) N(MEabs)  aumentar. . .
  • 21. DEEP STRUCTURE RULES OF SEMTAB A single deep-structure rule matches multiple surface-structures and produces correct target transfers he raised the rent  ele aumentou a renda V+Object the raising of the rent  o aumento da renda Gerund the rent, raised by …  a renda, aumentada por… Part. ADJ a rent raise  um aumento de renda Noun
  • 22. • MWU – problematic for MT systems independently of the approach • Literal translations lead to unclear/incorrect translations or loss of meaning • Correct identification and analysis of source language MWU is a challenging task, but the starting point for higher quality MT – Linguistic quality evaluation metrics – Systematic categorization of errors by MT expert linguists – Specific corpora for MWU evaluation • OpenLogos approach to MWU processing uses semantico-syntactic rules, which can contribute to MWU translation quality with reference to any language pair CONCLUSIONS
  • 23. FUTURE WORK • Research on how OpenLogos linguistic knowledge – SEMTAB - can be applied to a SMT system to correct MWU errors … and vice-versa • Successful combination of linguistic PRECISION (OL approach) and COVERAGE (GT approach) in resolving the MWU problem – evolution in the MT field • Successful integration of semantico-syntactic knowledge in SMT – solution for achieving high quality MT – The accomplishment of this task requires a combination of expertise in MT technology and deep linguistic knowledge to address reverse research avenue: integration of SMT technology/processes in RBMT to advance MT
  • 24. 24 Thank you! This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.

×