Cross language alignments - challenges guidelines and gold sets

330 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
330
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Antes de iniciar esta apresentação gostaria de agradecer à Priberam a oportunidade de mostrar o nosso trabalho neste seminário. Andnow, I willproceed in English…Goodmorning. Mynameis Anabela Barreiro. I amaninvitedresearcherattheSpoken Language Systems Laboratory, at INESC-ID, Lisbon. Today, I willpresent “Cross-Language Alignments: Challenges, Guidelines and Gold Sets”, done in collaborationwithmycolleagues Luísa Coheur, Tiago Luís, Ângela Costa, Fernando Batistaand João Graça.In thispresentation, I willdescribe the key cross-language annotation guidelines to provide support for machine translation systems. The guidelines aim at improving the quality of the machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units.
  • This presentation is divided in 3 parts.I will describe CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
  • I will focus on the challenges to the alignment of special cross-linguistic cases, such as multiword units, lexical and non-lexical realization (the pro-drop phenomenon, determiners and zero determiners), noun adjuncts, and idiosyncrasies of each language.
  • Themain use ofwordalignmentsis SMT.[Brown et al., 1990] – introducedtheconceptofwordalignmentandapplieditdirectly to a SMT system[OchandNey, 2004] – usedit as a primaryresource for phrase base machinetranslation[Galleyet al., 2004] – usedit as a resource for syntax base machinetranslation
  • In thelastyears, withtheincreaseoffreelyavailableparallel corpora, a hugedevelopmenttookplace in SMT.Many workshops andevaluationtaskshavebeendedicated to multi-languagewordalignment.Some projects too. For example,theBlinkerprojectaimedataligningwordsbetweenFrenchandEnglishtexts.Manywordalignmentguidelineshavebeensuggested.
  • Re-definitionorwordalignment: wordandmultiword, phrase, expression – translationunit
  • Despitethegrowing # ofavailablemulti-languagesentencealignedparallel corpora andalignmenttools, the # ofpubliclyavailable manual wordalignmentsisrestricted to a fewlanguagepairs.Word alignmentis a desirableresource.
  • The guidelines were based on the alignment of bilingual texts of the common test set of the publicly available Europarl corpus thatcontainsproceedingsoftheEuropeanParliament in thedifferentofficiallanguagesofthe EU. Theworkprovides 6 goldalignment sets. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages.
  • CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
  • Onlywhenoneoftheseelementsiselidediswhenwe use blockalignments. Whentheelements are lexicallyrealized, determiners,pronounsandthe individual elementsoftherelatives are single alignedwiththecorrespondingelements in theparallelsentenceExceptions:Discontinuousmultiwordunitswith a smallnumberofinserts are aligned
  • Otherexamplesofalignedandnotaligned MWUPhrasalverbsAligned – look intotheproblem – debruçar-se sobre este problemaNotaligned – (230)VerbcompoundsAligned – hasalsoincreased (22)Notaligned - FrenchnegationAligned – nepasNotaligned -
  • (da presidênciais S-alignedwithpresidency)Presidencycommunicationis in the corpus – butit does notsoundright!
  • NOT A GOOD SOLUTION – it does notaccount for thedoublenegationstructure
  • The gold collection and alignment tool are publicly available.
  • Cross language alignments - challenges guidelines and gold sets

    1. 1. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 1Cross-Language Alignments:Challenges, Guidelines and Gold SetsAnabela Barreiro Luísa Coheur Tiago LuísÂngela Costa Fernando Batista João Graça
    2. 2. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 2Outline – Part 1• Word alignment• Basic concepts• Applications• State of the art• Limitations• Paraphrase alignment• Multiword, meaning and translation unit alignment: importance• Our task• Alignment tool: CLUE-Aligner
    3. 3. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 3Outline – Part 2• General annotation guidelines• Cross-linguistic major challenges to word alignment• Annotation guidelines for multiword units and lexical and non-lexicalrealization phenomena• Pro-dropping• Articles and zero articles• Examples: continuous multiword units• Examples: continuous and discontinuous support verb constructionsPreposition-dependency(V, N and Adj)Active vs passive Choice of noun pre-modifiers Different PoS with samesemantics (V vs process N)Noun adjuncts Coordination Anaphora: choice of co-referentsImpersonal constructionsContractions Style Antonyms and negationconstructionsRomance languages doublenegationSingular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasingconstructions;Idiosyncrasies of eachlanguage
    4. 4. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 4Outline – Part 3• Our contribution• Annotation process• Preliminary results• Discussion• Future work
    5. 5. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryWord Alignment: Basic Concepts• Objects representing the mapping of words (or expressions),which are semantically equivalent in a source and a targetsentence of a parallel corpus [Brown at al., 1990]– Matrix of n * m entries, where n is a position on the source sentence, andm is a position on the target sentence. An entry in that matrix an,mspecifies if the word at position n is part of a translation of the word at aposition m on the target language• Task of word alignment - identifying translational equivalences(= semantic correspondences) in the aligned sentence pairs ofa parallel text [Hearne & Way, 2011]• Translational equivalences - graphically represented in a gridby the intersection of single segments (individual words) orblocks (semantico-syntactic units, phrases, expressions)5
    6. 6. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryWord Alignment: Basic Concepts6• Sure alignment (S-alignment)– Unambiguous and valid in all contexts• EN system• ES sistema• FR système• PT sistema• Possible alignment (P-alignment)– Ambiguous and invalid in some contexts• EN be• ES ser/estar/haber/existir• FR être/avoir/exister• PT ser/estar/haver/existir
    7. 7. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryWord Alignment: Applications• Statistical machine translation– [Brown et al., 1990] – statistical machine translation– [Och and Ney, 2004] – phrase base machine translation– [Galley et al., 2004] – syntax base machine translation• Annotations’ projections• Extraction of bilingual lexica• Evaluation of machine translation systems7
    8. 8. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryWord Alignment: State of the Art• Workshops and evaluation tasks (multi-language)– http://www.cse.unt.edu/~rada/wp/– http://www.statmt.org/wpt05– http://www.lpl.univ-aix.fr/projects/arcade• Projects– Blinker project –French-Englishhttp://nlp.cs.nyu.edu/blinker/• Guidelines[Melamed, 1998] [Och and Ney, 2000][Lambert et al., 2005] [Kruijff-Korbayová et al., 2006][Graça et al., 2004]8
    9. 9. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryWord Alignment: Limitations• Language does not operate on a word-for-word basis• A large number of words are undissociated– Multiword units• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU• [Sag et al., 2002] – 50-70% of specialized lexica are MWU• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+words (not included general purpose MWU, e.g., generic compounds,lexical bundles, phrasal verbs, fixed expressions, which also occur indomain-specific texts)– Translation units– Meaning units– Paraphrases• Segment and block alignment (sure and possible)9
    10. 10. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryExample: Segment and BlockAlignment (Sure and Possible)10
    11. 11. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryParaphrase Alignment• Monolingual– [Callison-Burch et al., 2006]• Annotation guidelines for paraphrase alignment• Paraphrases - sentences that convey the same meaning but areworded differently• Alignment of words, phrases, expressions, within the same language• Bilingual = (non-literal) translation– Need to account for paraphrases across languages11
    12. 12. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryMultiword, Meaning and TranslationUnit Alignment: Importance• Publicly available manual word alignments are restrictedto a few language pairs• Manual word alignments are a desired resource– Evaluation of word alignment algorithms– Training of supervised and semi-supervised algorithms– Tuning of parameters for different types of model• But, “name”, “concept” and “techniques” of alignment needto be linguistically sophisticated to be more useful andhelp provide improved machine translation!12
    13. 13. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryOur Task• EuroParl corpus [Koehn, 2005]• 6 gold alignments sets– 400 alignments each set (400x6=2,400)• Languages: English, French, Portuguese and Spanish– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]• Guidelines for multi-language manual word annotations(with inter-annotator agreement)• Linguistically-informed (and linguistically-motivated) cross-language multiword unit and paraphrase alignment(translation unit alignment)13
    14. 14. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCLUE-Aligner Alignment Tool14CLUE-Aligner =Cross-Language Unit Elicitation Aligner• Helps reduce ambiguity in the alignment process• Facilitates the alignment of translation units
    15. 15. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryMajor Challenges (4 different classes)• semantico-discursive– emphatic linguistic constructions• tautology• pleonasm and repetition• focus constructions• lexical and semantico-syntactic– multiword units– compound verbs– prepositional predicates15
    16. 16. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryMajor Challenges (4 different classes)• morphological– contracted forms– lexical versus non-lexical realization• articles and zero articles• pro-dropping– subject pronoun drop– empty relative pronoun• morpho-syntactic– free noun adjuncts16
    17. 17. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryLinguistic phenomenon No alignment P-alignmentIncomplete or non-translation XIncorrect translation and typo X*Approximate correspondence (numeric) XNon-obligatorylinguistic structurePleonasm XRepetition of words or expressions XRedundancy or additional/extra information XMismatching pronoun, determiner, verbs, etc. XAbbreviations versus full word XPunctuation markDifferent but correct XIncorrect / mismatch XMissing X17General Annotation Guidelines* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned
    18. 18. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryLinguistic phenomenon No alignment Block-alignmentS-align P-alignMultiword Unitcontinuous X Xdiscontinuous X*Lexicalversusnon-lexicalrealizationarticle+ Nversuszero-article + NØ people=PT - as pessoasXPro-drop + Vversuspronoun + VI went=PT - Ø fuiXEmpty relative pronounversusrealized relative pronounN that I met = N I met=PT - que (eu) conheciXRelativeversusparticipial adjectivethat was writen = writen=PT – escritoX18Annotation Guidelines* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unitis “semi-frozen”
    19. 19. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryContinuous multiword units Block-S-alignment Block-P-alignmentSupport verb construction X XCompound X XPhrasal verb X XNamed entity X XDate and time expression XLexical bundle XIdiomatic expression XDomain term XFrench negation (ne pas) XEnglish infinitive (to + V) X X19Annotation Guidelines[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit
    20. 20. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryExample: Continuous Support VerbConstructions (alignment)20ES aprueba plenamenteFR approuve pleinement
    21. 21. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryExample: Discontinuous Support VerbConstructions (no alignment)21ES para que acelere la directiva sobre pensionescomplementaresFR pour faire avancer la directive sur les pensionscomplementaires
    22. 22. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Prepositional predicatesEN I too should like to congratulate [NE] on his excellent reportES también yo quisiera felicitar a mi colega [NE] por su excelente informeFR je voudrais féliciter moi aussi mon collègue [NE] pour son excellentrapportPT também eu gostaria de felicitar o meu colega [NE] pelo seu excelenterelatórioEN […] our Asian partners prefer to deal with questions which unite usES […] nuestros socios asiáticos prefieren dedicarse a las questiones quenos unenFR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unitPT […] os nossos parceiros asiáticos preferem centrar-se unicamente nasquestões comuns22Segment S-alignmentImpossible to annotate discontinuous preposition-dependencyBlock P-alignment
    23. 23. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratoryagree with belong to forgive s/o for pay for stand foraim at/for choose between hope for prepare for thank s/o forallow for comment on insist on prevent s/o from think of/aboutapologise for compare with interfere with/in provide s/o with volunteer toapply for complain about joke about refer to wait forapprove of concentrate on laugh at rely on warn s/o aboutargue with/about congratulate on lend s/th to s/o run for worry aboutask for consist of listen to smile atattend to deal with long for succeed inbelieve in decide on object to suffer fromCross-Linguistic Challenges• Prepositional verbs23
    24. 24. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Prepositional nouns24attack on attitude towards in agreement on strikecruelty towards comparison between on average in troubledifficulty in/with decrease in on condition on behalf ofknowledge of disadvantage of delay in connection betweenreason for incerase in in doubt difference between/ofrise in preference for information about under guaranteesolution to reduction in need for in poweruse of at risk protection from reaction toin a hurry at stake report on result ofin practice in theory room for trouble with
    25. 25. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Prepositional adjectives25delighted at/about frightened of opposed to similar todifferent from friendly with pleased with sorry for/aboutdissatisfied with good at popular with suspicious ofdoubtful about guilty of proud of sympathetic to(wards)enthusiastic about incapable of puzzled by/about tired ofenvious of interested in safe from typical ofexcited about jealous of satisfied with unaware offamous for keen on sensitive to(wards) used tofed up with kind to serious aboutfond of mad at/about sick of
    26. 26. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Noun Adjuncts– Compounds• European investment bank banco europeu de investimento[Adj N N] [N Adj [de N]]– Free noun phrases (not compounds)• presidency communication comunicação da presidência[N N] [N [de N]]26Block S-alignmentSegment S-alignmentBlock-P-alignmentof [de N]
    27. 27. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Contractions– two or more words with different parts-of-speech overlap, whichmakes syntactic analysis and generation difficult– in cross-language analysis, the contrast between languages thathave contractions and languages that do not have them, or do nothave them in the same contexts, presents additional difficulties– The alignment of one segment that corresponds to a contracted formin one language with the corresponding segments where elementsare not contracted in the other language of the parallel pair ispragmatically motivated27
    28. 28. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryExample: Contractions (block-P-alignment)28Interference with the support verb constructionEN to make a reference toPT fazer uma referência a
    29. 29. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryExample: Contractions (block-P-alignment)29Interference with the support verb constructionES hacer una referencia aFR faire référence a
    30. 30. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Singular versus plural (related to determiner)EN in every official language of the unionES en todos los idiomas oficiales de la uniónFR dans toutes les langues officielles de lunionPT em cada uma das línguas oficiais da união• Active versus passiveEN before new member states are admittedES antes de la incorporación de nuevos miembrosFR avant ladmission de nouveaux membresPT antes da entrada de novos membros30Block or segmentP-alignmentBlock-S-alignment if thereis some fixedness(such as in this case)Block P-alignment
    31. 31. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• CoordinationEN which we will send to the council and Ø parliamentES que enviaremos al consejo y al parlamentoFR qui sera envoyée au conseil et au parlementPT que remeterá ao conselho e ao parlamento• Style: idiomatic versus non-idiomaticEN which began four years agoES que empezó hace quatro añosFR qui a vu le jour il y a quatre ansPT que se iniciou há quatro anos31No alignmentBlock P-alignment
    32. 32. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Choice of noun pre-modifiersEN we should use that public funding for those types of project which aremost difficult to finance through the private sectorES deberíamos utilizar esa financiación pública para aquel tipo de proyectosque tienen mayor dificuldad para ser financiados por el sector privadoFR nous devrions recourir au financement public pour les projets que lesecteur privé boudePT o financiamento público deveria ser utilizado para os projectos queregistam maiores dificuldades em serem financiados pelo sector privado32Block P-alignmentEN despite certain difficultiesPT apesar das dificuldades
    33. 33. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Anaphora - choice of co-referents (noun versus pronoun)EN it is not acceptable that we assisted Korea during the Asean crisis bymeans of IMF loans and suchlike, only for Korea still to be subsidising itsshipyardsEN no resulta procedente que hayamos ayudado a Corea en la crisis de laAsean a través de préstamos del FMI, etc. y que Corea sigasubvencionando sus astillerosFR il n’est pas acceptable que nous ayons aidé la Corée dans la crise del’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionnerses chantiers navalsPT é inadmissível que, depois de termos ajudado a Coreia, através decréditos do FMI, etc., na crise da Asean, este país continue asubvencionar agora os seus estaleiros navais33Segment or blockP-alignment
    34. 34. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Antonyms and negation constructionsEN the countries of Asia have not unfortunately been in favour of thatproposalES los países de Asia desgraciadamente no han sido favorables a dichapropuestaFR les pays dAsie ont malheureusement rejeté cette propositionPT os países da Ásia, infelizmente, não se mostraram favoráveis a estaproposta34Block S-alignment togetherwith adverb(insert in EN and FR)
    35. 35. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Flexible/loose paraphrasing constructionsEN and we shall vote against itES y merece nuestra condenaFR et dénonçonsPT e merece a nossa condenaçãoEN 1993 was a significant yearES el año 1993 es una fecha notableFR l’année 1993 est à marquer d’une pierre blanchePT 1993 é uma data charneira35Block P-alignment
    36. 36. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Different parts-of-speech with same semantics (verbs versusprocess nouns)EN we must use all the financial instruments at our disposal to rapidlydevelop the marketES es preciso utilizar todos los instrumentos financieros disponibles para unrápido desarollo ulterior del mercadoFR il faut utiliser tous les instruments financiers disponibles pourdévelopper rapidement le marchéPT todos os instrumentos financeiros disponíveis deverão ser aplicadospara continuar a desenvolver rapidamente o mercado36Block S-alignment (with internal segment P-alignments)EN and PT :Segment S-alignmentNo alignment of [continuar a]
    37. 37. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Impersonal constructions(+ “impersonal” relative versus participial adjective)EN we must fully support the demands that have been madeES hay que apoyar plenamente las exigencias que se han formuladoFR il faut par conséquent appuyer les requêtes formuléesPT as reivindicações formuladas deverão ser plenamente apoiadas37Block P-alignmentInternal P-alignmentEN we mustES hay queFR il fautInternal segment S-alignment - adverb and verb (EN, ES, FR)Internal segment P-alignment - verb (PT)
    38. 38. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Romance languages double negation (+ coordination)EN it is not, therefore, surprising that there is, in this context, no realintegration or gennuine political dialogueES no es nada sorprendente, entonces, que en ese contexto, no haya niverdadera integración ni verdadero diálogo políticoFR rien d’étonnant donc, quil ny ait dans ce contexte, ni intégrationvéritable, ni dialogue politique véritablePT assim, não é de espantar que, nesse contexto, não exista verdadeiraintegração nem verdadeiro diálogo político38Block P-alignment of the relative existential with adverbial (insert)EN that there is, in this context, noES que en esse contexto, no hayaFR qu’il n’y ait dans ce contextePT que, nesse contexto, não existaSegment P-alignment of negationand negation connectorEN no – orES ni – niFR n’ – niPT Ø - nem
    39. 39. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryCross-Linguistic Challenges• Idiosyncrasies of languages• Portuguese inflected infinitive (peculiar verb tense)• English to+Infinitive• French negation• English apostrophe• …• Sociolinguistic differences39
    40. 40. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryOur Contribution• Tool CLUE-Aligner• Annotated corpora• Cross-language resources – gold collectionPublicly available on the META-NET website:http://metanet4u.l2f.inesc-id.pt/• Guidelines– http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf40
    41. 41. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryAnnotation Process• Annotation of 400 x 6 (2,400 sentence alignments) by alinguist• Alignment on a subset of by a second linguist (25• sentences of the English-Portuguese language pair)• Inter-annotators agreement41
    42. 42. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryPreliminary Results42language words avg. wordsen 11158 27.9es 11664 29.2fr 12464 31.2pt 11649 29.1pair Sure Possible Totalen-pt 6684 418 7102en-fr 7025 569 7594en-es 7636 399 8035es-fr 7477 767 8244pt-es 7958 557 8515pt-fr 7029 782 7811pair Sure Possible Totalen-pt 2588 602 3190en-fr 3865 414 4279en-es 3551 351 3902es-fr 3516 495 4011pt-es 3162 382 3544pt-fr 3253 698 3951Block (MWU) alignmentSegment (word) alignment
    43. 43. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryInter-annotators Agreement43• Statistical significance for kappa is rarely reported. However, anumber magnitude guidelines have appeared in the literature.– Landis & Koch (1977) consider• kappas between .4 and .6 as a moderate agreement• kappas between .8 and 1 correspond to an almost perfect agreement– Fleiss (1981) (equally arbitrary guidelines) characterize• kappas from .40 to .75 as fair to good• kappas over .75 as excellent• This set of guidelines is however by no means universally acceptedCohens kappacoefficientMulti-word units (MWU) 0.541Word alignments (WA) 0.984Total 0.871
    44. 44. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryDiscussion• Difficulties in analyzing fluency, stylistics (including word order),paraphrase, etc.• Alignments do not always work bi-directionally (sometimes the source-target direction for a language pair matters)• Levels of alignment and ranking systems (n-grams, morphology,semantico-syntactic level, phrase, paraphrase, etc.)• Terminology imprecision is found in corpora (it leads to poor qualitymachine translation)45
    45. 45. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryFuture Work• Integration of lexica (multiword units, etc.) obtained via the use of localgrammars – use multiword units as ONE (1) segment of alignment,whenever that is possible (contiguous, etc.)• Pre-processing of contractions and post-processing of elements thatneed to be contracted is important if applied to machine translation orto create “more polished” lexica• Evaluation of the current alignments in a statistical machine translationsystem to see if translation quality improves46
    46. 46. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryFuture Work• Machine learning of recognition and alignment of multiword units• based on segment alignments, i.e., individual words inside themultiword unit• based on multiword units of a parallel sentence in another language orlanguage pair alignment• Use of local grammars that identify and process discontinuousmultiword units and other complex linguistic phenomena to combinewith word alignment techniques – how to combine?47
    47. 47. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems LaboratoryMain Conclusion• Bringing linguistics into STM at the start is the first inevitable placewhere hybridization should be possible.• We believe that it would be productive to convert texts on both sides ofa translation pair into a common semantico-syntacticrepresentation before applying statistics into them. For this, eachlanguage would have to have a parser capable of producinghomogeneous output.• If this common representation were available, that would bring vastpossibilities for multi-linguistic SMT.48
    48. 48. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 49technologyfrom seedL2 F - Spoken Language Systems LaboratoryThank you!
    49. 49. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboatechnologyfrom seedL2 F - Spoken Language Systems Laboratory 50Cross-Language Alignments:Challenges, Guidelines and Gold SetsAnabela Barreiro Luísa Coheur Tiago LuísÂngela Costa Fernando Batista João Graça

    ×