• Save
Parallel text extraction from multimodal comparable corpora
Upcoming SlideShare
Loading in...5
×
 

Parallel text extraction from multimodal comparable corpora

on

  • 503 views

 

Statistics

Views

Total Views
503
Views on SlideShare
503
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Parallel text extraction from multimodal comparable corpora Parallel text extraction from multimodal comparable corpora Presentation Transcript

  • Introduction Existing Works Proposed Approach Conclusion Parallel text extraction from multimodal comparable corpora Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc LIUM, University of Le Maine 72085 Le Mans cedex 9, FRANCE FirstName.LastName@lium.univ-lemans.fr Oct 22, 2012 JapTal 2012, Kanazawa - JAPAN1/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Outline 1 Introduction and Context Statistical Machine Translation Parallel and Comparable Corpora 2 Existing Works Exploiting Comparable Corpora Main Existing Methods 3 Proposed Approach System Architecture Several Issues Task Description Experimental setup Results 4 Conclusion and Discussion2/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Statistical Machine Translation Purpose : text translation Approach : Statistical, given by : t ∗ = arg max P(s|t)P(t) t Modeling Translation Model : P(s|t) Language Model : P(t) Decoding Algorithme : argmax Some open source tools are available like Moses and Joshua ⇒ needs parallel data3/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Parallel Corpora Texts that are translations of each other An essential resource for MT Provide training data for statistical translation models Also useful for other NLP applications Expensive and time consuming to prepare Translate, Sentence Align, ... But limited in Size, Language and Domain ⇒ There are no better data than more data4/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Comparable Corpora Generally not parallel, but overlapping information Readily available Mainly from Newswire AFP, Al JAZEERA, BBC ... Much larger quantities than parallel corpora Multiple languages and Genres Large collections available for NLP tasks e.g. Gigaword corpora from LDC English, Arabic, Chinese, French, Spanish5/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Exploiting comparable corpora Extract parallel documents Using structural information Extending parallel sentence alignement algorithms Extract parallel sentence pairs With sentence alignement algorithms Cross-lingual IR methods Translation aproach6/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Main Existing Methods Webcrawling [Resnik and Smith, 2003] : use URLs to find matching documents Alignment [Brown et al., 1991] : use word alignment models to judge how close a source and a target document (sentence) are Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to translate source words and apply information retrieval techniques Translation [Rauf and Schwenk, 2011] : use SMT system to translate documents and apply information retrieval7/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Goal : Exploiting multimodal comparable corpora Text Audio Parallel text extraction Parallel text8/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion Proposed Approach Build a baseline SMT system (using generic data ) Transcribe the audio data Translate the transcribed text Use translations as queries for IR to find the ”matching” sentences in the target comparable corpus Use TER between SMT translation and the found sentences to detect parallel ones9/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • Introduction Existing Works Proposed Approach Conclusion System Architecture Multimodal Audio L1 comparable corpora ASR Transc. L1 SMT Bitext Transl. L2 IR Texts L2 Text L210/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Several issues Feasibility : Is the multimodal comparable corpora useful to extract parallel text ? Good quality : Can we get a parallel text generated from multimodal corpora good as the bitext extracted from comparable text ? Effectiveness : since one of our motivations for exploiting comparable corpora is to adapt a SMT system for a specific domain, extracted bitext needs to be useful to improve SMT performance.11/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Task description (1) Analyze the impact of the errors of each module ⇒ conducted three different types of experiments Exp 1 : we use the reference translations as queries for the IR system → This is the most favorable condition, it simulates the case where the ASR and the SMT systems do not commit any error. Exp 2 : we use the reference transcription as input to the SMT system → In this case, the errors come only from the SMT system since no ASR is involved. Exp 3 : represents the complete proposed framework → It corresponds to a real scenario.12/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Task description (2) Exp 1 Exp 2 Exp 3 TED audio ASR TEDbi. En TEDasr. En SMT SMT TEDbi_tran. TEDasr_tran . TEDbi. FR FR FR IR IR IR Texte FR Texte FR Texte FR ccb2+ %TrainTED.fr13/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Task description (3) Importance of the degree of similarity between the two parts of the comparable corpora ⇒ we artificially created four comparable corpora with different degrees of similarity the source part of our comparable corpus is always the same the target language part of the comparable corpus consists of a large generic corpus plus 25%, 50%, 75% and 100% respectively of the reference translations Evaluation of the approach final parallel data extracted are re-injected into the baseline system systems are evaluated using the BLEU score14/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Data (TED task in IWSLT) Training bitexts # words in domain ? nc7 3.7M no eparl7 56.4M no TEDasr 1.8M yes TEDbi 1.9M yes Development and test Dev # words dev.outASR 36k dev.refSMT 38k Test # words tst.outASR 8.7k tst.refSMT 9.1 k15/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Modules ASR : a five-pass system based on CMU Sphinx has a WER of about 18% SMT : a phrase-based system based on Moses SMT toolkit trained on generic bitext only word alignments in both directions are calculated using GIZA++ phrases and lexical reordering are extracted using the default settings of the Moses toolkit the parameters were tuned on dev.outASR, using the MERT tool IR : system based on Lemur IR toolkit index all target language (French) text data transforming the translated source language (English) to queries using Indri Query Language16/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores on dev and test after adaptation of a baseline system with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi) Experiment Dev Test Baseline system 22.93 23.96 Exp1 24.14 25.14 Exp2 23.90 25.15 Exp3 23.40 24.69 Extracted sentences do improve the SMT system BLEU score of the adapted system matches the one of Exp1 in most of the cases ⇒ errors inducted by the SMT and ASR systems have no major impact on the performance of the parallel sentence extraction algorithm17/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores for different degrees of parallelism of the comparable corpus. Experiment Dev Test # injected words Baseline system 22.93 23.96 - 25% TEDbi 23.11 24.40 ∼110k 50% TEDbi 23.27 24.58 ∼215k 75% TEDbi 23.43 24.42 ∼293k 100% TEDbi 23.40 24.69 ∼393k The degree of similarity of the comparable corpus is important in term of the performance of the extraction process and the quality of parallel sentences extracted18/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Results e.g. 1: Source sentence: i wrote a story about genetically engineered food Baseline Sys: Adapted Sys: jai écrit un article sur la nourriture jai écrit un article sur les produits génétiquement modifiée alimentaires génétiquement modifiés Domain Adaptation e.g. 2: Source sentence: yeah youre right lets fix it Baseline Sys: Adapted Sys: yeah tu as raison de réparer euh oui tu as raison il faut réparer Oral vocabulary correction19/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Conclusion Proposed to extend exploiting comparable corpora to multimodal comparable corpora, i.e. the source side is available as audio and the target side as text An encouraging result since we automatically aligned source audio in one language with texts in another language, without the need of human intervention to transcribe and translate the data Able to adapt a generic SMT system to the task of lecture translation by extracting parallel data from a multimodal comparable corpus20/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Perspectives Apply this task at a much larger scale, i.e using hundreds of hours of speech and hundreds of millions of words Woking on deferent specific domains or subdomains Iterate the process in order to use the extracted bitexts to translate again source sentences Calculate the degree of the similarity of the corpus before using it21/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Brown, P. F., Lai, J. C., and Mercer, R. L. (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on ACL, pages 169–176. Munteanu, D. S. and Marcu, D. (2005). Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4) :477–504. Rauf, S. A. and Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4) :341–375. Resnik, P. and Smith, N. A. (2003). The web as a parallel corpus. Comput. Linguist., 29 :349–380.22/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Thank you23/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Results (1) 24.5 24.5 Exp1 Exp1 Exp2 Exp2 Exp3 Exp3 24 24 score BLEU score BLEU 23.5 23.5 23 23 22.5 22.5 0 20 40 60 80 100 0 20 40 60 80 100 TER threshold TER threshold Figure: BLEU score on dev using Figure: BLEU score on dev using SMT systems adapted with bitexts SMT systems adapted with bitexts extracted from ccb2 + 100% extracted from ccb2 + 75% TEDbi TEDbi index corpus. index corpus. The choice of the appropriate TER threshold depends on the type of data24/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Crawling the Web [Resnik and Smith, 2003] Search for web pages with similar URLs Many companies and organizations have their web pages in multiple languages Identified by language ID, eg http ://x.../y.../z.en and http ://x.../y.../z.fr Pages have links to parallel pages Webcrawler, which exploits this structural information25/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Alignment Approach [Brown et al., 1991] Train initial lexicon based on parallel data Use lexicon to calculate alignment score between documents (or sentences) Typically IBM1 Select most reliable document (sentence) pairs Add to parallel training data and retrain -> bootstrapping26/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Finding Comparable Documents [Zhao and Vogel, 2002] Given comparable documents, find (nearly) parallel sentences Xinhua News Agency publishes news in English and Chinese Calculate similarity based on lexicon Iterative process27/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion CLIR Aproach [Munteanu and Marcu, 2005] Figure: CLIR Aproach28/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • Introduction Existing Works Proposed Approach Conclusion Translation Approach [Rauf and Schwenk, 2011] Figure: Translation Approach29/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor