Introduction
                                  Existing Works
                              Proposed Approach
                                      Conclusion




           Parallel text extraction from multimodal
                      comparable corpora

           Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc
                                LIUM, University of Le Maine
                              72085 Le Mans cedex 9, FRANCE
                          FirstName.LastName@lium.univ-lemans.fr

                                  Oct 22, 2012
                         JapTal 2012, Kanazawa - JAPAN




1/ 29   Haithem Afli, Lo¨ Barrault and Holger Schwenk
                       ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                      Existing Works
                                  Proposed Approach
                                          Conclusion


    Outline

        1   Introduction and Context
                  Statistical Machine Translation
                  Parallel and Comparable Corpora
        2   Existing Works
                  Exploiting Comparable Corpora
                  Main Existing Methods
        3   Proposed Approach
                  System Architecture
                  Several Issues
                  Task Description
                  Experimental setup
                  Results
        4   Conclusion and Discussion

2/ 29       Haithem Afli, Lo¨ Barrault and Holger Schwenk
                           ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Statistical Machine Translation

          Purpose : text translation
          Approach : Statistical, given by :

                                      t ∗ = arg max P(s|t)P(t)
                                                     t

          Modeling
                Translation Model : P(s|t)
                Language Model : P(t)
                Decoding Algorithme : argmax
          Some open source tools are available like Moses and Joshua
          ⇒ needs parallel data



3/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                      Existing Works
                                  Proposed Approach
                                          Conclusion


    Parallel Corpora


            Texts that are translations of each other
            An essential resource for MT
                  Provide training data for statistical translation models
                  Also useful for other NLP applications
                  Expensive and time consuming to prepare
                         Translate, Sentence Align, ...
            But limited in
                  Size, Language and Domain
        ⇒ There are no better data than more data



4/ 29       Haithem Afli, Lo¨ Barrault and Holger Schwenk
                           ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Comparable Corpora


         Generally not parallel, but overlapping information
         Readily available
               Mainly from Newswire
                      AFP, Al JAZEERA, BBC ...
               Much larger quantities than parallel corpora
               Multiple languages and Genres
         Large collections available for NLP tasks
               e.g. Gigaword corpora from LDC
                      English, Arabic, Chinese, French, Spanish




5/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                        ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Exploiting comparable corpora
          Extract parallel documents
                Using structural information
                Extending parallel sentence alignement algorithms




          Extract parallel sentence pairs
                With sentence alignement algorithms
                Cross-lingual IR methods
                Translation aproach

6/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Main Existing Methods


         Webcrawling [Resnik and Smith, 2003] : use URLs to find
         matching documents
         Alignment [Brown et al., 1991] : use word alignment models
         to judge how close a source and a target document (sentence)
         are
         Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to
         translate source words and apply information retrieval
         techniques
         Translation [Rauf and Schwenk, 2011] : use SMT system to
         translate documents and apply information retrieval



7/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Goal : Exploiting multimodal comparable corpora




                                                                          Text
                      Audio



                                    Parallel text extraction




                                             Parallel text


8/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Proposed Approach



         Build a baseline SMT system (using generic data )
         Transcribe the audio data
         Translate the transcribed text
         Use translations as queries for IR to find the ”matching”
         sentences in the target comparable corpus
         Use TER between SMT translation and the found sentences
         to detect parallel ones




9/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                        ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    System Architecture


                                                                        Multimodal
                                        Audio L1                        comparable
                                                                          corpora
                                                ASR

                                       Transc. L1
                                               SMT

                   Bitext               Transl. L2
                                                IR                      Texts L2

                                         Text L2




10/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Several issues


          Feasibility : Is the multimodal comparable corpora useful to
          extract parallel text ?
          Good quality : Can we get a parallel text generated from
          multimodal corpora good as the bitext extracted from
          comparable text ?
          Effectiveness : since one of our motivations for exploiting
          comparable corpora is to adapt a SMT system for a specific
          domain, extracted bitext needs to be useful to improve SMT
          performance.




11/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (1)


          Analyze the impact of the errors of each module
          ⇒ conducted three different types of experiments
                Exp 1 : we use the reference translations as queries for the IR
                system → This is the most favorable condition, it simulates
                the case where the ASR and the SMT systems do not commit
                any error.
                Exp 2 : we use the reference transcription as input to the SMT
                system → In this case, the errors come only from the SMT
                system since no ASR is involved.
                Exp 3 : represents the complete proposed framework → It
                corresponds to a real scenario.




12/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (2)
                            Exp 1              Exp 2             Exp 3
                                                                 TED audio

                                                                         ASR

                                               TEDbi. En         TEDasr. En

                                                     SMT                SMT
                                              TEDbi_tran.      TEDasr_tran .
                            TEDbi. FR
                                                 FR                FR
                                   IR                 IR                 IR


                             Texte FR           Texte FR          Texte FR




                                               ccb2+
                                            %TrainTED.fr



13/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                                Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (3)

          Importance of the degree of similarity between the two parts
          of the comparable corpora
          ⇒ we artificially created four comparable corpora with
          different degrees of similarity
                the source part of our comparable corpus is always the same
                the target language part of the comparable corpus consists of a
                large generic corpus plus 25%, 50%, 75% and 100%
                respectively of the reference translations
          Evaluation of the approach
                final parallel data extracted are re-injected into the baseline
                system
                systems are evaluated using the BLEU score



14/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Experimental Setup : Data (TED task in IWSLT)
         Training
                              bitexts         # words      in domain ?
                              nc7             3.7M         no
                              eparl7          56.4M        no
                              TEDasr          1.8M         yes
                              TEDbi           1.9M         yes
         Development and test
                           Dev                           # words
                           dev.outASR                    36k
                           dev.refSMT                    38k
                           Test                          # words
                           tst.outASR                    8.7k
                           tst.refSMT                    9.1 k

15/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Experimental Setup : Modules

         ASR : a five-pass system based on CMU Sphinx
               has a WER of about 18%
         SMT : a phrase-based system based on Moses SMT toolkit
               trained on generic bitext only
               word alignments in both directions are calculated using
               GIZA++
               phrases and lexical reordering are extracted using the default
               settings of the Moses toolkit
               the parameters were tuned on dev.outASR, using the MERT
               tool
         IR : system based on Lemur IR toolkit
               index all target language (French) text data
               transforming the translated source language (English) to
               queries using Indri Query Language

16/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                        Existing Works
                                    Proposed Approach
                                            Conclusion


    Results

         Table: BLEU scores on dev and test after adaptation of a baseline system
         with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi)

                                Experiment                     Dev      Test
                                Baseline system               22.93     23.96
                                Exp1                          24.14     25.14
                                Exp2                          23.90     25.15
                                Exp3                          23.40     24.69
              Extracted sentences do improve the SMT system
              BLEU score of the adapted system matches the one of Exp1
              in most of the cases
              ⇒ errors inducted by the SMT and ASR systems have no
              major impact on the performance of the parallel sentence
              extraction algorithm
17/ 29         Haithem Afli, Lo¨ Barrault and Holger Schwenk
                              ıc                                Parallel text extraction from multimodal comparable corpor
Introduction
                                        Existing Works
                                    Proposed Approach
                                            Conclusion


    Results

         Table: BLEU scores for different degrees of parallelism of the comparable
         corpus.
                 Experiment                   Dev        Test      # injected words
                 Baseline system             22.93       23.96     -
                 25% TEDbi                   23.11       24.40     ∼110k
                 50% TEDbi                   23.27       24.58     ∼215k
                 75% TEDbi                   23.43       24.42     ∼293k
                 100% TEDbi                  23.40       24.69     ∼393k

              The degree of similarity of the comparable corpus is important
              in term of the performance of the extraction process
              and the quality of parallel sentences extracted

18/ 29         Haithem Afli, Lo¨ Barrault and Holger Schwenk
                              ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                            Existing Works
                                        Proposed Approach
                                                Conclusion


    Results
                    e.g. 1:
                                                    Source sentence:
                                    i wrote a story about genetically engineered food


                                  Baseline Sys:                                Adapted Sys:
                     j'ai écrit un article sur la nourriture       j'ai écrit un article sur les produits
                            génétiquement modifiée                alimentaires génétiquement modifiés


                                                  Domain Adaptation

                    e.g. 2:
                                                    Source sentence:
                                                 yeah you're right let's fix it



                               Baseline Sys:                                  Adapted Sys:
                         yeah tu as raison de réparer               euh oui tu as raison il faut réparer



                                             Oral vocabulary correction




19/ 29        Haithem Afli, Lo¨ Barrault and Holger Schwenk
                             ıc                                             Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Conclusion


         Proposed to extend exploiting comparable corpora to
         multimodal comparable corpora, i.e. the source side is
         available as audio and the target side as text
         An encouraging result since we automatically aligned source
         audio in one language with texts in another language, without
         the need of human intervention to transcribe and translate the
         data
         Able to adapt a generic SMT system to the task of lecture
         translation by extracting parallel data from a multimodal
         comparable corpus



20/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Perspectives



          Apply this task at a much larger scale, i.e using hundreds of
          hours of speech and hundreds of millions of words
          Woking on deferent specific domains or subdomains
          Iterate the process in order to use the extracted bitexts to
          translate again source sentences
          Calculate the degree of the similarity of the corpus before
          using it




21/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


         Brown, P. F., Lai, J. C., and Mercer, R. L. (1991).
         Aligning sentences in parallel corpora.
         In Proceedings of the 29th annual meeting on ACL, pages
         169–176.
         Munteanu, D. S. and Marcu, D. (2005).
         Improving Machine Translation Performance by Exploiting
         Non-Parallel Corpora.
         Computational Linguistics, 31(4) :477–504.
         Rauf, S. A. and Schwenk, H. (2011).
         Parallel sentence generation from comparable corpora for
         improved SMT.
         Machine Translation, 25(4) :341–375.
         Resnik, P. and Smith, N. A. (2003).
         The web as a parallel corpus.
         Comput. Linguist., 29 :349–380.
22/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Thank you




23/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                                                Existing Works
                                                            Proposed Approach
                                                                    Conclusion


    Results (1)
                         24.5                                                                 24.5
                                                               Exp1                                                                 Exp1
                                                               Exp2                                                                 Exp2
                                                               Exp3                                                                 Exp3


                          24                                                                   24
            score BLEU




                                                                                 score BLEU
                         23.5                                                                 23.5




                          23                                                                   23




                         22.5                                                                 22.5
                                0   20   40            60      80     100                            0     20    40            60   80     100
                                          TER threshold                                                           TER threshold




         Figure: BLEU score on dev using                                                      Figure: BLEU score on dev using
         SMT systems adapted with bitexts                                                     SMT systems adapted with bitexts
         extracted from ccb2 + 100%                                                           extracted from ccb2 + 75% TEDbi
         TEDbi index corpus.                                                                  index corpus.


                         The choice of the appropriate TER threshold
                         depends on the type of data

24/ 29                      Haithem Afli, Lo¨ Barrault and Holger Schwenk
                                           ıc                                                            Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Crawling the Web [Resnik and Smith, 2003]



         Search for web pages with similar URLs
               Many companies and organizations have their web pages in
               multiple languages
               Identified by language ID, eg
                      http ://x.../y.../z.en and http ://x.../y.../z.fr
               Pages have links to parallel pages
         Webcrawler, which exploits this structural information




25/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Alignment Approach [Brown et al., 1991]



         Train initial lexicon based on parallel data
         Use lexicon to calculate alignment score between documents
         (or sentences)
               Typically IBM1
         Select most reliable document (sentence) pairs
         Add to parallel training data and retrain -> bootstrapping




26/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Finding Comparable Documents [Zhao and Vogel, 2002]




         Given comparable documents, find (nearly) parallel sentences
         Xinhua News Agency publishes news in English and Chinese
         Calculate similarity based on lexicon
         Iterative process




27/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    CLIR Aproach [Munteanu and Marcu, 2005]




                                    Figure: CLIR Aproach



28/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Translation Approach [Rauf and Schwenk, 2011]




                               Figure: Translation Approach


29/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor

Parallel text extraction from multimodal comparable corpora

  • 1.
    Introduction Existing Works Proposed Approach Conclusion Parallel text extraction from multimodal comparable corpora Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc LIUM, University of Le Maine 72085 Le Mans cedex 9, FRANCE FirstName.LastName@lium.univ-lemans.fr Oct 22, 2012 JapTal 2012, Kanazawa - JAPAN 1/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 2.
    Introduction Existing Works Proposed Approach Conclusion Outline 1 Introduction and Context Statistical Machine Translation Parallel and Comparable Corpora 2 Existing Works Exploiting Comparable Corpora Main Existing Methods 3 Proposed Approach System Architecture Several Issues Task Description Experimental setup Results 4 Conclusion and Discussion 2/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 3.
    Introduction Existing Works Proposed Approach Conclusion Statistical Machine Translation Purpose : text translation Approach : Statistical, given by : t ∗ = arg max P(s|t)P(t) t Modeling Translation Model : P(s|t) Language Model : P(t) Decoding Algorithme : argmax Some open source tools are available like Moses and Joshua ⇒ needs parallel data 3/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 4.
    Introduction Existing Works Proposed Approach Conclusion Parallel Corpora Texts that are translations of each other An essential resource for MT Provide training data for statistical translation models Also useful for other NLP applications Expensive and time consuming to prepare Translate, Sentence Align, ... But limited in Size, Language and Domain ⇒ There are no better data than more data 4/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 5.
    Introduction Existing Works Proposed Approach Conclusion Comparable Corpora Generally not parallel, but overlapping information Readily available Mainly from Newswire AFP, Al JAZEERA, BBC ... Much larger quantities than parallel corpora Multiple languages and Genres Large collections available for NLP tasks e.g. Gigaword corpora from LDC English, Arabic, Chinese, French, Spanish 5/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 6.
    Introduction Existing Works Proposed Approach Conclusion Exploiting comparable corpora Extract parallel documents Using structural information Extending parallel sentence alignement algorithms Extract parallel sentence pairs With sentence alignement algorithms Cross-lingual IR methods Translation aproach 6/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 7.
    Introduction Existing Works Proposed Approach Conclusion Main Existing Methods Webcrawling [Resnik and Smith, 2003] : use URLs to find matching documents Alignment [Brown et al., 1991] : use word alignment models to judge how close a source and a target document (sentence) are Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to translate source words and apply information retrieval techniques Translation [Rauf and Schwenk, 2011] : use SMT system to translate documents and apply information retrieval 7/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 8.
    Introduction Existing Works Proposed Approach Conclusion Goal : Exploiting multimodal comparable corpora Text Audio Parallel text extraction Parallel text 8/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 9.
    Introduction Existing Works Proposed Approach Conclusion Proposed Approach Build a baseline SMT system (using generic data ) Transcribe the audio data Translate the transcribed text Use translations as queries for IR to find the ”matching” sentences in the target comparable corpus Use TER between SMT translation and the found sentences to detect parallel ones 9/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 10.
    Introduction Existing Works Proposed Approach Conclusion System Architecture Multimodal Audio L1 comparable corpora ASR Transc. L1 SMT Bitext Transl. L2 IR Texts L2 Text L2 10/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 11.
    Introduction Existing Works Proposed Approach Conclusion Several issues Feasibility : Is the multimodal comparable corpora useful to extract parallel text ? Good quality : Can we get a parallel text generated from multimodal corpora good as the bitext extracted from comparable text ? Effectiveness : since one of our motivations for exploiting comparable corpora is to adapt a SMT system for a specific domain, extracted bitext needs to be useful to improve SMT performance. 11/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 12.
    Introduction Existing Works Proposed Approach Conclusion Task description (1) Analyze the impact of the errors of each module ⇒ conducted three different types of experiments Exp 1 : we use the reference translations as queries for the IR system → This is the most favorable condition, it simulates the case where the ASR and the SMT systems do not commit any error. Exp 2 : we use the reference transcription as input to the SMT system → In this case, the errors come only from the SMT system since no ASR is involved. Exp 3 : represents the complete proposed framework → It corresponds to a real scenario. 12/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 13.
    Introduction Existing Works Proposed Approach Conclusion Task description (2) Exp 1 Exp 2 Exp 3 TED audio ASR TEDbi. En TEDasr. En SMT SMT TEDbi_tran. TEDasr_tran . TEDbi. FR FR FR IR IR IR Texte FR Texte FR Texte FR ccb2+ %TrainTED.fr 13/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 14.
    Introduction Existing Works Proposed Approach Conclusion Task description (3) Importance of the degree of similarity between the two parts of the comparable corpora ⇒ we artificially created four comparable corpora with different degrees of similarity the source part of our comparable corpus is always the same the target language part of the comparable corpus consists of a large generic corpus plus 25%, 50%, 75% and 100% respectively of the reference translations Evaluation of the approach final parallel data extracted are re-injected into the baseline system systems are evaluated using the BLEU score 14/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 15.
    Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Data (TED task in IWSLT) Training bitexts # words in domain ? nc7 3.7M no eparl7 56.4M no TEDasr 1.8M yes TEDbi 1.9M yes Development and test Dev # words dev.outASR 36k dev.refSMT 38k Test # words tst.outASR 8.7k tst.refSMT 9.1 k 15/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 16.
    Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Modules ASR : a five-pass system based on CMU Sphinx has a WER of about 18% SMT : a phrase-based system based on Moses SMT toolkit trained on generic bitext only word alignments in both directions are calculated using GIZA++ phrases and lexical reordering are extracted using the default settings of the Moses toolkit the parameters were tuned on dev.outASR, using the MERT tool IR : system based on Lemur IR toolkit index all target language (French) text data transforming the translated source language (English) to queries using Indri Query Language 16/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 17.
    Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores on dev and test after adaptation of a baseline system with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi) Experiment Dev Test Baseline system 22.93 23.96 Exp1 24.14 25.14 Exp2 23.90 25.15 Exp3 23.40 24.69 Extracted sentences do improve the SMT system BLEU score of the adapted system matches the one of Exp1 in most of the cases ⇒ errors inducted by the SMT and ASR systems have no major impact on the performance of the parallel sentence extraction algorithm 17/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 18.
    Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores for different degrees of parallelism of the comparable corpus. Experiment Dev Test # injected words Baseline system 22.93 23.96 - 25% TEDbi 23.11 24.40 ∼110k 50% TEDbi 23.27 24.58 ∼215k 75% TEDbi 23.43 24.42 ∼293k 100% TEDbi 23.40 24.69 ∼393k The degree of similarity of the comparable corpus is important in term of the performance of the extraction process and the quality of parallel sentences extracted 18/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 19.
    Introduction Existing Works Proposed Approach Conclusion Results e.g. 1: Source sentence: i wrote a story about genetically engineered food Baseline Sys: Adapted Sys: j'ai écrit un article sur la nourriture j'ai écrit un article sur les produits génétiquement modifiée alimentaires génétiquement modifiés Domain Adaptation e.g. 2: Source sentence: yeah you're right let's fix it Baseline Sys: Adapted Sys: yeah tu as raison de réparer euh oui tu as raison il faut réparer Oral vocabulary correction 19/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 20.
    Introduction Existing Works Proposed Approach Conclusion Conclusion Proposed to extend exploiting comparable corpora to multimodal comparable corpora, i.e. the source side is available as audio and the target side as text An encouraging result since we automatically aligned source audio in one language with texts in another language, without the need of human intervention to transcribe and translate the data Able to adapt a generic SMT system to the task of lecture translation by extracting parallel data from a multimodal comparable corpus 20/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 21.
    Introduction Existing Works Proposed Approach Conclusion Perspectives Apply this task at a much larger scale, i.e using hundreds of hours of speech and hundreds of millions of words Woking on deferent specific domains or subdomains Iterate the process in order to use the extracted bitexts to translate again source sentences Calculate the degree of the similarity of the corpus before using it 21/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 22.
    Introduction Existing Works Proposed Approach Conclusion Brown, P. F., Lai, J. C., and Mercer, R. L. (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on ACL, pages 169–176. Munteanu, D. S. and Marcu, D. (2005). Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4) :477–504. Rauf, S. A. and Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4) :341–375. Resnik, P. and Smith, N. A. (2003). The web as a parallel corpus. Comput. Linguist., 29 :349–380. 22/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 23.
    Introduction Existing Works Proposed Approach Conclusion Thank you 23/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 24.
    Introduction Existing Works Proposed Approach Conclusion Results (1) 24.5 24.5 Exp1 Exp1 Exp2 Exp2 Exp3 Exp3 24 24 score BLEU score BLEU 23.5 23.5 23 23 22.5 22.5 0 20 40 60 80 100 0 20 40 60 80 100 TER threshold TER threshold Figure: BLEU score on dev using Figure: BLEU score on dev using SMT systems adapted with bitexts SMT systems adapted with bitexts extracted from ccb2 + 100% extracted from ccb2 + 75% TEDbi TEDbi index corpus. index corpus. The choice of the appropriate TER threshold depends on the type of data 24/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 25.
    Introduction Existing Works Proposed Approach Conclusion Crawling the Web [Resnik and Smith, 2003] Search for web pages with similar URLs Many companies and organizations have their web pages in multiple languages Identified by language ID, eg http ://x.../y.../z.en and http ://x.../y.../z.fr Pages have links to parallel pages Webcrawler, which exploits this structural information 25/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 26.
    Introduction Existing Works Proposed Approach Conclusion Alignment Approach [Brown et al., 1991] Train initial lexicon based on parallel data Use lexicon to calculate alignment score between documents (or sentences) Typically IBM1 Select most reliable document (sentence) pairs Add to parallel training data and retrain -> bootstrapping 26/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 27.
    Introduction Existing Works Proposed Approach Conclusion Finding Comparable Documents [Zhao and Vogel, 2002] Given comparable documents, find (nearly) parallel sentences Xinhua News Agency publishes news in English and Chinese Calculate similarity based on lexicon Iterative process 27/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 28.
    Introduction Existing Works Proposed Approach Conclusion CLIR Aproach [Munteanu and Marcu, 2005] Figure: CLIR Aproach 28/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 29.
    Introduction Existing Works Proposed Approach Conclusion Translation Approach [Rauf and Schwenk, 2011] Figure: Translation Approach 29/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor