Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding Machine Translation and the Challenge of Patents

1,275 views

Published on

Delivered at the annual conference of the international patent information user group (PIUG Annual Conference 2013)
April 29th 2013
Alexandria, VA, USA.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Understanding Machine Translation and the Challenge of Patents

  1. 1. Dr. John TinsleyDr. John Tinsley CEO IPTranslatorCEO IPTranslator PIUG Annual Conference 2013PIUG Annual Conference 2013 Alexandria, VA. April 29Alexandria, VA. April 29thth Understanding Machine Translation and the Challenge of Patents
  2. 2. PIUG Annual Conference, Alexandria, April 29, 2013 The need for translation Accelerating Global growth in volume of patents: - 10.7% increase in PCT applications in 2011 - China +33.4% - Japan +21%
  3. 3. PIUG Annual Conference, Alexandria, April 29, 2013 Why listen to me? Machine translation is what I do! - BSc in Computational Linguistics - PhD in Machine Translation (DCU, CMU) - Software Engineer for MT (CNGL) - Founder of IPTranslator
  4. 4. PIUG Annual Conference, Alexandria, April 29, 2013 Machine Translation: The Basics MachineTranslation = automatic translation  The use of computers to translate from one language into another  The use of computers to automate some, or all, of the translation process Statistical MachineTranslation (SMT)  An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.  e.g. IPTranslator, Google Translate Previous approaches:  Rule-based (or transfer-based): based on linguistic rules  e.g. Systran; Altavista’s Babelfish  Example-based: based on translation examples and inferred linguistic patterns SMT is now by far the predominant approach
  5. 5. PIUG Annual Conference, Alexandria, April 29, 2013 Bilingual Corpora A corpus (pl. corpora) is a collection of texts, in electronic format, in a single language  document(s)  book(s) A bilingual corpus is a collection of corresponding texts, in multiple languages  A document & its translation  A book in multiple languages  The European Parliament proceedings • Note: source language = original language or language we’re translating from target language = language we’re translating into a bilingual corpus
  6. 6. PIUG Annual Conference, Alexandria, April 29, 2013 Aligned Bilingual Corpora A document-aligned bilingual corpus corresponds on a document level For translation, we required sentence-aligned bilingual corpora  The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.  Often referred to as parallel aligned corpora Sentence aligned bilingual parallel corpora are essential for statistical machine translation
  7. 7. PIUG Annual Conference, Alexandria, April 29, 2013 Learning From Previous Translations Suppose we already know (from a sentence-aligned bilingual corpus) that: - “dog” is translated as “perro” - “I have a cat” is translated as “Tengo un gato” We can theoretically translate: - “I have a dog” -> “Tengo un perro” - Even though we have never seen “I have a dog” before Statistical machine translation induces information about unseen input, based on previously known translations - Primarily co-occurrence statistics - Takes contextual information into account
  8. 8. PIUG Annual Conference, Alexandria, April 29, 2013 Statistical Machine Translation - Example of a small sentence aligned bilingual corpus for English-French
  9. 9. PIUG Annual Conference, Alexandria, April 29, 2013 Statistical Machine Translation - We take some new input to translate
  10. 10. PIUG Annual Conference, Alexandria, April 29, 2013 Statistical Machine Translation - We take some new input to translate - From the corpus we can infer possible target (French) translations for various source (English) words - We can then select the most probable translations based on simple frequencies (co-occurrence statistics)
  11. 11. PIUG Annual Conference, Alexandria, April 29, 2013 Statistical Machine Translation Given a previously unseen input sentence, and our collated statistics, we can estimate translation
  12. 12. PIUG Annual Conference, Alexandria, April 29, 2013 Advanced Modelling All modern approaches are based on building translations for complete sentences by putting together smaller pieces of translation Previous example is very simplistic  In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages  Upwards of 2M sentence pairs on average for large-scale systems Statistics calculated to represent:  Word-to-word translation probabilities  Phrase-to-phrase translation probabilities  Word order probabilities  Structural information (i.e. syntactic information)  Fluency of the final output
  13. 13. PIUG Annual Conference, Alexandria, April 29, 2013 Data is Key For SMT data is key  Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data Important that data used to train SMT systems is:  Of sufficient size  avoid sparseness/skewed statistics  Representative and relevant  contains the right type of language  High-quality  absence of misspellings, incorrect alignments etc.  Proofed by human translators training data
  14. 14. PIUG Annual Conference, Alexandria, April 29, 2013 Why is MT Difficult? A word or a phrase can have more than one meaning (ambiguity – lexical or structural)  E.g.: “bank”, “dive” ; “I saw the man with the telescope” People use language creatively  New words are cropping up all the time Linguistic differences between languages  E.g. structure of Irish sentences vs. structure of English sentences: “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry” There can be more than one way to express the same meaning.  “New York”, “The Big Apple”, “NYC”
  15. 15. PIUG Annual Conference, Alexandria, April 29, 2013 Why is MT Difficult? Israeli o cials are responsible for airport security.ffi Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security o cials.ffi
  16. 16. PIUG Annual Conference, Alexandria, April 29, 2013 Not all languages are created equal It’s easier to translate between some language pairs than others A group of rival companies seek sanctions against Google Un grupo de compañías rivales pide sanciones contra Google We believe that the delegates will make their decision after a long debate Wir glauben dass die Delegierten ihre Entscheidung nach einer langen Debatte treffen Thank you very much Go raibh míle maith agat (Lit: May you have a thousand good things)
  17. 17. PIUG Annual Conference, Alexandria, April 29, 2013 The Challenge of Patents Long Sentences Complex constructions L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1- C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.
  18. 18. PIUG Annual Conference, Alexandria, April 29, 2013 • Authoring guide for “to be translated” text • Patents break almost all of the rules! • “Thanks, guys(!)” The Challenge of Patents Very long sentences as standard Grammatically incomplete using nominal and telegraphic style (!) Passive forms are frequent Frequent use of subordinate clauses, participles, implicit constructs Inconsistent and incorrect spelling High use of neologisms Instances of synonymy and polysemy Spurious use of punctuation
  19. 19. PIUG Annual Conference, Alexandria, April 29, 2013 Evaluating Machine Translation Quality Automatic Evaluation Judge the quality of an MT system by comparing its output against a human-produced “reference” translation -Pros: Quick, cheap, consistent -Cons: Inflexible, cannot be used on ‘new’ input Human Evaluation Assessment of output by a bilingual evaluator -Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking) -Cons: Slow, expensive, subjective Task Based Evaluation Fluency vs Adequacy
  20. 20. PIUG Annual Conference, Alexandria, April 29, 2013 Evaluating Machine Translation Quality Task Based Evaluation -Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system -To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required -Why? Fluency vs. Adequacy Fluency: how fluent and grammatically correct the translation output is Adequacy: how accurately the translation conveys the meaning of the source Output 1 The big blue house Output 2 The big house red Source La gran casa roja
  21. 21. PIUG Annual Conference, Alexandria, April 29, 2013 Practical uses of Machine Translation Understand its limitations and you’ll understand it’s capabilities! No •Translate a patent for filing •Translate literature for publication •Translate marketing materials Yes •Productivity tool for professional translation •Understand foreign patents •Localisation processes and “controlled’ content
  22. 22. Thank you! Dr. John Tinsley john@iptranslator.com @IPTranslator
  23. 23. PIUG Annual Conference, Alexandria, April 29, 2013 GermanVerb Movement We like that Götze scored a goal in the final. Uns gefällt, dass Götze einTor im Finale geschossen hat (we like that Götze a goal in the final scored has)
  24. 24. PIUG Annual Conference, Alexandria, April 29, 2013 Sentence: 这是一篇有趣的文章 Words: 这是 一篇 有趣 的 文章 (zhèshì y pi n y uqù de wénzh ng)ī ā ǒ ā (This is an interesting article) 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  25. 25. PIUG Annual Conference, Alexandria, April 29, 2013 English: “Software” Simplified: 软件 Traditional: 軟體 English: “Network” Simplified: 网络 Traditional: 網路 Я пошёл в магазин I went to the shop В магазин пошёл я I went to the shop Пошёл я в магазин I went to the shop (A) (B) (C)

×