SlideShare a Scribd company logo
“Machine Translation 101”
And The Challenge of Patents
John Tinsley
Director / Co-Founder
EPOPIC. 5th Nov 2014, Warsaw
The need for translation
50% of all PCT applications in 2013 came from Asia
  BSc in Computational Linguistics
  PhD in Machine Translation
  Language Technology consultant
  Founder of Iconic Translation Machines
Why listen to me?
Machine Translation is what I do!
The world’s first and only patent specific machine translation system
§  The use of computers to translate from one language into another
§  The use of computers to automate some, or all, of the translation
process
§  An approach to Machine Translation, where translations for an input are
estimated based on previous seen translation examples and associated
(inferred) probabilities.
§  e.g. IPTranslator, Google Translate
§  Rule-based (or transfer-based): based on linguistic rules
•  e.g. Systran; Altavista’s Babelfish
§  Example-based: based on translation examples and inferred linguistic
patterns
Machine Translation: The Basics
Machine Translation = automatic translation
Statistical Machine Translation (SMT)
Other approaches
SMT is now by far the predominant approach
A corpus (pl. corpora) is a collection
of texts, in electronic format, in a
single language
§  document(s)
§  book(s)
Bilingual Corpora
a bilingual corpus
  Note source language = original language or language we’re translating from
target language = language we’re translating into
A bilingual corpus is a collection of
corresponding texts, in multiple
languages
§  a document & its translation
§  a book in multiple languages
§  European Parliament proceedings
Aligned Bilingual Corpora
A document-aligned bilingual corpus corresponds on a document
level
For translation, we required sentence-aligned bilingual corpora
§  The sentence on line 1 in the source language text corresponds
to (i.e. is a translation of) the sentence on line 1 in the target
language text etc.
§  Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora
are essential for statistical machine translation
Learning from Previous Translations
Suppose we already know
(from a sentence-aligned bilingual
corpus) that:
§  “dog” is translated as “perro”
§  “I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:
§  “I have a dog” à “Tengo un perro”
§  Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on
previously known translations:
§  Primarily co-occurrence statistics
§  Takes contextual information into account
Statistical Machine Translation
§  Example of a small sentence-aligned
bilingual corpus for English-French
Statistical Machine Translation
§  We take some new sentence to translate
Statistical Machine Translation
§  From the corpus we can infer possible target (French)
translations for various source (English) words
§  We can then select the most probable translations
based on simple frequencies (co-occurrence statistics)
Statistical Machine Translation
Given a previously unseen input sentence, and our collated statistics,
we can estimate translation
Advanced MT
All modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translation
Previous example is very simplistic
§  In reality SMT systems calculate much more complex statistical models
over millions of sentence pairs for a pair of languages
§  Upwards of 2M sentence pairs on average for large-scale systems
§  Word-to-word translation probabilities
§  Phrase-to-phrase translation probabilities
§  Word order probabilities
§  Linguistic information (are the words nouns, verbs?)
§  Fluency of the final output
	
  
Previous example is very simplistic
Other statistics calculated include
Data is Key
For SMT data is key
§  Information (word/phrase correspondences and associated statistics) is only based
on what we have seen before in the data
Important that data used to train SMT systems is:
§  Of sufficient size
§  avoid sparseness/skewed statistics
§  Representative and relevant
§  contains the right type of language
§  High-quality
§  absence of misspellings,
incorrect alignments etc.
§  Proofed by human
translators
training data
Why is MT Difficult?
A word or a phrase can have more than one meaning (ambiguity – lexical or
structural)
§  e.g. “bank”, “dive”, “I saw the man with the telescope”
People use language creatively
§  New words are cropping up all the time
Linguistic differences between languages
§  e.g. structure of Irish sentences vs. structure of English sentences:
§  “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning.
§  “New York”, “The Big Apple”, “NYC”
Why is MT Difficult?
§  Israeli officials are responsible for airport security.
§  Israel is in charge of the security at this airport.
§  The security work for this airport is the responsibility of the Israel government.
§  Israeli side was in charge of the security of this airport.
§  Israel is responsible for the airport’s security.
§  Israel is responsible for safety work at this airport.
§  Israel presides over the security of the airport.
§  Israel took charge of the airport security.
§  The safety of this airport is taken charge of by Israel.
§  This airport’s security is the responsibility of the Israeli security officials.
No single solution for all languages
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
No single solution for all languages
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
The Challenge of Patents
  Very	
  long	
  sentences	
  as	
  standard	
  
  Gramma1cally	
  incomplete	
  using	
  
nominal	
  and	
  telegraphic	
  style	
  (!)	
  
  Passive	
  forms	
  are	
  frequent	
  
  Frequent	
  use	
  of	
  subordinate	
  clauses,	
  
par1ciples,	
  implicit	
  constructs	
  
  Inconsistent	
  and	
  incorrect	
  spelling	
  
  High	
  use	
  of	
  neologisms	
  	
  
  Instances	
  of	
  synonymy	
  and	
  polysemy	
  	
  
  Spurious	
  use	
  of	
  punctua1on	
  
Authoring guide
for “to be
translated” text
Patents break
almost all of the
rules!
Judge the quality of an MT system by comparing its output against a
human-produced “reference” translation
§  Pros: Quick, cheap, consistent
§  Cons: Inflexible, cannot be used on ‘new’ input
§  Pros: Reliable, flexible, multi-faceted (fluency, error analyses,
benchmarking)
§  Cons: Slow, expensive, subjective
§  Fluency vs. Adequacy
Evaluating Machine Translation Quality
Automatic Evaluation
Human Evaluation
Task-Based Evaluation
Evaluating Machine Translation Quality
Task Based Evaluation
§  Standalone evaluation of MT systems is necessary to get a sense of the
overall quality of a system
§  To determine the ultimate usability of an MT system, intrinsic task-based
evaluation is required
§  Why? Fluency vs. Adequacy
Fluency how fluent and grammatically correct the translation
output is
Adequacy how accurately the translation conveys the meaning of the
source
Output 1 The big blue house
Output 2 The big house red
Source La gran casa roja
Task-Based Evaluation
Practical uses of Machine Translation
Understand its limitations and you’ll understand
its capabilities!
No
§  Translate a patent for filing
§  Translate literature for
publication
§  Translate marketing materials
§  Anything mission critical
without review
Yes
§  Productivity tool for
professional translation
§  Understand foreign patents
§  Localisation processes and
“controlled’ content
§  High volume, e.g. eDiscovery
We provide Machine Translation
solutions with Subject Matter Expertise
We do this using Linguistic Engineering
An “ensemble” MT architecture
Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
How this impacts translation quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English
Thank You!
john@iptranslator.com
@IconicTrans

More Related Content

What's hot

Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?
Multilizer
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
Iconic Translation Machines
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copy
Nakul Sharma
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
Marcis Pinnis
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Sebastian Ruder
 
Moses
MosesMoses
Introduction To Translation Technologies
Introduction To Translation TechnologiesIntroduction To Translation Technologies
Introduction To Translation Technologies
xenotext
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
juvelle villafania
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
guest873a50
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
Iconic Translation Machines
 
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
RIILP
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
Hayahide Yamagishi
 
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
SDL
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine TranslationRIILP
 
MT and Translator's Tools
MT and Translator's ToolsMT and Translator's Tools
MT and Translator's ToolsJim O'Regan
 

What's hot (20)

Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copy
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Moses
MosesMoses
Moses
 
Introduction To Translation Technologies
Introduction To Translation TechnologiesIntroduction To Translation Technologies
Introduction To Translation Technologies
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
 
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
SMT3
SMT3SMT3
SMT3
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
MT and Translator's Tools
MT and Translator's ToolsMT and Translator's Tools
MT and Translator's Tools
 

Viewers also liked

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWSSeeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
Iconic Translation Machines
 
A Critical Evaluation of Dynamic and Situationist Approaches to Personality
A Critical Evaluation of Dynamic and Situationist Approaches to PersonalityA Critical Evaluation of Dynamic and Situationist Approaches to Personality
A Critical Evaluation of Dynamic and Situationist Approaches to PersonalityLauren Gui
 
The Business Side | Translating and the Computer
The Business Side | Translating and the ComputerThe Business Side | Translating and the Computer
The Business Side | Translating and the Computer
Richard Brooks
 
Innovative Business and Pricing Models: for MT
Innovative Business and Pricing Models: for MTInnovative Business and Pricing Models: for MT
Innovative Business and Pricing Models: for MT
Iconic Translation Machines
 
Rpp pertemuan5
Rpp pertemuan5Rpp pertemuan5
Rpp pertemuan5rizka_safa
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Iconic Translation Machines
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
Iconic Translation Machines
 
MT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesMT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the Trees
Iconic Translation Machines
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projects
Iconic Translation Machines
 
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Iconic Translation Machines
 
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
TAUS - The Language Data Network
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT Research
Iconic Translation Machines
 
Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
 Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo... Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
TAUS - The Language Data Network
 

Viewers also liked (13)

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWSSeeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS
 
A Critical Evaluation of Dynamic and Situationist Approaches to Personality
A Critical Evaluation of Dynamic and Situationist Approaches to PersonalityA Critical Evaluation of Dynamic and Situationist Approaches to Personality
A Critical Evaluation of Dynamic and Situationist Approaches to Personality
 
The Business Side | Translating and the Computer
The Business Side | Translating and the ComputerThe Business Side | Translating and the Computer
The Business Side | Translating and the Computer
 
Innovative Business and Pricing Models: for MT
Innovative Business and Pricing Models: for MTInnovative Business and Pricing Models: for MT
Innovative Business and Pricing Models: for MT
 
Rpp pertemuan5
Rpp pertemuan5Rpp pertemuan5
Rpp pertemuan5
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
 
MT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesMT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the Trees
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projects
 
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
Data and Linguistics: Delivering Machine Translation with Subject Matter Expe...
 
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT Research
 
Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
 Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo... Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
Streamlining MT for Asian Languages, by Natsuki Wakabayashi, ISE and Tetsuzo...
 

Similar to "Machine Translation 101" and the Challenge of Patents

Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
Kalyanee Baruah
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Kotaro Hara
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
Sameer Wadkar
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
simonp16
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Choosing the English That’s Right for You: Simplified Technical English and O...
Choosing the English That’s Right for You: Simplified Technical English and O...Choosing the English That’s Right for You: Simplified Technical English and O...
Choosing the English That’s Right for You: Simplified Technical English and O...
Scott Abel
 
Tips and Tools for NMT
Tips and Tools for NMTTips and Tools for NMT
Tips and Tools for NMT
Matīss ‎‎‎‎‎‎‎  
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
Soha82
 
Language Grid
Language GridLanguage Grid
Language Gridlindh
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
Hemantha Kulathilake
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
Michel Bruley
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
Suneel Marthi
 
Introduction to Technical Documentation Localization with Acclaro
Introduction to Technical Documentation Localization with AcclaroIntroduction to Technical Documentation Localization with Acclaro
Introduction to Technical Documentation Localization with AcclaroAcclaro
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
Chamani Shiranthika
 

Similar to "Machine Translation 101" and the Challenge of Patents (20)

NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
Nlp
NlpNlp
Nlp
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Choosing the English That’s Right for You: Simplified Technical English and O...
Choosing the English That’s Right for You: Simplified Technical English and O...Choosing the English That’s Right for You: Simplified Technical English and O...
Choosing the English That’s Right for You: Simplified Technical English and O...
 
Tips and Tools for NMT
Tips and Tools for NMTTips and Tools for NMT
Tips and Tools for NMT
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Language Grid
Language GridLanguage Grid
Language Grid
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
Introduction to Technical Documentation Localization with Acclaro
Introduction to Technical Documentation Localization with AcclaroIntroduction to Technical Documentation Localization with Acclaro
Introduction to Technical Documentation Localization with Acclaro
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

"Machine Translation 101" and the Challenge of Patents

  • 1. “Machine Translation 101” And The Challenge of Patents John Tinsley Director / Co-Founder EPOPIC. 5th Nov 2014, Warsaw
  • 2. The need for translation 50% of all PCT applications in 2013 came from Asia
  • 3.   BSc in Computational Linguistics   PhD in Machine Translation   Language Technology consultant   Founder of Iconic Translation Machines Why listen to me? Machine Translation is what I do! The world’s first and only patent specific machine translation system
  • 4. §  The use of computers to translate from one language into another §  The use of computers to automate some, or all, of the translation process §  An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities. §  e.g. IPTranslator, Google Translate §  Rule-based (or transfer-based): based on linguistic rules •  e.g. Systran; Altavista’s Babelfish §  Example-based: based on translation examples and inferred linguistic patterns Machine Translation: The Basics Machine Translation = automatic translation Statistical Machine Translation (SMT) Other approaches SMT is now by far the predominant approach
  • 5. A corpus (pl. corpora) is a collection of texts, in electronic format, in a single language §  document(s) §  book(s) Bilingual Corpora a bilingual corpus   Note source language = original language or language we’re translating from target language = language we’re translating into A bilingual corpus is a collection of corresponding texts, in multiple languages §  a document & its translation §  a book in multiple languages §  European Parliament proceedings
  • 6. Aligned Bilingual Corpora A document-aligned bilingual corpus corresponds on a document level For translation, we required sentence-aligned bilingual corpora §  The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc. §  Often referred to as parallel aligned corpora Sentence aligned bilingual parallel corpora are essential for statistical machine translation
  • 7. Learning from Previous Translations Suppose we already know (from a sentence-aligned bilingual corpus) that: §  “dog” is translated as “perro” §  “I have a cat” is translated as “Tengo un gato” We can theoretically translate: §  “I have a dog” à “Tengo un perro” §  Even though we have never seen “I have a dog” before Statistical machine translation induces information about unseen input, based on previously known translations: §  Primarily co-occurrence statistics §  Takes contextual information into account
  • 8. Statistical Machine Translation §  Example of a small sentence-aligned bilingual corpus for English-French
  • 9. Statistical Machine Translation §  We take some new sentence to translate
  • 10. Statistical Machine Translation §  From the corpus we can infer possible target (French) translations for various source (English) words §  We can then select the most probable translations based on simple frequencies (co-occurrence statistics)
  • 11. Statistical Machine Translation Given a previously unseen input sentence, and our collated statistics, we can estimate translation
  • 12. Advanced MT All modern approaches are based on building translations for complete sentences by putting together smaller pieces of translation Previous example is very simplistic §  In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages §  Upwards of 2M sentence pairs on average for large-scale systems §  Word-to-word translation probabilities §  Phrase-to-phrase translation probabilities §  Word order probabilities §  Linguistic information (are the words nouns, verbs?) §  Fluency of the final output   Previous example is very simplistic Other statistics calculated include
  • 13. Data is Key For SMT data is key §  Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data Important that data used to train SMT systems is: §  Of sufficient size §  avoid sparseness/skewed statistics §  Representative and relevant §  contains the right type of language §  High-quality §  absence of misspellings, incorrect alignments etc. §  Proofed by human translators training data
  • 14. Why is MT Difficult? A word or a phrase can have more than one meaning (ambiguity – lexical or structural) §  e.g. “bank”, “dive”, “I saw the man with the telescope” People use language creatively §  New words are cropping up all the time Linguistic differences between languages §  e.g. structure of Irish sentences vs. structure of English sentences: §  “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry” There can be more than one way to express the same meaning. §  “New York”, “The Big Apple”, “NYC”
  • 15. Why is MT Difficult? §  Israeli officials are responsible for airport security. §  Israel is in charge of the security at this airport. §  The security work for this airport is the responsibility of the Israel government. §  Israeli side was in charge of the security of this airport. §  Israel is responsible for the airport’s security. §  Israel is responsible for safety work at this airport. §  Israel presides over the security of the airport. §  Israel took charge of the airport security. §  The safety of this airport is taken charge of by Israel. §  This airport’s security is the responsibility of the Israeli security officials.
  • 16. No single solution for all languages Number agreement: the house / the houses vs. la maison / les maisons Gender agreement: the house / the cheese vs. la maison / le frommage English - Spanish English - French
  • 17. No single solution for all languages English - German English - Chinese 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  • 18. The Challenge of Patents L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words
  • 19. The Challenge of Patents   Very  long  sentences  as  standard     Gramma1cally  incomplete  using   nominal  and  telegraphic  style  (!)     Passive  forms  are  frequent     Frequent  use  of  subordinate  clauses,   par1ciples,  implicit  constructs     Inconsistent  and  incorrect  spelling     High  use  of  neologisms       Instances  of  synonymy  and  polysemy       Spurious  use  of  punctua1on   Authoring guide for “to be translated” text Patents break almost all of the rules!
  • 20. Judge the quality of an MT system by comparing its output against a human-produced “reference” translation §  Pros: Quick, cheap, consistent §  Cons: Inflexible, cannot be used on ‘new’ input §  Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking) §  Cons: Slow, expensive, subjective §  Fluency vs. Adequacy Evaluating Machine Translation Quality Automatic Evaluation Human Evaluation Task-Based Evaluation
  • 21. Evaluating Machine Translation Quality Task Based Evaluation §  Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system §  To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required §  Why? Fluency vs. Adequacy Fluency how fluent and grammatically correct the translation output is Adequacy how accurately the translation conveys the meaning of the source Output 1 The big blue house Output 2 The big house red Source La gran casa roja Task-Based Evaluation
  • 22. Practical uses of Machine Translation Understand its limitations and you’ll understand its capabilities! No §  Translate a patent for filing §  Translate literature for publication §  Translate marketing materials §  Anything mission critical without review Yes §  Productivity tool for professional translation §  Understand foreign patents §  Localisation processes and “controlled’ content §  High volume, e.g. eDiscovery
  • 23. We provide Machine Translation solutions with Subject Matter Expertise
  • 24. We do this using Linguistic Engineering
  • 25. An “ensemble” MT architecture
  • 26. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data
  • 27. Data Engineering + Linguistic Engineering An “ensemble” architecture Chinese pre-ordering rules Statistical Post-editing Input Output Training Data Spanish med-device entity recognizer Multi-output Combination Korean pharma tokenizer Patent input classifier Client TM/terminology (optional) Japanese script normalisation German Compounding rules Moses RBMT Moses Moses
  • 28. What is the value for users? Specialist solutions deliver more useable outcomes for the user Post-editing For information purposes Multilingual search Increased productivity Extract more meaning Retrieve more relevant results = = =
  • 29. How this impacts translation quality 0 5 10 15 20 25 30 35 40 45 50 Iconic Google Systran Portuguese to English

Editor's Notes

  1. Second point is important. It has different uses and usability. The concept of FAHQMT is no more. Focus is now on HAMT and PEMT. Problems with rule-based is that they didn’t scale You need bilingual experts for each language pair SMT is the predominant approach
  2. Starting point for all systems is data. The most important aspect is the quality of the data…
  3. They are essential and the quality is crucial. The translations must be accurate and the alignment must be correct, otherwise we infer the wrong things. Introduce “noise” into our systems.
  4. How do we use these corpora? It’s all about learning and remembering things we’ve seen before, the same way you might go about translating something
  5. Ok, so the translation isn’t exactly right here. It should be “Je parle a la fille” but we haven’t seen enough examples (don’t have enough data) for reliable estimates, we’re just going on the counts of the words
  6. How likely a word is to translate to another word – as you have seen How likely the different phrases are to translate as one another What’s the likelihood a certain word will have a different position in the target sentence Sometimes we take into account linguistic information about the words, is it a verb, then it should go here, articles should proceed nouns, etc. Look at models of the target language and see if what we have produce makes sense (can these words go together in this order?)
  7. Google Translate aims to be a general system, but what happens when your translating a sports website? Quality issues can be caused by the fact that there’s a lot of other data in their models than sports news. Similarly, if I have a translation system for car manuals, it won’t be any good at translating sports websites. This is reflected in our systems at IPTranslator too where all of our models are built using patents which have been filed in multiple languages to ensure we get the style correct (patents are a bigger fish than this though)
  8. The simple answer is that language is complex! Which is what makes it difficult to learn but also so interesting at the same time! Who has the telescope, him or I? New words, especially in patents. And new usage of words. The verb “to tweet” didn’t exist so long ago…
  9. The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding, for each language pair, what the differences are between them, e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
  10. With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people! Chinese, need to identify these DE constructions so we know to move the head noun No tense, going into English, how do we know what tense? There’s no article! We have to generate it! DE particle has many translations, which one! FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese! ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
  11. But of course it’s not just that easy. Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software. Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim). Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences. And then we have patents which introduce a whole new level of complexity on top of the language issues… Patents are hard to read, never mind translate, never mind try to teach a computer how to translate them!
  12. Sometimes it’s hard to tell whether the translation is bad or that’s simply how the original patent was written
  13. Commercial machine translation is plagued with misleading marketing with unrealistic claims and promises - Need to manage expectations When I say NO, I mean no in a fully-automatic manner with no human intervention Filing – not when meaning is CRUCIAL Publication – no, there will be errors Marketing – no, not with subtleties, idioms, etc.
  14. If you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content. The same applies to MT technology (and its providers)
  15. High quality data is essential for most effective approaches to MT. Clean data is engineering to build MT systems. But it is just an ingredient. You still need to cook the data for the specific language, the specific content type and writing style. This varies from language to language, domain to domain. We need to know how to cook it, we need to understand the language, the content, the style and not only take this into account, but make integral to the development process. This is linguistic engineering.
  16. As a developer, you cannot be dogmatic when it comes to approaches to MT. We’re not a statistical MT vendor, we don’t focus on Moses, we’re not a rule-based MT vendor. We don’t do hybrid MT. We do all of them. We call this an “ensemble” approach. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together. e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes. We study. We learn what ensemble works for a particular configuration and that’s what we implement.
  17. Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. By being completely reliant on the data, We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for. ***ALSO need to develop special processes for languages…
  18. Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever. Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?... Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data. That presents a challenge when the training data changes for each system that’s built. I’ll come back to this later…
  19. General advantages of this approach to MT
  20. All of these examples are using our IPTranslator systems which have been developed for patent machine translation. First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option. These results correlated well with human assessment of adequacy, another of which we can look at here…