SlideShare a Scribd company logo
1 of 31
Download to read offline
“Data & Linguistics”
Delivering Machine Translation with
Subject Matter Expertise
John Tinsley
Director / Co-Founder
Localization World. 31st Oct 2014, Vancouver
Machine Translation
with Subject Matter Expertise
From Data Engineering to
Linguistic Engineering
“Ensemble” MT architecture
The world’s first and only patent specific
MT system that’s ready to go
Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Patents: an MT nightmare
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
“Most of these things are not like the other”
Many languages aren’t a dream either
(And teaches the teacher her students language the Arabic)
Spanish – Italian English – Spanish Arabic – English
Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
Easier said than done
“A very particular set of skills”
MT Knowledge
(from a scientific
perspective)
Domain Knowledge
(the nature of the
content)
Linguistic Knowledge
(the characteristics of
the language)
MT Knowledge
Implementation
•  Computer science!
•  Programming
•  Data structures
•  Algorithms
Science
•  Machine learning
•  Probability theory
•  Bayesian statistics
•  Markov Models
Domain Knowledge
What’s important?
•  Chemical names
•  References to figures
•  Claim cross-references
Where do we learn?
•  Commercial partners
•  LSPs & Translators
•  Research
Consistent across langs?
•  Japanese abstract order
•  Numbering / bullets
•  Document layout
Document types?
•  Patents
•  Applications, reports
•  Pharmaceutical
•  IFUs, labels
Iconic
Translation Machines
Linguistic Knowledge
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
Linguistic Knowledge
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse
reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are
associated with a genetic marker, the HLA-B*5801
allele.”
“IPTranslator is perfect for someone who needs to search [patents]
across multiple languages and with is useful in the case of both
patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
New PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types
Case Studies
1.  What this approach means straight up in terms of quality…
2.  Productivity gains from using these systems…
3.  As a foundation for client customization…
Case 1: Quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English
Case 1: Quality
2.83
4 3.86
3.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English
Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
Case 2: Productivity
“The complexities and unforeseen but inevitable surprises of MT
integration in large scale production processes were handled both
competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“Measurable productivity gains delivered from the outset”
Performance
Case 2: Productivity
•  Ongoing improvement through feedback from translators
•  Ongoing improvement through the incorporation of post-edits
•  More than 5 million words translated to date for Asian languages
•  Periodic roll-out of new languages over time
Looking forward
Case 3: Customization
-  Modify our patent machine translation engines for
“Written Opinions” on patents
-  0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English
Case 3: Customization
Productivity
threshold
Essentially out of domain – not viable for post-editing
Case 3: Customization
Productivity
threshold
After customization – 25% gain in productivity
All content is not created equal
We cannot afford to be dogmatic when it
comes to MT
Know your subject matter!
Domain specific MT is about more than just
data
Take home messages…
+ Linguistics!
Thank You!
john@iptranslator.com
@IconicTrans

More Related Content

Viewers also liked

شهاده خبره محمد جلال
شهاده خبره محمد جلالشهاده خبره محمد جلال
شهاده خبره محمد جلالMahmoud Aly
 
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...TAUS - The Language Data Network
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsIconic Translation Machines
 

Viewers also liked (10)

شهاده خبره محمد جلال
شهاده خبره محمد جلالشهاده خبره محمد جلال
شهاده خبره محمد جلال
 
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (Gábor Bessenyei, CEO of M...
 
Plantilla hecha bien 2
Plantilla hecha bien 2Plantilla hecha bien 2
Plantilla hecha bien 2
 
What Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred TuinstraWhat Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred Tuinstra
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
 
Topic 2: How to Pump up Your MT Quality (5)
 Topic 2: How to Pump up Your MT Quality (5) Topic 2: How to Pump up Your MT Quality (5)
Topic 2: How to Pump up Your MT Quality (5)
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
MT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesMT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the Trees
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projects
 

Similar to Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS - The Language Data Network
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchIconic Translation Machines
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013pangeanic
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliDeep Learning Italia
 
Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Loctimize GmbH
 
Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21lurching
 
Translation Technologies & Business in the Future
Translation Technologies & Business in the FutureTranslation Technologies & Business in the Future
Translation Technologies & Business in the FutureMultilizer
 
Internationalizing a Complex B2B Application
Internationalizing a Complex B2B ApplicationInternationalizing a Complex B2B Application
Internationalizing a Complex B2B Applicationbobdonaldson
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorScott Abel
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorScott Abel
 
Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21lurching
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedinfredleoni
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedinfredleoni
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
The data limbo in modern biomedical research
The data limbo in modern biomedical researchThe data limbo in modern biomedical research
The data limbo in modern biomedical researchJorge Boucas
 
Joaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo
 

Similar to Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT Research
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...
 
Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21
 
Translation Technologies & Business in the Future
Translation Technologies & Business in the FutureTranslation Technologies & Business in the Future
Translation Technologies & Business in the Future
 
Internationalizing a Complex B2B Application
Internationalizing a Complex B2B ApplicationInternationalizing a Complex B2B Application
Internationalizing a Complex B2B Application
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical Communicator
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical Communicator
 
Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedin
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedin
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
The data limbo in modern biomedical research
The data limbo in modern biomedical researchThe data limbo in modern biomedical research
The data limbo in modern biomedical research
 
Joaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer Impact
 
K. Lovell Resume
K. Lovell ResumeK. Lovell Resume
K. Lovell Resume
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

  • 1. “Data & Linguistics” Delivering Machine Translation with Subject Matter Expertise John Tinsley Director / Co-Founder Localization World. 31st Oct 2014, Vancouver
  • 3. From Data Engineering to Linguistic Engineering
  • 5. The world’s first and only patent specific MT system that’s ready to go
  • 6. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data
  • 7. Patents: an MT nightmare L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words
  • 8. “Most of these things are not like the other” Many languages aren’t a dream either (And teaches the teacher her students language the Arabic) Spanish – Italian English – Spanish Arabic – English
  • 9. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data
  • 10. Data Engineering + Linguistic Engineering An “ensemble” architecture Chinese pre-ordering rules Statistical Post-editing Input Output Training Data Spanish med-device entity recognizer Multi-output Combination Korean pharma tokenizer Patent input classifier Client TM/terminology (optional) Japanese script normalisation German Compounding rules Moses RBMT Moses Moses
  • 11. Easier said than done “A very particular set of skills” MT Knowledge (from a scientific perspective) Domain Knowledge (the nature of the content) Linguistic Knowledge (the characteristics of the language)
  • 12. MT Knowledge Implementation •  Computer science! •  Programming •  Data structures •  Algorithms Science •  Machine learning •  Probability theory •  Bayesian statistics •  Markov Models
  • 13. Domain Knowledge What’s important? •  Chemical names •  References to figures •  Claim cross-references Where do we learn? •  Commercial partners •  LSPs & Translators •  Research Consistent across langs? •  Japanese abstract order •  Numbering / bullets •  Document layout Document types? •  Patents •  Applications, reports •  Pharmaceutical •  IFUs, labels Iconic Translation Machines
  • 14. Linguistic Knowledge Number agreement: the house / the houses vs. la maison / les maisons Gender agreement: the house / the cheese vs. la maison / le frommage English - Spanish English - French
  • 15. Linguistic Knowledge English - German English - Chinese 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  • 16. If you don’t understand it, you can’t translate it MT with Subject Matter Expertise “Allopurinol-induced serious cutaneous adverse reactions (SCAR), including Steven Johnson’s syndrome (SJS) and toxic epidermal necrolysis (TEN), are associated with a genetic marker, the HLA-B*5801 allele.” “IPTranslator is perfect for someone who needs to search [patents] across multiple languages and with is useful in the case of both patentability and infringement searches.” – Aalt van de Kuilen, Global Head of Patent Information, Abbott Machine Translation for Patents
  • 17. What is the value for users? Specialist solutions deliver more useable outcomes for the user Post-editing For information purposes Multilingual search Increased productivity Extract more meaning Retrieve more relevant results = = =
  • 18. De-risking the machine translation proposition What is the value for users? + Data + Time + €€€ = ??? + No data needed + Systems are ready to go + No upfront cost = Evaluate immediately New PrerequisitesTypical Prerequisites Customisation. Refinement. » Incorporation of user feedback » Incremental training with post-edits » Tuning for specific input types
  • 19. Case Studies 1.  What this approach means straight up in terms of quality… 2.  Productivity gains from using these systems… 3.  As a foundation for client customization…
  • 21. Case 1: Quality 2.83 4 3.86 3.56 1 1.5 2 2.5 3 3.5 4 4.5 5 Evaluator 1 Evaluator 2 Evaluator 3 Average German to English TranslationGerman to English
  • 22. Case 2: Productivity Iconic had a domain-specific MT solution for that industry Machine Translation technology for the legal industry Business Need
  • 23. Case 2: Productivity Delivered immediately and initial results were positive Translation samples required for initial evaluation Process (1)
  • 24. Case 2: Productivity “The complexities and unforeseen but inevitable surprises of MT integration in large scale production processes were handled both competently and efficiently.” Integrate Iconic with GlobalSight for productivity pilot Process (2)
  • 25. Case 2: Productivity >20% productivity increase for translator post-editing Iconic output “Measurable productivity gains delivered from the outset” Performance
  • 26. Case 2: Productivity •  Ongoing improvement through feedback from translators •  Ongoing improvement through the incorporation of post-edits •  More than 5 million words translated to date for Asian languages •  Periodic roll-out of new languages over time Looking forward
  • 27. Case 3: Customization -  Modify our patent machine translation engines for “Written Opinions” on patents -  0.25% new data, 2 new ensemble processes 21 20 27 0 10 20 30 40 50 60 Iconic Google + Modification Baseline Chinese to English
  • 28. Case 3: Customization Productivity threshold Essentially out of domain – not viable for post-editing
  • 29. Case 3: Customization Productivity threshold After customization – 25% gain in productivity
  • 30. All content is not created equal We cannot afford to be dogmatic when it comes to MT Know your subject matter! Domain specific MT is about more than just data Take home messages… + Linguistics!

Editor's Notes

  1. In this presentation, I’m going to talk about our experience of developing machine translation engines for complex content and languages. Looking at where were get to when we reach the limitations existing technology and approaches, particularly focusing on WHY we reached that ceiling *WHAT was it about the content and the language the could be overcome. From there, I’ll look at what we need to do to advance the technology from there and, FROM OUR PERSPECTIVE as MT technology developers and providers, tell you about what we discovered we needed to know, what skillsets and knowhow we needed in our team to achieve this. I’ll then WRAP UP with some case studies which will serve to illustrate the benefits that can be seen as a result of taking this approach. For DEVELOPERS, I hope we can share our experiences with you, and for BUYERS OR USERS OF MT, my hope is that, from your perspective, this talk will pull back the curtain a little bit on MT development, which has kinda been a bit of a black box.
  2. Just a little bit by way of an overview of Iconic Translation Machines to introduce the concepts I'm going to talk about. We develop what we call “MT with Subject Matter Expertise” The concept is that if you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content. *And the same applies to MT technology* ----- Meeting Notes (14/10/2014 12:52) ----- Our philosophy
  3. High quality data is essential for most effective approaches to MT. Clean data is engineering to build MT systems. But it is just an ingredient. You still need to cook the data for the specific language, the specific content type and writing style. This varies from language to language, domain to domain. We need to know how to cook it, we need to understand the language, the content, the style and not only take this into account, but make integral to the development process. This is linguistic engineering.
  4. How do you go about building such a concept? To answer this, I want to introduce the concept of the ensemble architecture for machine translation As a developer, you cannot be dogmatic when it comes to approaches to MT. There are many approaches, you cab be a statistical MT vendor, we you can focus on Moses, you can use a rule-based MT. Or you might do some sort of hybrid MT. In the “ensemble” approach, WE DO ALL OF THEM. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together. e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes. You study. You learn what ensemble works for a particular configuration and that’s what you implement.
  5. An instance of this approach is our IPTranslator service for patent/IP/legal translation and I’ll mention patents as an example of a highly complex content type as I go through the rest of the presentation.
  6. TO understand this Linguistic Engineering approach, let’s first describe DATA ENGINEERING Existing approaches to MT typically use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent – AND THAT’s WHY IT’S USED, BECAUSE IT CAN WORK - but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you often need A LOT of data (and many clients simply don’t have it.) But then your being completely reliant on the data to capture all of the nuances of language and content, and this isn’t enough. We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for. ***ALSO need to develop special processes for languages… LET’S LOOK AT WHY
  7. But of course it’s not just that easy. Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software. Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim). Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
  8. To quote Sesame Street…or to slightly modify a line from a famous Sesame street song… “Most of these things are not like the other” AS A RULE OF THUMB, the more similar languages are to one another, the easier they are for machine translation. Particularly in terms of the order of the words in sentence, and then also grammatically. The closer they are, the more you can get away with just using statistical MT and throwing lots of data into the system. But most of them are not like one anothr But what if the languages are SO grammatically different from all perspectives?! Like English and Arabic, where Arabic has a different word order, frequently doesn’t have a verb, affixes pronouns, articles, and conjunctions to verbs (when they ARE there) and nouns. Look at this example which shows many of these phenomenon together. Firstly, the words are in a totally different order if we read it out as it would be word for word… and it manages to say all that in 5 words due to all the affixes, compounding, and morphology Data cannot solve these problems either. Each one of these phenomena needs to be addressed. And that’s where the linguistic knowledge and linguistic engineering comes in…
  9. Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the CONTENT being translated, often technical nuances, terminology etc. that needs to be specially accounted for , ASWELL as the LANGUAGE being translated which again cannot just be a generic process.
  10. Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever. Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?... Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data. That presents a challenge when the training data changes for each system that’s built. LATER, I’ll come back to this and look at some examples where we have QUANTIFIED the impact and the value in taking this approach BUT FIRST, I want to talk about WHY we took this approach and WHAT we learned over the course of the last few years… ----- Meeting Notes (14/10/2014 12:52) ----- **Good if you can develop the systems with the training data that you know you're going to use...
  11. THAT’S WHAT’S REQUIRED AND DEVELOPMENT OF THE VARIOUS COGS IS AN ONGOING PROCESS. However, as with most areas of natural language processing (like MT itself as the over-arching process) these things aren’t perfect. You know the way MT is improving, well so is syntactic parsing of German, named-entity recognition in Japanese, Arabic morphologic analysis so it’s about constant iterative improvement. THAT’S WHY THERE ARE NO BREAKTHROUGHS, NO SILVER BULLETS IN MT DEVELOPMENT. We work hard, we improve our German parsing, we improve our German systems a bit… But all of that is easier said than done. When building a technical team to do this, we have to look closely at what sort of skillset we need. Let me tell you, what we came across is quite the high bar. It’s a talent pool that’s thin on the ground for a number of reasons, which I’ll get to… To quote another movie, a compatriot of mine, Liam Neeson in the film Taken “You need a very particular set of skills”. Now his is not per person, but these are skills you really need to have within your team to get the most you can out of your MT systems **NOW START SLIDES** Over the course of our existence, we’ve identify three key areas in which you need to have expertise in order to be able to develop adequate MT engines for different languages and content types… 1…2…3 ----- Meeting Notes (14/10/2014 15:58) ----- 16 minutes to here
  12. Let’s look first at MT knowledge. THIS IS NOT JUST KNOWING HOW TO RUN MOSES. You can’t treat it as a black box. I believe MT knowledge here is two-fold. You have to know the science (THEORY), and you have to know to implement the science (PRACTICE). They don’t always go hand in hand…we’re talking implementation from a product development perspective, not from a “let’s hack together my idea in some scripts held together by string so that I can write a paper about my results and it doesn’t really matter how efficiently it works!” So then if know the theory, we know how to develop a maximum-entropy classifier to identify chemical names in Korean – we then need to understand the mechanics of the MT engine in order to implement this along with all of the other components in an efficient manner. Examples of machine learning methods: support vector machines, decision trees, neural networks Examples of probability models: Baysian, HMMs, Maximum Likelihood Example of programming language/styles: Java, python, C++, MapReduce Examples of data structures: hashmaps, databases, Example of algorithms: sorting/searching, parsing, OUTRO: one of the biggest challenges in this regard is finding talent with this skillset. MT grads and postgraduates are thin on the ground and many of them are on an academic career path. Couple that with the fact that the research groups are dotted around the word makes hiring a real challenge. There was actually an interesting panel about this at the AMTA conference…
  13. So that’s what you need to be able to develop with the MT. With that, what is it that you actually need to develop? Well, we can split this into two sets of components that need to work together. first is those for the DOMAIN, and then those for the LANGUAGE itself. Looking at the DOMAIN KNOWLEDGE required first, what do we need to know? 1. WHAT’S IMPORTANT IN THIS DOMAIN? 2. WHAT TYPES OF DOCUMENTS ARE THERE? 3. ARE THESE CHARACTERISTICS CONSISTENT ACROSS LANGUAGES? 4. WHERE DO WE FIND THIS INFORMATION OUT?
  14. The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding THE RELATIONSHIP between the languages you’re translating to and from, what the differences are between them e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
  15. With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people! Chinese, need to identify these DE constructions so we know to move the head noun No tense, going into English, how do we know what tense? There’s no article! We have to generate it! DE particle has many translations, which one! FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese! ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
  16. AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE The whole motivation for this is that same as if you’re hiring a linguist for translation, they simply need to have technical subject matter expertise. Otherwise, how can they understand everything? “If you don’t understand it, you can’t translate it” The same applies to MT. The training and translation process needs to know what it’s dealing with so it can use the right terms, do the right preprocessing, etc. That’s what we’ve done with our flagship offering, IPTranslator. Systems have subject matter expertise because they were developed with, and evaluated and used by patent information specialists.
  17. General advantages of this approach to MT ANALOGY of buying fresh fruit…
  18. Obviously one of the issues in adopting machine translation technology is the risk that’s involved. You invest in a program, it doesn’t deliver straight away, it might start brining you returns but when? How long? If ever If we look specifically at the approach we’ve taken: Our proposition helps to derisk the adoption of MT from A QUALITY PERSPECTIVE and a DELIVERY PERSPECTIVE Typical setup involves: data, across all languages. How much do you have? Is that enough? Is it clean? Is it yours to give away? Time, how long is development going to take? Will MT be good enough straight away after that? If not, when? What’s the upfront cost for customisation or subscription to the service?
  19. That’s the value for the users for the whole concept, but what if we get down to the nuts and bolts of it and talk about the value in terms of the returns…what does using this type of MT get you? To give an illustration, I’ll run through 3 quick examples and case studies from our own experiences. The first of which will look at what this does in terms of straight up quality of the MT output After that, no pun intended, we’ll see how that translates to productivity when post-editing the output Finally, we’ll look at what you can do when you have these systems built and ready to go in terms of customisation, with minimal effort..
  20. All of these examples are using our IPTranslator systems which have been developed for patent machine translation. First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option. These results correlated well with human assessment of adequacy, another of which we can look at here…
  21. For our German to English system, we had 3 evaluators look at around 400 segments each and rank them from 1-5 in terms of how adequately the carried the meaning from the source to the target. Typically, a score of 3 or high indications the the segments are “usable” – i.e. readable and understandable So they’re just a couple of brief examples to show that this approach is developing systems that can produce good quality output, without the need for additional adaption for each individual user. I want to now look at a case study that illustrates how these systems, as they are, with these levels of quality, can produce output that leads to more productive post-editing…
  22. This is a case study with WeLocalize who had a particular business need…
  23. For English to Chinese MT…
  24. Used on a daily basis So this ongoing improvement through incorporation of client-specific data is related to our third case study about how these engines that we’ve building with linguistic engineering can serve as a solid backbone for customized engines…
  25. This is a case with another of our clients who have a substantial patent translation business. They had a slightly different need in that, rather than the translation of patent documents themselves, they wanted to translation what are known as Written Opinions, essentially reports from patent examiners about the validity of a patent application. From an MT perspective, when a lot of the technical terminology is the same, the register is completely different. These written opinions contain first person, questions, opinions – sentence structures and words that just aren’t in patents and consequently not in our original systems. If we looked at how our systems performed when trying to handle this, we get a BLEU score of around 21 where Google, a system designed for whatever’s thrown at it, gets a score of 20 – so around the same. What we need to do is modify these systems for this particular type of text. What we had at hand to do this was some TMs from our client, not much though, it amounted of around 0.25% of the amount of data we’d trained our original engines with. We also developed a couple of processes to add to our ensemble architecture to handle specifics of these Reports, such as consistent references to PCT (patent cooperation treaty) Regulations. This resulted in the performance more than doubling….
  26. In terms of how this correlated into post-editing productivity for the client, well let’s look at this scatter plot. Each dot is a segment in our test. Along the horizontal axis we have the length of the segment in words. On the vertical axis we have a proprietary score that correlates with post-editing productivity whereby a score of 0.4 means, roughly, there’ll be some productivity from post-editing. Above means most likely not, and the lower, the less editing is required. So here we can see that only a small portion of the segments fall below the threshold so, basically, the document (which is essentially out of domain) is NOT VIABLE for this MT system.
  27. However, AFTER we do the customisation we see that a large number of the segments drop below the line, a bit over 60% of them, with quite a few hitting the 0 score also. When we run the number of these, they lead to productivity gains of around 25% ----- Meeting Notes (13/10/2014 17:03) ----- The heavy lifting has been done
  28. Some of these points may be obvious, but allow me to elaborate All content is not created equal (to modify a well know phrase); as such, the (machine) translation process has to be different We cannot afford to be dogmatic when it comes to MT; one size does not final all. If we are practitioners of SMT, we’re restricting ourselves. Even being “hybrid” is restrictive. It’s SMT + rules, or rule-based + statistical post-editing. Domain specific MT is about more than just data; a sufficient amount of good quality clean training data is obviously a key component in the MT training process (especially for SMT) but it’s no everything. To use a cooking analogy, data is to MT what the ingredients are to a chef. The chef (in this case the training/development process) needs to know what to do with the ingredients. To bring it back to MT, the training and translation processes need to be informed by the data, by the content type and the subject matter. Training is sensitive to data. So you could have the most refined approach but data will be the biggest variable to quality. Our approach allows us to deliver high quality “out of the box” which we then refine as opposed to the great unknown of training from scratch.