There are a number of current approaches to developing commercial machine translation systems, ranging from do-it-yourself platforms to fully customized development as a professional service. While these various approaches have their relative merits, they all present a number of drawbacks for the end user, be it the inability to handle complex content or a long and expensive period of development and testing.
At Iconic Translation Machines, our approach goes beyond basic engineering of data to build MT systems and overcome these drawbacks. We combine deep domain knowledge and linguistic expertise to deliver highly focused MT engines for targeted domains and languages. Our IPTranslator service, for example, has been developed using this approach to produce intelligent MT systems adapted for patent and legal content. We demonstrate how this approach has delivered significant value to end users and describe how these systems serve as an ideal launchpad for ongoing adaptation and optimization.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
My INSURER PTE LTD - Insurtech Innovation Award 2024
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
1. Wednesday,
4
June
Beyond
Data:
Delivering
Machine
Transla;on
with
Subject
Ma@er
Exper;se
John
Tinsley,
Iconic
Transla1on
Machines
TAUS
Machine
Transla;on
Showcase
2014
Dublin
(Ireland)
The
research
within
the
project
MosesCore
leading
to
these
results
has
received
funding
from
the
European
Union
7th
Framework
Programme,
grant
agreement
no
288487
2. Beyond Data
Delivering Machine Translation with
Subject Matter Expertise
John Tinsley
Director / Co-Founder
TAUS MT Showcase. 4th June 2014, Dublin
6. The world’s first and only patent specific MT
system that’s ready to go
7. Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
8. Patents: an MT nightmare
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
9. Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
10. Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
11. If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse
reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are
associated with a genetic marker, the HLA-B*5801
allele.”!
“IPTranslator is perfect for someone who needs to search [patents]
across multiple languages and with is useful in the case of both
patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
12. What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
13. De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
Our PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types
15. Iconic in practice
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
16. Iconic in practice
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
17. Iconic in practice
“The complexities and unforeseen but inevitable surprises of MT integration
in large scale production processes were handled both competently and
efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
18. Iconic in practice
>20% productivity increase for translator post-editing Iconic output
“Iconic delivered measurable productivity gains from the outset”
Performance
19. Iconic in practice
• Ongoing improvement through feedback from translators
• Ongoing improvement through the incorporation of post-edits
• More than 4 million words translated to date for Asian languages
• Periodic roll-out of new languages over time
Looking forward
20. Need: short-term solution to provide on-demand translation through
a web search interface
Iconic in practice
Process: integrate directly through Iconic API and evaluate quality
and throughput concurrently
Outcomes: in 5 months of production for English-Portuguese alone,
we processed:
• 15,526 translation requests
• 14,606,374 words
21. All content is not created equal
We cannot afford to be dogmatic when it
comes to MT
Domain specific MT is about more than just
data
Know your subject matter!
Take home messages…