"Machine Translation 101" and the Challenge of Patents

“Machine Translation 101”
And The Challenge of Patents
John Tinsley
Director / Co-Founder
EPOPIC. 5th Nov 2014, Warsaw

The need for translation
50% of all PCT applications in 2013 came from Asia

BSc in Computational Linguistics
  PhD in Machine Translation
  Language Technology consultant
  Founder of Iconic Translation Machines
Why listen to me?
Machine Translation is what I do!
The world’s first and only patent specific machine translation system

§  The use of computers to translate from one language into another
§  The use of computers to automate some, or all, of the translation
process
§  An approach to Machine Translation, where translations for an input are
estimated based on previous seen translation examples and associated
(inferred) probabilities.
§  e.g. IPTranslator, Google Translate
§  Rule-based (or transfer-based): based on linguistic rules
•  e.g. Systran; Altavista’s Babelfish
§  Example-based: based on translation examples and inferred linguistic
patterns
Machine Translation: The Basics
Machine Translation = automatic translation
Statistical Machine Translation (SMT)
Other approaches
SMT is now by far the predominant approach

A corpus (pl. corpora) is a collection
of texts, in electronic format, in a
single language
§  document(s)
§  book(s)
Bilingual Corpora
a bilingual corpus
  Note source language = original language or language we’re translating from
target language = language we’re translating into
A bilingual corpus is a collection of
corresponding texts, in multiple
languages
§  a document & its translation
§  a book in multiple languages
§  European Parliament proceedings

Aligned Bilingual Corpora
A document-aligned bilingual corpus corresponds on a document
level
For translation, we required sentence-aligned bilingual corpora
§  The sentence on line 1 in the source language text corresponds
to (i.e. is a translation of) the sentence on line 1 in the target
language text etc.
§  Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora
are essential for statistical machine translation

Learning from Previous Translations
Suppose we already know
(from a sentence-aligned bilingual
corpus) that:
§  “dog” is translated as “perro”
§  “I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:
§  “I have a dog” à “Tengo un perro”
§  Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on
previously known translations:
§  Primarily co-occurrence statistics
§  Takes contextual information into account

Statistical Machine Translation
§  Example of a small sentence-aligned
bilingual corpus for English-French

§  We take some new sentence to translate

§  From the corpus we can infer possible target (French)
translations for various source (English) words
§  We can then select the most probable translations
based on simple frequencies (co-occurrence statistics)

Given a previously unseen input sentence, and our collated statistics,
we can estimate translation

Advanced MT
All modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translation
Previous example is very simplistic
§  In reality SMT systems calculate much more complex statistical models
over millions of sentence pairs for a pair of languages
§  Upwards of 2M sentence pairs on average for large-scale systems
§  Word-to-word translation probabilities
§  Phrase-to-phrase translation probabilities
§  Word order probabilities
§  Linguistic information (are the words nouns, verbs?)
§  Fluency of the final output

Previous example is very simplistic
Other statistics calculated include

Data is Key
For SMT data is key
§  Information (word/phrase correspondences and associated statistics) is only based
on what we have seen before in the data
Important that data used to train SMT systems is:
§  Of sufficient size
§  avoid sparseness/skewed statistics
§  Representative and relevant
§  contains the right type of language
§  High-quality
§  absence of misspellings,
incorrect alignments etc.
§  Proofed by human
translators
training data

Why is MT Difﬁcult?
A word or a phrase can have more than one meaning (ambiguity – lexical or
structural)
§  e.g. “bank”, “dive”, “I saw the man with the telescope”
People use language creatively
§  New words are cropping up all the time
Linguistic differences between languages
§  e.g. structure of Irish sentences vs. structure of English sentences:
§  “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning.
§  “New York”, “The Big Apple”, “NYC”

Why is MT Difﬁcult?
§  Israeli officials are responsible for airport security.
§  Israel is in charge of the security at this airport.
§  The security work for this airport is the responsibility of the Israel government.
§  Israeli side was in charge of the security of this airport.
§  Israel is responsible for the airport’s security.
§  Israel is responsible for safety work at this airport.
§  Israel presides over the security of the airport.
§  Israel took charge of the airport security.
§  The safety of this airport is taken charge of by Israel.
§  This airport’s security is the responsibility of the Israeli security officials.

No single solution for all languages
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French

No single solution for all languages
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]

The Challenge of Patents
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words

The Challenge of Patents
  Very
long
sentences
as
standard

  Gramma1cally
incomplete
using

nominal
and
telegraphic
style
(!)

  Passive
forms
are
frequent

  Frequent
use
of
subordinate
clauses,

par1ciples,
implicit
constructs

  Inconsistent
and
incorrect
spelling

  High
use
of
neologisms

  Instances
of
synonymy
and
polysemy

  Spurious
use
of
punctua1on

Authoring guide
for “to be
translated” text
Patents break
almost all of the
rules!

Judge the quality of an MT system by comparing its output against a
human-produced “reference” translation
§  Pros: Quick, cheap, consistent
§  Cons: Inflexible, cannot be used on ‘new’ input
§  Pros: Reliable, flexible, multi-faceted (fluency, error analyses,
benchmarking)
§  Cons: Slow, expensive, subjective
§  Fluency vs. Adequacy
Evaluating Machine Translation Quality
Automatic Evaluation
Human Evaluation
Task-Based Evaluation

Evaluating Machine Translation Quality
Task Based Evaluation
§  Standalone evaluation of MT systems is necessary to get a sense of the
overall quality of a system
§  To determine the ultimate usability of an MT system, intrinsic task-based
evaluation is required
§  Why? Fluency vs. Adequacy
Fluency how fluent and grammatically correct the translation
output is
Adequacy how accurately the translation conveys the meaning of the
source
Output 1 The big blue house
Output 2 The big house red
Source La gran casa roja
Task-Based Evaluation

Practical uses of Machine Translation
Understand its limitations and you’ll understand
its capabilities!
No
§  Translate a patent for filing
§  Translate literature for
publication
§  Translate marketing materials
§  Anything mission critical
without review
Yes
§  Productivity tool for
professional translation
§  Understand foreign patents
§  Localisation processes and
“controlled’ content
§  High volume, e.g. eDiscovery

We provide Machine Translation
solutions with Subject Matter Expertise

We do this using Linguistic Engineering

An “ensemble” MT architecture

Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data

Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses

What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=

How this impacts translation quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English

Thank You!
john@iptranslator.com
@IconicTrans

"Machine Translation 101" and the Challenge of Patents

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to "Machine Translation 101" and the Challenge of Patents

Similar to "Machine Translation 101" and the Challenge of Patents (20)

Recently uploaded

Recently uploaded (20)

"Machine Translation 101" and the Challenge of Patents

Editor's Notes