More Related Content Similar to ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains (20) More from Dr. Haxel Consult (20) ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains 1. Copyright © 2014, Asia Online Pte Ltd
High volume, High Quality Patent Translation across Multiple Domains
Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net 2. Copyright © 2014, Asia Online Pte Ltd
•Language Studio™ is a language processing platform, not just a translation tool
•We currently support 534 language pairs
•Our very first customer was LexisNexis Univentio in 2008
–Our first commercial engine was translating Japanese patents into English
•Not all customers are in the patent space, but patents are the most complex content that we have ever encountered 3. Copyright © 2014, Asia Online Pte Ltd
•Collectively our customers are translating more than 2 billion words per day
•One single customer is translating more than 1 billion words a day of patent content
•Our highest rate of throughput required by a customer (government) to date is 600 million words per minute
–Yes, we can support this volume if you can provide the hardware – approx. 25K CPU cores
–Currently being designed and architected ahead of deployment 4. Copyright © 2014, Asia Online Pte Ltd
•Equivalent of 20 million four drawer filing cabinets filled with text.
•The volume of data is expected to increase by 20 times by 2020. 5. Copyright © 2014, Asia Online Pte Ltd
•Equivalent of 20 million four drawer filing cabinets filled with text.
•The volume of data is expected to increase by 20 times by 2020. 6. Copyright © 2014, Asia Online Pte Ltd
A method of distilling a polymerizable vinyl compound selected from the group consisting of acrolein, methacrolein, acrylic acid, methacrylec acid, hydroxyethyl acrylate, hydroxyethyl methacrylate, hydroxypropyl acrylate, hydroxypropyl methacrylate, glycidyl acrylate and glycidyl methacrylate, the method comprising distilling the polymerizable vinyl compound in the presence of a polymerization inhibitor using a distillation tower having perforated trays without downcomers and wherein the temperature of the inner wall of the tower is maintained at a temperature sufficient to prevent the condensation of the vapor being distilled, whereby the polymerizable vinyl compound is distilled without the formation of polymer. 7. Copyright © 2014, Asia Online Pte Ltd
Translate 13 million historical patents from Japanese to English and also translate all new Japanese patents going forward. Follow this with the same task in many other languages.
It would take a human translator 152,257 years to translate all existing Japanese patents into English and would cost US$ 40 billion. 8. Copyright © 2014, Asia Online Pte Ltd
Quality requires an understanding of the data
There is no exception to this rule 9. Copyright © 2014, Asia Online Pte Ltd
•Structured XML
–Header
•Language
•IPC
•…
–Sections
•Title
•Claim
•Abstract
•Description 10. Copyright © 2014, Asia Online Pte Ltd
•Writing Style Changes
–Between domains of knowledge
–Between sections of the patent document
•Multiple Classes Of Data
–Formulas
•Detection
•Transformation
•Protection
–Reference Numbers
•Breaks fluency of translation
•Not part of the text, meta data
–Numbers + Units
–Dates
–Patent Numbers 11. Copyright © 2014, Asia Online Pte Ltd
•Content Formatting
–Broken sentences
–Wrong encoding
–OCR
•Different formats data
–USPTO, EPO, WP and many others have their own formats
–Changes in format in different offices
•Quality of Learning Data
–Spelling errors
–Poor quality human translations
–Words glued together
–OCR
•the data provided told us it wasn’t OCRed, but… 12. Copyright © 2014, Asia Online Pte Ltd
•Gaps in Data
–Many terms are not in the learning data
•Tricks By Authors
–Changing writing mechanism
•i.e. Switch to Katakana with there is a perfectly good Kanji term
•Bilingual Data
–Matching patent documents between various patent office formats
–Matching sentences
–Removing poor quality translations
–Fixing “broken data” 13. Copyright © 2014, Asia Online Pte Ltd
•Sentence Length
–The longest patent sentence we have seen so far is 4,500 words in a single sentence
•Throughput Requirements
–Front File
•Translated and published within X hours of be published by Y patent office
–Back File
•All patents going back to X within 3 months
–This is millions of documents 15. Copyright © 2014, Asia Online Pte Ltd
•Unique Customization and Quality Improvement Plan
•Clean Data Strategy
•One Engine, Multiple Writing Styles
–Writing Styles By
•Content Domain
•Document Section
–Sentence by sentence domain switching
•Hybrid – Rules + Syntax + Statistics
•Multiple Translations
–Only the best will do
•Ongoing Improvement
–Driven by Quality and Measurement 17. Copyright © 2014, Asia Online Pte Ltd
Data Cleaning
Data Preparation
Data Collections
Training
Diagnostics and Fine Tuning
Original Translation Sources
Translate
Quality Assurance
Language Pair Foundation Data
Domain Foundation Data 18. Copyright © 2014, Asia Online Pte Ltd
Language Pair Foundation
Domain Foundation
Client Data
+
=
Custom Engine
Asia Online Foundation Data
+
Sub-Domain Specific Data
Manufactured Data 19. Copyright © 2014, Asia Online Pte Ltd
•Definition
–Domain
–Target Audience
–Preferred Writing Style
–Glossaries, Non-Translatable Terms, Preferred Capitalization
–Special Formatting Requirements
–Quality Requirements
•Data Gathering
–Source data in domain
–Bilingual data to support domain
–Monolingual data to support domain
•Data Analysis
–Gap analysis
–High frequency terms
–Term extraction
•Data Generation
–Supporting grammar structures
–Source Data Analysis
•Cleaning of Data
•Tuning and Test Set Preparation
•Diagnostic Engine
–Fine tuning
Provided by client and gathered from third parties. 20. Copyright © 2014, Asia Online Pte Ltd
•Data Preparation
–Language ID
–Encoding ID
–Class Definition
–Rule Definition
–Writing Style Definition
–Data Alignment
–Data Cleaning & Repair
–Gap Analysis
–Word segmentation
–De-compounding
–Data Manufacturing
–Spelling Correction
–Domain detection
–Syntax parsing
–Reordering rules
–Data structuring rules
–Language Normalization
–Term Normalization
21. Copyright © 2014, Asia Online Pte Ltd
•Engine Training
–5 major categories
•Leverage IPC
•Override option for user to bypass IPC logic
–4 writing styles
•Title, Claim, Abstract, Description
–20 different sub-engines
•5 categories x 4 styles
–Tuning/testing data for each of the 20 sub-engines
–Integration of 20 sub-engines into a single engine 22. Copyright © 2014, Asia Online Pte Ltd
•Runtime Translation
–Pre-Translation Corrections
–Domain detection
–Syntax parsing
–Reordering rules
–Data structuring rules
–Statistical translation
–Multi-candidate translations
–Class extraction and processing 24. Copyright © 2014, Asia Online Pte Ltd
•There is no magic in MT, human effort is required.
•The quality of the output and suitability for purpose is directly in proportion to the amount of human effort.
•Without human direction, MT will cost more in the long term and is more likely to fail. 25. Copyright © 2014, Asia Online Pte Ltd
•Source
–The entire body of data in the back file
•Target
–Every USPTO patent published from 1976 until current
•Bilingual Data
–USPTO, EPO, etc. matching documents 26. Copyright © 2014, Asia Online Pte Ltd
•This is the actual format from one customer 29. Copyright © 2014, Asia Online Pte Ltd
•Data
–Gathered from as many sources as possible.
–Domain of knowledge does not matter.
–Data quality is not important.
–Data quantity is important.
•Theory
–Good data will be more statistically relevant.
•Data
–Gathered from a small number of trusted quality sources.
–Domain of knowledge must match target
–Data quality is very important.
–Data quantity is less important.
•Theory
–Bad or undesirable patterns cannot be learned if they don’t exist in the data.
Dirty Data SMT Model
Clean Data SMT Model 30. Copyright © 2014, Asia Online Pte Ltd
English Source
Human Translation
Google Translation
Google Context
I went to the bank
Fui al banco
Fui al banco
Bank as in finance
I went to the bank to deposit money
Fui al banco para depositar dinero
Fui al banco a depositar el dinero
Bank as in finance
I went to the bank of the turn in my car
Fui en coche a la inclinación de la vuelta
Fui a la orilla de la vuelta en mi coche
Bank as in river bank
I put my car into the bank of the turn
Puse mi coche en la inclinación de la vuelta.
Pongo mi coche en el banco de la vuelta
Bank as in finance
I swam to the bank of the river
Nadé en la orilla del río
Nadé hasta la orilla del río
Bank as in river bank
I banked my money
Deposité mi dinero
Yo depositado mi dinero
Banked as in finance
I banked my car into the turn
Incliné mi coche en la vuelta
Yo depositado mi coche en la vuelta
Banked as in finance
I banked my plane into a steep dive
Incliné mi avión en para una zambullida.
Yo depositado en mi avión en picada
Banked as in finance
The above examples show that Google is biased towards the banking and finance domain
Issue:
There is much more multilingual banking and finance data available to learn from than there is aeronautical or water sports data available.
Cause: 31. Copyright © 2014, Asia Online Pte Ltd
Dirty Data SMT Baseline
Language Studio™ Clean Data SMT Foundation
Dirty Data SMT Baseline
20% Required for Noticeable Improvement
Client Data
Initial Customization
Improvement
Improvement
< 0.1%
Language Studio™ Clean Data SMT Foundation
Client Data
Initial Customization
Manufactured Data 33. Copyright © 2014, Asia Online Pte Ltd
•Language Studio™ provides tools and processes for normalization of terminology
•Benefits include cost reductions, faster deliverables, higher customer satisfaction and happier post editors 34. Copyright © 2014, Asia Online Pte Ltd
Translation quality can be greatly improved by performing 3 similar but different cross references of data.
All Source Data to be Translated
Bilingual Data
Monolingual Target Language Data
Bilingual Data
Bilingual Data
Monolingual Target Language Data
Goal:
Identify words in the source data to be translated that are not in the bilingual data.
Benefit:
Ensures all words in the data to be translated are known and will be translated correctly.
Action:
Human translate or locate word lists from industry sources and directories and add to bilingual data.
Goal:
Identify words in the monolingual target language data that are not in the bilingual data.
Benefit:
Ensures all words in the monolingual target language data are known, ensuring that data to be translated in future but not yet known will be translated better.
Action:
Human translate or locate word lists from industry sources and directories and add to bilingual data.
Goal:
Identify words in the bilingual data that are missing or low frequency in the monolingual target language data.
Benefit:
Ensures that there is enough grammatical representation of the words, phrases and terminology in the monolingual target language data. This delivers greater fluency in translation output.
Action:
Generate monolingual target language data using Language Studio™ Pro Crawl and Generate Tools and add to monolingual data.
EN
EN
1
2
3 35. Copyright © 2014, Asia Online Pte Ltd
Gruppenmasterdatenverarbeitungsvorrichtungssynchronisationsinformation
Leistungswirkungsgradindexmarkierungsberechnungseinrichtung
Schwenkmotorbetriebsdrehmomentbegrenzungswertberechnungsschritt
Differenzialmechanismusumschaltbedingungsänderungseinrichtung
Kraftstoffverbrauchsratenprioritätsmodusauswahlschalter
Reproduktionsunmöglichkeitsgegenmaßnahmeneinrichtung
Telefonbuchdatenübertragungsprotokollverbindungsabschnitts
Leistungswirkungsgradindexmarkierungsberechnungseinrichtung
Bezugspunktsolldrehungsgeschwindigkeitsfestlegungsabschnitt
Höhenstandsaufnahmedifferenzdrucksondenresonanzverstimmung
Maschinenrotationspumpenkapazitätsbefehlwandlungsabschnitt
Brennkraftmaschinenausgangsdrehmomenterfassungseinrichtung
Telefonbuchdatenübertragungsprotokollverbindungsabschnitt
übermaßwankwinkelauftrittstendenzbeurteilungseinrichtung
Unterstützungsdrehmomentbegrenzungswertberechnungsschritt
Personenwahrscheinlichkeitsberechnungsverarbeitungsroutine
Positionsaktualisierungsinformationsübertragungszeitpunkt
Automatikgetriebehydraulikfluidtemperaturerfassungseinheit
Leistungswirkungsgradindexmarkierungsberechungseinrichtung
Octadecylaminodimethyltrimethoxysilylpropylammoniumchlorid
Katalysatorverschlechterungsbeurteilungseinrichtung
Kraftstoffverbrauchsprioritätsmodusauswahlschalter
37. Copyright © 2014, Asia Online Pte Ltd
•Generic MT from Google, Bing, etc. offers unknown productivity gains and sometimes productivity loss due to lack of control.
•Competitors offer < 20-40% productivity gains due to domain centric and “dirty data SMT” customization model.
•Language Studio™ :
–Targets of 150-300%+ productivity gains with granular sub-domain “clean data SMT” approach.
–Provides complete control of writing style, terminology and is mapped to target audience reducing editing effort.
Language Pair
Top-Level Domain
Engines/Sub-Domains
EN-ES
Automotive
Honda
Cars
Motorbikes
Toyota
Marketing
Service Reports
User Manuals
Engineering Service Manuals
User Manuals
Engineering Service Manuals
Client
Product
Target Audience / Purpose
Cars
50%+
90%+
150-300%+
Customization Level:
Typical Productivity Gain:
Google/Bing Quality Level
Typical Competitor Quality Level
Generic
????
Domain
< 20-40% 38. Copyright © 2014, Asia Online Pte Ltd
Translated text can be stylized based on the style of the Monolingual data.
ES
Millions of Sentence Pairs
News paper article
Business News
The Economist
New York Times
Forbes
Children’s Books
Harry Potter
Rupert the Bear
Famous Five
Bilingual Data
Monolingual Data
Text written in the style of business news
EN
Text written in the style of children’s books
EN
Possible Vocabulary
Writing Style & Grammar 39. Copyright © 2014, Asia Online Pte Ltd
Spanish Original Before Translation:
Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.
Business News After Translation:
Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.
Children’s Books After Translation:
A lot of care was taken to not upset others when organizing the meeting between the two long time enemies. 40. Copyright © 2014, Asia Online Pte Ltd
•5 different main categories
–Tests were performed on more granular categories, but they did not have much impact for the effort
–Categories automatically detected using the IPC data
•IPCs within various ranges are mapped into 1 of 5 categories
•4 writing styles determined by the XML identifiers for the Title, Claims, Abstract and Description section.
•Language Studio is configured to recognize a sentence header and change style for every sentence based on the header.
•This permits 20 writing styles within a single engine.
–Changes the use of bilingual and monolingual data as required per style 42. Copyright © 2014, Asia Online Pte Ltd
Pre-Processing Rules
Hybrid Rules and SMT Engine Model
Hybrid Rules and Corrective Statistical Engine Model
• Sentence Segmentation
• Word Segmentation
• Phrase Reordering
• Dates and Numbers
• Patterns, Formulas etc.
• Pre-Normalization
• Spell Checking
• Custom Runtime Glossary
• Pre-Formatting
• Capitalization
• Post-Formatting
• Grammar Checking
• Post-Normalization
• XML Tag Reinsertion
• Currency Conversion
• Cross Referencing
• Other custom post processing
This is more of a
Band-Aid approach as the core MT is still a traditional Rules Based MT Engine
Statistical Machine Translation
Post-Processing Rules
Statistical Correction of Rules Errors
Translation Rules
EN
No
Yes
ES
No
Yes
• Statistical Smoothing
• “Automated Post Editing” 43. Copyright © 2014, Asia Online Pte Ltd
•Problem
–Reference numbers break translation fluency
•Solution
–Use JavaScript rules
–Remove from translation recording its original position
–Track the movement position of the word associated with the reference number and reinsert after translation
However, malware on electronic device 103 must still make requests of resource 106 if it is to carry out malicious activities.
Apartments are in very good condition, well equipped and furnished to a very good standard.
los apartamentos están en |0-2,0, 0=0 0=1 1=2 2=3 | muy buenas condiciones |3-5,0, 0=0 1=1 2=2 | , |6-6,0, | bien equipados y amueblados |7-10,0, 0=0 1=1 2=2 3=3 | a un nivel muy bueno |11-15,0, 0=0 1=1 2=3 3=4 4=2 | . |16-16,0, | 44. Copyright © 2014, Asia Online Pte Ltd
•Problem:
–An infinite number or highly variable data element that statistics will not handle well
•Solution
–Use JavaScript rules
–Associate the data element with the class and store data on a Session object
–Substitute the data element with the class identifier
–Translate with the class – all data of the class will be treated the same
–After translation merge the data element back into the class using word tracking information
The above-identified U.S. patent application Ser. No. 13/155,881, filed Jun. 8, 2011 provides further details of searching by image. The above-identified @PATENTNOPREFIX@ @PATENTNO@, filed @DATE@ provides further details of searching by image. 49. Copyright © 2014, Asia Online Pte Ltd
•Problem:
–Sometimes it is not possible to predict the best approach to deliver the best quality
•Solution:
–Perform multiple approaches and score them
•Language Studio supports multiple ordering and restructuring formats for a single segment of data.
•Each can be evaluated independently using a number of scoring metrics and the best quality translation result returned
–Scores for Segment Level Confidence, Language Model, Source Matching, TM Matching, Terminology Confidence 51. Copyright © 2014, Asia Online Pte Ltd
4. Manage
Manage translation projects while generating corrective data for quality improvement.
2. Measure
Measure the quality of the engine for rating and future improvement comparisons
3. Improve
Provide corrective feedback removing potential for translation errors.
1. Customize
Create a new custom engine using foundation data and your own language assets 52. Copyright © 2014, Asia Online Pte Ltd
•Exception handling
–Long sentences
–Bad sentences
–Bug bears
•New Data
–Integrate quickly as it is produced by various patent offices
–Data produced regularly
•Hire Specialists
–People to work on data and rules that understand the engine and know how to refine it
•Outsource Term Translation
–Find a specialist that can translate terms from Gap Analysis 53. Copyright © 2014, Asia Online Pte Ltd
•Coined by Laura Rossi from LexisNexis
–A nasty or bad word that should never be in the translation output
•Previous solution
–Find in the phrase table data
•Remove
•Re-binarize
–Find in the training data
•Remove
–Very time consuming
•Language Studio Solution
–Bad word list
–Can be updated any time
–Translation engine decoder will ignore any data that has a bad word in it 54. Copyright © 2014, Asia Online Pte Ltd
•Training data can often have gaps in coverage and an excess of data in other areas.
•Gaps in coverage reduce translation quality.
•Gaps can quickly be filled via post editing the machine translated output and submitting the data back to the system for further learning.
•Many gaps can be filled with monolingual data only.
•Further gaps can be identified and resolved by analyzing the text that is to be translated for high frequency terms and unknown words
•In some cases incorrect data may be statistically more relevant. Post editing will raise the relevance of the correct grammar.
Sufficient Data Threshold
Data Shortfall
Post Edited Feedback and
Generated Data to Fill Gaps
Example of Training Data
Data Volume
More initial data provided for training results in greater vocabulary and grammatical coverage above the Sufficient Data Threshold and less post editing feedback required.
Gaps in Topic Coverage 55. Copyright © 2014, Asia Online Pte Ltd
•Document and Proximity Translations
–All existing translation platforms translate at a sentence level only.
–By leveraging information in the document or in near proximity to the current sentence, higher quality translations are possible.
•Immediate Quality Updates
–Updates to engine quality within 60 minutes of making edits.
–Updates to engine quality by learning automatically from external sources.
•Improved Slavic language support
–Generation of inflected forms
–Deeper grammatical and syntactical analysis 56. Copyright © 2014, Asia Online Pte Ltd
High volume, High Quality Patent Translation across Multiple Domains
Dion Wiggins
Chief Executive Officer
dion.wiggins@asiaonline.net