SlideShare a Scribd company logo
Manuel Herranz
When search engine [techniques]
meet
machine translation
Introducing Machine Translation
Unrelated tech changes the
game…
 TM database built from TMX files
 Based on the state-of-the-art full-text search engine
 Extremely fast indexing, search and retrieval
 Supports advanced text retrieval techniques
(fuzzy match, regular expressions)
 Easily scalable
 Role-based security
ElasticTM
 Considered Lucene-based search engines:
 Solr and ElasticSearch
 Mature open source projects
 Have similar capabilities & performance
 ElasticSearch was picked mainly because of:
 Out-of-the-box scalability
 Powerful Query DSL (query language)
 Role-based security (via plugin)
ElasticTM - Search Engine
ElasticTM - Design
EN ES FR ... NL
Search Engine
EN
<->
ES
FR
<->
ES
FR
<->
NL
...
Map DB
ElasticTM
ElasticTM - Design (cont’d)
 Monolingual indices
 Memory-effective
 Implicit transitive language pairs
 Bilingual mappings
§ Fast bidirectional id <-> id mapping
 Role-based security system
 Admin, project admin, user etc.
§ Mapping source language segments to a target language
§ Bidirectional map (id to id)
§ Supports quick bulk incremental updates
ElasticTM - Map
Alternatives?
§ NoSQL key-value databases
 MongoDB
 CouchDB
 Redis
 ElasticSearch, many others …
Lack of upsert support for bulk
updates
Handling duplicate entries
Scalability
§ SQL databases
 MySQL
 PostgreSQL
✗
✗
ElasticTM - Map -
Benchmarks
* The lower, the better
Time, s
Memory, MB
ElasticTM - Map -
Benchmarks
ElasticSearch MongoDB CouchDB Redis
Add (47K) 83s 432s 67s 458s
Add (440K) 858s 6112s 644s 621s
Query
(10K)
51s 187s 458s 72s
Query
(440K)
1400s 6451 19647 1210s
Memory 252M 549M 771M 148M
ElasticTM - Scaling
ghg
Cluster
EN
...
1)
EN-
ES1
EN-
ES2
ghg
Cluster
EN1
ES1
ES2 EN2
...
2)
EN-
ES1
EN-
ES2
ES
Translation Memory (TM)
 Pre-translations stored in a database and offered as suggestions
 Implemented matching algorithm to propose a relevant translations
 exact match and fuzzy match
 segments similarities based on characters or tokens
NLP improves matching algorithm
Approach
• Statistical Machine Translation (SMT)
• Computer-Aided Translation (CAT)
environment
Run maintenance
• Search and
replace
• Update TM entries
• Imports & Export
entries
Translatio
n Memory
TM processing ElasticT
M
Full-text search engine
+
NLP techniques
Basic examples of TM Matching
& processing
perfect match by substitution
fuzzy match
{
“source_TM” : “I have 3 cats”,
“target_TM” : “Yo tengo 3 gatos”,
“score” : “80%”
}
{
“source_TM” : “I have <number> cats”,
“target_TM” : “Yo tengo <number> gatos”,
“score” : “100%”
}
Original TM
{
“input_source”: “I have 2 cats”,
“output_target”: “ ”
}
TM after preprocessing
• URLs
• Emails
• Dates
• Units
Basic examples of TM
Matching & processing
fuzzy match
{
“source_TM” : “I have a cat and I am very happy”,
“target_TM” : “Yo tengo un gato y estoy muy feliz”,
“score” : “44%”
}
{
“target_TM” : “Yo tengo un gato y estoy muy feliz”,
“source_TM” : “I have a cat”,
“target_TM” : “Yo tengo un gato”,
“source_TM” : “I am very happy”,
“target_TM” : “Estoy muy feliz”,
“score” : “100%”
}
Original TM
{
“input_source”: “I have a cat”,
“output_target”: “ ”
}
TM after preprocessing
perfect match by substitution
Improving TM Matching
 Several language → Maximise the reuse of existing human translation
 Linguistic feature → improving fuzzy matching
 string transformation
 segmentation rules
 pos tagger
 tokenizer
EN
ES
PT
JA
.
.
.
FR
EN
ES
PT
JA
.
.
.
FR
Improving TM Matching
Linguistic approach to improve match
• Segment the text by sentence
○ Using delimiters like . ? ! , - :
○ Limited the total of words
• Intra-sentence segmentation
○ Using conjunctions, adverbs,
determiners, pronouns
○ Others approaches
• Replace segments
○ Numbers, dates, proper nouns and
identifiers, URLs, e-mail address,
punctuation marks, acronyms,
variables.
• POS pattern string
• Named entity recognition
ElasticTM
TMX
files
source
text
(Puscasu, 2004; Eriksson and
Myhrman, 2010; Orasan, 2000)
Challenges
• Morphologically rich and non-Indo-European languages
• Go beyond statistics (ongoing work, part of EXPERT project)
Hybrid approaches improve certain language pairs: Japanese (R&D with Japanese partners),
morphologically rich languages, Semitic languages.
• Continue building revenue streams on MT
MT allows Pangeanic to build other technologies (web, search, etc), enhance and improve its solutions to
its client portfolio and offer new services.
¡Gracias!
Questions?
m.herranz@pangeanic.com
#pangeanic pangeanic

More Related Content

Similar to Pangea v3 - when engine search meets machine translation, Manuel Herranz, Pangeanic

Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
Using Azure Cognitive Services with NativeScript
Using Azure Cognitive Services with NativeScriptUsing Azure Cognitive Services with NativeScript
Using Azure Cognitive Services with NativeScript
Sherry List
 
Elasticsearch at Dailymotion
Elasticsearch at DailymotionElasticsearch at Dailymotion
Elasticsearch at Dailymotion
Cédric Hourcade
 
Modern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.jsModern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.js
Mike North
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
clintongormley
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Positive Hack Days
 
Microsoft Makes Machine Learning Accessible and Affordable
Microsoft Makes Machine Learning Accessible and AffordableMicrosoft Makes Machine Learning Accessible and Affordable
Microsoft Makes Machine Learning Accessible and Affordable
Douglas Starnes
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
XPATH, LDAP and Path Traversal Injection
XPATH, LDAP and Path Traversal InjectionXPATH, LDAP and Path Traversal Injection
XPATH, LDAP and Path Traversal Injection
Blueinfy Solutions
 
Fuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two CulturesFuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two Cultures
CISPA Helmholtz Center for Information Security
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
Leo Loobeek
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Lucidworks
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan GunterMaterials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Dan Gunter
 
Secure .NET programming
Secure .NET programmingSecure .NET programming
Secure .NET programmingAnte Gulam
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
rtelmore
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!
nerdybeardo
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
Fastly
 

Similar to Pangea v3 - when engine search meets machine translation, Manuel Herranz, Pangeanic (20)

Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Using Azure Cognitive Services with NativeScript
Using Azure Cognitive Services with NativeScriptUsing Azure Cognitive Services with NativeScript
Using Azure Cognitive Services with NativeScript
 
Elasticsearch at Dailymotion
Elasticsearch at DailymotionElasticsearch at Dailymotion
Elasticsearch at Dailymotion
 
Modern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.jsModern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.js
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
 
Microsoft Makes Machine Learning Accessible and Affordable
Microsoft Makes Machine Learning Accessible and AffordableMicrosoft Makes Machine Learning Accessible and Affordable
Microsoft Makes Machine Learning Accessible and Affordable
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
XPATH, LDAP and Path Traversal Injection
XPATH, LDAP and Path Traversal InjectionXPATH, LDAP and Path Traversal Injection
XPATH, LDAP and Path Traversal Injection
 
Fuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two CulturesFuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two Cultures
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan GunterMaterials Project Validation, Provenance, and Sandboxes by Dan Gunter
Materials Project Validation, Provenance, and Sandboxes by Dan Gunter
 
Secure .NET programming
Secure .NET programmingSecure .NET programming
Secure .NET programming
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 

More from TAUS - The Language Data Network

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS - The Language Data Network
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
TAUS - The Language Data Network
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
TAUS - The Language Data Network
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
TAUS - The Language Data Network
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
TAUS - The Language Data Network
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
TAUS - The Language Data Network
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
TAUS - The Language Data Network
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
TAUS - The Language Data Network
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
TAUS - The Language Data Network
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
TAUS - The Language Data Network
 

More from TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Recently uploaded

Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 

Recently uploaded (13)

Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 

Pangea v3 - when engine search meets machine translation, Manuel Herranz, Pangeanic

  • 1. Manuel Herranz When search engine [techniques] meet machine translation
  • 4.  TM database built from TMX files  Based on the state-of-the-art full-text search engine  Extremely fast indexing, search and retrieval  Supports advanced text retrieval techniques (fuzzy match, regular expressions)  Easily scalable  Role-based security ElasticTM
  • 5.  Considered Lucene-based search engines:  Solr and ElasticSearch  Mature open source projects  Have similar capabilities & performance  ElasticSearch was picked mainly because of:  Out-of-the-box scalability  Powerful Query DSL (query language)  Role-based security (via plugin) ElasticTM - Search Engine
  • 6. ElasticTM - Design EN ES FR ... NL Search Engine EN <-> ES FR <-> ES FR <-> NL ... Map DB ElasticTM
  • 7. ElasticTM - Design (cont’d)  Monolingual indices  Memory-effective  Implicit transitive language pairs  Bilingual mappings § Fast bidirectional id <-> id mapping  Role-based security system  Admin, project admin, user etc.
  • 8. § Mapping source language segments to a target language § Bidirectional map (id to id) § Supports quick bulk incremental updates ElasticTM - Map Alternatives? § NoSQL key-value databases  MongoDB  CouchDB  Redis  ElasticSearch, many others … Lack of upsert support for bulk updates Handling duplicate entries Scalability § SQL databases  MySQL  PostgreSQL ✗ ✗
  • 9. ElasticTM - Map - Benchmarks * The lower, the better Time, s Memory, MB
  • 10. ElasticTM - Map - Benchmarks ElasticSearch MongoDB CouchDB Redis Add (47K) 83s 432s 67s 458s Add (440K) 858s 6112s 644s 621s Query (10K) 51s 187s 458s 72s Query (440K) 1400s 6451 19647 1210s Memory 252M 549M 771M 148M
  • 12. Translation Memory (TM)  Pre-translations stored in a database and offered as suggestions  Implemented matching algorithm to propose a relevant translations  exact match and fuzzy match  segments similarities based on characters or tokens NLP improves matching algorithm
  • 13. Approach • Statistical Machine Translation (SMT) • Computer-Aided Translation (CAT) environment Run maintenance • Search and replace • Update TM entries • Imports & Export entries Translatio n Memory TM processing ElasticT M Full-text search engine + NLP techniques
  • 14. Basic examples of TM Matching & processing perfect match by substitution fuzzy match { “source_TM” : “I have 3 cats”, “target_TM” : “Yo tengo 3 gatos”, “score” : “80%” } { “source_TM” : “I have <number> cats”, “target_TM” : “Yo tengo <number> gatos”, “score” : “100%” } Original TM { “input_source”: “I have 2 cats”, “output_target”: “ ” } TM after preprocessing • URLs • Emails • Dates • Units
  • 15. Basic examples of TM Matching & processing fuzzy match { “source_TM” : “I have a cat and I am very happy”, “target_TM” : “Yo tengo un gato y estoy muy feliz”, “score” : “44%” } { “target_TM” : “Yo tengo un gato y estoy muy feliz”, “source_TM” : “I have a cat”, “target_TM” : “Yo tengo un gato”, “source_TM” : “I am very happy”, “target_TM” : “Estoy muy feliz”, “score” : “100%” } Original TM { “input_source”: “I have a cat”, “output_target”: “ ” } TM after preprocessing perfect match by substitution
  • 16. Improving TM Matching  Several language → Maximise the reuse of existing human translation  Linguistic feature → improving fuzzy matching  string transformation  segmentation rules  pos tagger  tokenizer EN ES PT JA . . . FR EN ES PT JA . . . FR
  • 17. Improving TM Matching Linguistic approach to improve match • Segment the text by sentence ○ Using delimiters like . ? ! , - : ○ Limited the total of words • Intra-sentence segmentation ○ Using conjunctions, adverbs, determiners, pronouns ○ Others approaches • Replace segments ○ Numbers, dates, proper nouns and identifiers, URLs, e-mail address, punctuation marks, acronyms, variables. • POS pattern string • Named entity recognition ElasticTM TMX files source text (Puscasu, 2004; Eriksson and Myhrman, 2010; Orasan, 2000)
  • 18. Challenges • Morphologically rich and non-Indo-European languages • Go beyond statistics (ongoing work, part of EXPERT project) Hybrid approaches improve certain language pairs: Japanese (R&D with Japanese partners), morphologically rich languages, Semitic languages. • Continue building revenue streams on MT MT allows Pangeanic to build other technologies (web, search, etc), enhance and improve its solutions to its client portfolio and offer new services.