Embracing diversity searching over multiple languages

Suneel Marthi
Suneel MarthiSenior Principal Engineer, Office of Technology, Redhat;Member at The Apache Software Foundation
Embracing Diversity: Searching
over multiple languages
Tommaso Teofili
Suneel Marthi
June 12, 2017
Berlin Buzzwords, Berlin, Germany
1
$WhoAreWe
Tommaso Teofili
 @tteofili
Software Engineer, Adobe Systems
Member of Apache Software Foundation,
PMC Chair, Apache Lucene
Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit
Suneel Marthi
 @suneelmarthi
Principal Software Engineer, Office of Technology, Red Hat
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams
2
Agenda
What is Multi-Lingual Search ?
Why Multi-Lingual Search ?
What is Statistical Machine Translation ?
Overview of Apache Joshua
Dataflow Pipeline
Demo
3
What is Multi-Lingual Search ?
4
Searching
over content written in different languages
with users speaking different languages
both
Parallel corpora
Translating queries
Translating documents
5
Why Multi-Lingual Search ?
6
Embracing diversity
Most online tech content is in English
Wikipedia dumps:
en: 62GB
de: 17GB
it:   10GB
Good number of non-English speaking users
A lot of search queries are composed in English
Preferable to retrieve search results in native
language
… or even to consolidate all results in one language
7
UC1 — tech domain, native first
8
UC2 — native only ?
9
What is Machine Translation ?
10
Generate Translations from Statistical Models trained
on Bilingual Corpora.
Translation happens per a probability distribution
p(e/f)
E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e/f) = argmax p(f/e) * p(e)
e~ = best translation, the one with highest probability
11
Word-based Translation
12
How to translate a word → lookup in dictionary
Gebäude — building, house, tower.
Multiple translations
some more frequent than others
for instance: house and building most common
13
Look at a parallel corpus
(German text along with English translation)
Translation of Gebäude Count Probability
house 5.28 billion 0.51
building 4.16 billion 0.402
tower 9.28 million 0.09
14
Alignment
In a parallel text (or when we translate), we align
words in one language with the word in the other
Das Gebäude ist hoch
↓ ↓ ↓ ↓
the building is high
Word positions are numbered 1—4
15
Alignment Function
Define the Alignment with an Alignment Function
Mapping an English target word at position i to a
German source word at position j with a function a :
i → j
Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
16
One-to-Many Translation
A source word could translate into multiple target words
Das ist ein Hochhaus   
↓ ↓ ↓ ↙ ↓ ↘
This is a high    rise building
17
Phrase-based Translation
18
Alignment Function
Word-Based Models translate words as atomic units
Phrase-Based Models translate phrases as atomic
units
Advantages:
many-to-many translation can handle non-
compositional phrases
use of local context in translation
the more data, the longer phrases can be learned
“Standard Model”, used by Google Translate and
others
19
Phrase-Based Model
Berlin ist ein herausragendes Kunst- und Kulturzentrum .
↓ ↓ ↓ ↓ ↓ ↓
Berlin is an outstanding Art and cultural center .
Foreign input is segmented in phrases
Each phrase is translated into English
Phrases are reordered
20
Decoding
21
We have a mathematical model for translation
p(e|f)
Task of decoding: find the translation ebest with
highest probability
Two types of error
the most probable translation is bad → fix the
model
search does not find the most probable translation
→ fix the search
ebest = argmax p(e|f)
22
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er        
↓        
he        
Pick and input phrase, translate
23
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er     ja noch nichts
↓      
he   does not yet  
Pick and input phrase, translate
24
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er   trinkt   ja noch nichts
↓      
he   does not yet   drink
Pick and input phrase, translate
25
Apache Joshua
26
Statistical Machine Translation Decoder for phrase-based and hierarch
machine translation
Written in Java
Provide 64 language packs for machine translation
https://cwiki.apache.org/confluence/display/JOSHUA/Language+P
Project initiated by Johns Hopkins Univ. and University of Pennsylvania
Presently incubating at Apache Software Foundation
Used extensively by Amazon.com, NASA JPL
https://cwiki.apache.org/confluence/display/JOSHUA
 @ApacheJoshua
27
Flows
28
References
Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA
Apache OpenNLP — https://opennlp.apache.org
GitHub — https://github.com/smarthi/BBuzz-multilang-search
Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching-
over-multiple-languages/#/
29
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Matt Post — PMC Chair, Apache Joshua
Bruno P. Kinoshita — Committer on Apache OpenNLP,
committer and PMC on Apache Commons and Apache
Jena
30
Questions ???
31
1 of 31

Recommended

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St... by
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
1.2K views41 slides
Large Scale Text Processing by
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
3.5K views36 slides
Practical NLP with Lisp by
Practical NLP with LispPractical NLP with Lisp
Practical NLP with LispVsevolod Dyomkin
4.5K views45 slides
Aspects of NLP Practice by
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP PracticeVsevolod Dyomkin
1K views37 slides
The State of #NLProc by
The State of #NLProcThe State of #NLProc
The State of #NLProcVsevolod Dyomkin
997 views19 slides
Why Python? by
Why Python?Why Python?
Why Python?Adam Pah
1.8K views29 slides

More Related Content

What's hot

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal by
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
2.8K views81 slides
Natural Language Processing in Practice by
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
3.2K views25 slides
Programming languages vienna by
Programming languages viennaProgramming languages vienna
Programming languages viennagreg_s
433 views20 slides
NLP Project Full Cycle by
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full CycleVsevolod Dyomkin
12K views55 slides
Parallel Corpora in (Machine) Translation: goals, issues and methodologies by
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
240 views39 slides
Lets learn Python ! by
Lets learn Python !Lets learn Python !
Lets learn Python !Kiran Gangadharan
1.4K views27 slides

What's hot(20)

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal by Prof. Wim Van Criekinge
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
Natural Language Processing in Practice by Vsevolod Dyomkin
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin3.2K views
Programming languages vienna by greg_s
Programming languages viennaProgramming languages vienna
Programming languages vienna
greg_s433 views
Parallel Corpora in (Machine) Translation: goals, issues and methodologies by Antonio Toral
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Antonio Toral240 views
ANTLR4 and its testing by Knoldus Inc.
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
Knoldus Inc.3.9K views
OpenNLP demo by Gagan Gowda
OpenNLP demoOpenNLP demo
OpenNLP demo
Gagan Gowda26.2K views
Python and its Applications by Abhijeet Singh
Python and its ApplicationsPython and its Applications
Python and its Applications
Abhijeet Singh12.2K views
You too can nlp - PyBay 2018 lightning talk by Jacob Perkins
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talk
Jacob Perkins632 views
Natural Language Processing using Text Mining by Sushanti Acharya
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
Sushanti Acharya183 views
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA by Maulik Borsaniya
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Maulik Borsaniya16.2K views
An Extensible Multilingual Open Source Lemmatizer by COMRADES project
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
COMRADES project322 views

Similar to Embracing diversity searching over multiple languages

Nlp by
NlpNlp
NlpAndrea Hill
743 views10 slides
How to expand your nlp solution to new languages using transfer learning by
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
79 views37 slides
Swift vs. Language X by
Swift vs. Language XSwift vs. Language X
Swift vs. Language XScott Wlaschin
15.9K views15 slides
"Machine Translation 101" and the Challenge of Patents by
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
334 views30 slides
Machine translation from English to Hindi by
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
2.3K views20 slides
The Latest Advances in Patent Machine Translation by
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
543 views31 slides

Similar to Embracing diversity searching over multiple languages(20)

How to expand your nlp solution to new languages using transfer learning by Lena Shakurova
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
Lena Shakurova79 views
Machine translation from English to Hindi by Rajat Jain
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
Rajat Jain2.3K views
Preparing an Open Source Documentation Repository for Translations by HPCC Systems
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for Translations
HPCC Systems160 views
Past, Present, and Future: Machine Translation & Natural Language Processing ... by John Tinsley
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
John Tinsley83 views
Past, Present, and Future: Machine Translation & Natural Language Processing ... by Iconic Translation Machines
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Towards Universal Language Understanding (2020 version) by Yunyao Li
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
Yunyao Li125 views

More from Suneel Marthi

Measuring vegetation health to predict natural hazards by
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsSuneel Marthi
146 views53 slides
Large scale landuse classification of satellite imagery by
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
128 views62 slides
Streaming topic model training and inference by
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inferenceSuneel Marthi
720 views39 slides
Large scale landuse classification of satellite imagery by
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
341 views46 slides
Building streaming pipelines for neural machine translation by
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationSuneel Marthi
678 views65 slides
Moving beyond moving bytes by
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytesSuneel Marthi
259 views37 slides

More from Suneel Marthi(8)

Measuring vegetation health to predict natural hazards by Suneel Marthi
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazards
Suneel Marthi146 views
Large scale landuse classification of satellite imagery by Suneel Marthi
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
Suneel Marthi128 views
Streaming topic model training and inference by Suneel Marthi
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inference
Suneel Marthi720 views
Large scale landuse classification of satellite imagery by Suneel Marthi
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
Suneel Marthi341 views
Building streaming pipelines for neural machine translation by Suneel Marthi
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translation
Suneel Marthi678 views
Moving beyond moving bytes by Suneel Marthi
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytes
Suneel Marthi259 views
Distributed Machine Learning with Apache Mahout by Suneel Marthi
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
Suneel Marthi530 views
Apache Flink Stream Processing by Suneel Marthi
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
Suneel Marthi1.4K views

Recently uploaded

Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides
SUPER STORE SQL PROJECT.pptx by
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptxkhan888620
13 views16 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
7 views30 slides
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
18 views19 slides
Custom Tag Manager Templates by
Custom Tag Manager TemplatesCustom Tag Manager Templates
Custom Tag Manager TemplatesMarkus Baersch
22 views17 slides
Amy slides.pdf by
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 views13 slides

Recently uploaded(20)

Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus18 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines by DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int... by DataScienceConferenc1
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...

Embracing diversity searching over multiple languages

  • 1. Embracing Diversity: Searching over multiple languages Tommaso Teofili Suneel Marthi June 12, 2017 Berlin Buzzwords, Berlin, Germany 1
  • 2. $WhoAreWe Tommaso Teofili  @tteofili Software Engineer, Adobe Systems Member of Apache Software Foundation, PMC Chair, Apache Lucene Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit Suneel Marthi  @suneelmarthi Principal Software Engineer, Office of Technology, Red Hat Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. Agenda What is Multi-Lingual Search ? Why Multi-Lingual Search ? What is Statistical Machine Translation ? Overview of Apache Joshua Dataflow Pipeline Demo 3
  • 5. Searching over content written in different languages with users speaking different languages both Parallel corpora Translating queries Translating documents 5
  • 7. Embracing diversity Most online tech content is in English Wikipedia dumps: en: 62GB de: 17GB it:   10GB Good number of non-English speaking users A lot of search queries are composed in English Preferable to retrieve search results in native language … or even to consolidate all results in one language 7
  • 8. UC1 — tech domain, native first 8
  • 9. UC2 — native only ? 9
  • 10. What is Machine Translation ? 10
  • 11. Generate Translations from Statistical Models trained on Bilingual Corpora. Translation happens per a probability distribution p(e/f) E = string in the target language (English) F = string in the source language (Spanish) e~ = argmax p(e/f) = argmax p(f/e) * p(e) e~ = best translation, the one with highest probability 11
  • 13. How to translate a word → lookup in dictionary Gebäude — building, house, tower. Multiple translations some more frequent than others for instance: house and building most common 13
  • 14. Look at a parallel corpus (German text along with English translation) Translation of Gebäude Count Probability house 5.28 billion 0.51 building 4.16 billion 0.402 tower 9.28 million 0.09 14
  • 15. Alignment In a parallel text (or when we translate), we align words in one language with the word in the other Das Gebäude ist hoch ↓ ↓ ↓ ↓ the building is high Word positions are numbered 1—4 15
  • 16. Alignment Function Define the Alignment with an Alignment Function Mapping an English target word at position i to a German source word at position j with a function a : i → j Example a : {1 → 1, 2 → 2, 3 → 3, 4 → 4} 16
  • 17. One-to-Many Translation A source word could translate into multiple target words Das ist ein Hochhaus    ↓ ↓ ↓ ↙ ↓ ↘ This is a high    rise building 17
  • 19. Alignment Function Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many translation can handle non- compositional phrases use of local context in translation the more data, the longer phrases can be learned “Standard Model”, used by Google Translate and others 19
  • 20. Phrase-Based Model Berlin ist ein herausragendes Kunst- und Kulturzentrum . ↓ ↓ ↓ ↓ ↓ ↓ Berlin is an outstanding Art and cultural center . Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered 20
  • 22. We have a mathematical model for translation p(e|f) Task of decoding: find the translation ebest with highest probability Two types of error the most probable translation is bad → fix the model search does not find the most probable translation → fix the search ebest = argmax p(e|f) 22
  • 23. Translation Process Translate this query from German into English er trinkt ja noch nichts er         ↓         he         Pick and input phrase, translate 23
  • 24. Translation Process Translate this query from German into English er trinkt ja noch nichts er     ja noch nichts ↓       he   does not yet   Pick and input phrase, translate 24
  • 25. Translation Process Translate this query from German into English er trinkt ja noch nichts er   trinkt   ja noch nichts ↓       he   does not yet   drink Pick and input phrase, translate 25
  • 27. Statistical Machine Translation Decoder for phrase-based and hierarch machine translation Written in Java Provide 64 language packs for machine translation https://cwiki.apache.org/confluence/display/JOSHUA/Language+P Project initiated by Johns Hopkins Univ. and University of Pennsylvania Presently incubating at Apache Software Foundation Used extensively by Amazon.com, NASA JPL https://cwiki.apache.org/confluence/display/JOSHUA  @ApacheJoshua 27
  • 29. References Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA Apache OpenNLP — https://opennlp.apache.org GitHub — https://github.com/smarthi/BBuzz-multilang-search Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching- over-multiple-languages/#/ 29
  • 30. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Matt Post — PMC Chair, Apache Joshua Bruno P. Kinoshita — Committer on Apache OpenNLP, committer and PMC on Apache Commons and Apache Jena 30