SlideShare a Scribd company logo
1 of 70
Download to read offline
PyCon Korea 2019
이홍주 (lee.hongjoo@yandex.com)
집에서 만든 머신러닝 기반 자동번역기
#NoDeepLearning
PyCon Korea 2019
이홍주 (lee.hongjoo@yandex.com)
KOREANizer :
SMT based Ro-Ko Transliterator
Introduction
● Korean Input Method SUCKS!
Installing Korean Language on Windows 3.1 and Windows 95
Once upon a time...
Introduction
● https://google.com/inputtools
Google Input Tools
Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web
MT Basics
● Three MT Levels : Direct, Transfer, Interlingual
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Interlingua
Transfer
Direct
analysis
generation
source semantics target semantics
source syntax target syntax
MT Basics
● Direct Translation
○ Single phase : translate word by word with some
reorderings
○ lack of analysis
■ long-range reordering
● “Sources said that IBM bought Lotus yesterday”
● “소식동(은) 어제 IBM(이) Lotus(를) 샀다(고) 말했다"
■ syntactic role ambiguity
● “They said that I like ice-cream”
● “They like that ice-cream”
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Direct
MT Basics
● Interlingual Translation
○ Two phases
■ Analysis : Analyze the source language into a semantic
representation
■ Generation : Convert the representation into an target
language
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Interlingua
analysis
generation
semantics semantics
MT Basics
● Transfer based Translation
○ Three phases
■ Analysis : Analyze the source language’s structure
■ Transfer : Convert the source’s structure to a target’s
■ Generation : Convert the target structure to an target
language
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Transfer
analysis
generation
MT Basics
● Transfer based Translation
○ Levels of Transfer : Words, Phrases, Syntax
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
syntax
phrases
words
syntax
phrases
words
MT Basics
● Ideas for future works
○ 18th
C. - Bayes theorem
○ 1948 - Noisy-Channel coding theorem
○ 1949 - Warren Weaver’s memo
● Statistical Machine Translation (SMT):
○ 1988 - Word-based models
○ 2003 - Phrase-based models
○ 2006 - Google Translate
● Neural Machine Translation (NMT):
○ 2013 - First papers on pure NMT
○ 2015 - NMT enters shared tasks (WMT, IWSLT)
○ 2016 - In production
History
MT Basics
● SMT as Noisy Channel
○ Said in English, received in Spanish
The Noisy Channel Model
Good Morning! ¡Buenos días!
MT Basics
● SMT as Noisy Channel
○ By convention, we use E for a source, F for a foreign language.
The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
MT Basics
● SMT as Noisy Channel
○ P(f|e)
The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
P(f|e)
T(f) = ê = argmaxe
P(e|f)
MT Basics The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
MT Basics
● P(e) - Language Model
○ models the fluency of the translation
○ data : corpus in the target language E
● P(f|e) - Translation Model
○ models the adequacy of the translation
○ data : parallel corpus of F and E pairs
● argmaxe
- Decoder
○ given LM, TM and f, generate the most fluent and adequate translation
result e^
SMT Systems
MT Basics SMT Systems
Spanish
Broken
English
EnglishTM LM
Statistical AnalysisStatistical Analysis
Spanish / English
Parallel Corpus
English
Corpus
Decoding
MT Basics
● candidates based on Translation Model alone
○ Que hambre tengo yo
What hunger have I p(s|e) = 0.000014
Hungry I am so p(s|e) = 0.000001
I am so hungry p(s|e) = 0.0000015
Have I that hunger p(s|e) = 0.000020
...
SMT Systems
MT Basics
● with Language Model
○ Que hambre tengo yo
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems
MT Basics
● by Decoding
○ Que hambre tengo yo
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems
argmaxe
= I am so hungry
Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web
SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the next word?
○ “I am _____”
■ Sam, eggs, ham, not, ...
○ P(Sam | I am) = ?
Language Model
SMT Components
I am Sam
Sam I am
I do not like green eggs and ham
Language Model
SMT Components
I am Sam
Sam I am
I do not like green eggs and ham
Language Model
SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the whole sequence?
○ P(I am Sam) = ?
Language Model
SMT Components
● Predict probability of a sequence of words:
○ w = (w1
w2
w3
...wk
)
○ A Model is to compute P(wk
|w1
w2
w3
...wk-1
) or P(w)
● Application
○ speech recognition : P(I saw a van) >> P(eyes awe of an)
○ spelling correction : P(about fifteen minutes from) > P(about fifteen
minuets from)
○ machine translation : P(high winds tonite) > P(large winds tonite)
○ handwriting recognition
○ suggestions (search keyword, messaging,...)
Language Model
SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
Language Model
SMT Components
● Chain rule
○ two random variables
○ more than two variables
○ example : n=4 variables
Language Model
SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
● Chain rule:
○ P(w) = P(w1
)P(w2
|w1
)P(w3
|w2
w1
)...p(wk
|w1
...wk-1
)
○ w1
...wk-1
is still too long
Language Model
SMT Components
● Markov assumption
○ Conditional probability distribution of future states depends only on
the present state, not on the sequence of events that preceded it
(wikipedia)
○ Bi-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs | green )
○ Tri-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs |like green )
Language Model
SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
● Chain rule:
○ P(w) = P(w1
)P(w2
|w1
)P(w3
|w2
w1
)...p(wk
|w1
...wk-1
)
○ w1
...wk-1
is still too long
● Markov assumption:
○ P(wi
|w1
...wi-1
) = p(wi
|wi-n+1
...wi-1
)
○ ex) n = 2 ⇒ P(wi
|w1
...wi-1
) = P(wi
|wi-1
)
Language Model
SMT Components
● For Bi-gram Language Model (n=2):
○ P(w) = P(w1
)P(w2
|w1
)...P(wk
|wk-1
)
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the whole sequence?
○ P(I am Sam) = P(I) x P(am | I) x P(Sam | am)
= 3/14 x 2/3 x 1 = 1/7
Language Model
SMT Components
● Bi-gram Language Model (Normalized):
○ P(w) = P(w1
|<s>)P(w2
|w1
)...P(wk
|wk-1
)P(</s>|wk
)
● Given a corpus
○ “<s> I am Sam </s>”
○ “<s> Sam I am </s>”
○ “<s> I do not like green eggs and ham </s>”
● What’s the probability of the whole sequence?
○ P(<s> I am Sam </s>) = P(I|<s>) x P(am|I) x P(Sam|am) x P(Sam|</s>)
= 2/3 x 2/3 x 1/2 x 1/3 = 2/9
Language Model
● generate Bi-gram
SMT Components
def generate_ngrams(
s: str,
n: int,
tokenize: Callable[[str],List[str]] = lambda x:x.split()) -> List[str]:
# ‘a b c’ => [‘a’, ‘b’, ‘c’]
tokens = tokenize(s)
# zip([‘a’, ‘b’, ‘c’], [‘b’, ‘c’]) => [(‘a’, ‘b’), (‘b’, ‘c’)]
ngrams = zip(*[tokens[i:] for i in range(n)])
# [‘a b’, ‘b c’]
return [“ “.join(ngram) for ngram in ngrams]
Language Model
● generate Bi-gram
SMT Components
corpus = [ ‘I am Sam’,
‘Sam I am’,
‘I do not like green eggs and ham’ ]
def tokenizer(s: str) -> List[str]:
return [‘<s>’] + s.split() + [‘</s>’]
Language Model
SMT Components
>>> corpus[0]
‘I am Sam’
>>> generate_ngrams(corpus[0], 2, tokenizer)
['<s> I', 'I am', 'am Sam', 'Sam </s>']
>>> [bigram for s in corpus for bigram in generate_ngrams(s, 2, tokenizer)]
['<s> I',
'I am',
'am Sam',
'Sam </s>',
'<s> Sam',
... ,
'and ham',
'ham </s>']
Language Model
● Train Bi-gram Language Model
SMT Components
for bigram in bigrams:
# count(‘a b’)
counter_numer[ bigram ] += 1
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# count(‘a’)
counter_denom[ w_denom ] += 1
for bigram, count in counter_numer:
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# p(‘a b’) = count(‘a b’) / count(‘a’)
prob[bigram] = counter_numer[bigram] / counter_denom[w_denom]
Language Model
● How do we model P(f|e)?
● Given a parallel corpus of <e, f> sentence pairs:
○ e has le words, e = (e1
… ele
)
○ f has lf words, f = (f1
… flf
)
● Introduce function a:j ⇒ i
○ Alignments a between e and f:
○ a = {a1
, … alf
}, aj
∈ {0...le}
○ (le + 1)lf
possible alignments
○ f[j] is aligned to e[a[j]] which is e[i]
SMT Components Translation Model
● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }
SMT Components Translation Model
● Modeling the generation of f given e: P(f,a|e,θ)
○ f, e : observable variables
○ a : hidden variables
○ ϴ : parameters
● Likelihood Maximization
○ p(f|e,θ) = ∑a
p(f,a|e,θ) → maxΘ
○ but we don’t have labeled alignments
● EM-algorithm
○ E-step : estimates posterior probabilities for alignments
○ M-step : updates parameter θ of the model
SMT Components Translation Model
● IBM Model 1
○ For parameter θ, length of foreign sentence (lf) is used
○ Alignements are uniformly distributed by assumption
■ P(a | e, lf) = 1 / (le+1)lf
● Translation Probability
○ estimate, P(f | a, e, lf) = ∏j=1..lf
t(fj
|ea(j)
)
● Result:
○ P(f,a|e,lf) = P(a|e,lf)xP(f|a,e,lf) = 1/(le+1)lf
∏j=1..lf
t(fj
|ea(j)
)
SMT Components Translation Model
● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }
p(f|a,e) = t(Le|the) x t(programme|program) x t(a|has) x
= t(etc|been) x t(mis|implemented) x
= t(application|implemented)
SMT Components Translation Model
SMT Components
● EM Algorithm
○ For incomplete data
■ Observable sentence pairs, but no alignments between each sentence pairs
○ Chicken and Egg Problem
■ if we had the alignments, we could estimate the translation parameters
■ if we had the parameters, we could estimate the alignments
Translation Model
SMT Components
● EM Algorithm (2)
○ initialize all alignments equally likely
○ model learns… (ex. “la” is often aligned to “the”)
Translation Model
● EM Algorithm (3)
○ after first iteration
○ some alignments... (ex. “la” is often aligned to “the”)
SMT Components Translation Model
● EM Algorithm (4)
○ after another iteration
○ alignments becomes apparent (ex. “fleur” and “flower” are more
likely)
SMT Components Translation Model
● EM Algorithm (5)
○ convergence
○ hidden alignments revealed by EM
SMT Components Translation Model
● EM Algorithm (6)
○ parameter estimation from the corpus of aligned sentence pairs
SMT Components Translation Model
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...
SMT Components
1 def IbmModelOneTrainEM(E, F):
2 TM = {} # translation model
3 total_s = {}
4 num_sentence = len(E)
5 assert(len(E) == len(F))
6
7 # assuming uniform distribution
8 for i in range(num_sentence):
9 for e in E[i]:
10 for f in F[i]:
11 TM[(e,f)] = UNIFORM
12
Translation Model
SMT Components
13 # EM iteration
14 while iteration < ITERATION_MAX:
15 count = defaultdict(int)
16 total = {}
17 for i in range(num_sentence):
18 for e in E[i]:
19 total_s[e] = 0
20 for f in F[i]:
21 total_s[e] += TM[(e,f)]
22 for e in E[i]:
23 for f in F[i]:
24 count[(e,f)] += TM[(e,f)] / total_s[e]
25 total[f] += TM[(e,f)] / total_s[e]
Translation Model
SMT Components
26 # recalculate parameters
27 for x in count.keys():
28 f = x[1]
29 try:
30 TM[x] = count[x] / total[f]
31 except KeyError:
32 pass
33 return TM
Translation Model
● Decoding is NP-complete
○ Given a Language Model and Translation Model, the decoder
constructs the possible translations and looks for the most probable
one.
● Traveling Salesman Problem
○ Words as vertices with translation probability
○ Edges with bigram probability as a weight
○ valid sentences are Hamiltonian path
SMT Components Decoder
I
am
so
sleepy
.
SMT Components
● Solutions
○ Dynamic Programming : reduces time complexities from exponential to
polynomial by storing results of subproblems and avoid re-computing
them
○ Viterbi Algorithm : DP algorithm to find shortest path in finite number
of states
○ Beam Search (Stack Decoding) : Keeps a list of the best N candidates
seen so far
Decoder
SMT Components
hypothesis = namedtuple('hypothesis', 'logprob, lm_state, predecessor, phrase')
initial_hypothesis = hypothesis(0.0, "<s>", None, None)
stacks = [{} for _ in f] + [{}]
stacks[0]['<s>'] = initial_hypothesis
for i, stack in enumerate(stacks[:-1]):
for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:STACKSIZE]: # prune
for j in xrange(i+1,len(f)+1):
if f[i:j] in tm:
for phrase in tm[f[i:j]]:
logprob = h.logprob + phrase.logprob
lm_state = h.lm_state
for word in phrase.english.split():
(lm_state, word_logprob) = lm.score(lm_state, word)
logprob += word_logprob
logprob += lm.end(lm_state) if j == len(f) else 0.0
new_hypothesis = hypothesis(logprob, lm_state, h, phrase)
if lm_state not in stacks[j] or stacks[j][lm_state].logprob < logprob:
# second case is recombination
stacks[j][lm_state] = new_hypothesis
winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob)
Decoder
Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web (demo)
Ro-Ko Transliteration
● Korean alphabet (Hangeul)
○ Grapheme (자소)
■ A feature alphabet of 24 consonant (자음) and vowel (모음) letters
자음 ㄱ ㄲ ㅋ ㄷ ㄸ ㅌ ㅂ ㅃ ㅍ ㅈ ㅉ ㅊ ㅅ ㅆ ㅎ ㄴ ㅁ ㅇ ㄹ
roman g, k kk k d, t tt t b, p pp p j jj ch s ss h n m ng r, l
모음 ㅏ ㅓ ㅗ ㅜ ㅡ ㅣ ㅐ ㅔ ㅚ ㅟ ㅑ ㅕ ㅛ ㅠ ㅒ ㅖ ㅘ ㅙ ㅝ ㅞ ㅢ
roman a ae o u eu i ae e oe wi ya yeo yo yu yae ye wa wae wo we ui
About Hangeul
Ro-Ko Transliteration
● Korean alphabet (Hangeul)
○ Syllables (음절)
■ Letters are grouped into blocks each of which transcribes a syllable
■ ㅎ(h) + ㅏ(a) + ㄴ(n) ⇒ 한 (han)
■ ㄱ(g) + ㅡ(eu) + ㄹ(l) ⇒ 글 (geul)
About Hangeul
Ro-Ko Transliteration
● Romanization
○ Romanization is the translation of sound of a foreign language into
Roman (English) letters as phonemes. (ex. 한글 ⇒ hangeul)
● Back-romanization
○ Translating a string in roman characters into best phonetically
matching words in a target language. (ex. hangeul ⇒ 한글)
○ Applications
■ Useful to a keyboard input system without any installation of desired language input
method. (ex. Google input tools)
■ Search engine (ex. loanword)
■ Address transliteration
○ Korean Back-romanization = Ro-Ko Transliteration
Problem
Ro-Ko Transliteration
● The Goal
○ Finding the best phonetic matching Hangeul string from a Roman
string
● Two Challenges
○ more than one possible way
■ Romanization : ‘한글’ → ‘hangul’ | ‘hangeul’ | ‘hanguel’
■ Back-romanization : ‘hangul’ | ‘hangeul’ | ‘hanguel’→ ‘한글’
■ finding the most probable matching word
○ Segmental Alignment
■ ‘kanmagi’ → ‘칸(kan)막(mag)이(i)’ vs ‘칸(kan)마(ma)기(gi)’
■ finding the most probable segment in roman string
○ This project focuses on bi-text segmental alignment problem.
Problem
Ro-Ko Transliteration
● Phoneme based alignment
○ Monotonic many-to-one segmental alignment.
■ Mapping one or more roman phonemes(음소) to one hangeul grapheme(자소)
○ “hangeul 한글” (includes 2-to-1 case)
■ [ ㅎㅏㄴㄱㅡㄹ, h a n g e u l ]→[ ㅎ/h, ㅏ/a, ㄴ/n, ㄱ/g, ㅡ/eu, ㄹ/l ]
○ “kanmagi 칸막이” (only 1-to-1 cases)
■ [ ㅋㅏㄴㅁㅏㄱㅇㅣ, k a n m a g i ]→[ ㅋ/k, ㅏ/a, ㄴ/n, ㅁ/m, ㅏ/a, ㄱ/g, i/l]
Approaches
Ro-Ko Transliteration
● Phoneme based alignment (Drawback)
○ disassembled sequence (ㅋㅏㄴㅁㅏㄱㅇㅣ) of Hangeul syllables
(칸막이) needs to be assembled again by another segmentation
algorithm (or some heuristics)
○ it may produce wrong segmentation
○ 칸막이 → disassemble : ㅋㅏㄴㅁㅏㄱㅇㅣ→ reassemble : 칸마기
○ Though ‘칸막이’ and ‘칸마기’ has very similar pronunciation
○ Yet ‘칸마기’ is incorrect and ‘칸막이’ should be the answer
Approaches
Ro-Ko Transliteration
● Syllable based Alignment (Translation Model)
○ Align monotonically by syllables
○ preserve syllables with no disassembling, and run many-to-one
(roman phonemes to a hangeul syllable) segmental alignment.
○ “한글 hangeul”
■ [ 한 글, h a n g e u l ]→[ 한/han, 글/geul ]
■ [ 칸 막 이, k a n m a g i ]→[ 칸/kan, 막/mag, 이/i ]
Training Data
한글 hangeul
한자 hanja
글자 geulja
Translation Model
한 글 han geul
한 자 han ja
글 자 geul ja
Unsupervised
segmental
alignment
P(Ro|Ko)
Approaches
Language ModelTraining Data
Ro-Ko Transliteration
● Syllable based Language Model
○ Train syllable based bi-gram language model for all korean words in
train data
p(<s>한), p(한글), p(글</s>)
p(<s>한), p(한자), p(자</s>)
p(<s>글), p(글자), p(자</s>)
한글
한자
글자
Bigrams
P(Ko)
Approaches
Ro-Ko Transliteration
● Decode new input Roman sequence
○ T(Ro) = Ko^
= argmaxko
P(Ko)P(Ro|Ko)
new roman sequence
singgeulbeonggeul
hangeulja
hangeulnal
best matching output
싱글벙글
한글자
한글날
decode
Approaches
Ro-Ko Transliteration
def parse_page(
baseurl: str,
page: int) -> Sequence[Tuple(str, str, str)]:
response = requests.get(baseurl + page)
page = BeautifulSoup(response.text, 'html.parser')
songs = page.find_all('div', {'class':KPOP_SONG})
records = []
for s in songs:
attrs = s.find('a', {'class':TITLE}).attrs
artist, title = attrs['title'].split('-',1)
songurl = attrs['href']
records.append((artist, title, songurl))
return records
>>> parse_page(BASEURL, 216)
[('A.C.E', 'Take Me Higher', '/a-c-e-take-me-higher/'),
('ONF', 'Complete (널 만난 순간)',
'/onf-complete-neol-mannan-sungan/'),
...
]
Training Data
Ro-Ko Transliteration Training Data
Ro-Ko Transliteration
def get_song_text(song_url: str) -> str:
response = requests.get(songurl)
song = BeautifulSoup(response.text, 'html.parser')
text = song.h2.parent.text
return text
>>> print( get_song_text(“/bts-boy-with-luv-jageun-geosdeureul-wihan-si/”) )
방탄소년단 (Bangtan Boys) BTS – 작은 것들을 위한 시 (Feat. Halsey) Boy with Luv Lyrics
Genre : R&B/Soul
Release Date : 2019-04-12
Language : Korean
BTS – Boy with Luv Hangul
모든 게 궁금해
How’s your day
Oh tell me
뭐가 널 행복하게 하는지
...
BTS – Boy with Luv Romanization
modeun ge gunggeumhae
How’s your day
...
Training Data
● 12,095 songs
● 1,586,305 lines
● 1,900,775 bi-word pairs (“모르는 moreuneun”)
● 121,469 unique bi-word pairs
Ro-Ko Transliteration Training Data
Ro-Ko Transliteration
● http://koreanizer.herokuapp.com
Koreanizer
key in
roman phoneme
sequence
Contacts
lee.hongjoo@yandex.com
https://www.linkedin.com/in/hongjoo-lee/
Some examples and figures are cited or referenced from some lecture materials, tutorials.
Mostly from Michael Collins’, Kevin Knight’s and Philipp Koehn’s.

More Related Content

Similar to Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator

Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingYu-Sheng (Yosen) Chen
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdfykyog
 
Syntax Analysis.pdf
Syntax Analysis.pdfSyntax Analysis.pdf
Syntax Analysis.pdfShivang70701
 
Level-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsLevel-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsJinho Choi
 
Python week 2 2019 2020 for g10 by eng.osama ghandour
Python week 2 2019 2020 for g10 by eng.osama ghandourPython week 2 2019 2020 for g10 by eng.osama ghandour
Python week 2 2019 2020 for g10 by eng.osama ghandourOsama Ghandour Geris
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsWeitao Duan
 
Course project solutions 2019
Course project solutions 2019Course project solutions 2019
Course project solutions 2019Robert Geofroy
 
2019 session 6 develop programs to solve a variety of problems in math , phys...
2019 session 6 develop programs to solve a variety of problems in math , phys...2019 session 6 develop programs to solve a variety of problems in math , phys...
2019 session 6 develop programs to solve a variety of problems in math , phys...Osama Ghandour Geris
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsRishabh Soni
 
Count-min sketch to Infinity.pdf
Count-min sketch to Infinity.pdfCount-min sketch to Infinity.pdf
Count-min sketch to Infinity.pdfStephen Lorello
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
Writing a SAT solver as a hobby project
Writing a SAT solver as a hobby projectWriting a SAT solver as a hobby project
Writing a SAT solver as a hobby projectMasahiro Sakai
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
C program language tutorial pattern printing
C program language tutorial pattern printingC program language tutorial pattern printing
C program language tutorial pattern printingSourav Ganguly
 

Similar to Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator (20)

Lecture02
Lecture02Lecture02
Lecture02
 
Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programming
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdf
 
Syntax Analysis.pdf
Syntax Analysis.pdfSyntax Analysis.pdf
Syntax Analysis.pdf
 
Level-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsLevel-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing Positions
 
Python week 2 2019 2020 for g10 by eng.osama ghandour
Python week 2 2019 2020 for g10 by eng.osama ghandourPython week 2 2019 2020 for g10 by eng.osama ghandour
Python week 2 2019 2020 for g10 by eng.osama ghandour
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
 
Course project solutions 2019
Course project solutions 2019Course project solutions 2019
Course project solutions 2019
 
2019 session 6 develop programs to solve a variety of problems in math , phys...
2019 session 6 develop programs to solve a variety of problems in math , phys...2019 session 6 develop programs to solve a variety of problems in math , phys...
2019 session 6 develop programs to solve a variety of problems in math , phys...
 
Python week 1 2020-2021
Python week 1 2020-2021Python week 1 2020-2021
Python week 1 2020-2021
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Count-min sketch to Infinity.pdf
Count-min sketch to Infinity.pdfCount-min sketch to Infinity.pdf
Count-min sketch to Infinity.pdf
 
Unit 1
Unit 1Unit 1
Unit 1
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
Writing a SAT solver as a hobby project
Writing a SAT solver as a hobby projectWriting a SAT solver as a hobby project
Writing a SAT solver as a hobby project
 
defense
defensedefense
defense
 
Presentation
PresentationPresentation
Presentation
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
C program language tutorial pattern printing
C program language tutorial pattern printingC program language tutorial pattern printing
C program language tutorial pattern printing
 
Tech tut
Tech tutTech tut
Tech tut
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator

  • 1. PyCon Korea 2019 이홍주 (lee.hongjoo@yandex.com) 집에서 만든 머신러닝 기반 자동번역기 #NoDeepLearning
  • 2. PyCon Korea 2019 이홍주 (lee.hongjoo@yandex.com) KOREANizer : SMT based Ro-Ko Transliterator
  • 3. Introduction ● Korean Input Method SUCKS! Installing Korean Language on Windows 3.1 and Windows 95 Once upon a time...
  • 5. Outline • Introduction • Machine Translation Basics • SMT Components • Language Model • Translation Model • Decoder • SMT based Ro-Ko Transliteration • About Hanguel / Problem Definition / Approaches / Training Data • Koreanizer on web
  • 6. MT Basics ● Three MT Levels : Direct, Transfer, Interlingual Machine Translation Pyramid Bernard Vauquois' pyramid target text source text Interlingua Transfer Direct analysis generation source semantics target semantics source syntax target syntax
  • 7. MT Basics ● Direct Translation ○ Single phase : translate word by word with some reorderings ○ lack of analysis ■ long-range reordering ● “Sources said that IBM bought Lotus yesterday” ● “소식동(은) 어제 IBM(이) Lotus(를) 샀다(고) 말했다" ■ syntactic role ambiguity ● “They said that I like ice-cream” ● “They like that ice-cream” Machine Translation Pyramid Bernard Vauquois' pyramid target text source text Direct
  • 8. MT Basics ● Interlingual Translation ○ Two phases ■ Analysis : Analyze the source language into a semantic representation ■ Generation : Convert the representation into an target language Machine Translation Pyramid Bernard Vauquois' pyramid target text source text Interlingua analysis generation semantics semantics
  • 9. MT Basics ● Transfer based Translation ○ Three phases ■ Analysis : Analyze the source language’s structure ■ Transfer : Convert the source’s structure to a target’s ■ Generation : Convert the target structure to an target language Machine Translation Pyramid Bernard Vauquois' pyramid target text source text Transfer analysis generation
  • 10. MT Basics ● Transfer based Translation ○ Levels of Transfer : Words, Phrases, Syntax Machine Translation Pyramid Bernard Vauquois' pyramid target text source text syntax phrases words syntax phrases words
  • 11. MT Basics ● Ideas for future works ○ 18th C. - Bayes theorem ○ 1948 - Noisy-Channel coding theorem ○ 1949 - Warren Weaver’s memo ● Statistical Machine Translation (SMT): ○ 1988 - Word-based models ○ 2003 - Phrase-based models ○ 2006 - Google Translate ● Neural Machine Translation (NMT): ○ 2013 - First papers on pure NMT ○ 2015 - NMT enters shared tasks (WMT, IWSLT) ○ 2016 - In production History
  • 12. MT Basics ● SMT as Noisy Channel ○ Said in English, received in Spanish The Noisy Channel Model Good Morning! ¡Buenos días!
  • 13. MT Basics ● SMT as Noisy Channel ○ By convention, we use E for a source, F for a foreign language. The Noisy Channel Model e: Good Morning! f: ¡Buenos días! Language E (e ∈ E) Language F (f ∈ F) translation
  • 14. MT Basics ● SMT as Noisy Channel ○ P(f|e) The Noisy Channel Model e: Good Morning! f: ¡Buenos días! Language E (e ∈ E) Language F (f ∈ F) translation P(f|e) T(f) = ê = argmaxe P(e|f)
  • 15. MT Basics The Noisy Channel Model e: Good Morning! f: ¡Buenos días! Language E (e ∈ E) Language F (f ∈ F) translation
  • 16. MT Basics ● P(e) - Language Model ○ models the fluency of the translation ○ data : corpus in the target language E ● P(f|e) - Translation Model ○ models the adequacy of the translation ○ data : parallel corpus of F and E pairs ● argmaxe - Decoder ○ given LM, TM and f, generate the most fluent and adequate translation result e^ SMT Systems
  • 17. MT Basics SMT Systems Spanish Broken English EnglishTM LM Statistical AnalysisStatistical Analysis Spanish / English Parallel Corpus English Corpus Decoding
  • 18. MT Basics ● candidates based on Translation Model alone ○ Que hambre tengo yo What hunger have I p(s|e) = 0.000014 Hungry I am so p(s|e) = 0.000001 I am so hungry p(s|e) = 0.0000015 Have I that hunger p(s|e) = 0.000020 ... SMT Systems
  • 19. MT Basics ● with Language Model ○ Que hambre tengo yo What hunger have I p(s|e)p(e) = 0.000014 x 0.000001 Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014 I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001 Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098 ... SMT Systems
  • 20. MT Basics ● by Decoding ○ Que hambre tengo yo What hunger have I p(s|e)p(e) = 0.000014 x 0.000001 Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014 I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001 Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098 ... SMT Systems argmaxe = I am so hungry
  • 21. Outline • Introduction • Machine Translation Basics • SMT Components • Language Model • Translation Model • Decoder • SMT based Ro-Ko Transliteration • About Hanguel / Problem Definition / Approaches / Training Data • Koreanizer on web
  • 22. SMT Components ● Given a corpus ○ “I am Sam” ○ “Sam I am” ○ “I do not like green eggs and ham” ● What’s the probability of the next word? ○ “I am _____” ■ Sam, eggs, ham, not, ... ○ P(Sam | I am) = ? Language Model
  • 23. SMT Components I am Sam Sam I am I do not like green eggs and ham Language Model
  • 24. SMT Components I am Sam Sam I am I do not like green eggs and ham Language Model
  • 25. SMT Components ● Given a corpus ○ “I am Sam” ○ “Sam I am” ○ “I do not like green eggs and ham” ● What’s the probability of the whole sequence? ○ P(I am Sam) = ? Language Model
  • 26. SMT Components ● Predict probability of a sequence of words: ○ w = (w1 w2 w3 ...wk ) ○ A Model is to compute P(wk |w1 w2 w3 ...wk-1 ) or P(w) ● Application ○ speech recognition : P(I saw a van) >> P(eyes awe of an) ○ spelling correction : P(about fifteen minutes from) > P(about fifteen minuets from) ○ machine translation : P(high winds tonite) > P(large winds tonite) ○ handwriting recognition ○ suggestions (search keyword, messaging,...) Language Model
  • 27. SMT Components ● How to compute a Language Model P(w) ○ Given a sequence of words: w = (w1 w2 w3 ... wk ) ○ too many combination of words Language Model
  • 28. SMT Components ● Chain rule ○ two random variables ○ more than two variables ○ example : n=4 variables Language Model
  • 29. SMT Components ● How to compute a Language Model P(w) ○ Given a sequence of words: w = (w1 w2 w3 ... wk ) ○ too many combination of words ● Chain rule: ○ P(w) = P(w1 )P(w2 |w1 )P(w3 |w2 w1 )...p(wk |w1 ...wk-1 ) ○ w1 ...wk-1 is still too long Language Model
  • 30. SMT Components ● Markov assumption ○ Conditional probability distribution of future states depends only on the present state, not on the sequence of events that preceded it (wikipedia) ○ Bi-gram approximation ■ P( eggs | I do not like green ) ≈ P( eggs | green ) ○ Tri-gram approximation ■ P( eggs | I do not like green ) ≈ P( eggs |like green ) Language Model
  • 31. SMT Components ● How to compute a Language Model P(w) ○ Given a sequence of words: w = (w1 w2 w3 ... wk ) ○ too many combination of words ● Chain rule: ○ P(w) = P(w1 )P(w2 |w1 )P(w3 |w2 w1 )...p(wk |w1 ...wk-1 ) ○ w1 ...wk-1 is still too long ● Markov assumption: ○ P(wi |w1 ...wi-1 ) = p(wi |wi-n+1 ...wi-1 ) ○ ex) n = 2 ⇒ P(wi |w1 ...wi-1 ) = P(wi |wi-1 ) Language Model
  • 32. SMT Components ● For Bi-gram Language Model (n=2): ○ P(w) = P(w1 )P(w2 |w1 )...P(wk |wk-1 ) ● Given a corpus ○ “I am Sam” ○ “Sam I am” ○ “I do not like green eggs and ham” ● What’s the probability of the whole sequence? ○ P(I am Sam) = P(I) x P(am | I) x P(Sam | am) = 3/14 x 2/3 x 1 = 1/7 Language Model
  • 33. SMT Components ● Bi-gram Language Model (Normalized): ○ P(w) = P(w1 |<s>)P(w2 |w1 )...P(wk |wk-1 )P(</s>|wk ) ● Given a corpus ○ “<s> I am Sam </s>” ○ “<s> Sam I am </s>” ○ “<s> I do not like green eggs and ham </s>” ● What’s the probability of the whole sequence? ○ P(<s> I am Sam </s>) = P(I|<s>) x P(am|I) x P(Sam|am) x P(Sam|</s>) = 2/3 x 2/3 x 1/2 x 1/3 = 2/9 Language Model
  • 34. ● generate Bi-gram SMT Components def generate_ngrams( s: str, n: int, tokenize: Callable[[str],List[str]] = lambda x:x.split()) -> List[str]: # ‘a b c’ => [‘a’, ‘b’, ‘c’] tokens = tokenize(s) # zip([‘a’, ‘b’, ‘c’], [‘b’, ‘c’]) => [(‘a’, ‘b’), (‘b’, ‘c’)] ngrams = zip(*[tokens[i:] for i in range(n)]) # [‘a b’, ‘b c’] return [“ “.join(ngram) for ngram in ngrams] Language Model
  • 35. ● generate Bi-gram SMT Components corpus = [ ‘I am Sam’, ‘Sam I am’, ‘I do not like green eggs and ham’ ] def tokenizer(s: str) -> List[str]: return [‘<s>’] + s.split() + [‘</s>’] Language Model
  • 36. SMT Components >>> corpus[0] ‘I am Sam’ >>> generate_ngrams(corpus[0], 2, tokenizer) ['<s> I', 'I am', 'am Sam', 'Sam </s>'] >>> [bigram for s in corpus for bigram in generate_ngrams(s, 2, tokenizer)] ['<s> I', 'I am', 'am Sam', 'Sam </s>', '<s> Sam', ... , 'and ham', 'ham </s>'] Language Model
  • 37. ● Train Bi-gram Language Model SMT Components for bigram in bigrams: # count(‘a b’) counter_numer[ bigram ] += 1 # ‘a b’ => ‘a’ w_denom = bigram.split()[0] # count(‘a’) counter_denom[ w_denom ] += 1 for bigram, count in counter_numer: # ‘a b’ => ‘a’ w_denom = bigram.split()[0] # p(‘a b’) = count(‘a b’) / count(‘a’) prob[bigram] = counter_numer[bigram] / counter_denom[w_denom] Language Model
  • 38. ● How do we model P(f|e)? ● Given a parallel corpus of <e, f> sentence pairs: ○ e has le words, e = (e1 … ele ) ○ f has lf words, f = (f1 … flf ) ● Introduce function a:j ⇒ i ○ Alignments a between e and f: ○ a = {a1 , … alf }, aj ∈ {0...le} ○ (le + 1)lf possible alignments ○ f[j] is aligned to e[a[j]] which is e[i] SMT Components Translation Model
  • 39. ● Example: le = 6, lf = 7 ○ e = And1 the2 program3 has4 been5 implemented6 ○ f = Le1 programme2 a3 ete4 mis5 en6 application7 ● One possible alignment: ○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 } SMT Components Translation Model
  • 40. ● Modeling the generation of f given e: P(f,a|e,θ) ○ f, e : observable variables ○ a : hidden variables ○ ϴ : parameters ● Likelihood Maximization ○ p(f|e,θ) = ∑a p(f,a|e,θ) → maxΘ ○ but we don’t have labeled alignments ● EM-algorithm ○ E-step : estimates posterior probabilities for alignments ○ M-step : updates parameter θ of the model SMT Components Translation Model
  • 41. ● IBM Model 1 ○ For parameter θ, length of foreign sentence (lf) is used ○ Alignements are uniformly distributed by assumption ■ P(a | e, lf) = 1 / (le+1)lf ● Translation Probability ○ estimate, P(f | a, e, lf) = ∏j=1..lf t(fj |ea(j) ) ● Result: ○ P(f,a|e,lf) = P(a|e,lf)xP(f|a,e,lf) = 1/(le+1)lf ∏j=1..lf t(fj |ea(j) ) SMT Components Translation Model
  • 42. ● Example: le = 6, lf = 7 ○ e = And1 the2 program3 has4 been5 implemented6 ○ f = Le1 programme2 a3 ete4 mis5 en6 application7 ● One possible alignment: ○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 } p(f|a,e) = t(Le|the) x t(programme|program) x t(a|has) x = t(etc|been) x t(mis|implemented) x = t(application|implemented) SMT Components Translation Model
  • 43. SMT Components ● EM Algorithm ○ For incomplete data ■ Observable sentence pairs, but no alignments between each sentence pairs ○ Chicken and Egg Problem ■ if we had the alignments, we could estimate the translation parameters ■ if we had the parameters, we could estimate the alignments Translation Model
  • 44. SMT Components ● EM Algorithm (2) ○ initialize all alignments equally likely ○ model learns… (ex. “la” is often aligned to “the”) Translation Model
  • 45. ● EM Algorithm (3) ○ after first iteration ○ some alignments... (ex. “la” is often aligned to “the”) SMT Components Translation Model
  • 46. ● EM Algorithm (4) ○ after another iteration ○ alignments becomes apparent (ex. “fleur” and “flower” are more likely) SMT Components Translation Model
  • 47. ● EM Algorithm (5) ○ convergence ○ hidden alignments revealed by EM SMT Components Translation Model
  • 48. ● EM Algorithm (6) ○ parameter estimation from the corpus of aligned sentence pairs SMT Components Translation Model p(la|the) = 0.453 p(le|the) = 0.334 p(maison|house) = 0.876 p(bleu|blue) = 0.563 ...
  • 49. SMT Components 1 def IbmModelOneTrainEM(E, F): 2 TM = {} # translation model 3 total_s = {} 4 num_sentence = len(E) 5 assert(len(E) == len(F)) 6 7 # assuming uniform distribution 8 for i in range(num_sentence): 9 for e in E[i]: 10 for f in F[i]: 11 TM[(e,f)] = UNIFORM 12 Translation Model
  • 50. SMT Components 13 # EM iteration 14 while iteration < ITERATION_MAX: 15 count = defaultdict(int) 16 total = {} 17 for i in range(num_sentence): 18 for e in E[i]: 19 total_s[e] = 0 20 for f in F[i]: 21 total_s[e] += TM[(e,f)] 22 for e in E[i]: 23 for f in F[i]: 24 count[(e,f)] += TM[(e,f)] / total_s[e] 25 total[f] += TM[(e,f)] / total_s[e] Translation Model
  • 51. SMT Components 26 # recalculate parameters 27 for x in count.keys(): 28 f = x[1] 29 try: 30 TM[x] = count[x] / total[f] 31 except KeyError: 32 pass 33 return TM Translation Model
  • 52. ● Decoding is NP-complete ○ Given a Language Model and Translation Model, the decoder constructs the possible translations and looks for the most probable one. ● Traveling Salesman Problem ○ Words as vertices with translation probability ○ Edges with bigram probability as a weight ○ valid sentences are Hamiltonian path SMT Components Decoder I am so sleepy .
  • 53. SMT Components ● Solutions ○ Dynamic Programming : reduces time complexities from exponential to polynomial by storing results of subproblems and avoid re-computing them ○ Viterbi Algorithm : DP algorithm to find shortest path in finite number of states ○ Beam Search (Stack Decoding) : Keeps a list of the best N candidates seen so far Decoder
  • 54. SMT Components hypothesis = namedtuple('hypothesis', 'logprob, lm_state, predecessor, phrase') initial_hypothesis = hypothesis(0.0, "<s>", None, None) stacks = [{} for _ in f] + [{}] stacks[0]['<s>'] = initial_hypothesis for i, stack in enumerate(stacks[:-1]): for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:STACKSIZE]: # prune for j in xrange(i+1,len(f)+1): if f[i:j] in tm: for phrase in tm[f[i:j]]: logprob = h.logprob + phrase.logprob lm_state = h.lm_state for word in phrase.english.split(): (lm_state, word_logprob) = lm.score(lm_state, word) logprob += word_logprob logprob += lm.end(lm_state) if j == len(f) else 0.0 new_hypothesis = hypothesis(logprob, lm_state, h, phrase) if lm_state not in stacks[j] or stacks[j][lm_state].logprob < logprob: # second case is recombination stacks[j][lm_state] = new_hypothesis winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob) Decoder
  • 55. Outline • Introduction • Machine Translation Basics • SMT Components • Language Model • Translation Model • Decoder • SMT based Ro-Ko Transliteration • About Hanguel / Problem Definition / Approaches / Training Data • Koreanizer on web (demo)
  • 56. Ro-Ko Transliteration ● Korean alphabet (Hangeul) ○ Grapheme (자소) ■ A feature alphabet of 24 consonant (자음) and vowel (모음) letters 자음 ㄱ ㄲ ㅋ ㄷ ㄸ ㅌ ㅂ ㅃ ㅍ ㅈ ㅉ ㅊ ㅅ ㅆ ㅎ ㄴ ㅁ ㅇ ㄹ roman g, k kk k d, t tt t b, p pp p j jj ch s ss h n m ng r, l 모음 ㅏ ㅓ ㅗ ㅜ ㅡ ㅣ ㅐ ㅔ ㅚ ㅟ ㅑ ㅕ ㅛ ㅠ ㅒ ㅖ ㅘ ㅙ ㅝ ㅞ ㅢ roman a ae o u eu i ae e oe wi ya yeo yo yu yae ye wa wae wo we ui About Hangeul
  • 57. Ro-Ko Transliteration ● Korean alphabet (Hangeul) ○ Syllables (음절) ■ Letters are grouped into blocks each of which transcribes a syllable ■ ㅎ(h) + ㅏ(a) + ㄴ(n) ⇒ 한 (han) ■ ㄱ(g) + ㅡ(eu) + ㄹ(l) ⇒ 글 (geul) About Hangeul
  • 58. Ro-Ko Transliteration ● Romanization ○ Romanization is the translation of sound of a foreign language into Roman (English) letters as phonemes. (ex. 한글 ⇒ hangeul) ● Back-romanization ○ Translating a string in roman characters into best phonetically matching words in a target language. (ex. hangeul ⇒ 한글) ○ Applications ■ Useful to a keyboard input system without any installation of desired language input method. (ex. Google input tools) ■ Search engine (ex. loanword) ■ Address transliteration ○ Korean Back-romanization = Ro-Ko Transliteration Problem
  • 59. Ro-Ko Transliteration ● The Goal ○ Finding the best phonetic matching Hangeul string from a Roman string ● Two Challenges ○ more than one possible way ■ Romanization : ‘한글’ → ‘hangul’ | ‘hangeul’ | ‘hanguel’ ■ Back-romanization : ‘hangul’ | ‘hangeul’ | ‘hanguel’→ ‘한글’ ■ finding the most probable matching word ○ Segmental Alignment ■ ‘kanmagi’ → ‘칸(kan)막(mag)이(i)’ vs ‘칸(kan)마(ma)기(gi)’ ■ finding the most probable segment in roman string ○ This project focuses on bi-text segmental alignment problem. Problem
  • 60. Ro-Ko Transliteration ● Phoneme based alignment ○ Monotonic many-to-one segmental alignment. ■ Mapping one or more roman phonemes(음소) to one hangeul grapheme(자소) ○ “hangeul 한글” (includes 2-to-1 case) ■ [ ㅎㅏㄴㄱㅡㄹ, h a n g e u l ]→[ ㅎ/h, ㅏ/a, ㄴ/n, ㄱ/g, ㅡ/eu, ㄹ/l ] ○ “kanmagi 칸막이” (only 1-to-1 cases) ■ [ ㅋㅏㄴㅁㅏㄱㅇㅣ, k a n m a g i ]→[ ㅋ/k, ㅏ/a, ㄴ/n, ㅁ/m, ㅏ/a, ㄱ/g, i/l] Approaches
  • 61. Ro-Ko Transliteration ● Phoneme based alignment (Drawback) ○ disassembled sequence (ㅋㅏㄴㅁㅏㄱㅇㅣ) of Hangeul syllables (칸막이) needs to be assembled again by another segmentation algorithm (or some heuristics) ○ it may produce wrong segmentation ○ 칸막이 → disassemble : ㅋㅏㄴㅁㅏㄱㅇㅣ→ reassemble : 칸마기 ○ Though ‘칸막이’ and ‘칸마기’ has very similar pronunciation ○ Yet ‘칸마기’ is incorrect and ‘칸막이’ should be the answer Approaches
  • 62. Ro-Ko Transliteration ● Syllable based Alignment (Translation Model) ○ Align monotonically by syllables ○ preserve syllables with no disassembling, and run many-to-one (roman phonemes to a hangeul syllable) segmental alignment. ○ “한글 hangeul” ■ [ 한 글, h a n g e u l ]→[ 한/han, 글/geul ] ■ [ 칸 막 이, k a n m a g i ]→[ 칸/kan, 막/mag, 이/i ] Training Data 한글 hangeul 한자 hanja 글자 geulja Translation Model 한 글 han geul 한 자 han ja 글 자 geul ja Unsupervised segmental alignment P(Ro|Ko) Approaches
  • 63. Language ModelTraining Data Ro-Ko Transliteration ● Syllable based Language Model ○ Train syllable based bi-gram language model for all korean words in train data p(<s>한), p(한글), p(글</s>) p(<s>한), p(한자), p(자</s>) p(<s>글), p(글자), p(자</s>) 한글 한자 글자 Bigrams P(Ko) Approaches
  • 64. Ro-Ko Transliteration ● Decode new input Roman sequence ○ T(Ro) = Ko^ = argmaxko P(Ko)P(Ro|Ko) new roman sequence singgeulbeonggeul hangeulja hangeulnal best matching output 싱글벙글 한글자 한글날 decode Approaches
  • 65. Ro-Ko Transliteration def parse_page( baseurl: str, page: int) -> Sequence[Tuple(str, str, str)]: response = requests.get(baseurl + page) page = BeautifulSoup(response.text, 'html.parser') songs = page.find_all('div', {'class':KPOP_SONG}) records = [] for s in songs: attrs = s.find('a', {'class':TITLE}).attrs artist, title = attrs['title'].split('-',1) songurl = attrs['href'] records.append((artist, title, songurl)) return records >>> parse_page(BASEURL, 216) [('A.C.E', 'Take Me Higher', '/a-c-e-take-me-higher/'), ('ONF', 'Complete (널 만난 순간)', '/onf-complete-neol-mannan-sungan/'), ... ] Training Data
  • 67. Ro-Ko Transliteration def get_song_text(song_url: str) -> str: response = requests.get(songurl) song = BeautifulSoup(response.text, 'html.parser') text = song.h2.parent.text return text >>> print( get_song_text(“/bts-boy-with-luv-jageun-geosdeureul-wihan-si/”) ) 방탄소년단 (Bangtan Boys) BTS – 작은 것들을 위한 시 (Feat. Halsey) Boy with Luv Lyrics Genre : R&B/Soul Release Date : 2019-04-12 Language : Korean BTS – Boy with Luv Hangul 모든 게 궁금해 How’s your day Oh tell me 뭐가 널 행복하게 하는지 ... BTS – Boy with Luv Romanization modeun ge gunggeumhae How’s your day ... Training Data
  • 68. ● 12,095 songs ● 1,586,305 lines ● 1,900,775 bi-word pairs (“모르는 moreuneun”) ● 121,469 unique bi-word pairs Ro-Ko Transliteration Training Data
  • 70. Contacts lee.hongjoo@yandex.com https://www.linkedin.com/in/hongjoo-lee/ Some examples and figures are cited or referenced from some lecture materials, tutorials. Mostly from Michael Collins’, Kevin Knight’s and Philipp Koehn’s.