Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator

PyCon Korea 2019
이홍주 (lee.hongjoo@yandex.com)
집에서 만든 머신러닝 기반 자동번역기
#NoDeepLearning

PyCon Korea 2019
이홍주 (lee.hongjoo@yandex.com)
KOREANizer :
SMT based Ro-Ko Transliterator

Introduction
● Korean Input Method SUCKS!
Installing Korean Language on Windows 3.1 and Windows 95
Once upon a time...

Introduction
● https://google.com/inputtools
Google Input Tools

Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web

MT Basics
● Three MT Levels : Direct, Transfer, Interlingual
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Interlingua
Transfer
Direct
analysis
generation
source semantics target semantics
source syntax target syntax

MT Basics
● Direct Translation
○ Single phase : translate word by word with some
reorderings
○ lack of analysis
■ long-range reordering
● “Sources said that IBM bought Lotus yesterday”
● “소식동(은) 어제 IBM(이) Lotus(를) 샀다(고) 말했다"
■ syntactic role ambiguity
● “They said that I like ice-cream”
● “They like that ice-cream”
target
text
source
text
Direct

MT Basics
● Interlingual Translation
○ Two phases
■ Analysis : Analyze the source language into a semantic
representation
■ Generation : Convert the representation into an target
language
target
text
source
text
Interlingua
analysis
generation
semantics semantics

MT Basics
● Transfer based Translation
○ Three phases
■ Analysis : Analyze the source language’s structure
■ Transfer : Convert the source’s structure to a target’s
■ Generation : Convert the target structure to an target
language
target
text
source
text
Transfer
analysis
generation

MT Basics
● Transfer based Translation
○ Levels of Transfer : Words, Phrases, Syntax
target
text
source
text
syntax
phrases
words
syntax
phrases
words

MT Basics
● Ideas for future works
○ 18th
C. - Bayes theorem
○ 1948 - Noisy-Channel coding theorem
○ 1949 - Warren Weaver’s memo
● Statistical Machine Translation (SMT):
○ 1988 - Word-based models
○ 2003 - Phrase-based models
○ 2006 - Google Translate
● Neural Machine Translation (NMT):
○ 2013 - First papers on pure NMT
○ 2015 - NMT enters shared tasks (WMT, IWSLT)
○ 2016 - In production
History

MT Basics
● SMT as Noisy Channel
○ Said in English, received in Spanish
The Noisy Channel Model
Good Morning! ¡Buenos días!

MT Basics
○ By convention, we use E for a source, F for a foreign language.
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation

MT Basics
○ P(f|e)
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
P(f|e)
T(f) = ê = argmaxe
P(e|f)

MT Basics The Noisy Channel Model
Language E
(e ∈ E)
Language F
(f ∈ F)
translation

MT Basics
● P(e) - Language Model
○ models the fluency of the translation
○ data : corpus in the target language E
● P(f|e) - Translation Model
○ models the adequacy of the translation
○ data : parallel corpus of F and E pairs
● argmaxe
- Decoder
○ given LM, TM and f, generate the most fluent and adequate translation
result e^
SMT Systems

MT Basics SMT Systems
Spanish
Broken
English
EnglishTM LM
Statistical AnalysisStatistical Analysis
Spanish / English
Parallel Corpus
English
Corpus
Decoding

MT Basics
● candidates based on Translation Model alone
○ Que hambre tengo yo
What hunger have I p(s|e) = 0.000014
Hungry I am so p(s|e) = 0.000001
I am so hungry p(s|e) = 0.0000015
Have I that hunger p(s|e) = 0.000020
...
SMT Systems

MT Basics
● with Language Model
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems

MT Basics
● by Decoding
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems
argmaxe
= I am so hungry

SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the next word?
○ “I am _____”
■ Sam, eggs, ham, not, ...
○ P(Sam | I am) = ?
Language Model

SMT Components
I am Sam
Sam I am
I do not like green eggs and ham
Language Model

SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
● What’s the probability of the whole sequence?
○ P(I am Sam) = ?
Language Model

SMT Components
● Predict probability of a sequence of words:
○ w = (w1
w2
w3
...wk
)
○ A Model is to compute P(wk
|w1
w2
w3
...wk-1
) or P(w)
● Application
○ speech recognition : P(I saw a van) >> P(eyes awe of an)
○ spelling correction : P(about fifteen minutes from) > P(about fifteen
minuets from)
○ machine translation : P(high winds tonite) > P(large winds tonite)
○ handwriting recognition
○ suggestions (search keyword, messaging,...)
Language Model

SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
Language Model

SMT Components
● Chain rule
○ two random variables
○ more than two variables
○ example : n=4 variables
Language Model

SMT Components
w2
w3
... wk
)
● Chain rule:
○ P(w) = P(w1
)P(w2
|w1
)P(w3
|w2
w1
)...p(wk
|w1
...wk-1
)
○ w1
...wk-1
is still too long
Language Model

SMT Components
● Markov assumption
○ Conditional probability distribution of future states depends only on
the present state, not on the sequence of events that preceded it
(wikipedia)
○ Bi-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs | green )
○ Tri-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs |like green )
Language Model

SMT Components
● For Bi-gram Language Model (n=2):
○ P(w) = P(w1
)P(w2
|w1
)...P(wk
|wk-1
)
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ P(I am Sam) = P(I) x P(am | I) x P(Sam | am)
= 3/14 x 2/3 x 1 = 1/7
Language Model

● generate Bi-gram
SMT Components
def generate_ngrams(
s: str,
n: int,
tokenize: Callable[[str],List[str]] = lambda x:x.split()) -> List[str]:
# ‘a b c’ => [‘a’, ‘b’, ‘c’]
tokens = tokenize(s)
# zip([‘a’, ‘b’, ‘c’], [‘b’, ‘c’]) => [(‘a’, ‘b’), (‘b’, ‘c’)]
ngrams = zip(*[tokens[i:] for i in range(n)])
# [‘a b’, ‘b c’]
return [“ “.join(ngram) for ngram in ngrams]
Language Model

● generate Bi-gram
SMT Components
corpus = [ ‘I am Sam’,
‘Sam I am’,
‘I do not like green eggs and ham’ ]
def tokenizer(s: str) -> List[str]:
return [‘<s>’] + s.split() + [‘</s>’]
Language Model

SMT Components
>>> corpus[0]
‘I am Sam’
>>> generate_ngrams(corpus[0], 2, tokenizer)
['<s> I', 'I am', 'am Sam', 'Sam </s>']
>>> [bigram for s in corpus for bigram in generate_ngrams(s, 2, tokenizer)]
['<s> I',
'I am',
'am Sam',
'Sam </s>',
'<s> Sam',
... ,
'and ham',
'ham </s>']
Language Model

● Train Bi-gram Language Model
SMT Components
for bigram in bigrams:
# count(‘a b’)
counter_numer[ bigram ] += 1
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# count(‘a’)
counter_denom[ w_denom ] += 1
for bigram, count in counter_numer:
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# p(‘a b’) = count(‘a b’) / count(‘a’)
prob[bigram] = counter_numer[bigram] / counter_denom[w_denom]
Language Model

● How do we model P(f|e)?
● Given a parallel corpus of <e, f> sentence pairs:
○ e has le words, e = (e1
… ele
)
○ f has lf words, f = (f1
… flf
)
● Introduce function a:j ⇒ i
○ Alignments a between e and f:
○ a = {a1
, … alf
}, aj
∈ {0...le}
○ (le + 1)lf
possible alignments
○ f[j] is aligned to e[a[j]] which is e[i]
SMT Components Translation Model

● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }

● Modeling the generation of f given e: P(f,a|e,θ)
○ f, e : observable variables
○ a : hidden variables
○ ϴ : parameters
● Likelihood Maximization
○ p(f|e,θ) = ∑a
p(f,a|e,θ) → maxΘ
○ but we don’t have labeled alignments
● EM-algorithm
○ E-step : estimates posterior probabilities for alignments
○ M-step : updates parameter θ of the model

● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }
p(f|a,e) = t(Le|the) x t(programme|program) x t(a|has) x
= t(etc|been) x t(mis|implemented) x
= t(application|implemented)

SMT Components
● EM Algorithm
○ For incomplete data
■ Observable sentence pairs, but no alignments between each sentence pairs
○ Chicken and Egg Problem
■ if we had the alignments, we could estimate the translation parameters
■ if we had the parameters, we could estimate the alignments
Translation Model

SMT Components
● EM Algorithm (2)
○ initialize all alignments equally likely
○ model learns… (ex. “la” is often aligned to “the”)
Translation Model

○ after first iteration
○ some alignments... (ex. “la” is often aligned to “the”)

○ after another iteration
○ alignments becomes apparent (ex. “fleur” and “flower” are more
likely)

○ convergence
○ hidden alignments revealed by EM

○ parameter estimation from the corpus of aligned sentence pairs
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...

SMT Components
1 def IbmModelOneTrainEM(E, F):
2 TM = {} # translation model
3 total_s = {}
4 num_sentence = len(E)
5 assert(len(E) == len(F))
6
7 # assuming uniform distribution
8 for i in range(num_sentence):
9 for e in E[i]:
10 for f in F[i]:
11 TM[(e,f)] = UNIFORM
12
Translation Model

SMT Components
13 # EM iteration
14 while iteration < ITERATION_MAX:
15 count = defaultdict(int)
16 total = {}
17 for i in range(num_sentence):
18 for e in E[i]:
19 total_s[e] = 0
20 for f in F[i]:
21 total_s[e] += TM[(e,f)]
22 for e in E[i]:
23 for f in F[i]:
24 count[(e,f)] += TM[(e,f)] / total_s[e]
25 total[f] += TM[(e,f)] / total_s[e]
Translation Model

SMT Components
26 # recalculate parameters
27 for x in count.keys():
28 f = x[1]
29 try:
30 TM[x] = count[x] / total[f]
31 except KeyError:
32 pass
33 return TM
Translation Model

● Decoding is NP-complete
○ Given a Language Model and Translation Model, the decoder
constructs the possible translations and looks for the most probable
one.
● Traveling Salesman Problem
○ Words as vertices with translation probability
○ Edges with bigram probability as a weight
○ valid sentences are Hamiltonian path
SMT Components Decoder
I
am
so
sleepy
.

SMT Components
● Solutions
○ Dynamic Programming : reduces time complexities from exponential to
polynomial by storing results of subproblems and avoid re-computing
them
○ Viterbi Algorithm : DP algorithm to find shortest path in finite number
of states
○ Beam Search (Stack Decoding) : Keeps a list of the best N candidates
seen so far
Decoder

SMT Components
hypothesis = namedtuple('hypothesis', 'logprob, lm_state, predecessor, phrase')
initial_hypothesis = hypothesis(0.0, "<s>", None, None)
stacks = [{} for _ in f] + [{}]
stacks[0]['<s>'] = initial_hypothesis
for i, stack in enumerate(stacks[:-1]):
for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:STACKSIZE]: # prune
for j in xrange(i+1,len(f)+1):
if f[i:j] in tm:
for phrase in tm[f[i:j]]:
logprob = h.logprob + phrase.logprob
lm_state = h.lm_state
for word in phrase.english.split():
(lm_state, word_logprob) = lm.score(lm_state, word)
logprob += word_logprob
logprob += lm.end(lm_state) if j == len(f) else 0.0
new_hypothesis = hypothesis(logprob, lm_state, h, phrase)
if lm_state not in stacks[j] or stacks[j][lm_state].logprob < logprob:
# second case is recombination
stacks[j][lm_state] = new_hypothesis
winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob)
Decoder

Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web (demo)

Ro-Ko Transliteration
● Korean alphabet (Hangeul)
○ Grapheme (자소)
■ A feature alphabet of 24 consonant (자음) and vowel (모음) letters
자음 ㄱ ㄲ ㅋ ㄷ ㄸ ㅌ ㅂ ㅃ ㅍ ㅈ ㅉ ㅊ ㅅ ㅆ ㅎ ㄴ ㅁ ㅇ ㄹ
roman g, k kk k d, t tt t b, p pp p j jj ch s ss h n m ng r, l
모음 ㅏ ㅓ ㅗ ㅜ ㅡ ㅣ ㅐ ㅔ ㅚ ㅟ ㅑ ㅕ ㅛ ㅠ ㅒ ㅖ ㅘ ㅙ ㅝ ㅞ ㅢ
roman a ae o u eu i ae e oe wi ya yeo yo yu yae ye wa wae wo we ui
About Hangeul

● Korean alphabet (Hangeul)
○ Syllables (음절)
■ Letters are grouped into blocks each of which transcribes a syllable
■ ㅎ(h) + ㅏ(a) + ㄴ(n) ⇒ 한 (han)
■ ㄱ(g) + ㅡ(eu) + ㄹ(l) ⇒ 글 (geul)
About Hangeul

● Romanization
○ Romanization is the translation of sound of a foreign language into
Roman (English) letters as phonemes. (ex. 한글 ⇒ hangeul)
● Back-romanization
○ Translating a string in roman characters into best phonetically
matching words in a target language. (ex. hangeul ⇒ 한글)
○ Applications
■ Useful to a keyboard input system without any installation of desired language input
method. (ex. Google input tools)
■ Search engine (ex. loanword)
■ Address transliteration
○ Korean Back-romanization = Ro-Ko Transliteration
Problem

● The Goal
○ Finding the best phonetic matching Hangeul string from a Roman
string
● Two Challenges
○ more than one possible way
■ Romanization : ‘한글’ → ‘hangul’ | ‘hangeul’ | ‘hanguel’
■ Back-romanization : ‘hangul’ | ‘hangeul’ | ‘hanguel’→ ‘한글’
■ finding the most probable matching word
○ Segmental Alignment
■ ‘kanmagi’ → ‘칸(kan)막(mag)이(i)’ vs ‘칸(kan)마(ma)기(gi)’
■ finding the most probable segment in roman string
○ This project focuses on bi-text segmental alignment problem.
Problem

● Phoneme based alignment
○ Monotonic many-to-one segmental alignment.
■ Mapping one or more roman phonemes(음소) to one hangeul grapheme(자소)
○ “hangeul 한글” (includes 2-to-1 case)
■ [ ㅎㅏㄴㄱㅡㄹ, h a n g e u l ]→[ ㅎ/h, ㅏ/a, ㄴ/n, ㄱ/g, ㅡ/eu, ㄹ/l ]
○ “kanmagi 칸막이” (only 1-to-1 cases)
■ [ ㅋㅏㄴㅁㅏㄱㅇㅣ, k a n m a g i ]→[ ㅋ/k, ㅏ/a, ㄴ/n, ㅁ/m, ㅏ/a, ㄱ/g, i/l]
Approaches

● Phoneme based alignment (Drawback)
○ disassembled sequence (ㅋㅏㄴㅁㅏㄱㅇㅣ) of Hangeul syllables
(칸막이) needs to be assembled again by another segmentation
algorithm (or some heuristics)
○ it may produce wrong segmentation
○ 칸막이 → disassemble : ㅋㅏㄴㅁㅏㄱㅇㅣ→ reassemble : 칸마기
○ Though ‘칸막이’ and ‘칸마기’ has very similar pronunciation
○ Yet ‘칸마기’ is incorrect and ‘칸막이’ should be the answer
Approaches

● Syllable based Alignment (Translation Model)
○ Align monotonically by syllables
○ preserve syllables with no disassembling, and run many-to-one
(roman phonemes to a hangeul syllable) segmental alignment.
○ “한글 hangeul”
■ [ 한 글, h a n g e u l ]→[ 한/han, 글/geul ]
■ [ 칸 막 이, k a n m a g i ]→[ 칸/kan, 막/mag, 이/i ]
Training Data
한글 hangeul
한자 hanja
글자 geulja
Translation Model
한 글 han geul
한 자 han ja
글 자 geul ja
Unsupervised
segmental
alignment
P(Ro|Ko)
Approaches

Language ModelTraining Data
● Syllable based Language Model
○ Train syllable based bi-gram language model for all korean words in
train data
p(<s>한), p(한글), p(글</s>)
p(<s>한), p(한자), p(자</s>)
p(<s>글), p(글자), p(자</s>)
한글
한자
글자
Bigrams
P(Ko)
Approaches

● Decode new input Roman sequence
○ T(Ro) = Ko^
= argmaxko
P(Ko)P(Ro|Ko)
new roman sequence
singgeulbeonggeul
hangeulja
hangeulnal
best matching output
싱글벙글
한글자
한글날
decode
Approaches

def parse_page(
baseurl: str,
page: int) -> Sequence[Tuple(str, str, str)]:
response = requests.get(baseurl + page)
page = BeautifulSoup(response.text, 'html.parser')
songs = page.find_all('div', {'class':KPOP_SONG})
records = []
for s in songs:
attrs = s.find('a', {'class':TITLE}).attrs
artist, title = attrs['title'].split('-',1)
songurl = attrs['href']
records.append((artist, title, songurl))
return records
>>> parse_page(BASEURL, 216)
[('A.C.E', 'Take Me Higher', '/a-c-e-take-me-higher/'),
('ONF', 'Complete (널 만난 순간)',
'/onf-complete-neol-mannan-sungan/'),
...
]
Training Data

Ro-Ko Transliteration Training Data

def get_song_text(song_url: str) -> str:
response = requests.get(songurl)
song = BeautifulSoup(response.text, 'html.parser')
text = song.h2.parent.text
return text
>>> print( get_song_text(“/bts-boy-with-luv-jageun-geosdeureul-wihan-si/”) )
방탄소년단 (Bangtan Boys) BTS – 작은 것들을 위한 시 (Feat. Halsey) Boy with Luv Lyrics
Genre : R&B/Soul
Release Date : 2019-04-12
Language : Korean
BTS – Boy with Luv Hangul
모든 게 궁금해
How’s your day
Oh tell me
뭐가 널 행복하게 하는지
...
BTS – Boy with Luv Romanization
modeun ge gunggeumhae
How’s your day
...
Training Data

● 12,095 songs
● 1,586,305 lines
● 1,900,775 bi-word pairs (“모르는 moreuneun”)
● 121,469 unique bi-word pairs
Ro-Ko Transliteration Training Data

● http://koreanizer.herokuapp.com
Koreanizer
key in
roman phoneme
sequence

Contacts
lee.hongjoo@yandex.com
https://www.linkedin.com/in/hongjoo-lee/
Some examples and figures are cited or referenced from some lecture materials, tutorials.
Mostly from Michael Collins’, Kevin Knight’s and Philipp Koehn’s.

Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator

Recommended

Recommended

More Related Content

Similar to Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator

Similar to Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator (20)

Recently uploaded

Recently uploaded (20)

Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator