Koreanizer is Roman to Korean Transliterator (Back-Romanizer) based on Statistical Machine Translation technique with ngram language model, IBM alignment model for translation model and decoding algorithm.
This slide introducing Koreanizer and some techniques applied for the system for a session in PyCon KR '19 .
5. Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web
6. MT Basics
● Three MT Levels : Direct, Transfer, Interlingual
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Interlingua
Transfer
Direct
analysis
generation
source semantics target semantics
source syntax target syntax
7. MT Basics
● Direct Translation
○ Single phase : translate word by word with some
reorderings
○ lack of analysis
■ long-range reordering
● “Sources said that IBM bought Lotus yesterday”
● “소식동(은) 어제 IBM(이) Lotus(를) 샀다(고) 말했다"
■ syntactic role ambiguity
● “They said that I like ice-cream”
● “They like that ice-cream”
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Direct
8. MT Basics
● Interlingual Translation
○ Two phases
■ Analysis : Analyze the source language into a semantic
representation
■ Generation : Convert the representation into an target
language
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Interlingua
analysis
generation
semantics semantics
9. MT Basics
● Transfer based Translation
○ Three phases
■ Analysis : Analyze the source language’s structure
■ Transfer : Convert the source’s structure to a target’s
■ Generation : Convert the target structure to an target
language
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
Transfer
analysis
generation
10. MT Basics
● Transfer based Translation
○ Levels of Transfer : Words, Phrases, Syntax
Machine Translation Pyramid
Bernard Vauquois' pyramid
target
text
source
text
syntax
phrases
words
syntax
phrases
words
11. MT Basics
● Ideas for future works
○ 18th
C. - Bayes theorem
○ 1948 - Noisy-Channel coding theorem
○ 1949 - Warren Weaver’s memo
● Statistical Machine Translation (SMT):
○ 1988 - Word-based models
○ 2003 - Phrase-based models
○ 2006 - Google Translate
● Neural Machine Translation (NMT):
○ 2013 - First papers on pure NMT
○ 2015 - NMT enters shared tasks (WMT, IWSLT)
○ 2016 - In production
History
12. MT Basics
● SMT as Noisy Channel
○ Said in English, received in Spanish
The Noisy Channel Model
Good Morning! ¡Buenos días!
13. MT Basics
● SMT as Noisy Channel
○ By convention, we use E for a source, F for a foreign language.
The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
14. MT Basics
● SMT as Noisy Channel
○ P(f|e)
The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
P(f|e)
T(f) = ê = argmaxe
P(e|f)
15. MT Basics The Noisy Channel Model
e: Good Morning! f: ¡Buenos días!
Language E
(e ∈ E)
Language F
(f ∈ F)
translation
16. MT Basics
● P(e) - Language Model
○ models the fluency of the translation
○ data : corpus in the target language E
● P(f|e) - Translation Model
○ models the adequacy of the translation
○ data : parallel corpus of F and E pairs
● argmaxe
- Decoder
○ given LM, TM and f, generate the most fluent and adequate translation
result e^
SMT Systems
17. MT Basics SMT Systems
Spanish
Broken
English
EnglishTM LM
Statistical AnalysisStatistical Analysis
Spanish / English
Parallel Corpus
English
Corpus
Decoding
18. MT Basics
● candidates based on Translation Model alone
○ Que hambre tengo yo
What hunger have I p(s|e) = 0.000014
Hungry I am so p(s|e) = 0.000001
I am so hungry p(s|e) = 0.0000015
Have I that hunger p(s|e) = 0.000020
...
SMT Systems
19. MT Basics
● with Language Model
○ Que hambre tengo yo
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems
20. MT Basics
● by Decoding
○ Que hambre tengo yo
What hunger have I p(s|e)p(e) = 0.000014 x 0.000001
Hungry I am so p(s|e)p(e) = 0.000001 x 0.0000014
I am so hungry p(s|e)p(e) = 0.0000015 x 0.0001
Have I that hunger p(s|e)p(e) = 0.000020 x 0.00000098
...
SMT Systems
argmaxe
= I am so hungry
21. Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web
22. SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the next word?
○ “I am _____”
■ Sam, eggs, ham, not, ...
○ P(Sam | I am) = ?
Language Model
23. SMT Components
I am Sam
Sam I am
I do not like green eggs and ham
Language Model
24. SMT Components
I am Sam
Sam I am
I do not like green eggs and ham
Language Model
25. SMT Components
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the whole sequence?
○ P(I am Sam) = ?
Language Model
26. SMT Components
● Predict probability of a sequence of words:
○ w = (w1
w2
w3
...wk
)
○ A Model is to compute P(wk
|w1
w2
w3
...wk-1
) or P(w)
● Application
○ speech recognition : P(I saw a van) >> P(eyes awe of an)
○ spelling correction : P(about fifteen minutes from) > P(about fifteen
minuets from)
○ machine translation : P(high winds tonite) > P(large winds tonite)
○ handwriting recognition
○ suggestions (search keyword, messaging,...)
Language Model
27. SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
Language Model
28. SMT Components
● Chain rule
○ two random variables
○ more than two variables
○ example : n=4 variables
Language Model
29. SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
● Chain rule:
○ P(w) = P(w1
)P(w2
|w1
)P(w3
|w2
w1
)...p(wk
|w1
...wk-1
)
○ w1
...wk-1
is still too long
Language Model
30. SMT Components
● Markov assumption
○ Conditional probability distribution of future states depends only on
the present state, not on the sequence of events that preceded it
(wikipedia)
○ Bi-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs | green )
○ Tri-gram approximation
■ P( eggs | I do not like green ) ≈ P( eggs |like green )
Language Model
31. SMT Components
● How to compute a Language Model P(w)
○ Given a sequence of words: w = (w1
w2
w3
... wk
)
○ too many combination of words
● Chain rule:
○ P(w) = P(w1
)P(w2
|w1
)P(w3
|w2
w1
)...p(wk
|w1
...wk-1
)
○ w1
...wk-1
is still too long
● Markov assumption:
○ P(wi
|w1
...wi-1
) = p(wi
|wi-n+1
...wi-1
)
○ ex) n = 2 ⇒ P(wi
|w1
...wi-1
) = P(wi
|wi-1
)
Language Model
32. SMT Components
● For Bi-gram Language Model (n=2):
○ P(w) = P(w1
)P(w2
|w1
)...P(wk
|wk-1
)
● Given a corpus
○ “I am Sam”
○ “Sam I am”
○ “I do not like green eggs and ham”
● What’s the probability of the whole sequence?
○ P(I am Sam) = P(I) x P(am | I) x P(Sam | am)
= 3/14 x 2/3 x 1 = 1/7
Language Model
33. SMT Components
● Bi-gram Language Model (Normalized):
○ P(w) = P(w1
|<s>)P(w2
|w1
)...P(wk
|wk-1
)P(</s>|wk
)
● Given a corpus
○ “<s> I am Sam </s>”
○ “<s> Sam I am </s>”
○ “<s> I do not like green eggs and ham </s>”
● What’s the probability of the whole sequence?
○ P(<s> I am Sam </s>) = P(I|<s>) x P(am|I) x P(Sam|am) x P(Sam|</s>)
= 2/3 x 2/3 x 1/2 x 1/3 = 2/9
Language Model
34. ● generate Bi-gram
SMT Components
def generate_ngrams(
s: str,
n: int,
tokenize: Callable[[str],List[str]] = lambda x:x.split()) -> List[str]:
# ‘a b c’ => [‘a’, ‘b’, ‘c’]
tokens = tokenize(s)
# zip([‘a’, ‘b’, ‘c’], [‘b’, ‘c’]) => [(‘a’, ‘b’), (‘b’, ‘c’)]
ngrams = zip(*[tokens[i:] for i in range(n)])
# [‘a b’, ‘b c’]
return [“ “.join(ngram) for ngram in ngrams]
Language Model
35. ● generate Bi-gram
SMT Components
corpus = [ ‘I am Sam’,
‘Sam I am’,
‘I do not like green eggs and ham’ ]
def tokenizer(s: str) -> List[str]:
return [‘<s>’] + s.split() + [‘</s>’]
Language Model
36. SMT Components
>>> corpus[0]
‘I am Sam’
>>> generate_ngrams(corpus[0], 2, tokenizer)
['<s> I', 'I am', 'am Sam', 'Sam </s>']
>>> [bigram for s in corpus for bigram in generate_ngrams(s, 2, tokenizer)]
['<s> I',
'I am',
'am Sam',
'Sam </s>',
'<s> Sam',
... ,
'and ham',
'ham </s>']
Language Model
37. ● Train Bi-gram Language Model
SMT Components
for bigram in bigrams:
# count(‘a b’)
counter_numer[ bigram ] += 1
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# count(‘a’)
counter_denom[ w_denom ] += 1
for bigram, count in counter_numer:
# ‘a b’ => ‘a’
w_denom = bigram.split()[0]
# p(‘a b’) = count(‘a b’) / count(‘a’)
prob[bigram] = counter_numer[bigram] / counter_denom[w_denom]
Language Model
38. ● How do we model P(f|e)?
● Given a parallel corpus of <e, f> sentence pairs:
○ e has le words, e = (e1
… ele
)
○ f has lf words, f = (f1
… flf
)
● Introduce function a:j ⇒ i
○ Alignments a between e and f:
○ a = {a1
, … alf
}, aj
∈ {0...le}
○ (le + 1)lf
possible alignments
○ f[j] is aligned to e[a[j]] which is e[i]
SMT Components Translation Model
39. ● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }
SMT Components Translation Model
40. ● Modeling the generation of f given e: P(f,a|e,θ)
○ f, e : observable variables
○ a : hidden variables
○ ϴ : parameters
● Likelihood Maximization
○ p(f|e,θ) = ∑a
p(f,a|e,θ) → maxΘ
○ but we don’t have labeled alignments
● EM-algorithm
○ E-step : estimates posterior probabilities for alignments
○ M-step : updates parameter θ of the model
SMT Components Translation Model
41. ● IBM Model 1
○ For parameter θ, length of foreign sentence (lf) is used
○ Alignements are uniformly distributed by assumption
■ P(a | e, lf) = 1 / (le+1)lf
● Translation Probability
○ estimate, P(f | a, e, lf) = ∏j=1..lf
t(fj
|ea(j)
)
● Result:
○ P(f,a|e,lf) = P(a|e,lf)xP(f|a,e,lf) = 1/(le+1)lf
∏j=1..lf
t(fj
|ea(j)
)
SMT Components Translation Model
42. ● Example: le = 6, lf = 7
○ e = And1
the2
program3
has4
been5
implemented6
○ f = Le1
programme2
a3
ete4
mis5
en6
application7
● One possible alignment:
○ a = { 1 → 2, 2 → 3, 3 → 4, 4 → 5, 5 → 6, 6 → 6, 7 → 6 }
p(f|a,e) = t(Le|the) x t(programme|program) x t(a|has) x
= t(etc|been) x t(mis|implemented) x
= t(application|implemented)
SMT Components Translation Model
43. SMT Components
● EM Algorithm
○ For incomplete data
■ Observable sentence pairs, but no alignments between each sentence pairs
○ Chicken and Egg Problem
■ if we had the alignments, we could estimate the translation parameters
■ if we had the parameters, we could estimate the alignments
Translation Model
44. SMT Components
● EM Algorithm (2)
○ initialize all alignments equally likely
○ model learns… (ex. “la” is often aligned to “the”)
Translation Model
45. ● EM Algorithm (3)
○ after first iteration
○ some alignments... (ex. “la” is often aligned to “the”)
SMT Components Translation Model
46. ● EM Algorithm (4)
○ after another iteration
○ alignments becomes apparent (ex. “fleur” and “flower” are more
likely)
SMT Components Translation Model
47. ● EM Algorithm (5)
○ convergence
○ hidden alignments revealed by EM
SMT Components Translation Model
48. ● EM Algorithm (6)
○ parameter estimation from the corpus of aligned sentence pairs
SMT Components Translation Model
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...
49. SMT Components
1 def IbmModelOneTrainEM(E, F):
2 TM = {} # translation model
3 total_s = {}
4 num_sentence = len(E)
5 assert(len(E) == len(F))
6
7 # assuming uniform distribution
8 for i in range(num_sentence):
9 for e in E[i]:
10 for f in F[i]:
11 TM[(e,f)] = UNIFORM
12
Translation Model
50. SMT Components
13 # EM iteration
14 while iteration < ITERATION_MAX:
15 count = defaultdict(int)
16 total = {}
17 for i in range(num_sentence):
18 for e in E[i]:
19 total_s[e] = 0
20 for f in F[i]:
21 total_s[e] += TM[(e,f)]
22 for e in E[i]:
23 for f in F[i]:
24 count[(e,f)] += TM[(e,f)] / total_s[e]
25 total[f] += TM[(e,f)] / total_s[e]
Translation Model
51. SMT Components
26 # recalculate parameters
27 for x in count.keys():
28 f = x[1]
29 try:
30 TM[x] = count[x] / total[f]
31 except KeyError:
32 pass
33 return TM
Translation Model
52. ● Decoding is NP-complete
○ Given a Language Model and Translation Model, the decoder
constructs the possible translations and looks for the most probable
one.
● Traveling Salesman Problem
○ Words as vertices with translation probability
○ Edges with bigram probability as a weight
○ valid sentences are Hamiltonian path
SMT Components Decoder
I
am
so
sleepy
.
53. SMT Components
● Solutions
○ Dynamic Programming : reduces time complexities from exponential to
polynomial by storing results of subproblems and avoid re-computing
them
○ Viterbi Algorithm : DP algorithm to find shortest path in finite number
of states
○ Beam Search (Stack Decoding) : Keeps a list of the best N candidates
seen so far
Decoder
54. SMT Components
hypothesis = namedtuple('hypothesis', 'logprob, lm_state, predecessor, phrase')
initial_hypothesis = hypothesis(0.0, "<s>", None, None)
stacks = [{} for _ in f] + [{}]
stacks[0]['<s>'] = initial_hypothesis
for i, stack in enumerate(stacks[:-1]):
for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:STACKSIZE]: # prune
for j in xrange(i+1,len(f)+1):
if f[i:j] in tm:
for phrase in tm[f[i:j]]:
logprob = h.logprob + phrase.logprob
lm_state = h.lm_state
for word in phrase.english.split():
(lm_state, word_logprob) = lm.score(lm_state, word)
logprob += word_logprob
logprob += lm.end(lm_state) if j == len(f) else 0.0
new_hypothesis = hypothesis(logprob, lm_state, h, phrase)
if lm_state not in stacks[j] or stacks[j][lm_state].logprob < logprob:
# second case is recombination
stacks[j][lm_state] = new_hypothesis
winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob)
Decoder
55. Outline
• Introduction
• Machine Translation Basics
• SMT Components
• Language Model
• Translation Model
• Decoder
• SMT based Ro-Ko Transliteration
• About Hanguel / Problem Definition / Approaches / Training Data
• Koreanizer on web (demo)
56. Ro-Ko Transliteration
● Korean alphabet (Hangeul)
○ Grapheme (자소)
■ A feature alphabet of 24 consonant (자음) and vowel (모음) letters
자음 ㄱ ㄲ ㅋ ㄷ ㄸ ㅌ ㅂ ㅃ ㅍ ㅈ ㅉ ㅊ ㅅ ㅆ ㅎ ㄴ ㅁ ㅇ ㄹ
roman g, k kk k d, t tt t b, p pp p j jj ch s ss h n m ng r, l
모음 ㅏ ㅓ ㅗ ㅜ ㅡ ㅣ ㅐ ㅔ ㅚ ㅟ ㅑ ㅕ ㅛ ㅠ ㅒ ㅖ ㅘ ㅙ ㅝ ㅞ ㅢ
roman a ae o u eu i ae e oe wi ya yeo yo yu yae ye wa wae wo we ui
About Hangeul
57. Ro-Ko Transliteration
● Korean alphabet (Hangeul)
○ Syllables (음절)
■ Letters are grouped into blocks each of which transcribes a syllable
■ ㅎ(h) + ㅏ(a) + ㄴ(n) ⇒ 한 (han)
■ ㄱ(g) + ㅡ(eu) + ㄹ(l) ⇒ 글 (geul)
About Hangeul
58. Ro-Ko Transliteration
● Romanization
○ Romanization is the translation of sound of a foreign language into
Roman (English) letters as phonemes. (ex. 한글 ⇒ hangeul)
● Back-romanization
○ Translating a string in roman characters into best phonetically
matching words in a target language. (ex. hangeul ⇒ 한글)
○ Applications
■ Useful to a keyboard input system without any installation of desired language input
method. (ex. Google input tools)
■ Search engine (ex. loanword)
■ Address transliteration
○ Korean Back-romanization = Ro-Ko Transliteration
Problem
59. Ro-Ko Transliteration
● The Goal
○ Finding the best phonetic matching Hangeul string from a Roman
string
● Two Challenges
○ more than one possible way
■ Romanization : ‘한글’ → ‘hangul’ | ‘hangeul’ | ‘hanguel’
■ Back-romanization : ‘hangul’ | ‘hangeul’ | ‘hanguel’→ ‘한글’
■ finding the most probable matching word
○ Segmental Alignment
■ ‘kanmagi’ → ‘칸(kan)막(mag)이(i)’ vs ‘칸(kan)마(ma)기(gi)’
■ finding the most probable segment in roman string
○ This project focuses on bi-text segmental alignment problem.
Problem
60. Ro-Ko Transliteration
● Phoneme based alignment
○ Monotonic many-to-one segmental alignment.
■ Mapping one or more roman phonemes(음소) to one hangeul grapheme(자소)
○ “hangeul 한글” (includes 2-to-1 case)
■ [ ㅎㅏㄴㄱㅡㄹ, h a n g e u l ]→[ ㅎ/h, ㅏ/a, ㄴ/n, ㄱ/g, ㅡ/eu, ㄹ/l ]
○ “kanmagi 칸막이” (only 1-to-1 cases)
■ [ ㅋㅏㄴㅁㅏㄱㅇㅣ, k a n m a g i ]→[ ㅋ/k, ㅏ/a, ㄴ/n, ㅁ/m, ㅏ/a, ㄱ/g, i/l]
Approaches
61. Ro-Ko Transliteration
● Phoneme based alignment (Drawback)
○ disassembled sequence (ㅋㅏㄴㅁㅏㄱㅇㅣ) of Hangeul syllables
(칸막이) needs to be assembled again by another segmentation
algorithm (or some heuristics)
○ it may produce wrong segmentation
○ 칸막이 → disassemble : ㅋㅏㄴㅁㅏㄱㅇㅣ→ reassemble : 칸마기
○ Though ‘칸막이’ and ‘칸마기’ has very similar pronunciation
○ Yet ‘칸마기’ is incorrect and ‘칸막이’ should be the answer
Approaches
62. Ro-Ko Transliteration
● Syllable based Alignment (Translation Model)
○ Align monotonically by syllables
○ preserve syllables with no disassembling, and run many-to-one
(roman phonemes to a hangeul syllable) segmental alignment.
○ “한글 hangeul”
■ [ 한 글, h a n g e u l ]→[ 한/han, 글/geul ]
■ [ 칸 막 이, k a n m a g i ]→[ 칸/kan, 막/mag, 이/i ]
Training Data
한글 hangeul
한자 hanja
글자 geulja
Translation Model
한 글 han geul
한 자 han ja
글 자 geul ja
Unsupervised
segmental
alignment
P(Ro|Ko)
Approaches
63. Language ModelTraining Data
Ro-Ko Transliteration
● Syllable based Language Model
○ Train syllable based bi-gram language model for all korean words in
train data
p(<s>한), p(한글), p(글</s>)
p(<s>한), p(한자), p(자</s>)
p(<s>글), p(글자), p(자</s>)
한글
한자
글자
Bigrams
P(Ko)
Approaches
64. Ro-Ko Transliteration
● Decode new input Roman sequence
○ T(Ro) = Ko^
= argmaxko
P(Ko)P(Ro|Ko)
new roman sequence
singgeulbeonggeul
hangeulja
hangeulnal
best matching output
싱글벙글
한글자
한글날
decode
Approaches
65. Ro-Ko Transliteration
def parse_page(
baseurl: str,
page: int) -> Sequence[Tuple(str, str, str)]:
response = requests.get(baseurl + page)
page = BeautifulSoup(response.text, 'html.parser')
songs = page.find_all('div', {'class':KPOP_SONG})
records = []
for s in songs:
attrs = s.find('a', {'class':TITLE}).attrs
artist, title = attrs['title'].split('-',1)
songurl = attrs['href']
records.append((artist, title, songurl))
return records
>>> parse_page(BASEURL, 216)
[('A.C.E', 'Take Me Higher', '/a-c-e-take-me-higher/'),
('ONF', 'Complete (널 만난 순간)',
'/onf-complete-neol-mannan-sungan/'),
...
]
Training Data
67. Ro-Ko Transliteration
def get_song_text(song_url: str) -> str:
response = requests.get(songurl)
song = BeautifulSoup(response.text, 'html.parser')
text = song.h2.parent.text
return text
>>> print( get_song_text(“/bts-boy-with-luv-jageun-geosdeureul-wihan-si/”) )
방탄소년단 (Bangtan Boys) BTS – 작은 것들을 위한 시 (Feat. Halsey) Boy with Luv Lyrics
Genre : R&B/Soul
Release Date : 2019-04-12
Language : Korean
BTS – Boy with Luv Hangul
모든 게 궁금해
How’s your day
Oh tell me
뭐가 널 행복하게 하는지
...
BTS – Boy with Luv Romanization
modeun ge gunggeumhae
How’s your day
...
Training Data