This document discusses a thesis proposal for grapheme to phoneme conversion for Chinese dialects. The motivation is that Chinese dialects have many mutually unintelligible varieties but most are written with the same characters. The proposal aims to use information from other dialects and historical data to predict character pronunciations in a target dialect with limited available data. A preliminary k-nearest neighbors approach was tested using character features from Mandarin and Cantonese, achieving over 25% exact matches for predicting readings. The goal is to build grapheme to phoneme conversion tables for resource-poor Chinese dialects.
2. • Motivation
• Problem Description
• Problem Definition
• Proposed Solution
• Proposed Evaluation
3. Motivation
• Term definition
• dialects - mutually intelligible language
varieties
• Chinese dialects - Sinitic languages that
have a common ancestor
• Chinese dialects are more often than not
mutually unintelligible.
4. Motivation
• There are many Chinese dialects.
• “innumerable” (Lü, 1980)
• “no one really knows” (Yan, 2006)
• Most of them are written with Chinese
characters
5. Motivation
• Inter-dialect MT systems have been
developed (Zhang, 1998; Lin and Chen
1999)
• they are in fact text-to-speech systems
with rule-based word substitution
• naive assumption: all Chinese dialects
share the same underlying syntax (Chao,
1968)
6. Mandarin
Input
Word Translation / /beh/
Character Translit. / /sin kong sam
Taiwanese Output
Taiwanese-Mandarin MT system
of Lin and Chen
7. Grapheme to Wave Form
Preprocessing Phoneme Conversion Generation
Flow of a TTS System
8. Problems of the G2P
Conversion
• Each character’s reading has to be learned!
• bad if we are going to build these tables
for many Chinese dialects.
9. Problem Description
• What can we do for these resource-poor
languages?
• The idea is to make use of other dialects’
information and historical data to predict
one dialect’s character pronunciation.
10. Language & Character
• Chinese dialects often have more than 1
readings for 1 character.
• > 40% characters are polyphonic in
Taiwanese
11. Language & Character
• Two systems in 1 language. e.g. Taiwanese
• hiann5 chui2 (to boil water)
• jian5 sio (to burn)
• different readings; different strata; different
functions
• often referred to as literary and colloquial
12. Language & Character
• Usually only the literary readings are used
in
• foreign name entities
• chhiu7 na5 (Shu-Lin) but Peh lim5 (Berlin)
• borrowed words
• literature reading
• would be quite useful if we can predict
13. Problem Definition
• input
• a set of characters
• each language’s character readings
• each character’s description in historical
rime books
• presumably few char readings of the
target language
14. • output
• one reading for each character in the
target language.
15. CHAN2,
Mandarin SHEN3 CHUAN1 SHAN4, ZHAO4
TAN2
Japanese SHIN SEN ZEN, SEN SHOU
Vietnamese thẩm xuyên ? chiếu
16. available data by now
• character readings in various languages
• Mandarin, Cantonese, Hokkien, Japanese,
Korean,Vietnamese
• historical rime book
•
17. Data Format
• character reading
• phonemic, not phonetic transcription
• phonemes not directly comparable across
languages
• onset - (glide+nucleus) - coda - tone
• e.g. : Y - (U+E) - ∅ - 4
18. • historical data
•
• each character can be described with
/ / / / / /
• e.g.
• / / / / / /
19. Preliminary Work
• kNN for prediction
• calculate the “average feature” for each
character
• ex: in Mandarin:
ZHAN1 TIAN1 TIE1 DIAN4 CHAN1
• the onset: (ZH, T, D, CH) = (0.2, 0.4,
0.2, 0.2)
20. • kNN
• compute KL divergence between
characters
• nearest characters vote the predicted
reading