Proposal Tomorrow

•Download as KEY, PDF•

0 likes•234 views

This document discusses a thesis proposal for grapheme to phoneme conversion for Chinese dialects. The motivation is that Chinese dialects have many mutually unintelligible varieties but most are written with the same characters. The proposal aims to use information from other dialects and historical data to predict character pronunciations in a target dialect with limited available data. A preliminary k-nearest neighbors approach was tested using character features from Mandarin and Cantonese, achieving over 25% exact matches for predicting readings. The goal is to build grapheme to phoneme conversion tables for resource-poor Chinese dialects.

Technology Education

Grapheme to Phoneme
Conversion for Chinese
Dialects
Chu-Cheng’s thesis proposal

• Motivation
• Problem Description
• Problem Deﬁnition
• Proposed Solution
• Proposed Evaluation

Motivation
• Term deﬁnition
• dialects - mutually intelligible language
varieties
• Chinese dialects - Sinitic languages that
have a common ancestor
• Chinese dialects are more often than not
mutually unintelligible.

Motivation

• There are many Chinese dialects.
• “innumerable” (Lü, 1980)
• “no one really knows” (Yan, 2006)
• Most of them are written with Chinese
characters

Motivation
• Inter-dialect MT systems have been
developed (Zhang, 1998; Lin and Chen
1999)
• they are in fact text-to-speech systems
with rule-based word substitution
• naive assumption: all Chinese dialects
share the same underlying syntax (Chao,
1968)

Mandarin
Input

Word Translation / /beh/

Character Translit. / /sin kong sam

Taiwanese Output

Taiwanese-Mandarin MT system
of Lin and Chen

Grapheme to Wave Form
Preprocessing Phoneme Conversion Generation

Flow of a TTS System

Problems of the G2P
Conversion

• Each character’s reading has to be learned!
• bad if we are going to build these tables
for many Chinese dialects.

Problem Description

• What can we do for these resource-poor
languages?
• The idea is to make use of other dialects’
information and historical data to predict
one dialect’s character pronunciation.

Language & Character

• Chinese dialects often have more than 1
readings for 1 character.
• > 40% characters are polyphonic in
Taiwanese

Language & Character
• Two systems in 1 language. e.g. Taiwanese
• hiann5 chui2 (to boil water)
• jian5 sio (to burn)
• different readings; different strata; different
functions
• often referred to as literary and colloquial

Language & Character
• Usually only the literary readings are used
in
• foreign name entities
• chhiu7 na5 (Shu-Lin) but Peh lim5 (Berlin)

• borrowed words

• literature reading

• would be quite useful if we can predict

Problem Deﬁnition
• input
• a set of characters
• each language’s character readings
• each character’s description in historical
rime books
• presumably few char readings of the
target language

• output
• one reading for each character in the
target language.

CHAN2,
Mandarin SHEN3 CHUAN1 SHAN4, ZHAO4
TAN2

Japanese SHIN SEN ZEN, SEN SHOU

Vietnamese thẩm xuyên ? chiếu

available data by now

• character readings in various languages
• Mandarin, Cantonese, Hokkien, Japanese,
Korean,Vietnamese
• historical rime book
•

Data Format
• character reading

• phonemic, not phonetic transcription

• phonemes not directly comparable across
languages

• onset - (glide+nucleus) - coda - tone

• e.g. : Y - (U+E) - ∅ - 4

• historical data

•
• each character can be described with
/ / / / / /

• e.g.

• / / / / / /

Preliminary Work
• kNN for prediction
• calculate the “average feature” for each
character
• ex: in Mandarin:
ZHAN1 TIAN1 TIE1 DIAN4 CHAN1

• the onset: (ZH, T, D, CH) = (0.2, 0.4,
0.2, 0.2)

• kNN
• compute KL divergence between
characters
• nearest characters vote the predicted
reading

Preliminary Work -
Results
• 7548 chars, 75 folds • 2 exact match:
(each fold ~ 100 chars) 26.9%

• Mandarin • 1 exact match:
20.2%
• 4 features
• none matches:
• all exact match: 5%
25.7%

• 3 exact match:
21.9%

• Cantonese • none: 8.0%

• 4 features

• 4: 27.0%

• 3: 24.7%

• 2: 24.3%

• 1: 16.1%

Viewers also liked

Collaborative Knowledge Building for Accessibility in Higher Education (2012)Antti Raike

PTSD in soldierskeseric93616

HY_tekstitysviestinnassa_sept2014Antti Raike

Fotografia Matemàticaguestbd88380

DfA_AaltoSCI_sept2014Antti Raike

Flexible Learning Strategies for Higher Education (2007)Antti Raike

Present SimpleJacqueline Tomàs

Design for All ja opetuksen saavutettavuusAntti Raike

Senior Project Research Paper Final Draft keseric93616

VIPP Presentation at Seeing Red symposium 2010Antti Raike

Lesson 1junheera

DfA, johdatus käyttäjäkeskeiseen tuotekehitykseen. Jan 2010Antti Raike

Searching Knowledge (medialab, 2005)Antti Raike

Co-creation and Design for All as Tools for AccessibilityAntti Raike

Viewers also liked (14)

Collaborative Knowledge Building for Accessibility in Higher Education (2012)

PTSD in soldiers

HY_tekstitysviestinnassa_sept2014

Fotografia Matemàtica

DfA_AaltoSCI_sept2014

Flexible Learning Strategies for Higher Education (2007)

Present Simple

Design for All ja opetuksen saavutettavuus

Senior Project Research Paper Final Draft

VIPP Presentation at Seeing Red symposium 2010

Lesson 1

DfA, johdatus käyttäjäkeskeiseen tuotekehitykseen. Jan 2010

Searching Knowledge (medialab, 2005)

Co-creation and Design for All as Tools for Accessibility

Similar to Proposal Tomorrow

Ena121 & 131 grammar lecture 1 word classes & clause elementsElisabeth Wulff Sahlén

Lecture: Word SensesMarina Santini

NLP 1.pptxSamah590739

Chinese basics and translation guideFrank Zhonghe Wei

textprocessingboth.pptxbdiot

Common Pronunciation Errors Made by KoreansSangjin Han

Demystifying Mandarin - Learn Chinese by Hutong SchoolHutong School

Accessibility hierarchy boston university (2) - lx400-7b-lguniv-1Laureen Davison

Basic Introduction to Chinese LanguageChenmama1

リーディング研究会２０１４年６月輪読_最終版（関西学院大学大学院・金澤）Yu Kanazawa / Osaka University

what is stylistics and its levels 1.Phonological level 2.Graphological leve...RajpootBhatti5

Localizing your apps for multibyte languagesWO Community

Mandarin Chinese - Intro, How to Learn, and DemystifyingBeleza Chan

Morphological AnalysisAkshat Pandey

Natural Language ProcessingVarunjeet Singh Rekhi

Writing history progretionLAUSD

Modelo ud en power pointpolzeath

Similar to Proposal Tomorrow (17)

Ena121 & 131 grammar lecture 1 word classes & clause elements

Lecture: Word Senses

NLP 1.pptx

Chinese basics and translation guide

textprocessingboth.pptx

Common Pronunciation Errors Made by Koreans

Demystifying Mandarin - Learn Chinese by Hutong School

Accessibility hierarchy boston university (2) - lx400-7b-lguniv-1

Basic Introduction to Chinese Language

リーディング研究会２０１４年６月輪読_最終版（関西学院大学大学院・金澤）

what is stylistics and its levels 1.Phonological level 2.Graphological leve...

Localizing your apps for multibyte languages

Mandarin Chinese - Intro, How to Learn, and Demystifying

Morphological Analysis

Natural Language Processing

Writing history progretion

Modelo ud en power point

Recently uploaded

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

MINDCTI Revenue Release Quarter One 2024MIND CTI

ICT role in 21st century education and its challengesrafiqahmad00786416

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Exploring Multimodal Embeddings with MilvusZilliz

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Understanding the FAA Part 107 License ..

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Six Myths about Ontologies: The Basics of Formal Ontology

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

[BuildWithAI] Introduction to Gemini.pdf

presentation ICT roal in 21st century education

Introduction to Multilingual Retrieval Augmented Generation (RAG)

MINDCTI Revenue Release Quarter One 2024

ICT role in 21st century education and its challenges

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Exploring Multimodal Embeddings with Milvus

CNIC Information System with Pakdata Cf In Pakistan

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Proposal Tomorrow

1. Grapheme to Phoneme Conversion for Chinese Dialects Chu-Cheng’s thesis proposal

2. • Motivation • Problem Description • Problem Deﬁnition • Proposed Solution • Proposed Evaluation

3. Motivation • Term deﬁnition • dialects - mutually intelligible language varieties • Chinese dialects - Sinitic languages that have a common ancestor • Chinese dialects are more often than not mutually unintelligible.

4. Motivation • There are many Chinese dialects. • “innumerable” (Lü, 1980) • “no one really knows” (Yan, 2006) • Most of them are written with Chinese characters

5. Motivation • Inter-dialect MT systems have been developed (Zhang, 1998; Lin and Chen 1999) • they are in fact text-to-speech systems with rule-based word substitution • naive assumption: all Chinese dialects share the same underlying syntax (Chao, 1968)

6. Mandarin Input Word Translation / /beh/ Character Translit. / /sin kong sam Taiwanese Output Taiwanese-Mandarin MT system of Lin and Chen

7. Grapheme to Wave Form Preprocessing Phoneme Conversion Generation Flow of a TTS System

8. Problems of the G2P Conversion • Each character’s reading has to be learned! • bad if we are going to build these tables for many Chinese dialects.

9. Problem Description • What can we do for these resource-poor languages? • The idea is to make use of other dialects’ information and historical data to predict one dialect’s character pronunciation.

10. Language & Character • Chinese dialects often have more than 1 readings for 1 character. • > 40% characters are polyphonic in Taiwanese

11. Language & Character • Two systems in 1 language. e.g. Taiwanese • hiann5 chui2 (to boil water) • jian5 sio (to burn) • different readings; different strata; different functions • often referred to as literary and colloquial

12. Language & Character • Usually only the literary readings are used in • foreign name entities • chhiu7 na5 (Shu-Lin) but Peh lim5 (Berlin) • borrowed words • literature reading • would be quite useful if we can predict

13. Problem Deﬁnition • input • a set of characters • each language’s character readings • each character’s description in historical rime books • presumably few char readings of the target language

14. • output • one reading for each character in the target language.

15. CHAN2, Mandarin SHEN3 CHUAN1 SHAN4, ZHAO4 TAN2 Japanese SHIN SEN ZEN, SEN SHOU Vietnamese thẩm xuyên ? chiếu

16. available data by now • character readings in various languages • Mandarin, Cantonese, Hokkien, Japanese, Korean,Vietnamese • historical rime book •

17. Data Format • character reading • phonemic, not phonetic transcription • phonemes not directly comparable across languages • onset - (glide+nucleus) - coda - tone • e.g. : Y - (U+E) - ∅ - 4

18. • historical data • • each character can be described with / / / / / / • e.g. • / / / / / /

19. Preliminary Work • kNN for prediction • calculate the “average feature” for each character • ex: in Mandarin: ZHAN1 TIAN1 TIE1 DIAN4 CHAN1 • the onset: (ZH, T, D, CH) = (0.2, 0.4, 0.2, 0.2)

20. • kNN • compute KL divergence between characters • nearest characters vote the predicted reading

21. Preliminary Work - Results • 7548 chars, 75 folds • 2 exact match: (each fold ~ 100 chars) 26.9% • Mandarin • 1 exact match: 20.2% • 4 features • none matches: • all exact match: 5% 25.7% • 3 exact match: 21.9%

22. • Cantonese • none: 8.0% • 4 features • 4: 27.0% • 3: 24.7% • 2: 24.3% • 1: 16.1%

Proposal Tomorrow

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Proposal Tomorrow

Similar to Proposal Tomorrow (17)

Recently uploaded

Recently uploaded (20)

Proposal Tomorrow