First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

•Download as PPTX, PDF•

2 likes•537 views

The lack of distinct word delimiters, white space, between the words in a sentence in written text is one of the problems in performing Natural Language Processing tasks for Myanmar Language. In order to determine the specific meaning or context of each word in the text, a proper syllable segmentation process is needed as an essential basic component to operate further linguistic process.

Technology

HTET MYET LYNN
20147728
Department of Computer Engineering
Chosun University
Intelligent Computing Laboratory
First-Character Filtering Method in SyllableSegmentation usingData
Dictionary for MyanmarLanguage

Contents
 Nature of Myanmar Words
 First-Character Filtering (FCF) Method
 Implementation & Result
 Future Work

Nature of Myanmar words (1/2)
 Like other South-East Asian languages
 No delimiter (whitespace) between words but
phrases
 No standard rules for whitespace
 33 Consonants, 10 digits

Nature of Myanmar words (2/2)
자음
모음
좋 다.
ㄲ
께이
꺼
까웅
까웅이
1
2
3
4
5

Nature of Myanmar words (3/3)
 Kinzi
 Stacked Consonants
Writing Methods
영어
대학교

First-Character Filtering (FCF)
Method(1/8)
Input
Sentence
Get First
Character
Syllable
Collections
Output
Syllable

First-Character Filtering (FCF)
Method(2/8)
Syllable Collections

First-Character Filtering (FCF)
Method(3/8)
Syllable Collections

First-Character Filtering (FCF)
Method(4/8)
Sentence Pre-processing
• Whitespace
• Punctuations Marks
• Number of Input Sentence
• Length of Each Sentence
학생은 학교로 간다.

First-Character Filtering (FCF)
Method(5/8)
Get First Character of the sentence
Input_txt_length =160

First-Character Filtering (FCF)
Method(6/8)
Extract Candidates from Syllable Collections

$First-Character Filtering (FCF) Method(7/8) Extract Candidates from Syllable Collections Input_txt_length =160 Length_of_syl=1 Length_of_syl=4 Length_of_syl=8 Length_of_syl=12 . . . . . . . $candidate = substr ( Input_txt, 0, length_of_syl); //Store Candidate If $candidate == $syllable { Store_Candidate ($candidate); } Candidates.txt$

First-Character Filtering (FCF)
Method(8/8)
Store Final Syllable
Input_txt_length =140
Results.txt
Final_syllable_length = 20
Input_txt_length =160
destroy candidates.txt

Future Work
Algorithm for :
 Loan Words
 Kinzi syllables
 Stacked Consonants syllables
 Word Segmentation
영어
대학교
버스

First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

What's hot

Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...International Journal of Science and Research (IJSR)

Selecting proper lexical paraphrase for children長岡技術科学大学　自然言語処理研究室

Michigan ecce exam 2013INSTITUTO PERUANO AMERICANO - EL CULTURAL

IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal

Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq

A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc

ToeflRaeesi

An expert system for automatic reading of a text written in standard arabicijnlc

STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf

A survey on phrase structure learning methods for text classificationijnlc

Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals

Error analysis on subject verb agreement the case of a university student in ...Alexander Decker

The Investigation of Grammatical Errors in Grade 10 Students’ Expository Essa...Tshen Tashi

Aaai 1Tathagata Raha

Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...iosrjce

Seminar hasil pendidikan bahasa inggrisJuvrianto Chrissunday Jakob

The recognition system of sententialijaia

4th semnastysuman009

What's hot (18)

Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...

Selecting proper lexical paraphrase for children

Michigan ecce exam 2013

IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...

Named Entity Recognition System for Hindi Language: A Hybrid Approach

A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES

Toefl

An expert system for automatic reading of a text written in standard arabic

STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES

A survey on phrase structure learning methods for text classification

Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer

Error analysis on subject verb agreement the case of a university student in ...

The Investigation of Grammatical Errors in Grade 10 Students’ Expository Essa...

Aaai 1

Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...

Seminar hasil pendidikan bahasa inggris

The recognition system of sentential

4th sem

Similar to First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

D2 anandkumarJasline Presilda

A Corpus-Based Concatenative Speech Synthesis System for Marathiiosrjce

A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce

Improving a Lightweight Stemmer for Gujarati Languageijistjournal

Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...IJECEIAES

MYANMAR WORDS SORTING kevig

SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit

5215ijcseit01ijcsit

SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit

D3 dhanalakshmiJasline Presilda

S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc

F017163443IOSR Journals

Parsing of Myanmar Sentences With Function Taggingkevig

PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig

Arabic MT ProjectHind Abdulkhaleq

speech segmentation based on four articles in one.Abebe Tora

Similar to First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language (20)

D2 anandkumar

A Corpus-Based Concatenative Speech Synthesis System for Marathi

A Marathi Hidden-Markov Model Based Speech Synthesis System

Improving a Lightweight Stemmer for Gujarati Language

Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...

MYANMAR WORDS SORTING

SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR

5215ijcseit01

SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR

D3 dhanalakshmi

S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS

F017163443

Parsing of Myanmar Sentences With Function Tagging

PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING

Arabic MT Project

speech segmentation based on four articles in one.

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

My Hashitalk Indonesia April 2024 Presentation

SQL Database Design For Developers at php[tek] 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Handwritten Text Recognition for manuscripts and early printed texts

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

How to convert PDF to text with Nanonets

08448380779 Call Girls In Civil Lines Women Seeking Men

🐬 The future of MySQL is Postgres 🐘

First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

1. HTET MYET LYNN 20147728 Department of Computer Engineering Chosun University Intelligent Computing Laboratory First-Character Filtering Method in SyllableSegmentation usingData Dictionary for MyanmarLanguage

2. Contents  Nature of Myanmar Words  First-Character Filtering (FCF) Method  Implementation & Result  Future Work

3. Nature of Myanmar words (1/2)  Like other South-East Asian languages  No delimiter (whitespace) between words but phrases  No standard rules for whitespace  33 Consonants, 10 digits

4. Nature of Myanmar words (2/2) 자음 모음 좋 다. ㄲ 께이 꺼 까웅 까웅이 1 2 3 4 5

5. Nature of Myanmar words (3/3)  Kinzi  Stacked Consonants Writing Methods 영어 대학교

6. First-Character Filtering (FCF) Method(1/8) Input Sentence Get First Character Syllable Collections Output Syllable

7. First-Character Filtering (FCF) Method(2/8) Syllable Collections

8. First-Character Filtering (FCF) Method(3/8) Syllable Collections

9. First-Character Filtering (FCF) Method(4/8) Sentence Pre-processing • Whitespace • Punctuations Marks • Number of Input Sentence • Length of Each Sentence 학생은 학교로 간다.

10. First-Character Filtering (FCF) Method(5/8) Get First Character of the sentence Input_txt_length =160

11. First-Character Filtering (FCF) Method(6/8) Extract Candidates from Syllable Collections

12. First-Character Filtering (FCF) Method(7/8) Extract Candidates from Syllable Collections Input_txt_length =160 Length_of_syl=1 Length_of_syl=4 Length_of_syl=8 Length_of_syl=12 . . . . . . . $candidate = substr ( Input_txt, 0, length_of_syl); //Store Candidate If $candidate == $syllable { Store_Candidate ($candidate); } Candidates.txt

13. First-Character Filtering (FCF) Method(8/8) Store Final Syllable Input_txt_length =140 Results.txt Final_syllable_length = 20 Input_txt_length =160 destroy candidates.txt

14. Implementation & Result

15. Implementation & Result

16. Future Work Algorithm for :  Loan Words  Kinzi syllables  Stacked Consonants syllables  Word Segmentation 영어 대학교 버스

Editor's Notes

안녕하십니까? 14학번 택미얏린라고 입니다. 오늘은 제가 미얀마언어의 음절 분할에대해서 연구하고 있는 내용은 발표하도록 할겠습니다. 오늘 발표한제목은 First Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language라고 이고 논문준비중입니다.
목적은 보시디시피, 먼저, 미얀마언어의 쓰기방법하고 미얀마 단어의 특성들에대해 알아보겠습니다 그다음에는 , 제가 연구하고 있는 “First-Character Filtering (FCF) 방법에 대해서 설명해드리고 구현 및 결과를 알려주겠습니다. 그리고, 마지막으로 향후 연구에대해서 알아보겠습니다.
미얀마 단어의 자연에 대해서 설명해드리겠습니다. 미얀마언어 쓰기방법는 동나마나라들의 언어들이랑 비슷합니다. 영어하고 한국어들의 문장에서는 whitespace 이라는 (공백)이 단어에 따라 있어서 단어 분할을 편하게 할수있지만 미얀마언어의 문장에서 공백이는 단어들 사이에 없고 문구 사이에 있기 때문에 단어 분할은 어렵습니다. 그리고 .. 미얀마언어에는 공백에 대한 표준 규칙이 없어서.. 문구 사이에 공백이 있어도 단어분할기 어렵습니다. 보시다시피 미얀마언어에는 자음이 33가개 있고… 번호수자는 10개 있습니다.
그리고 . 미얀마 단어의 자연이 보시다시피 자음과 모음 조합해서 한 음절가 형성되것입니다. 자음은 가만색이로표현하고 모음들은 파란색으로 표현하고 있습니다. 미얀마언어 쓸대 자음이 먼저나오고 그다음에 모음이 붙어서 한 음절가 됩니다. 예를들면, 제일 밑에 볼수있는 음절이는 한국어로 “좋다”라고 뜻입니다. 그단어를 쓰면는 왼쪽에 펴현하는게 처럼 자음과 모음 조합해서 단계별로 써야됩니다. 이것은 일반 미얀마 음절의 구조아니면은 미얀마음절의 쓰기교칙입니다.
그리고, 다른 쓰기방법2개가 있습니다. 그 방법들은 Kinzi 하고 Stacked consonant 라는고 입니다. Kinzi 라는 쓰기방법은 어떤 단어에따라 앞에자음의모음을 뒤자리에있는 자음 위에서 올리는것입니다. Stacked consonant라고 자음 적층 쓰기방법은 뒤자리에 있는 자음이 앞 자음의 모음 아래 내래서 쓰는게 입니다. 그런데 , 이런 kinzi 과 stacked consonant 라는 쓰는 방법들은 고유한 단어들이 위해서만 씁니다. 지금까지 미얀마 단어의 자연과 음절의 쓰기교칙에대 살펴보겠습니다 또는, 앞에 이야기했것처럼 미얀마 문장에서 공백이 없기때문에 단어 분할 안되서 먼저… 음절 분할을 수행해야합니다. 음절하기위해서 제가 만들었던 알고리즘을 다음에 설명해드릴겠습니다.
보시디시피, 첫째 그림은 알고리즘에대해서 자세하게 표현하고 2번째 그림은 짧은 형태 표현하고 있습니다. 먼저, 입력한 문장의 첫 자음을 뽑아서 그 자음에대해 음절를 음절 데이터베이스에서 찾습니다. 그리고, 추력했던 음절을 저장해서 출력합니다. 알고리즘을 자세하게 설명하기전에 이방법에 쓰고있는.. (Syllable collections)라는 음절 데이터베이스에 대해서 먼저 설명해드리겠습니다.
음절 데이터베이스 만드기위해서 .. 먼저, MySQL로 사용해서 미얀마언어의 자음 (33)의 각각에 따라 table 하나씩 만듭니다.
그리고 깍 하나의 자음에 따라서 음절들은 table 안어세 저정합니다. 그래서 Syllable collections 데이터를 구축됩니다. 다음은 알고리즘에 대해 설명드리도록 하겠습니다
처음에, 입력한 문장을 전처리하기위해 공백이나, 구두점들을 추출하고 입력 문장의수를 뽑았습니다. 그리고 깍 문장의 길을 저장합니다. 예를 들면 “학생은 학교로 간다”라고 뜻이 문장을 음절처리 해볼겠습니다. 먼저 , 보시디시피 문장에서 공백이를 다 빼고 새로운 문장을 공백이나 구두점이 없게 만듭니다.
그리고, 제가 앞에 이야기했것처럼 미얀마 단어쓸때 자음이 먼저 써야되니까 첫 자음을 문장에서 추출합니다. 그다음에.. 문장에서 첫 자음을 찾아서 음절 데애터베이스에서 음절을 찾으려고 준비됩니다.
뽑하떤 자음의 table에서 음절들을 찾아서 후보자 음절들을 찾으려고 준비됩니다.
Table 안에 있는 깍 음절들의 길를 먼저추출하고 깍 음절길 값에따라 입력문장의 맨 처음에서 잘라서 후보자 음절을 만듭니다. 그리고, 추출했던 후보자 음절이랑 table 안에있는.. 음절이랑 동일하거나 상이하거나 확인하고 똑같으면 candidates 라는 txt 파일에서 저장됩니다. 아니면은, 통과됩니다.
그다음에, candidates 택트파일에서 제일 가장 긴은 음절을 뽑하고 최종음절라고 result 라는 txt file 에서 저장됩니다. 그 추출했던 최종음절을 txt 파일에서 저장후에 candidate 파일을 삭제됩니다. 그리고 최종음절의 길값을 추출해서 똑 같은 길값을 입력문장의 맨 처음에서 삭제됩니다. 그래서 새로운 입력문장가 되고 새로운 음절을 찾기위해 다시 준비됩니다. 지금까지 First-character filtering 알고리즘에대 살펴보겠습니다. 다음에는 구현 및 결과를 표현하도록 할겠습니다.
아까우리가 입력던 문장을 음절분할 처리해봤습니다. 보시다시피, 문장에서 최종음절들을 추출하고 result라는 파일에서 저장해서… 결과를 표현해줍니다.
이것은 제가 어떤 미얀마뉴스 홈페이지에서 1문장이상 뽑아서 실험해봤던 결과입니다.
마지막으로, 제가 나중에 연구하도록 향후 연구에대해서 알려드리겠습니다. 지금까지는 미얀마언어의 음절분할처리는 일반쓰기방법에 적용하면은 트린게 없이 처리되고 있는데 외래어, Kinzi 쓰기방법 하고 자음 적층쓰기방법에 대해서 연구하고 있는중입니다. 그리고, 제가 나중에 음절데이터베이스에서 음절자료가없어도 음절교칙에 따라 음절분할 처리할수있게 연구할도로 할겠습니다. 감사합니다.
혹시 질문이 있으시면 질문해 주십시오.

First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

Similar to First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language (20)

Recently uploaded

Recently uploaded (20)

First-Character Filtering Method in Syllable Segmentation using Data Dictionary for Myanmar Language

Editor's Notes