Statistical Analysis of Myanmar Words on the World Wide Web for
                         Search Engine Development

including Myanmar daily newspaper, community           to multi-font converter to the Unicode 5.1. At last
Web sites, news...
Web. In this research, Kanaung converter 1 and          match. If no such match is found in the word lists,
Burglish conve...
Table 2. Top ten mono-syllable words                                                Table 3. Top ten bi-syllable words

4.2. Word Level Frequency Matrix

           Based on the input string, the program                              for parsi...
5. Error Analysis                                     expect this ongoing research will yield benefits
Upcoming SlideShare
Loading in …5

Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+

  1. 1. Statistical Analysis of Myanmar Words on the World Wide Web for Search Engine Development Pann Yu Mon Maung Maung Thant Ohnmar Htun Pe San Ko Oo Yoshiki Mikami †Management and Information Systems Engineering Department Nagaoka University of Technology ††International University of Japan Abstract the Indian subcontinent between 5th Century B.C and 3rd Century AD. Myanmar language has 33 This paper introduces an automatic consonants and 12 vowels according to traditional Myanmar word analysis program for ongoing tones on grammar. research of Myanmar search engine development. Since 1990 Myanmar natural language In this research we collected Myanmar words from documents on the World Wide Web to know processing task has been done by Myanmar which words are frequently used. This program is Unicode & NLP Research Center. The first designed for encodings compatible with Unicode Myanmar Unicode font for GUI environment 5.1standard. Our program can automatically (Mac) was developed in 1988 and the one for generate Markov Chain matrix on the result Windows system was developed in 1992. In 1998, words. The program was written by using PHP Myanmar Language processing was first script. Myanmar head words that include in Myanmar-English dictionary are also used as discussed at ISO/IEC JTC1 and Unicode index words. Technical Committee and finally Myanmar Keywords character code set was included in ISO 10646. Until now, they keep on trying over Myanmar, Code conversion tools, Myanmar word Myanmar language processing tasks to cope well searching with all applications so as to complete all the tasks to cover the whole area which requires more 1. Introduction endeavors. In this research, the program that can Myanmar Language, a member of the automatically collect Myanmar words from the Tibeto-Burman language, subfamily of the Sino- Myanmar Web Pages is proposed. The main Tibetan family of language, is spoken as mother purpose of this research is to present the analysis language by more than 37 million Burmese and as of Myanmar words on the Myanmar Web pages to second language by about 20 million ethnic support Myanmar Search Engine Development. minorities in Myanmar. It is the only official To establish the Myanmar Search Engine, it is language of Myanmar which is formerly known as needed to do a lot of tasks such as indexing rule, Burma. Myanmar language is written in a script sorting algorithm, stemming algorithm, word shaped in circular and semi-circular letters, which breaking algorithm and so on. are adopted from the Mon script. And the mon In this study, we have collected script is derived from Indian Brahmi flourished in Myanmar Web pages from various Web sites
  2. 2. including Myanmar daily newspaper, community to multi-font converter to the Unicode 5.1. At last Web sites, news Web sites total of which accounts the program run for searching the word from input to 9,274 Kbytes. And then we extracted words text, and result words are saved in the Database. The process will be explained step by step in the from downloaded Myanmar Web pages. And next section in more detail. detail process of collecting words and analysis of result data will be discussed in following sections. 3.1.First Step : Downloading Myanmar Web Pages 2. Related Research World Wide Web is the most convenient A number of researchers not only from existing source of linguistic data providing the local but also from word wide have collected users abundance of texts in various types in a Myanmar words from different sources for their large number of languages. Already having in individual purposes. electronic forms, these texts are quite suitable for From 2007, Myanmar Unicode and NLP the corpus studies. Research Center has started the development task In order to download Myanmar Web pages, it needs very efficient crawler that can of Myanmar National Corpus (MNC) [5]. MNC collect only Myanmar Web pages selectively from includes all texts including written text and the World Wide Web. In this research, the spoken text from various resources. That project is Language Specific Crawler (LSC) developed by almost finished. one of the authors [3] was used. LSC runs Hla Hla Htay and colleagues [2] have concurrently with language identifier and collect developed Myanmar corpora based on various Myanmar Web pages efficiently. Following table explains the sources of the downloaded web sites. resources such as text from official newspapers in After downloading, downloaded pages were Myanmar, over 300 full books and Myanmar texts passed to converter. from various Web sites including news sites and on-line magazines. In their research they had Table 1. Detail Information for source data processed all their tasks based on ASCII format. 3. Methodology 3.2.Second Step : Conversion of various encoding to Unicode 5.1 Standard Myanmar texts on the Web are using various encoding which are not fully compliant with Unicode 5.1. So it is required to convert the crawled Web Pages to Unicode encoding. If the Web pages are encoded in Unicode then the work Figure. 1. Step by step Procedure of Analysis becomes easier. The step by step processes of our In order to convert various Myanmar analysis are shown in figure 1. Firstly it needs to encodings to Unicode, an efficient converter is collect Myanmar Web pages regardless of their needed. Currently, there are a number of fonts and encodings. Then, we have to pass them Myanmar font conversion tools available on the
  3. 3. Web. In this research, Kanaung converter 1 and match. If no such match is found in the word lists, Burglish converter2 were used. Although both of the character is simply segmented as a word. them work nicely, it is still needed to edit a little bit. For example, Kanaug converter could not 3.4. Fourth Step: Frequency Markov covert ‘ ’ and ‘ ’ properly and correctly. In case Chain Analysis of Burglish, it works correctly in the conversion from “Zawgyi-One” font to “Myanmar3” font. In the program, Word-based Markov But in the conversion from “Wininwa” font to models are also used to calculated word matrix “Myanmar3” font, it cannot covert accurately for table to know the adjacency word in the sentences ‘ ’. And it cannot correctly work on punctuation (This mean which word most frequently appears marks and quotation marks. Thus manual after one word.) It gives us high level background correction is needed in those cases though they are information for word boundary detection in somewhat perfect. parsing of the Myanmar language. Our program firstly finds the words on the given web pages and 3.3. Third Step: Word Searching calculates the frequency of that word to know how Algorithm many times that word appears on the Web sites. After that, Markov chain matrix table was Myanmar language is written in a syllabic generated automatically. system and there are no spaces always put between words or sentences. That is why word 4. Result segmenting algorithm and word searching algorithm for Myanmar Language are needed. We downloaded the various web sites Very little research in different approach has been including newspaper sites, blog sites, published on segmenting sentences into words in Myanmar language [1] [4]. entertainment sites, sport sites and collected 9,274 In our program, all of the Myanmar head Kbytes of text data. After running the program, words that included in Myanmar–English total 766,892 words are collected and 12,211 Dictionary 3 are used as indexed file. It includes unique head words found. 28,000 Myanmar words. Those head words are stored in the database and sorted in reverse order of syllable length to compare with the input data. 4.1. Distribution of Words on input string If the input word is matched with one of the head word, the program will retrieve that word. If the It is found that mono-syllable is most input word does not match with the head word frequently used because those words can be used lists, the program cannot retrieve the word in several ways. For Example, mono-syllable correctly. Thus the accuracy of this algorithm is “ ” was found more than 20,000 times. largely depends on the head word lists. Because it can be used in different ways. For In our algorithm the longest matching Example, in case 1: polite prefix to a young man’s algorithm, was used to find the word on the input name (as in “ ”), in case 2: postpositional data. It normally starts at the first character in a marker to indicate objective (as in text using a heard word list and attempts to find “ ”), in case 3: emphatic the longest word in the list. If such a word is particle suffixed to words (as in found, the longest-matching algorithm marks a “ ”) and in case 4: post boundary at the end of the longest word, and then positional marker indicating destination (as in it repeats the same process as to start searching “ ”). And then bi- longest match at the characters following the syllables words are second most and it is followed by the tri-syllables and so on. The top ten words sorted by frequency for mono-syllable, bi- syllables, tri-syllables and tetra-syllables are 1 shown in the following tables. 2 3 Myanmar-English dictionary produced by Department of the Myanmar Language Commission
  4. 4. Table 2. Top ten mono-syllable words Table 3. Top ten bi-syllable words Mono-Syllable Frequency Bi-Syllable Frequency [ko] 20070 [Kyun 3537 Postpositional marker to (2.61%) taw] (0.46%) indicate objective case I(male) [ma] 18181 [Kyun ma] 3332 Partical prefixed to a verb to (2.40%) I(female) (0.43%) the negative sence [Ka lay] 1994 [ka] 17469 Child (0.26%) Postpositional marker to (2.30%) 1981 [A twat] indicate nominative case (0.25%) For [tal] 14424 [Ae di] 1737 Colloquial form of the (1.90%) That (0.22%) sentence final [par] 12774 Particle denoting inclusion (1.70%) Table 4. Top ten tri-syllable words Table 5. Top ten tetra-syllable words Tetra-Syllable Frequency [sar yay sa 222 Tri-Syllable Frequency yar] (0.02%) [Tha yot 627 Author saung] (0.08%) [a nu pa nyar] 204 Actor Art (0.02%) [Pa ri thet] 500 [a chay a nay] 176 Audience (0.06%) Condition (0.02%) [Sa yar ma] 495 [a yay a tar] 157 Teacher(female) (0.06%) Writing (0.01%) [Thu nge 404 [a mhat ta ya] 138 chin] (0.5%) Remembrance (0.01%) Friend [Main ka lay] 400 Girl (0.05%) 600,000 581,355 500,000 number of collected words 400,000 300,000 200,000 147,100 100,000 27,770 9,752 758 117 16 5 17 2 - Mono- Bi- Tri- 4- 5- 6- 7- 8- 9- 10- Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Number of Syllables Figure. 2. Number of Syllables found in Test Data
  5. 5. 4.2. Word Level Frequency Matrix Based on the input string, the program for parsing of the sentence into words. By applying generated word level Markov table. By using this this algorithm in character level we can also generate matrix we can know adjacency word pairs. It a character level Markov table. It can be used in gives us the high level background information Myanmar character input method to Mobile phone. Table 6 .Word-Level Matrix Sum of Second Frequency Word Grand First Word Total 1144 1144 722 1273 1217 4893 1564 2343 1339 1511 2850 934 934 1205 1717 2922 809 1754 Grand Total 1205 722 2651 1339 1273 2373 1511 1144 1217 16840 4.3. Distribution of characters on Input String It is found that the words begins with the “ ” is the over 90,000 and it is first ranking character. And it is followed by the “ ” and so on. No words are found We analyzed character level frequency of the that starting with the characters “ ”. We could not input data. The result is shown in Figure 3. find that words even in the Myanmar – English dictionary. 100000 90000 80000 70000 number of collected words 60000 50000 40000 30000 20000 10000 0 List of Characters Figure. 3. Total Frequency of Myanmar Characters found in Test Data
  6. 6. 5. Error Analysis expect this ongoing research will yield benefits for our Myanmar search engine development task. In our test data of 9,274 Kbytes, we found 2,935,233 characters which excluding Acknowledgements punctuation marks, numerals and English words. In terms of words, we identified total 766,892 We acknowledge and highly appreciate Myanmar words (12,211 unique headwords). But the kind assistance and help given by Myanmar 5,861 words (0.76%) were not identified. The Unicode & NLP Research Center. We would like errors result from the incorrect spelling in the to express our thanks to Dr. Daw Myint Myint original text, undefined headwords (proper nouns Than and U Ngwe Tun as they kindly provided us which are not defined in the dictionary) and the data we are in need of. incorrect description of syllable length in the database. Moreover, some error results from the References words ending with some characters such as “ ” (Myanmar Sign Dot Below) and ambiguity in word segmentation. Some examples of errors are [1] Hla Hla Htay and et al., “Myanmar Word listed in Table 7. Segmentation using Syllable level Longest Matching”, Proceedings of the 6th Workshop on Asian Language Resources (ALR6), Hyderabad, Table 7. Some Examples of errors India, January 2008. [2] Hla Hla Htay, G. Bharadwaja Kumar and Kavi N. Murthy, “Constructing English-Myanmar Parallel Corpora”. The Fourth International Conference on Computer Application 2006. [3] Pann Yu Mon, Chew Yew Choong, Yoshiki Mikami, “Language Specific Crawler for Myanmar Pages”, Proceedings of the 11th International Conference on Humans and Computers (HC 2008), Nagaoka, Japan, November 2008. [4] Tun Thura Thet and et al., “Word Segmentaion of the Myanmar Language”, Journal of Information Science, Vol. 34, No.5, pp 688- 704. 2008 [5] Wunna Ko Ko and Thin Zar Phyo, “Selection of XML tag set for Myanmar National Corpus”, 6. Conclusion Proceedings of the 6th Workshop on Asian Language Resources (ALR6), Hyderabad, India, In this paper, we presented word January 2008. segmentation program for Myanmar text based on longest string matching algorithm and dictionary. Also we presented both word level and character level frequency distributions and word level Markov table generated by this program. The program performed segmentation work well and proved itself to be used as a practical word segmentation engine for various NLP applications, including Myanmar search engine (in particular word stemming engine). Statistical data generated by this program is useful as background information for designing various Myanmar NLP applications including input system etc. For future task, we plan to extend our program by collecting all possible Myanmar words including not only conversational words but also proper nouns. We