Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Token classification using Bengali Tokenizer

1,644 views

Published on

Token Classifications using Tokenizer

Published in: Engineering

Token classification using Bengali Tokenizer

  1. 1. Presented by- Sujit Kumar Das M.Tech 3rd sem,IT Roll-021413 No-363202205 1 Token Classification In Bengali Language By Using Bangla Tokenizer Under the Supervision Of Mr. Sourish Dhar Asst. Professor,Dept of IT Assam University
  2. 2. Contents… 2  Introduction  Literature Survey  Our Proposal  Future Works To Be Done  Conclusions  References
  3. 3. Introduction: 3 What is Token Classification? Tokens classification means identification of each tokens(words/terms) in a document and classify them into some predefined categories. Theses predefined categories can be name of a person, symbols, punctuations, Abbreviations, numbers, date etc.
  4. 4. Steps in Tokens Classification: 4  Tokenize the given input text.  Assign to each token the class (or tag) that it belongs to. For Example, Token Class মাইকেল Name ৪৫ Number খবর Word
  5. 5. Introduction: What Is Tokenization? 5 Tokenization is the process of breaking a stream of text up into words, phrases, symbols and other meaningful elements called tokens. Token: It’s a sequence of character that can be treated as a single logical entity. Typically TNoaktuernals L aanrgeu-ages Programming Languages Words Identifiers Numbers Keywords Abbreviations Operators Symbols Special symbols Constants
  6. 6. Cont… What Is Tokenizer? 6 The job of a Tokenizer is to break up a stream of text into tokens. Why Tokenizer?  It does very crucial task in pre-processing any natural language.  To handle semantic issues in the subsequent stages in machine translation.  Produces a structural description on an input sentence.  For language modeling, the distribution of input text into tokens is compulsory[9].
  7. 7. Literature Survey: 7  A Tokenizer is a component of parser . Parsing natural language text is more difficult than the computer languages such as compiler and word processor because the grammars for natural languages are complex, ambiguous and infinity number of vocabulary[8].  Natural language applications namely Information Extraction, Machine Translation, and Speech Recognition, need to have an accurate parser[8].  A tokenizer plays its significant part in a parser, by identifying the group or collection of words, existing as a single and complex word in a sentence. Later on, it breaks up the complex word into its constituents in their appropriate forms[2].
  8. 8. Cont… Related Works: 8 Some Existing standard tokenizers-  Standford Tokenizer for English Language[10].  Shallow Tokenizer for Bengali Language. Vaakkriti Tokenizer for Sanskrit Language[2]. These Tokenizers was developed for some particular languages only i.e., all Tokenizers doesn’t work for all languages.
  9. 9. Cont… Standford Tokenizer: 9  Developed mainly for English Language and later on for Arabic,Chinese and spanish languages also.  Java language was used for developing. Online Interface:
  10. 10. Cont… Results after parsing: 10
  11. 11. Cont… Shallow Bangla Tokenizer: 11 The shallow parser gives the analysis of a sentence in terms of-  Morphological Analysis.  POS Tagging.  Chunking. Apart from the final output, intermediate output of individual modules is also available.
  12. 12. Cont… 12 Online Interface:
  13. 13. Cont… 13 Result after submitting:
  14. 14. Cont… 14 Bengali Stemmers:  A Rule-Based Stemmer for Bengali Language by Sandipan Sarkar,IBM and Sivaji Bandhopadhay,Jadavpur University[12].  A light weight stemmer for Bengali and which was use in spelling checker by Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan,CRBLP,BRAC University,Dhaka in 2007[13].  Yet Another Suffix Stripper, which uses a clustering based approach based on string distance measures and requires no linguistic knowledge by P.Majumdar, Gobinda Kole,ISI Pabitra Mitra,IIT and Kalyankumar Dutta,Jadavpur University in 2007[14].
  15. 15. Cont… Comparison Of Three stemmers: 15 Stemmer Used Method Accuracy(%) Rule-Based Orthographic-syllable 89.0 Light weight Longest Match Basis 90.8 YASS String Distance Measure 88.0
  16. 16. Cont… 16 POS Tagger:  Supervised POS Tagging: Has pre-tagged Corpora used for training to learn information about the tagset, word-tag frequencies, rule sets etc[11]. e.g., N-Gram,Maximum Entropy Model(ME),Hidden Markov Model(HMM) etc.  Unsupervised POS Tagging: Do not require a pre-tagged corpora. they use advanced computational methods to automatically induce tagsets. e.g.,Brill, Baum-Welch algorithm etc[11].
  17. 17. Cont… 17 Supervised POS Taggers Comparison: Tagger Applied Method Uni-Gram(N=1) Most likely approach HMM One sentence at a time. Formula- P (word | tag) * P (tag | previous n tags) Bi-Gram(N=2) Same as Unigram but consider just previous word tag
  18. 18. Cont… 18 UNI-GRAM BI-GRAM HMM Sentences Tokens Accuracy(%) Accuracy(%) Accuracy(%) 87 1002 28.6 28.6 39.3 304 4003 42.4 41.9 49.7 532 8026 48.1 47.9 53.6 677 10001 49.8 49.5 54.3 Bangla - SPSAL Corpus and Tagset with Test data: 400 sentences, 5225 tokens from the SPSAL test corpus[11].
  19. 19. Cont… Problem Domain: 19  Bangla is very rich in inflections, vibhakties (suffix) and karakas, and often they are ambiguous also.  It is not easy to provide necessary semantic and world knowledge that we humans often use while we parse and understand various Bangla sentences. So, mainly due to grammatical vastness design of bangla Toeknizer is not an easy task.
  20. 20. Cont… Bengali Grammar: POS 20
  21. 21. Cont… Bengali Grammar: Genders 21 There are four genders in Bengali grammar - 1.Pung lingo(masculine) 2.Stree lingo(feminine) 3.Ubha lingo(common) 4.Klib lingo(material)
  22. 22. Cont… Bengali Grammar: Numbers 22 Like English language Bengali has also two numbers-  Singular: When we define a single object or person its singular. eg. a man, a girl etc.  When we consider more than one objects or persons its plural numbers. eg. Two man, mangoes etc.
  23. 23. Our Proposal: 23 We are going to develop such a system which can be use for tokenize Bengali Text as well as the system will be able to solve the problem of Tokens Classification. Used Resources: Platform: Windows 7 Front End: ASP.Net 4.0 Back End: Microsoft Excel Stylsheet Language: C#(C-sharp)
  24. 24. Cont… Flow Chart : 24 Text Input Words Stop Words Removal Stemming POS Tag Classify
  25. 25. Cont… 25 Input: Input will be a Bengali Text. Words:(Done) Text will be split into words after removing all non-character and white spaces and then store them into excel file. Stop Words Removal(Done): Stop words are the frequently occurring set of words which do not aggregate relevant information to the text classification task. Root words: After pulling out prefixes and suffixes from any word thus the origin form of a word is known as root
  26. 26. Cont… 26 POS Tagging: After finding the root word(stemming) each elements will push into some particular classes which is previously generated. Thus, Parts-Of- Speech(POS) will be tagged with each word here. Tokens Classification: Tokens classification means after finding tokens from above tasks categories them into some pre-defined classes. Our consideration of classes will be mainly Title, Surname,Collocation,punctuation,Abbreviation, Number, Date, Unknown and foreign word.
  27. 27. Current Status Of Our Work: 27 Snapshot1: system Interface
  28. 28. Cont… 28 Snapshot 2: After Loading Using Load Button
  29. 29. Cont… 29 Snapshot 3: After getting tokens from Text
  30. 30. Cont… 30 Snapshot4: Tokens after removing Stop-words
  31. 31. Cont… 31 Snapshot3: After execution words are split and stored in excel file.
  32. 32. Future Works To Be Done: 32  Stemming i.e., Finding Root Words.  POS Tagging.  Classification
  33. 33. Conclusions: 33 Although in Language processing tokenizing is a Fundamental task, But due to richness of Bengali grammar and structure of Bengali text it is not an easy task in case of Bengali Language. Again Stemming is also a difficult task to do. To make an effective bangla Tokenizer one must have a vast knowledge on Bengali Grammar. So, We hope that we will able to develop such a system which will overcome difficulties and the limitations of existing bangla Tokenizer and give efficient Tokens and finally we will able to classify the tokens.
  34. 34. References: 34 [1] Wikipedia [2] Aasish Pappu and Ratna Sanyal “Vaakkriti: Sanskrit Tokenizer”Indian Institute of Information Technology, Allahabad (U.P.), India. [3] Firoj Alam, S. M. Murtoza Habib, Mumit Khan “Text Normalization system for Bangla” Center for research on Bangla Language Processing, Department of Computer Science and Engineering, BRAC University, Bangladesh. [4] Goutam Kumar Saha, “Parsing Bengali Text - an Intelligent Approach” Scientist-F, Centre for Development of Advanced Computing, (CDAC), Kolkata.
  35. 35. Cont… [5] “Magic of ASP.Net with C#” by Kumar Sanjeeb and Shibi Panikkar. [6] www.C-sharpcorner.com [7] “Overview of Stemming Algorithms” Ilia Smirnov http://the-smirnovs.org/info/stemming.pdf. [8] “Recognizing Bangla grammar using predictive parser”, by K. M. Azharul Hasan, Al-Mahmud, Amit Mondal, Amit Saha. Department of Computer Science and Engineering (CSE) Khulna University of Engineering and Technology (KUET) Khulna-9203, Bangladesh. [9] “Model for Sindhi Text Segmentation into Word Tokens” J. A. MAHAR, H. SHAIKH*, G. Q. MEMON Faculty of Engineering, Science and Technology, Hamdard University, Karachi. 35
  36. 36. Cont… 36 [11] “COMPARISON OF DIFFERENT POS TAGGING TECHNIQUES FOR SOME SOUTH ASIAN LANGUAGES” by Fahim Muhammad Hasan, BRAC University,Dhaka,Bangladesh. [12] “Design of a Rule-based Stemmer for Natural Language Text in Bengali”by Sandipan Sarkar IBM India and Sivaji Bandyopadhyay Computer Science and Engineering Department Jadavpur University, Kolkata. [13] “A Light Weight Stemmer for Bengali and Its Use in Spelling Checker” by Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan, Center for Research on Bangla Language Processing, BRAC University, Dhaka, Bangladesh. [14] “Yet Another Suffix Stripper” by PRASENJIT MAJUMDER, MANDAR MITRA, SWAPAN K. PARUI, and GOBINDA KOLE Indian Statistical Institute.
  37. 37. 37 Thank You

×