Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Natural language processing with python and amharic syntax parse tree by daniel adenew msc


Published on

Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!

Published in: Technology
  • Login to see the comments

Natural language processing with python and amharic syntax parse tree by daniel adenew msc

  1. 1. Amharic Language Syntax Parsing and Parse Tree By: Daniel Adenew MSC (AAU) source code:
  2. 2. Abstract Natural Language processing (NLP) the major field of study in computer science .Computers now a days believed to be for different reason is having a greater improvement over the capability of NLP processing if they are equipped with a processing logic that can make increase their ability to understand , interpret and communicate using human language. There is has been a lot work done and being done to incorporate these features of communication to computers. As a result, there are certain techniques, tools and scientific approaches to train and follow generally referred to as NLP ability for computers. For example , computers must understand ,characters, words ,sentence, paragraphs , sounds , and speeches more or less similar to human being does .In this report , I m going to see that how to enable the ability of computers to understand human constructed sentence. This is well known in NLP as syntax parsing. Syntax parsing is referred as the way of identifying words that are related to each other in a given sentence. And, this report only focuses in Amharic language sentence syntax parsing. example can be mentioned as አበበ በሶ በላ፡፡ (omitted some due to space) Keywords: NLP, Python, Syntax Parser, CFG, PCFG, Grammar, Amharic Language Sentence, NLP Tools.
  3. 3. Background Amharic language which is the official language of Ethiopia. Nature of Amharic is being a morphologically rich language having a similar characteristic in the Semitic language family like that of Arabic, Hebrew, etc. Amharic is the second largest Semitic language. The Speakers of Arabic count in hundreds of millions, of Amharic in tens of millions, and of Hebrew and Tigrinya in millions. [5] Since, The Amharic language is quite different both when spoken and written. The reason to say this is because Amharic language has a complex morphology, where nouns (and adjectives) are inflected for gender, number, definiteness, and case. Definite markers and conjunctions are suffixed to the nouns, while prepositions are prefixed. Like other Semitic languages, the verbal morphology is rich and based on triconsonantal roots. There are a quite number of reason , that are required for the Amharic language to be effectively incorporated for an NLP processing .One of the blockage to progress of developing NLP tools was lack of standardization: like an international standard for Ethiopic script was agreed on only in 1998 and 2000 into Unicode repetitions.[5] Another major blockage to progress in Amharic language processing has been the lack of large-scale resources such as corpora and tools that can effectively understand the language alphabets or symbols called 'Fidel' due to ASCII And Unicode Representation difference as I have seen this in handy when I was developing this syntax parser .
  4. 4. Introduction Human are naturally given with the gift of communication whether its using sound, signed and written kind. Communication in human’s life plays a vital role in our day to day activities. Computers in another hand a have a limited capability of communicating with humans. Since, computer in our age becoming the central point when we come to simplifying our day to day life. The need for increasing the capability of computers to communicate with humans effectively and efficiently is increasing. Natural Language Processing, as a field of scientific inquiry, plays an important role in increasing computers capability to understand natural languages, the language by which most human knowledge is recorded. NLP operates in designing and implementation of tools, techniques, frameworks to enable computers communicate effectively as and with humans.
  5. 5. ..continued As matter of fact the above mentioned tools, and many NLP tools has been developed to English language to more degree of acceptance, efficiency and correctness than that of Amharic language. Regarding Amharic language there is numerous numbers of researches being undergoing and done to improve the gap and alleviate the problem in different area of NLP for Amharic. Syntax parsing ,one of the steps to design a functional NLP application and which can work in cooperation and as input to other many NLP application like grammar and spell checker , spell correction , and etc. In syntax parsing the central point involves in manipulation, understanding, and parsing (breaking down to manageable components), understand their context, relation with each other to successfully identify their correctness. Sentences are the starting point when we come to analyzing a written material or documents. Syntax refers to the way words are related to each other in a sentence.
  6. 6. ..continued Today, parsers of different kinds (e.g. probabilistic, rule based) have been developed for languages, which have relatively wider use nationally and/or internationally (e .g. English, German, Chinese, etc. [1] Example 1: For a sentence አበበ የሰዉ አጥር አፈረሰ :: Can be parsed as '(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር)) (V አፈረሰ))) Syntax Parser Tree’s from this Developed Syntax Parser Application.
  7. 7. ..continued Example 2: For a sentence አበበ በሶ በላ:: (S (NP (N አበበ) (N በሶ)) (VP (V በላ)))
  8. 8. Statement of the problem The problem statement is some we really need a syntax parser that can automatically parse a given sentence regardless of sentence length, with ability to resolve ambiguities like by using probabilistic approaches and that can be trained and learn from sentence on how to parse features. One of the draw back in NLP tools for Amharic can be mentioned as for Google Online Translation tool which support translation to and from too many languages even the most morphologically complex language like Hebrew and Arabic but not Amharic.
  9. 9. Statement of the problem The major concern of this report is to contribute a little to the research in NLP of Amharic, by developing a syntactic analyzer (i.e. sentence parser) using rule based and probabilistic grammar parsing. The approach I have followed in this study is to explore current and previous progress of syntax parsers using set of mechanisms ,techniques, tools , theories and scientific algorithms because syntax parsing which is the second level analysis in NLP which is very important component to many NLP application done and to be done for Amharic language. The approach followed in the design and development of the parser is one that combines rule based and statistical techniques. This sort of statistical NLP applications require a large volume of data such as hand tagged and hand parsed corpus.Such corpus is currently made available for many natural languages (for instance, for English). But there is no such corpus available for the Amharic language and studies of this kind are believed to contribute to the initiation of compiling and producing the corpus mentioned above.
  10. 10. Purpose of the Study The purpose of study or this report is, to make a researcher like me pretty familiar with the challenges of NLP for Amharic languages, the tools, techniques for developing and filling the gap for lack of a syntax parser for Amharic language. So far, as far as my exploration in this matter with the given time to write the report, there are possibly no other syntax parser to date and to current technologies with a capability to be used as component in another NLP application. This report is beloved to be providing current information, experimental outputs, challenges for future researcher and clearing the road a little to syntax parsing in Amharic language. This report can provide a general awareness about the available grammar parsing (Syntax) methods , algorithms and tools that can possibly achieve the desired output (Syntax Parse Tree for a given Amharic sentence) and provide a sample that can strengthen the Amharic syntax parsing which is really becoming more closer to be resolved in near future, in my opinion. If God allows me I will like to be extending it to my master’s fulfillment thesis and to be even show my continued progress for a PHD program.
  11. 11. Limitation of the study ● This study uses a very small sample prepared for the purpose of the work due to lack of time and finding well organized corpus, machine editable dictionary, POS tagged words and unable to find specially a POS tagger application for Amharic, but simply used a manual dictionary to POS tagging a sentence or words to construct a parse ● The sentence and parse tree later using the my application. prototype developed in the report/study parses is assumed to be supporting a 10 and more composed -word Amharic sentences but, the to gain the real outcome of the prototype developed, again due mainly to time constraint, lack of linguistic ability to possibility determine grammar rules and probabilistic rules which I believe to use them as hybrid and unavailability of processed data needed. But, the prototype developed here can support more complex and complex sentence if proper care for above limitation is considered
  12. 12. Limitation of the study ● This report does not incorporate more advanced topic like ambiguity resolution, but showed sample parsing using probabilistic approaches. ● This study has shown a statistical way of parsing a sentence but, the to words or sentence components initial probabilistic value assigned are assigned by the syntax parser developer (me), in the future word with their probabilistic value formalization must be provided from grammar read from file (corpus) or similar dynamic input mechanism. an automatically feed
  13. 13. Literature Review Sentences and Parsing A natural language system must have a considerable knowledge about the structure of the language itself, including what the words are, how words are combined to form sentences, what the words mean, how word meanings contribute to sentence meanings and so on (Allen, 95).The major purpose of parsing in general and sentence parsing in particular is extracting structural and semantic information from the input text (Abiyot, 2000). Example 'I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'. A grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase in my pajamas describes the elephant or the shooting event.
  14. 14. Literature Review Parser Structure for the above sentence having multiple structures S -> NP VP ... PP -> P NP ... NP -> Det N | Det N PP | 'I' ... VP -> V NP | VP PP ... Det -> 'an' | 'my' ... N -> 'elephant' | 'pajamas' ... V -> 'shot' ... P -> 'in'
  15. 15. Literature Review Parsed Structure is continued on next page. (S (NP I) (VP (V shot) (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) (S (NP I) (VP (VP (V shot) (NP (Det an) (N elephant))) (PP (P in) (NP (Det my) (N pajamas)))))
  16. 16. Literature Review Syntax Parse Tree as Follow: A sentence can have multiple parse trees built from a single sentence , referred as ambiguities
  17. 17. Literature Review Context Free Grammar A context-free grammar (CFG) is a formal system that describes a language by specifying how any legal text can be derived from a distinguished symbol called the axiom, or sentence symbol. [5] An example of a CFG is given below. For a Sentence Like “አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ" can be represented using the following grammar. S -> NP VP VP -> V NP | V NP PP | NP V PP -> P NP | P P V -> “አየ” | “በላ” | "ተራመዳ" NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP Det -> "የ" | "ለ" N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "ቲልሳኦፕ" | "መናፈሻ" P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"
  18. 18. Literature Review The Syntax Parse Structure for the above example and its Parse Tree Using the developed application looks like the following respectively: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (V አየ)))
  19. 19. Literature Review Recursive Descent Parsing The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level sub goals. The top-level goal is to find an S. The S → NP VP production permits the parser to replace this goal with two subgoals: find an NP, then find a VP. Each of these sub goals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their left-hand side.
  20. 20. Literature Review Sample code taken form Python Language Processing grammarx = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP | NP V PP -> P NP V -> "አየ" | "በላ" | "ተራመዳ" NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | N Det -> "የ" | "ለ" N -> "ሰዉ" | "ውሻ" | "ድመት" | "ቲልሳኦፕ" | "መናፈሻ" P -> "በ" | "ላይ" | "በኩል" | "ከ" """) >>sent = "አበበ የ ሰዉ ውሻ አየ".split() >>print (sent) >>rd_parser = nltk.RecursiveDescentParser(grammarx) >>for tree in rd_parser.nbest_parse(sent): print (tree) >>parseTree = nltk.Tree.parse('(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N ውሻ)) (Vአየ)))',remove_empty_top_bracketing=True) >>parseTree .draw()
  21. 21. ..continued Parsed Structure Output: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N ውሻ)) (Vአየ))). Syntax Parse Tree for the above sentence parsed using Reduced Shift Parser (Top Down) .
  22. 22. ..continued Shift-Reduce Parsing A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the right hand side of a grammar production, and replace them with the left-hand side, until the whole sentence is reduced to an S.[5]
  23. 23. ..continued For a sentence: አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ .Its Parse Structure parse tree representation is given. Using the following CFG grammar. S -> NP VP VP -> V NP | V NP PP | NP V | NP Adj V PP -> P NP | P P V -> "አየ" | "በላ" | "ተራመዳ" NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP Det -> "የ" | "ለ" N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "ቲልሳኦፕ" | "መናፈሻ" P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ" Adj ->"ትንሽ"
  24. 24. ..continued Parser Structure, parsed using the above grammar. (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ))) Figure 1.8 Parser Tree Similar manner by keeping the source code on code example 1.0 above we can use a shift reduce parser.
  25. 25. Dependency Grammar Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. A distinct and complementary approach, dependency grammar, focuses instead on how words relate to other words. Dependency is a binary asymmetric relation that holds between a head and its dependents. The head of a sentence is usually taken to be the tensed verb, and every other word is either dependent on the sentence head, or connects to it through a path of dependencies. Sample code taken from Python Syntax parser Application >>dep_grammar = nltk.parse_dependency_grammar(""" ...'አየ' -> 'አበበ' | 'አጥር' | 'ላይ'|'ሰዉ' ...'አጥር' -> 'ላይ'|'ሰዉ'|'ሆኖ' ...'ሰዉ' -> 'ኧሱ'|'የ' …""") >>print (dep_grammar)
  26. 26. ..continued The Generated Output showing dependency of each word : Dependency grammar with 9 productions 'አየ' -> 'አበበ' 'አየ' -> 'አጥር' 'አየ' -> 'ላይ' 'አየ' -> 'ሰዉ' 'አጥር' -> 'ላይ' 'አጥር' -> 'ሰዉ' 'አጥር' -> 'ሆኖ' 'ሰዉ' -> 'ኧሱ'
  27. 27. Statistical Approaches In statistical parsing, grammar rules specify the structures allowable in the language, while probabilities specify the distributional regularities of sentence structures in the language. That is, probabilistic reasoning by way of statistical probabilities is introduced to assist reasoning. It means that linguistic specifications and statistical regularities of syntax are combined to be used for better syntax analysis. The probabilistic reasoning has become much more popular in recent years (Yao and Lua, 1998).[1]
  28. 28. Probabilistic CFG parsing Probabilistic Context-Free Grammar (or PCFG) is a context free grammar that associates a probability with each of its productions. It generates the same set of parses for a text that the corresponding context free grammar does, and assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product of the probabilities of the productions used to generate it.[1] PCFGs tend to be robust (Manning and Schütze, 1999). [1] They produce a model of a language based on real data, and therefore do not have to worry about things like grammatical mistakes, which occur in real-life situations. Although PCFGs have many advantages, a critical disadvantage is that context is not taken into account at all (Cahill, 2000).[8] In fact a tri-gram (sequence of three words in this case) model of a language would probably achieve better results (Charniak, 1993), even though it takes no account of internal structures in the language ,more applicable to language like Amharic.
  29. 29. Probabilistic CFG parsing Example of PCFG grammar is shown below and, the approach is explained in a topic below the figure. S -> NP VP [1.0] VP -> V NP PP -> P NP V -> "አየ" [0.2] VP -> V NP PP [0.3] VP -> NP V [0.2] PP -> P P [0.8] V -> "በላ" [0.1] VP -> NP Adj V [0.4] [0.8] [0.1] V -> "ተራመዳ" [0.1] NP -> "አበበ" [0.2] NP -> "ከበደ" NP -> Det N PP [0.1] NP -> N N [0.1] NP ->"ጫላ" [0.1] [0.1] NP -> Det N [0.1] NP -> Det N N [0.1] NP -> Det N N PP [0.2] Det -> "የ" [0.9] Det -> "ለ" [0.1] N -> "ሰዉ" [0.4] N -> "ውሻ" [0.1] N -> "አጥር" [0.2] N -> "ድመት" [0.1] N ->"ቲልሳኦፕ" [0.1] N -> "መናፈሻ" [0.1] P -> "በ" [0.1] P ->"ላይ" [0.4] P -> "በኩል" [0.1] P ->"ሆኖ" Adj ->"ትንሽ" [1.0] [0.3] P ->"ከ" [0.1]
  30. 30. Probabilistic CFG parsing The Syntax Parsed Structural Output using Viterbi algorithm using the above grammar is shown below, with a final summed up probabilistic value. Code Example Using Python viterbi_parser = nltk.ViterbiParser(grammer) sent = "አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ".split() print (viterbi_parser.parse(sent)) Output of the above grammar and Viterberi_Parser in My application using Python (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ))) (p=8.84736e-05)
  31. 31. Probabilistic CFG prasing Form the example of a PCFG with associated sentence probabilities taken from the developed syntax parser application : Note that ,the probabilities for each Crammer symbol categories say ,NP must sum up to 1.0.So that using the viterbri algorithm (selects the best route using a probability sum up ,this algorithm is also used in POS taggers as case Mesifin 2001.[2] )grammar can be parsed .In this case we can see that two productions of the grammar having a similar probability within same category like . V -> "አየ" [0.8] V -> "በላ" [0.1] V -> "ተራመዳ" [0.1] Assume we have the following sentence: አበበ የ ሰዉ አጥር ላይ ሆኖ አየ :: How is then it resolved whether the end of the production end in “Bela” , this the advantage of PCFG based on the previous path of probability we can have exact match. This case is demonstrated in my application and can see the source code the end of this document.
  32. 32. Meth0d0l0gy The methodology I used to develop this sample application is, takes a set of sample grammars 4 from simple to complex grammar production rules, and assigned those probabilities for probabilistic approach parsing and draws their parse tree and specifies their parsing structure based on the grammar. To develop the application, talking source code wise: I have used a collection tools working and supporting the main application for different purposes. Below I have listed out the names. ● Python 3.2 ● NLTK 3.0 Python Based Natural Language Processing Toolkit .( ● KeyMan Keyboard for Unicode Keyboard Writer (Amharic) ● PyScripter 3.2 for an interactive IDE for python.
  33. 33. Meth0d0l0gy In order to Setup my application, on a local environment, first python 3.2 must be installed and then download NLTK 3.0 and install it under the python directory, because this used as library inside a python code. Then you need to download NLTK data using python itself. Example using command line in windows. [Go to CMD] Type Python on windows `CMD` type to download data but , you need to install nltk first using how to install on
  34. 34. Meth0d0l0gy
  35. 35. Significance of study The significance of the study can be considered very important matter of fact, in Amharic language we don't really have this kind of parser developed so far, this study seems to provide a lot of possibilities to ease the parsing of Amharic sentences and transform one step ahead to our Amharic syntax parsing approaches. This study has also showed that there is a very easy and more accurate way of parsing syntax for Amharic language. As ,compared to previous trials of researchers , am not saying this study is above all but, think it has alleviated some of the approaches and problems they mentions on their study [Alebachew, Abitou,Mesfin], like probabilistic approaches ,automatic parsing ,the need to write a grammar parser and more from programming outcomes .
  36. 36. Significance of study By taking this study into a very advanced and researcher study with more time and effort I believe the must be the being that a real syntax parser for Amharic language to be developed. This study , tried so much that how to handle Amharic sentences using rule based and probabilistic approach and the outcomes of the study also has code or application output available on the end of this document. This also can motivate researcher's ,student and stockholder to move forward from the study I did in this limited amount of time that have left off and by seeing the source code and method I have suggested they can benefit a lot and lot more I believe. But, above all one thing I have to remind is the growth to Amharic NLP capabilities and that is my dedication for in this study.
  37. 37. Significance of study By taking this study into a very advanced and researcher study with more time and effort I believe the must be the being that a real syntax parser for Amharic language to be developed. This study , tried so much that how to handle Amharic sentences using rule based and probabilistic approach and the outcomes of the study also has code or application output available on the end of this document. This also can motivate researcher's ,student and stockholder to move forward from the study I did in this limited amount of time that have left off and by seeing the source code and method I have suggested they can benefit a lot and lot more I believe. But, above all one thing I have to remind is the growth to Amharic NLP capabilities and that is my dedication for in this study.
  38. 38. Reference [1] . AUTOMATIC SENTENCE PARSING FOR AMHARIC TEXT AN EXPERIMENT USING PROBABILISTIC CONTEXT FREE GRAMMARS A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION SCIENCE BY ATELACH ALEMU ARGAW [2].Speech and Language Processing: An introduction to natural language processing, Computational linguistics, and speech recognition. Daniel Jurafsky & James H. Martin. Copyright c 2006, All rights reserved. Draft of June 25, 2007. [3] Abiyot Bayou. Design and Development of Word Parser for Amharic Language. Masters Thesis, Addis Ababa University. 2000. [4] Mesfin Getachew. Automatic Part of Speech Tagging for Amharic: An Experiment Using Stochastic Hidden Markov (HMM) Approach. Masters thesis. Addis Ababa University. 2001. [5]. [6] Python Text Processing with NLTK 2.0 Cookbook Jacob Perkins Copyright © 2010 Packt Publishing [7] Tagging and Verifying an Amharic News CorpusBj¨orn Gamb¨ackNorwegian University of Science and TechnologyTrondheim, Norway [8]According to the my development tool [ file:///home/dadenew/Special%20Attenziona/ch08.html] ,
  39. 39. Thankyou! comment and contact me @ linkedin: daniel adenew accademia: daniel adenew google : daniel adenew slideshare : daniel adenew ,dannymanone