SlideShare a Scribd company logo
1 of 6
Download to read offline
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
DOI: 10.5121/ijnlc.2017.6602 15
SETSWANA PART OF SPEECH TAGGING
Gabofetswe Malema, Boago Okgetheng and Moffat Motlhanka
Department of Computer Science, University of Botswana, Gaborone, Botswana
ABSTRACT
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
KEYWORDS
Setswana, part of speech tagging, morphological analysis
1. INTRODUCTION
Part of speech tagging is process that identifies parts of speech in a sentence for a given language.
It is considered to be one of the fundamental stages of natural language processing for any
language. It is a pre-processing stage for advanced applications such as machine learning,
translation, and grammar checking [1]. Studies in this area are classified in three main approaches
of statistical, rule based and hybrid [1][2][3][4]. Statistical methods are provided with
learning/tagged data which trains the tagger. The approach has been shown to give good results in
the region of 90’s for many languages [2]. However, the approach works well if there is a good
training data. The rule based approach applies the language’s rules to identify parts of speech.
Although the rule based method relies heavily on knowledge of the language it can give better
accuracy and feedback [2]. The hybrid approach combines the benefits of statistical and rule
based approaches. Initially words are tagged based on statistical approach and if there are words
that need disambiguation language rules are used to resolve them [4].
The complexity of tagging varies from language to language. In some languages most tags are
just single words or tokens separated by space which makes it easy to identify. However, in some
languages the problem is nontrivial. For example in which is Setswana written disjunctively, a
few words are grouped together to form a part of speech in some cases. There are few studies
which have highlighted the complexity in Setswana part of speech tagging and tokenization. In
[5] a statistical approach is used to tag Northern Sotho which is close to Setswana language. The
study obtained a performance of 94%. The study however does not indicated whether compounds
parts of speech were included or not. A finite state machine approach is used [6] to analyse
Setswana verb morphology which also obtained a 94% rate. The tokenization problem is related
to this study in that for one to perform tagging they also need to tokenize. In [6] the main aim is
to tokenize or tag verbs. In both studies performance results are high and promising. However,
the results are not good enough to lead to a completely automated part of speech tagger. Studies
do not offer general approaches to single words (tokens) and compound tokens or parts of speech.
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
16
Setswana is a low resourced language and has limited resources in the form of tagged text. We
therefore choose to use the rule based approach to perform part of speech tagging. Setswana is a
Bantu language spoken in Botswana, South Africa, Zimbabwe and Namibia. There are few
automation tools for Setswana language. For Setswana to experience an explosion of NLP
applications like other languages, basic tools such as analyzers and part of speech taggers need to
be developed. In this study we present a way to identify the relative (leamanyi) and possessive
(lerui) which are one of the common parts of speech.
2. SETSWANA PARTS OF SPEECH
Sentences in a language are made up of parts of speech which include nouns, pronouns, adverbs,
adjectives and verbs. Each part plays a role in giving the sentence its meaning. Depending on the
language a sentence has a structure based on the language grammar. Setswana uses a disjunctive
orthography. That is, in some cases a few tokens (words) are written separately but are only
meaningful as a unit. Individual words play different functions in a sentence depending on words
around them. This possess a great challenge as a word could have many functions as also explain
in [6]. For example the word ba could be:
1) a possessive concord: batho ba gagwe (his people)
2) a relative concord: ba ba lemang (those planting)
3) a pronoun: ba a tsamaya (they are going)
4) a demonstrative: batho ba (these people)
Table 1. Setswana relative and possessive concords according to noun class [8].
noun class prefix Relative
concord
Possessive
Concord
1 Mo- yo o Wa
2 Ba- ba ba Ba
3 Mo- o o Wa
4 Me- e e Ya
5 Le- le le La
6 Ma- a a A
7 Se- se se Sa
8 Di- tse di Tsa
9 N- e e Ya
10 DiN- tse di Tsa
11 Lo- lo lo lwa
14 Bo- jo bo jwa
15 Go- mo go Ga
In the examples above the word ba plays a different role depending on the words around it. In
Setswana these words are in almost every sentence and they include concords, pronouns and
demonstratives. This is particularly the case in adjectives, relatives and adverbs [7][8]. They are
made up of a concord which depends on the noun class and the root which can be other parts of
speech. In this study we investigate how relatives and possessives can be identified in a sentence.
Table 1 below shows relative and possessive concords according to noun class. Based on the
sentence examples from Setswana text and literature we have derived general constructs for
relatives and possessives.
After investigating how relatives are used in Setswana sentences we derived the following
constructs.
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
17
cc + X, where cc is one of the possible concords from table 1 and
X can be verb+ng, negative + verb+ng, tense + verb+ng, adjective, locative adverbs
Examples are
yo o lelang(the one crying)
yo o tla lelang(the that will cry)
yo o thata (the tough one)
yo o kwa godimo (the one on the top)
yo o sa leleng(the one not crying)
We have also derived constructs for possessives as follows;
cc + X, where cc is one of the possible concords from table 1 and
X can be a pronoun, noun, relative, possessive, adjective, demonstrative
Examples are
ya bone (theirs)
ya Thabo (Thabo’s)
ya wa Thabo (its for Thabo’s wife,car etc)
ya ele (for that one)
ya yo moleele (for the tall one)
From these constructs we derived a finite state diagram which is then represented as a two-
dimensional array showing all the transitions from the time a concord is detected to the final state
when a relative is positively identified. It has to be noted that the constructs above are not
exhaustive. There are more complex indirect relatives and possessives. In this study we have
looked at a few direct constructs to demonstrate the use of state machines for identification of
compound Setswana parts of speech.
Figure 1. Compressed relative state diagram.
Figure 1 shows a compressed state diagram for identifying a relative part of speech. Starting at
state 0 the machine waits for an appropriate relative concord to move to state 1 at which it can
accept a few inputs to move to the accepting state 3. The diagram is compressed in that adverbs
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
18
and adjectives are also state diagrams. In Setswana descriptive parts of speech can be formed
using other parts of speech including other descriptive parts of speech. For example,
e e fa nokeng (the one at/by the river)
This example is a relative which is created using a locative adverb. e e is a relative concord for
noun class 4 (Table 1) and fa nokeng (at/by the river) is a locative adverb. Therefore the whole
phrase is classified as a relative.
The equivalent state transition table is as shown in Figure 2.
Figure 2. State transition table for relative state diagram in figure 1.
The state table is implemented in java as a 2D array. The numbers in the table cells indicate the
next state. At each state the transition is determined by the next input. If the transition lands in an
empty cell it means the tokens read so far are not forming a relative. We developed the possessive
state diagram and transition table the same way as we did above for the relative.
Figure 2. Block diagram of proposed Tagger.
3. PROPOSED PART OF SPEECH TAGGER
Based on the derived finite state transition table derived in the section above, we developed a tool
that detects and identifies a relative or a possessive in a sentence as shown as Figure 2. The
transition tables are implemented in java as 2D arrays. Concords are stored in another 2D array as
shown in Table. The tool uses a dictionary and a morphological analyser. The dictionary has a list
of tagged words. That is, identify whether a given token is a pronoun, noun, verb, enumerative,
demonstrative and others. After identifying a valid concords the tool needs to classify the next
token to determine the next state in the table. The dictionary does not contain all words in their
different forms. The morphological analyser is therefore used to reduce words which are feed
back into the dictionary for identification and classification. In Setswana verbs can be modified in
to many forms including intensive, passive, intensive, reciprocal, neuter, plural, reflexive and
others. Nouns also take different forms such as plural, locative and diminutive. Since the
dictionary will not have all the words with their derivatives we use the morphological analyser
developed in [9] to help in classifying derived words. For example, given a token berekile
(worked), the dictionary will only have the root form of the word which is bereka (work). The
morphological analyser will reduce berekile to bereka which the dictionary will identify as a
cc verb tense adjective adverb
0 1
1 3 2 3 3
2 3
Part of Speech
Detector
Dictionary Morphological
Analyzer
tokens Accept/reject
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
19
verb. The dictionary and the morphological analyser are actually implemented as one module.
The tagger scans a sentence from left to right until it detects a possible start of a relative or
possessive concord. Once that is detected, the tagger traverses the corresponding sate table until it
accepts or rejects the input.
3. PERFORMANCE RESULTS
The proposed tool was tested with a 10 page Setswana document. The results show an overall
performance rate of 82%. From the text, 77 relatives were manually identified out of which 65
were identified by the tool resulting with a performance rate of 84% for relatives. The tool could
detect 97% of relatives but fails in some cases to detect where they stop.
Also from the test text, 111 possessives were manually identified out of which 89 were correctly
identified giving a performance rate of 80%. The tool could detect 98% of possessives but fails in
some cases to detect where they end. It has to be noted that this results are only for direct relatives
and possessives and do not include indirect ones. From the results analysis it shows that the tool
fails to classify some words correctly due to limitations of the dictionary and the morphological
analyser. We noticed that in some cases the words could be classified as both adjectives and
nouns but our dictionaries in some cases have only one classification. It will be important to
develop a tagged dictionary specifically for the purposes of tagging. The tool also fails to
correctly identify the end of some relatives and possessives especially the long ones which are
made up of other parts of speech. A relative could be made up of other parts of speech such as
locative adverbs. In our implementation, other compound parts of speech were not fully defined.
The developed rules are limited in identifying complex constructs of these compounds parts of
speech. We found out in some cases these parts of speech are made up of three to five parts of
speech in recursive manner. That is, a possessive can be created using a relative and that
possessive can be created using a relative or some other part of speech.
3. CONCLUSIONS
The disjunctive orthography of Setswana language makes it challenging to automate tokenization
and part of speech tagging. We developed finite state diagrams for relatives and possessives and
we implemented them in Java. The tagger is supported by a dictionary and a morphological
analyser. From the test data the tool detects shorter parts of speech well compared to longer ones.
Overall the tool achieved 82% identification rate. The tool can be further improved by improving
the dictionary, the morphological analyser and adding other parts of speech which are used with
the parts of speech. The tagger heavily relies on the dictionary and morphological analyser to
identify tokens. The approach could also be applied to other compounded parts of speech in a
similar way.
REFERENCES
[1] Charniak E (1997) “Statistical techniques for Natural Language parsing”, AI Magazine, 18(4), pp.33-
44
[2] Brants T (2000) “A statistical part of speech tagger”, PANCL’00 Proceedings of the sixth conference
on applied natural language processing Association for Computational Linguistics
[3] Brill E (1992) “A simple rule based part of speech tagger”, In Proceedings of the third conference on
Applied Natural Language processing, ACL, Trento, Italy
[4] Brill E (1995) “Transformation Based Error-Driven Learning and Natural language Processing: A case
study in Part of Speech Tagging”, Computational Linguistics
[5] Faa G, Heid U, Taljard E & Prinsloo D (2009) “Part-of_Speech tagging of Northern Sotho:
Disambiguating polysemous function words”,Proceedings of the EACL, 2009 Workshop on
Language Technologies for African Languages – Aflat 2009, pages 38—45, Athens Greece, 31 March
2009
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
20
[6] Pretorius L, Viljoen B, Pretorius R and Berg A.(2009) “A finite state approach to Setswana verb
morphology”, International Workshop on finite state methods and natural Language Processing
FSMNLP 2009: Finite State Methods and Natural language Processing, pp. 131 – 138
[7] Cole D T, “An Introduction to Tswana grammar”, Longmans and Green, Cape Town.
[8] Mogapi, K, “Thuto Puo ya Setswana”, Longman Botswana, 184, ISBN:0582 619033
[9] Malema G, Motlogelwa N, Okgetheng B, Mogotlhwane O, “Setswana Verb Analyzer and Generator”,
International Journal of Computational Linguistics (IJCL), Vol 7, issue 1, 2016
AUTHORS
Dr. G. Malema is a Senior lecturer at the Department of Computer Science, University of Botswana. He
obtained his PhD Computer Engineering in 2008 from The University of Adelaide. He has been working on
Automation tools for Setswana for the past 3 years.
Mr. Boago Okhetheng is an assistant research. He graduated with a BSc Computer Science from the
University of Botswana in 2015.
Mr. Moffat Motlhanka is currently a forth year Computer Science student at the University of Botswana.

More Related Content

Similar to SETSWANA PART OF SPEECH TAGGING

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMEROPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
 
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMEROPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466IJRAT
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...sipij
 
George Yule_Phrases and sentences grammar.pdf
George Yule_Phrases and sentences  grammar.pdfGeorge Yule_Phrases and sentences  grammar.pdf
George Yule_Phrases and sentences grammar.pdfNajma Asyifa
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAMORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERijnlc
 
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdfNatural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdftheboysaiml
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia
 
The Intonation Patterns of English and Persian Sentences: A Contrastive Study
 The Intonation Patterns of English and Persian Sentences: A Contrastive Study The Intonation Patterns of English and Persian Sentences: A Contrastive Study
The Intonation Patterns of English and Persian Sentences: A Contrastive StudyResearch Journal of Education
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 

Similar to SETSWANA PART OF SPEECH TAGGING (20)

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMEROPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
 
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMEROPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466
 
Grammar Syntax(1).pptx
Grammar Syntax(1).pptxGrammar Syntax(1).pptx
Grammar Syntax(1).pptx
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...
 
George Yule_Phrases and sentences grammar.pdf
George Yule_Phrases and sentences  grammar.pdfGeorge Yule_Phrases and sentences  grammar.pdf
George Yule_Phrases and sentences grammar.pdf
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAMORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMER
 
ijcai11
ijcai11ijcai11
ijcai11
 
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdfNatural-Language-Processing-by-Dr-A-Nagesh.pdf
Natural-Language-Processing-by-Dr-A-Nagesh.pdf
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
The Intonation Patterns of English and Persian Sentences: A Contrastive Study
 The Intonation Patterns of English and Persian Sentences: A Contrastive Study The Intonation Patterns of English and Persian Sentences: A Contrastive Study
The Intonation Patterns of English and Persian Sentences: A Contrastive Study
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 

More from kevig

IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similaritykevig
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Taggingkevig
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabikevig
 
Improving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data OptimizationImproving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data Optimizationkevig
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structurekevig
 
Rag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented GenerationRag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
 
Evaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English LanguageEvaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English Languagekevig
 
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONIMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structurekevig
 
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONRAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...kevig
 
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEEVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
 
February 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdfFebruary 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdfkevig
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmkevig
 
Effect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal AlignmentsEffect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal Alignmentskevig
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Modelkevig
 
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENENLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENEkevig
 
January 2024: Top 10 Downloaded Articles in Natural Language Computing
January 2024: Top 10 Downloaded Articles in Natural Language ComputingJanuary 2024: Top 10 Downloaded Articles in Natural Language Computing
January 2024: Top 10 Downloaded Articles in Natural Language Computingkevig
 
Clustering Web Search Results for Effective Arabic Language Browsing
Clustering Web Search Results for Effective Arabic Language BrowsingClustering Web Search Results for Effective Arabic Language Browsing
Clustering Web Search Results for Effective Arabic Language Browsingkevig
 

More from kevig (20)

IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Improving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data OptimizationImproving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data Optimization
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Rag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented GenerationRag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented Generation
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
 
Evaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English LanguageEvaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English Language
 
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONIMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONRAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...
 
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEEVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
 
February 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdfFebruary 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdf
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
 
Effect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal AlignmentsEffect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal Alignments
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
 
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENENLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
 
January 2024: Top 10 Downloaded Articles in Natural Language Computing
January 2024: Top 10 Downloaded Articles in Natural Language ComputingJanuary 2024: Top 10 Downloaded Articles in Natural Language Computing
January 2024: Top 10 Downloaded Articles in Natural Language Computing
 
Clustering Web Search Results for Effective Arabic Language Browsing
Clustering Web Search Results for Effective Arabic Language BrowsingClustering Web Search Results for Effective Arabic Language Browsing
Clustering Web Search Results for Effective Arabic Language Browsing
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

SETSWANA PART OF SPEECH TAGGING

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 DOI: 10.5121/ijnlc.2017.6602 15 SETSWANA PART OF SPEECH TAGGING Gabofetswe Malema, Boago Okgetheng and Moffat Motlhanka Department of Computer Science, University of Botswana, Gaborone, Botswana ABSTRACT Part of speech tagging is one of the basic steps in natural language processing. Although it has been investigated for many languages around the world, very little has been done for Setswana language. Setswana language is written disjunctively and some words play multiple functions in a sentence. These features make part of speech tagging more challenging. This paper presents a finite state method for identifying one of the compound parts of speech, the relative. Results show an 82% identification rate which is lower than for other languages. The results also show that the model can identify the start of a relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the limitations of the morphological analyser and due to more complex sentences not accounted for in the model. KEYWORDS Setswana, part of speech tagging, morphological analysis 1. INTRODUCTION Part of speech tagging is process that identifies parts of speech in a sentence for a given language. It is considered to be one of the fundamental stages of natural language processing for any language. It is a pre-processing stage for advanced applications such as machine learning, translation, and grammar checking [1]. Studies in this area are classified in three main approaches of statistical, rule based and hybrid [1][2][3][4]. Statistical methods are provided with learning/tagged data which trains the tagger. The approach has been shown to give good results in the region of 90’s for many languages [2]. However, the approach works well if there is a good training data. The rule based approach applies the language’s rules to identify parts of speech. Although the rule based method relies heavily on knowledge of the language it can give better accuracy and feedback [2]. The hybrid approach combines the benefits of statistical and rule based approaches. Initially words are tagged based on statistical approach and if there are words that need disambiguation language rules are used to resolve them [4]. The complexity of tagging varies from language to language. In some languages most tags are just single words or tokens separated by space which makes it easy to identify. However, in some languages the problem is nontrivial. For example in which is Setswana written disjunctively, a few words are grouped together to form a part of speech in some cases. There are few studies which have highlighted the complexity in Setswana part of speech tagging and tokenization. In [5] a statistical approach is used to tag Northern Sotho which is close to Setswana language. The study obtained a performance of 94%. The study however does not indicated whether compounds parts of speech were included or not. A finite state machine approach is used [6] to analyse Setswana verb morphology which also obtained a 94% rate. The tokenization problem is related to this study in that for one to perform tagging they also need to tokenize. In [6] the main aim is to tokenize or tag verbs. In both studies performance results are high and promising. However, the results are not good enough to lead to a completely automated part of speech tagger. Studies do not offer general approaches to single words (tokens) and compound tokens or parts of speech.
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 16 Setswana is a low resourced language and has limited resources in the form of tagged text. We therefore choose to use the rule based approach to perform part of speech tagging. Setswana is a Bantu language spoken in Botswana, South Africa, Zimbabwe and Namibia. There are few automation tools for Setswana language. For Setswana to experience an explosion of NLP applications like other languages, basic tools such as analyzers and part of speech taggers need to be developed. In this study we present a way to identify the relative (leamanyi) and possessive (lerui) which are one of the common parts of speech. 2. SETSWANA PARTS OF SPEECH Sentences in a language are made up of parts of speech which include nouns, pronouns, adverbs, adjectives and verbs. Each part plays a role in giving the sentence its meaning. Depending on the language a sentence has a structure based on the language grammar. Setswana uses a disjunctive orthography. That is, in some cases a few tokens (words) are written separately but are only meaningful as a unit. Individual words play different functions in a sentence depending on words around them. This possess a great challenge as a word could have many functions as also explain in [6]. For example the word ba could be: 1) a possessive concord: batho ba gagwe (his people) 2) a relative concord: ba ba lemang (those planting) 3) a pronoun: ba a tsamaya (they are going) 4) a demonstrative: batho ba (these people) Table 1. Setswana relative and possessive concords according to noun class [8]. noun class prefix Relative concord Possessive Concord 1 Mo- yo o Wa 2 Ba- ba ba Ba 3 Mo- o o Wa 4 Me- e e Ya 5 Le- le le La 6 Ma- a a A 7 Se- se se Sa 8 Di- tse di Tsa 9 N- e e Ya 10 DiN- tse di Tsa 11 Lo- lo lo lwa 14 Bo- jo bo jwa 15 Go- mo go Ga In the examples above the word ba plays a different role depending on the words around it. In Setswana these words are in almost every sentence and they include concords, pronouns and demonstratives. This is particularly the case in adjectives, relatives and adverbs [7][8]. They are made up of a concord which depends on the noun class and the root which can be other parts of speech. In this study we investigate how relatives and possessives can be identified in a sentence. Table 1 below shows relative and possessive concords according to noun class. Based on the sentence examples from Setswana text and literature we have derived general constructs for relatives and possessives. After investigating how relatives are used in Setswana sentences we derived the following constructs.
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 17 cc + X, where cc is one of the possible concords from table 1 and X can be verb+ng, negative + verb+ng, tense + verb+ng, adjective, locative adverbs Examples are yo o lelang(the one crying) yo o tla lelang(the that will cry) yo o thata (the tough one) yo o kwa godimo (the one on the top) yo o sa leleng(the one not crying) We have also derived constructs for possessives as follows; cc + X, where cc is one of the possible concords from table 1 and X can be a pronoun, noun, relative, possessive, adjective, demonstrative Examples are ya bone (theirs) ya Thabo (Thabo’s) ya wa Thabo (its for Thabo’s wife,car etc) ya ele (for that one) ya yo moleele (for the tall one) From these constructs we derived a finite state diagram which is then represented as a two- dimensional array showing all the transitions from the time a concord is detected to the final state when a relative is positively identified. It has to be noted that the constructs above are not exhaustive. There are more complex indirect relatives and possessives. In this study we have looked at a few direct constructs to demonstrate the use of state machines for identification of compound Setswana parts of speech. Figure 1. Compressed relative state diagram. Figure 1 shows a compressed state diagram for identifying a relative part of speech. Starting at state 0 the machine waits for an appropriate relative concord to move to state 1 at which it can accept a few inputs to move to the accepting state 3. The diagram is compressed in that adverbs
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 18 and adjectives are also state diagrams. In Setswana descriptive parts of speech can be formed using other parts of speech including other descriptive parts of speech. For example, e e fa nokeng (the one at/by the river) This example is a relative which is created using a locative adverb. e e is a relative concord for noun class 4 (Table 1) and fa nokeng (at/by the river) is a locative adverb. Therefore the whole phrase is classified as a relative. The equivalent state transition table is as shown in Figure 2. Figure 2. State transition table for relative state diagram in figure 1. The state table is implemented in java as a 2D array. The numbers in the table cells indicate the next state. At each state the transition is determined by the next input. If the transition lands in an empty cell it means the tokens read so far are not forming a relative. We developed the possessive state diagram and transition table the same way as we did above for the relative. Figure 2. Block diagram of proposed Tagger. 3. PROPOSED PART OF SPEECH TAGGER Based on the derived finite state transition table derived in the section above, we developed a tool that detects and identifies a relative or a possessive in a sentence as shown as Figure 2. The transition tables are implemented in java as 2D arrays. Concords are stored in another 2D array as shown in Table. The tool uses a dictionary and a morphological analyser. The dictionary has a list of tagged words. That is, identify whether a given token is a pronoun, noun, verb, enumerative, demonstrative and others. After identifying a valid concords the tool needs to classify the next token to determine the next state in the table. The dictionary does not contain all words in their different forms. The morphological analyser is therefore used to reduce words which are feed back into the dictionary for identification and classification. In Setswana verbs can be modified in to many forms including intensive, passive, intensive, reciprocal, neuter, plural, reflexive and others. Nouns also take different forms such as plural, locative and diminutive. Since the dictionary will not have all the words with their derivatives we use the morphological analyser developed in [9] to help in classifying derived words. For example, given a token berekile (worked), the dictionary will only have the root form of the word which is bereka (work). The morphological analyser will reduce berekile to bereka which the dictionary will identify as a cc verb tense adjective adverb 0 1 1 3 2 3 3 2 3 Part of Speech Detector Dictionary Morphological Analyzer tokens Accept/reject
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 19 verb. The dictionary and the morphological analyser are actually implemented as one module. The tagger scans a sentence from left to right until it detects a possible start of a relative or possessive concord. Once that is detected, the tagger traverses the corresponding sate table until it accepts or rejects the input. 3. PERFORMANCE RESULTS The proposed tool was tested with a 10 page Setswana document. The results show an overall performance rate of 82%. From the text, 77 relatives were manually identified out of which 65 were identified by the tool resulting with a performance rate of 84% for relatives. The tool could detect 97% of relatives but fails in some cases to detect where they stop. Also from the test text, 111 possessives were manually identified out of which 89 were correctly identified giving a performance rate of 80%. The tool could detect 98% of possessives but fails in some cases to detect where they end. It has to be noted that this results are only for direct relatives and possessives and do not include indirect ones. From the results analysis it shows that the tool fails to classify some words correctly due to limitations of the dictionary and the morphological analyser. We noticed that in some cases the words could be classified as both adjectives and nouns but our dictionaries in some cases have only one classification. It will be important to develop a tagged dictionary specifically for the purposes of tagging. The tool also fails to correctly identify the end of some relatives and possessives especially the long ones which are made up of other parts of speech. A relative could be made up of other parts of speech such as locative adverbs. In our implementation, other compound parts of speech were not fully defined. The developed rules are limited in identifying complex constructs of these compounds parts of speech. We found out in some cases these parts of speech are made up of three to five parts of speech in recursive manner. That is, a possessive can be created using a relative and that possessive can be created using a relative or some other part of speech. 3. CONCLUSIONS The disjunctive orthography of Setswana language makes it challenging to automate tokenization and part of speech tagging. We developed finite state diagrams for relatives and possessives and we implemented them in Java. The tagger is supported by a dictionary and a morphological analyser. From the test data the tool detects shorter parts of speech well compared to longer ones. Overall the tool achieved 82% identification rate. The tool can be further improved by improving the dictionary, the morphological analyser and adding other parts of speech which are used with the parts of speech. The tagger heavily relies on the dictionary and morphological analyser to identify tokens. The approach could also be applied to other compounded parts of speech in a similar way. REFERENCES [1] Charniak E (1997) “Statistical techniques for Natural Language parsing”, AI Magazine, 18(4), pp.33- 44 [2] Brants T (2000) “A statistical part of speech tagger”, PANCL’00 Proceedings of the sixth conference on applied natural language processing Association for Computational Linguistics [3] Brill E (1992) “A simple rule based part of speech tagger”, In Proceedings of the third conference on Applied Natural Language processing, ACL, Trento, Italy [4] Brill E (1995) “Transformation Based Error-Driven Learning and Natural language Processing: A case study in Part of Speech Tagging”, Computational Linguistics [5] Faa G, Heid U, Taljard E & Prinsloo D (2009) “Part-of_Speech tagging of Northern Sotho: Disambiguating polysemous function words”,Proceedings of the EACL, 2009 Workshop on Language Technologies for African Languages – Aflat 2009, pages 38—45, Athens Greece, 31 March 2009
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017 20 [6] Pretorius L, Viljoen B, Pretorius R and Berg A.(2009) “A finite state approach to Setswana verb morphology”, International Workshop on finite state methods and natural Language Processing FSMNLP 2009: Finite State Methods and Natural language Processing, pp. 131 – 138 [7] Cole D T, “An Introduction to Tswana grammar”, Longmans and Green, Cape Town. [8] Mogapi, K, “Thuto Puo ya Setswana”, Longman Botswana, 184, ISBN:0582 619033 [9] Malema G, Motlogelwa N, Okgetheng B, Mogotlhwane O, “Setswana Verb Analyzer and Generator”, International Journal of Computational Linguistics (IJCL), Vol 7, issue 1, 2016 AUTHORS Dr. G. Malema is a Senior lecturer at the Department of Computer Science, University of Botswana. He obtained his PhD Computer Engineering in 2008 from The University of Adelaide. He has been working on Automation tools for Setswana for the past 3 years. Mr. Boago Okhetheng is an assistant research. He graduated with a BSc Computer Science from the University of Botswana in 2015. Mr. Moffat Motlhanka is currently a forth year Computer Science student at the University of Botswana.