SlideShare a Scribd company logo
1 of 4
Download to read offline
Bangla Documents Analyzer
Samina Azad1
, Mohammad Shamsul Arefin1
, A.N.M. Rezaul Karim2
1
Department of Computer Science and Engineering, Chittagong University of Engineering and Technology.
2
Department of Computer Science and Engineering, International Islamic University Chittagong.
e-mail: samin03@yahoo.com, sarefin@cuet.ac.bd, zakianaser@yahoo.com
ABSTRACT
This paper presents the implementation of a system
that analyzes Bangla documents to recognize its
grammatical structures. The system takes a Bangla
document as input. Then the system isolates all the
individual sentences and extracts the words from
each individual sentence eliminating the punctuation
marks. A collection of stored knowledge and rules
are applied to the words and sentences to detect
grammatical features of the individual sentences.
Finally the system performs different types of
analysis from different aspects such as word
detection, tense of verb recognition and types of
sentence determination.
KEY WORDS: word code, word detection, tense
detection, type detection.
1. INTRODUCTION
It was a great attempt of the modern computer
scientists to set up an intelligent machine that could
understand human languages. But the earlier
researches kept the attention only on English as well-
recognized international language. Recently some
great efforts have been made to process some other
languages. Bangla is a very rich language and
approximately 10% of the people in the world use to
speak in Bangla [1]. A most recent effort invoking
Bangla is a system that takes Bangla statements
written in English character set and displays it in
Bangla characters. Unfortunately any of these
systems is not able to provide sufficient aids for
transformation of Bangla language to any others.
Here, the need arises to understand the linguistic
distinctiveness of Bangla language. It is unconcerned
so far to have any application software that
grammatically analyzes Bangla documents for
notifying structural language attributes. In this paper
we are presenting a system that performs
morphological and syntactic analysis of Bangla
statements to signify some grammatical constructs
i.e. parts-of-speech, tense of verb, type of sentence
etc using artificial knowledge. This analysis is a
precondition for linguistic transformation.
2. BANGLA DOCUMENT ANALYSIS
The system breaks the text into easily perceivable
units and finally makes some decisions based on
simple heuristic knowledge. The system isolates the
sentences within it, detects each of the parts of
speeches, determines the roots of verbs, determines
types of verbs using roots, notifies the Specifiers and
Beevokties, finds out the numbers of nouns, detects
the persons i.e. first person, second person and third
person, decides the tense of verb of a sentence and
finally classifies the sentences in simple, complex
and compound through logical analysis.
For doing these tasks, first the system performs
morphological analysis of the sentences. Each word
is categorized in its own parts-of-speeches and finite
verb of the sentence is detected. This is done by
operating on the root of the verb. The root is
combined with some suffixes called – ‘Beevokty’ to
form the verb. The verb is then classified on the basis
of these suffixes. The verb detects the tense of the
sentence. After morphological analysis, the system
syntactically checks the sentence to discover its type.
2.1 Representation of Bangla characters
In this system Bangla characters are represented
using unicode conventions. The unicode are mapped
on the keyboard. For example- if the user presses the
key “k” the character will be displayed as ‘ ’. The
keys also represent KARs and FOLAs. For example-
if we type “a” it will be displayed as ‘ ’. Compound
Bangla characters are represented by combination of
keys. Some examples are given in Table-1.
Table-1: Examples of Compound Letters
Key Typed Unicode Display
jnHm
MitHr
AkHxr
SkHti
JYHja
2.2 Preprocessing of the text document
Sentences in the document are isolated and each
sentence is sent to the analyzer. A sentence ends with
punctuations like ‘| ’, ‘?’, ‘!’ etc. Words are then
segregated for morphological analysis. Each word is
terminated with a space or punctuation.
2.3 Morphological Analysis
In morphological analysis stage of natural language
processing - individual words are isolated and
analyzed into their categories and punctuations are
separated from the words [2].
2.3.1 Word detection
In this phase, individual words of the document are
classified in parts of speech invoking word detection
method. First, the whole word is searched in the
database. If the word is found then this method
terminates immediately by setting: detection=yes. If
the word is not found then any substring of the word
that matches some words in the database is detected
and the system checks if the prefix and the suffix of
the word are valid. The prefix is accounted as valid if
it is found in the list of Upasorgo. The suffix is valid
if it is a Modifier or any types of Beevokty or
Nirdeshok. The Nirdesok defines the number of the
nouns. Beevokty denotes the case of the word in the
sentence. There are two sorts of Beevokties – Nam
Beevokty and Kriya Beevokty. Nam Beevokty is
placed after nouns and Kriya Beevokty is placed
after verbs. Some examples [3] are listed in Table-2.
Table-2: Formats of ‘Beevokty’ and ‘Nirdesok’
The flowchart of the word detection module is given
in Figure-1.
Figure-1: The word detection module
Sometimes the word contains a modifier at the end as
a suffix to modify the original word as in -
etc.
The different formats of the prefix and suffix
combination are listed in Table-3.
Table-3: Different formats of suffix and prefix
2.3.2 Verb detection
Roots of the verbs are detected as described in the
sub-subsection 2.3.1. The only difference is - there
are some extra processing tasks. The roots of Bangla
verbs are classified into twenty classes. While
mapping the words the root is detected to signify its
class. Each class has its own list of Kriya Beevokty.
The classes of roots are listed in Table-4 [4].
Table-4: Classes of the Roots
After detecting the class of the root, the system
checks the list of beevokties only for that class using
a heuristic technique. If any Beevokty matches with
the suffix of that word then the system sets word
detection = yes.
2.3.3 Tense of verb detection
The flowchart of tense detection module is given in
Figure-2.
Figure-2: Tense of verb detection module
Our system detects the tense of verb while going
through the verb detection process. Each Beevokty in
the list has a tense code. When the system gets a
match between the suffix of the original word and an
element in the list of Kriya Beevokty, it immediately
collects the tense code of that element. For example:
if the original word is , the system first
checks the word in database and detects the root - .
This root is of class- and so the system will check
the list of Kriya Beevokty only for this class. Finally
the Kriya Beevokty is found to be . From the
predefined stored knowledge the system finally gives
decision that the sentence is in past continuous tense.
2.4 Syntactic Analysis
Syntactic analysis transforms linear word-sequences
into structure that shows how words are interrelated.
Word sequences may be rejected if they violate the
rules for how words may be combined [2].
2.4.1 Exclusion of ambiguity
We have represented each type of parts-of-speech by
assigning a code to the word. This code is unique for
each type of parts-of-speech. By checking this code
(Figure-3) we can easily identify the parts-of-speech.
Figure-3: Word Code Assignment Scheme
Because the same word in the database may be stored
as different types of parts-of-speech, the system
needs to invoke some extra efforts to eliminate the
ambiguity. This effort is gained by another detection
module, which tries to obtain the correct word code
using the interrelations of the words in the sentence.
For example: If the system gets the word ‘ ’, it
immediately checked in the database and detected as
as well as . So the word code (Figure-3)
will be 1+2=3 which is ambiguous shown in Table-5.
Table-5: Ambiguous words
To avoid this situation we apply some rules. Let, n is
the total number of words in the sentence and w[i] be
the words in the sentence where i=1, 2, 3...n. Then
word codes will be corrected using some rules as in
Table-6 and Table-7 considering different cases.
Table-6: Forward Scanning from Right-to-Left
Table-7: Reverse Scanning from Left-to-Right
2.4.2 Types of sentence detection
After all the words are detected unambiguously the
system scans the whole sentence and observes the
co-ordination of the words to determine its type.
There are three types of sentences [5]. They are:
simple sentences ( ), complex sentences
( ) and compound sentences ( ).
Simple sentence: The sentence will contain only one
finite verb ( ) and/or other non-finite
verbs ( ) i.e. only one word will have
the code=32. If there is any conjunction in the
sentence like , then the conjunction
will connect two words of similar types i.e. the code
for word[i-1] will be same as that for word[i+1]
where the conjunction is the ith
word of the sentence.
The conjunction in the sentence must not connect
two verbs i.e. two words of code 16 or 32. If the
sentence contains any comma, then the comma will
connect the words of the same code. The comma will
not connect two words of code 16 or 32.
Complex sentence: The sentence contains more than
one finite verbs and/or other non-finite verbs i.e. at
least two words with code=32 and may have some
other words with code=16. There will be no
conjunction in the sentence to connect two words as
in the simple sentence rather then joining two
clauses. The sentence may contain comma. The
comma conjunct two clauses, but the first clause will
have a transitive verb with no object in that clause
i.e. no word in the first clause will have code=1. If
the sentence contains pair of words like -
etc. then it will be complex sentence. If the sentence
encloses words like - etc then it
will also be complex sentence.
Compound sentence: The sentence contains more
than one finite verbs or other non-finite verbs i.e. at
least two words with code=32 and may have some
other words with code=16. The sentences can also
include some specific words such as
etc. It
can also have some conjunctions -
etc. but here connecting clauses rather then words i.e.
the code for word[i-1] will not be same as that for
word[i+1] where the conjunction is the ith
word of the
sentence. The conjunction in the sentence may
connect two verbs i.e. two words of word code 16 or
32. The sentence may contain comma to make
contact between two or more clauses rather than
words. If any clause includes a transitive verb then it
must also include an object of code=1.
3. PERFORMANCE ANALYSIS
To measure the accuracy of the system we have
incorporated some arbitrary sample documents and
different analyzing processes such as- word
detection, tense of verb determination and types of
sentence detection are applied on them. We also
detected all these outcomes manually to verify the
accuracy of the system. The performance of the
system is then considered by comparing these results
and is expressed in percentage.
3.1 The Word Detection Process
Words are searched in the database and if any
ambiguity found then rules in Table-6 are applied to
avoid them. The result is summarized in Figure-4.
Figure-4: Results of Word Detection
The performance can be improved by adding some
additional rules in the syntactic analysis of the
system.
3.2 The Tense Detection Process
The tense of verb is detected on the basis of the main
verb. The tense detection module presents an
accuracy of 98% with 15 samples. A portion of the
experiment is depicted in Figure-5.
Figure-5: Results of Tense Detection
3.3 The Type Detection Process
The types of sentences are detected with syntactic
analysis of the system. Performance can be improved
by adding syntactic rules. Results with 15 samples
are depicted in Figure-6 with an accuracy of 99%.
Figure-6: Evaluation of Type Detection
4. CONCLUSIONS
In our system Bangla characters are presented as
unicode characters. So, the system is machine
independent and need not to utilize any specialized
keyboard or font. Often it is very bothersome to type
the whole documents and here user can load existing
text documents from the system and operate on it.
The system is based on some knowledgebase and
some predefined rules. It will fail to detect a word if
the word is not in the database. In this regard, we
have enclosed some events to allow new entries in
the vocabulary. The user can add, delete and modify
the database. The performance can be improved by
updating the detection modules of the system.
References
[1] A. Tanvir, M. F. Zibran, R. Shammi and M.
A. Sattar, “Computer Representation of Bangla
Characters and Sorting of Bangla Words”, Proc.
of International Conference on Computer and
Information Technology, 27-28 December 2002.
[2] Rich, Night, “Artificial Intelligence”.
[3] M. Hoque, “Bangla Vashar Beyakoron and
Rochonaritee”.
[4] S. M. Chowdhury, S. M. H. Chowdhury, I.
Khalil, K. D. Muhammod and S. Lahiri, “Bangla
Vashar Beyakoron”.
[5] M. Asadujjaman and S.H. Rahman,
“Ucchotoro Bangla Beyakoron and Rochona”.

More Related Content

What's hot

KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONKANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
 
A Novel Approach for Rule Based Translation of English to Marathi
A Novel Approach for Rule Based Translation of English to MarathiA Novel Approach for Rule Based Translation of English to Marathi
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
 
Significance of error analysis
Significance of error analysisSignificance of error analysis
Significance of error analysisnirmeennimmu
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesijnlc
 
Discrete mathematics (topic)
Discrete mathematics (topic)Discrete mathematics (topic)
Discrete mathematics (topic)Mahfuzur Rahman
 
Development of learned dictionary based spoken language
Development of learned dictionary based spoken languageDevelopment of learned dictionary based spoken language
Development of learned dictionary based spoken languagePallavi Bharti
 
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
 
Word sense disambiguation a survey
Word sense disambiguation  a surveyWord sense disambiguation  a survey
Word sense disambiguation a surveyijctcm
 
Division_3_Fianna_O'Brien
Division_3_Fianna_O'BrienDivision_3_Fianna_O'Brien
Division_3_Fianna_O'BrienFianna O'Brien
 
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONEFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
 
Permutations and combinations ppt
Permutations and combinations pptPermutations and combinations ppt
Permutations and combinations pptPriya !!!
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
 

What's hot (20)

KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONKANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
 
D2 anandkumar
D2 anandkumarD2 anandkumar
D2 anandkumar
 
Anandkumar novel approach
Anandkumar novel approachAnandkumar novel approach
Anandkumar novel approach
 
C7 agramakirshnan2
C7 agramakirshnan2C7 agramakirshnan2
C7 agramakirshnan2
 
A Novel Approach for Rule Based Translation of English to Marathi
A Novel Approach for Rule Based Translation of English to MarathiA Novel Approach for Rule Based Translation of English to Marathi
A Novel Approach for Rule Based Translation of English to Marathi
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
Significance of error analysis
Significance of error analysisSignificance of error analysis
Significance of error analysis
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
E1 geetha2 karthikeyan
E1 geetha2 karthikeyanE1 geetha2 karthikeyan
E1 geetha2 karthikeyan
 
Discrete mathematics (topic)
Discrete mathematics (topic)Discrete mathematics (topic)
Discrete mathematics (topic)
 
Development of learned dictionary based spoken language
Development of learned dictionary based spoken languageDevelopment of learned dictionary based spoken language
Development of learned dictionary based spoken language
 
I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
 
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
 
Word sense disambiguation a survey
Word sense disambiguation  a surveyWord sense disambiguation  a survey
Word sense disambiguation a survey
 
Division_3_Fianna_O'Brien
Division_3_Fianna_O'BrienDivision_3_Fianna_O'Brien
Division_3_Fianna_O'Brien
 
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONEFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
 
Permutations and combinations ppt
Permutations and combinations pptPermutations and combinations ppt
Permutations and combinations ppt
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 

Viewers also liked

Tiens blood pressure controller
Tiens blood pressure controllerTiens blood pressure controller
Tiens blood pressure controllerSartaj Ahmad
 
Quantum resonance magnetic analyzer presentation dr kamaljit singh
Quantum resonance magnetic analyzer presentation dr kamaljit singhQuantum resonance magnetic analyzer presentation dr kamaljit singh
Quantum resonance magnetic analyzer presentation dr kamaljit singhDr Kamaljit Singh
 
tiens international,network marketing,business
tiens international,network marketing,businesstiens international,network marketing,business
tiens international,network marketing,businessariniazi
 
Tiens catalogue eng
Tiens catalogue engTiens catalogue eng
Tiens catalogue engSILVIA S
 
TIENS Products Cataloge
TIENS Products CatalogeTIENS Products Cataloge
TIENS Products CatalogeQaiser Rafique
 
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of BangladeshNice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh28051971
 
Health Is Wealth By Debroah Ferrari
Health Is Wealth By Debroah FerrariHealth Is Wealth By Debroah Ferrari
Health Is Wealth By Debroah FerrariDeborah Ferrari
 
Health is Wealth (TIENS OPP)
Health is Wealth (TIENS OPP)Health is Wealth (TIENS OPP)
Health is Wealth (TIENS OPP)Darius Topacio
 
Health Powerpoint
Health PowerpointHealth Powerpoint
Health Powerpointpasch1kb
 

Viewers also liked (16)

Bop english
Bop englishBop english
Bop english
 
Antilip Tea
Antilip TeaAntilip Tea
Antilip Tea
 
Tiens blood pressure controller
Tiens blood pressure controllerTiens blood pressure controller
Tiens blood pressure controller
 
Quantum resonance magnetic analyzer presentation dr kamaljit singh
Quantum resonance magnetic analyzer presentation dr kamaljit singhQuantum resonance magnetic analyzer presentation dr kamaljit singh
Quantum resonance magnetic analyzer presentation dr kamaljit singh
 
tiens international,network marketing,business
tiens international,network marketing,businesstiens international,network marketing,business
tiens international,network marketing,business
 
Tiens catalogue eng
Tiens catalogue engTiens catalogue eng
Tiens catalogue eng
 
Tiens Profiles
Tiens ProfilesTiens Profiles
Tiens Profiles
 
TIENS Toothpaste
TIENS ToothpasteTIENS Toothpaste
TIENS Toothpaste
 
Slimming Tea
Slimming TeaSlimming Tea
Slimming Tea
 
TIENS Products Cataloge
TIENS Products CatalogeTIENS Products Cataloge
TIENS Products Cataloge
 
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of BangladeshNice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh
Nice Tiens Open Plan presentation By Shimol Huda, 8 Star Leader of Bangladesh
 
Health Is Wealth By Debroah Ferrari
Health Is Wealth By Debroah FerrariHealth Is Wealth By Debroah Ferrari
Health Is Wealth By Debroah Ferrari
 
Health is Wealth (TIENS OPP)
Health is Wealth (TIENS OPP)Health is Wealth (TIENS OPP)
Health is Wealth (TIENS OPP)
 
Attitude Is Everything For Success
Attitude Is Everything For SuccessAttitude Is Everything For Success
Attitude Is Everything For Success
 
8 Stps To Success
8 Stps To Success8 Stps To Success
8 Stps To Success
 
Health Powerpoint
Health PowerpointHealth Powerpoint
Health Powerpoint
 

Similar to Bangla Document Analyzer Detects Grammatical Structures

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPAndi Wu
 
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
Automatic Generation of Compound Word Lexicon for Marathi Speech SynthesisAutomatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesisiosrjce
 
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
 
A Corpus-Based Concatenative Speech Synthesis System for Marathi
A Corpus-Based Concatenative Speech Synthesis System for MarathiA Corpus-Based Concatenative Speech Synthesis System for Marathi
A Corpus-Based Concatenative Speech Synthesis System for Marathiiosrjce
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesAlgoscale Technologies Inc.
 
Emotion Detection from Text
Emotion Detection from TextEmotion Detection from Text
Emotion Detection from TextIJERD Editor
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466IJRAT
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Taggingkevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 

Similar to Bangla Document Analyzer Detects Grammatical Structures (20)

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
Automatic Generation of Compound Word Lexicon for Marathi Speech SynthesisAutomatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
 
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
 
A Corpus-Based Concatenative Speech Synthesis System for Marathi
A Corpus-Based Concatenative Speech Synthesis System for MarathiA Corpus-Based Concatenative Speech Synthesis System for Marathi
A Corpus-Based Concatenative Speech Synthesis System for Marathi
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
Emotion Detection from Text
Emotion Detection from TextEmotion Detection from Text
Emotion Detection from Text
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 

Bangla Document Analyzer Detects Grammatical Structures

  • 1. Bangla Documents Analyzer Samina Azad1 , Mohammad Shamsul Arefin1 , A.N.M. Rezaul Karim2 1 Department of Computer Science and Engineering, Chittagong University of Engineering and Technology. 2 Department of Computer Science and Engineering, International Islamic University Chittagong. e-mail: samin03@yahoo.com, sarefin@cuet.ac.bd, zakianaser@yahoo.com ABSTRACT This paper presents the implementation of a system that analyzes Bangla documents to recognize its grammatical structures. The system takes a Bangla document as input. Then the system isolates all the individual sentences and extracts the words from each individual sentence eliminating the punctuation marks. A collection of stored knowledge and rules are applied to the words and sentences to detect grammatical features of the individual sentences. Finally the system performs different types of analysis from different aspects such as word detection, tense of verb recognition and types of sentence determination. KEY WORDS: word code, word detection, tense detection, type detection. 1. INTRODUCTION It was a great attempt of the modern computer scientists to set up an intelligent machine that could understand human languages. But the earlier researches kept the attention only on English as well- recognized international language. Recently some great efforts have been made to process some other languages. Bangla is a very rich language and approximately 10% of the people in the world use to speak in Bangla [1]. A most recent effort invoking Bangla is a system that takes Bangla statements written in English character set and displays it in Bangla characters. Unfortunately any of these systems is not able to provide sufficient aids for transformation of Bangla language to any others. Here, the need arises to understand the linguistic distinctiveness of Bangla language. It is unconcerned so far to have any application software that grammatically analyzes Bangla documents for notifying structural language attributes. In this paper we are presenting a system that performs morphological and syntactic analysis of Bangla statements to signify some grammatical constructs i.e. parts-of-speech, tense of verb, type of sentence etc using artificial knowledge. This analysis is a precondition for linguistic transformation. 2. BANGLA DOCUMENT ANALYSIS The system breaks the text into easily perceivable units and finally makes some decisions based on simple heuristic knowledge. The system isolates the sentences within it, detects each of the parts of speeches, determines the roots of verbs, determines types of verbs using roots, notifies the Specifiers and Beevokties, finds out the numbers of nouns, detects the persons i.e. first person, second person and third person, decides the tense of verb of a sentence and finally classifies the sentences in simple, complex and compound through logical analysis. For doing these tasks, first the system performs morphological analysis of the sentences. Each word is categorized in its own parts-of-speeches and finite verb of the sentence is detected. This is done by operating on the root of the verb. The root is combined with some suffixes called – ‘Beevokty’ to form the verb. The verb is then classified on the basis of these suffixes. The verb detects the tense of the sentence. After morphological analysis, the system syntactically checks the sentence to discover its type. 2.1 Representation of Bangla characters In this system Bangla characters are represented using unicode conventions. The unicode are mapped on the keyboard. For example- if the user presses the key “k” the character will be displayed as ‘ ’. The keys also represent KARs and FOLAs. For example- if we type “a” it will be displayed as ‘ ’. Compound Bangla characters are represented by combination of keys. Some examples are given in Table-1. Table-1: Examples of Compound Letters Key Typed Unicode Display jnHm MitHr AkHxr SkHti JYHja 2.2 Preprocessing of the text document Sentences in the document are isolated and each sentence is sent to the analyzer. A sentence ends with punctuations like ‘| ’, ‘?’, ‘!’ etc. Words are then segregated for morphological analysis. Each word is terminated with a space or punctuation. 2.3 Morphological Analysis In morphological analysis stage of natural language processing - individual words are isolated and analyzed into their categories and punctuations are separated from the words [2].
  • 2. 2.3.1 Word detection In this phase, individual words of the document are classified in parts of speech invoking word detection method. First, the whole word is searched in the database. If the word is found then this method terminates immediately by setting: detection=yes. If the word is not found then any substring of the word that matches some words in the database is detected and the system checks if the prefix and the suffix of the word are valid. The prefix is accounted as valid if it is found in the list of Upasorgo. The suffix is valid if it is a Modifier or any types of Beevokty or Nirdeshok. The Nirdesok defines the number of the nouns. Beevokty denotes the case of the word in the sentence. There are two sorts of Beevokties – Nam Beevokty and Kriya Beevokty. Nam Beevokty is placed after nouns and Kriya Beevokty is placed after verbs. Some examples [3] are listed in Table-2. Table-2: Formats of ‘Beevokty’ and ‘Nirdesok’ The flowchart of the word detection module is given in Figure-1. Figure-1: The word detection module Sometimes the word contains a modifier at the end as a suffix to modify the original word as in - etc. The different formats of the prefix and suffix combination are listed in Table-3. Table-3: Different formats of suffix and prefix 2.3.2 Verb detection Roots of the verbs are detected as described in the sub-subsection 2.3.1. The only difference is - there are some extra processing tasks. The roots of Bangla verbs are classified into twenty classes. While mapping the words the root is detected to signify its class. Each class has its own list of Kriya Beevokty. The classes of roots are listed in Table-4 [4]. Table-4: Classes of the Roots After detecting the class of the root, the system checks the list of beevokties only for that class using a heuristic technique. If any Beevokty matches with the suffix of that word then the system sets word detection = yes. 2.3.3 Tense of verb detection The flowchart of tense detection module is given in Figure-2. Figure-2: Tense of verb detection module
  • 3. Our system detects the tense of verb while going through the verb detection process. Each Beevokty in the list has a tense code. When the system gets a match between the suffix of the original word and an element in the list of Kriya Beevokty, it immediately collects the tense code of that element. For example: if the original word is , the system first checks the word in database and detects the root - . This root is of class- and so the system will check the list of Kriya Beevokty only for this class. Finally the Kriya Beevokty is found to be . From the predefined stored knowledge the system finally gives decision that the sentence is in past continuous tense. 2.4 Syntactic Analysis Syntactic analysis transforms linear word-sequences into structure that shows how words are interrelated. Word sequences may be rejected if they violate the rules for how words may be combined [2]. 2.4.1 Exclusion of ambiguity We have represented each type of parts-of-speech by assigning a code to the word. This code is unique for each type of parts-of-speech. By checking this code (Figure-3) we can easily identify the parts-of-speech. Figure-3: Word Code Assignment Scheme Because the same word in the database may be stored as different types of parts-of-speech, the system needs to invoke some extra efforts to eliminate the ambiguity. This effort is gained by another detection module, which tries to obtain the correct word code using the interrelations of the words in the sentence. For example: If the system gets the word ‘ ’, it immediately checked in the database and detected as as well as . So the word code (Figure-3) will be 1+2=3 which is ambiguous shown in Table-5. Table-5: Ambiguous words To avoid this situation we apply some rules. Let, n is the total number of words in the sentence and w[i] be the words in the sentence where i=1, 2, 3...n. Then word codes will be corrected using some rules as in Table-6 and Table-7 considering different cases. Table-6: Forward Scanning from Right-to-Left Table-7: Reverse Scanning from Left-to-Right 2.4.2 Types of sentence detection After all the words are detected unambiguously the system scans the whole sentence and observes the co-ordination of the words to determine its type. There are three types of sentences [5]. They are: simple sentences ( ), complex sentences ( ) and compound sentences ( ). Simple sentence: The sentence will contain only one finite verb ( ) and/or other non-finite verbs ( ) i.e. only one word will have the code=32. If there is any conjunction in the sentence like , then the conjunction will connect two words of similar types i.e. the code for word[i-1] will be same as that for word[i+1] where the conjunction is the ith word of the sentence. The conjunction in the sentence must not connect two verbs i.e. two words of code 16 or 32. If the sentence contains any comma, then the comma will connect the words of the same code. The comma will not connect two words of code 16 or 32. Complex sentence: The sentence contains more than one finite verbs and/or other non-finite verbs i.e. at least two words with code=32 and may have some other words with code=16. There will be no conjunction in the sentence to connect two words as in the simple sentence rather then joining two clauses. The sentence may contain comma. The comma conjunct two clauses, but the first clause will have a transitive verb with no object in that clause i.e. no word in the first clause will have code=1. If
  • 4. the sentence contains pair of words like - etc. then it will be complex sentence. If the sentence encloses words like - etc then it will also be complex sentence. Compound sentence: The sentence contains more than one finite verbs or other non-finite verbs i.e. at least two words with code=32 and may have some other words with code=16. The sentences can also include some specific words such as etc. It can also have some conjunctions - etc. but here connecting clauses rather then words i.e. the code for word[i-1] will not be same as that for word[i+1] where the conjunction is the ith word of the sentence. The conjunction in the sentence may connect two verbs i.e. two words of word code 16 or 32. The sentence may contain comma to make contact between two or more clauses rather than words. If any clause includes a transitive verb then it must also include an object of code=1. 3. PERFORMANCE ANALYSIS To measure the accuracy of the system we have incorporated some arbitrary sample documents and different analyzing processes such as- word detection, tense of verb determination and types of sentence detection are applied on them. We also detected all these outcomes manually to verify the accuracy of the system. The performance of the system is then considered by comparing these results and is expressed in percentage. 3.1 The Word Detection Process Words are searched in the database and if any ambiguity found then rules in Table-6 are applied to avoid them. The result is summarized in Figure-4. Figure-4: Results of Word Detection The performance can be improved by adding some additional rules in the syntactic analysis of the system. 3.2 The Tense Detection Process The tense of verb is detected on the basis of the main verb. The tense detection module presents an accuracy of 98% with 15 samples. A portion of the experiment is depicted in Figure-5. Figure-5: Results of Tense Detection 3.3 The Type Detection Process The types of sentences are detected with syntactic analysis of the system. Performance can be improved by adding syntactic rules. Results with 15 samples are depicted in Figure-6 with an accuracy of 99%. Figure-6: Evaluation of Type Detection 4. CONCLUSIONS In our system Bangla characters are presented as unicode characters. So, the system is machine independent and need not to utilize any specialized keyboard or font. Often it is very bothersome to type the whole documents and here user can load existing text documents from the system and operate on it. The system is based on some knowledgebase and some predefined rules. It will fail to detect a word if the word is not in the database. In this regard, we have enclosed some events to allow new entries in the vocabulary. The user can add, delete and modify the database. The performance can be improved by updating the detection modules of the system. References [1] A. Tanvir, M. F. Zibran, R. Shammi and M. A. Sattar, “Computer Representation of Bangla Characters and Sorting of Bangla Words”, Proc. of International Conference on Computer and Information Technology, 27-28 December 2002. [2] Rich, Night, “Artificial Intelligence”. [3] M. Hoque, “Bangla Vashar Beyakoron and Rochonaritee”. [4] S. M. Chowdhury, S. M. H. Chowdhury, I. Khalil, K. D. Muhammod and S. Lahiri, “Bangla Vashar Beyakoron”. [5] M. Asadujjaman and S.H. Rahman, “Ucchotoro Bangla Beyakoron and Rochona”.