1. Text Processing Framework for Hindi
Project ID : 2
Team Number -51
Mentor : Pulkit Parikh
Members:
Deepanshu Vijay (201302093)
Utsav Chokshi (201505581)
Arushi Dogra (201302084)
2. Problem Statement :
● The goal of this project is to develop a Text Processing
framework for Hindi.
● The framework that is required to develop should include all the
basic text processing algorithms such as, Stop word detection,
Tokenization, Sentence Breaker, POS Tagging, Key Concepts
Identification, Entity Recognition and Categorization.
3. Tasks Division
● There were three deliverables:
○ Phase1 : Scope Document
○ Phase2
○ Phase3
● All the deliverables are in the github repository.
5. Phase 3
● Named Entity Recognition
● Categorization of given Hindi Documents
6. Dataset
● We were provided with the dataset which contains hindi
paragraphs crawled from different URLs.
● Link to the Dataset :
https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
7. Tokenization
● Tokenization is the process of breaking a stream of text up into
words,phrases,symbols or other meaningful elements called
Tokens.
● We have done tokenisation by using nltk toolkit, regex-based
tokenization and also by using the indic-tokenizer(LTRC)
8. Tokenization Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी।
● Tokens :
{सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ
पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
10. Sentence Breaker Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी
डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की
है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90
फीसद अनुदान देगी। “
11. Sentence Breaker Example
● Sentences Generated :
○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी
○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी
○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों
के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है
○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए
90 फीसद अनुदान देगी
12. Stop-Word Detection
● Stop Words are the most frequent occurring words in the
corpus.
● These are basically the function words.
● Our approach for identifying stop words was first finding the
unigrams and maintain their counts.
● We Sorted the Unigrams based on their counts and then
depending on the size of the dataset top k( we took k as 50)
unigrams were detected as stop-words.
13. Stemmer
● Rule based stemmer is built.
● Paper followed :
http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-
Ramanathan.pdf
14. Identifying Variations
● Two steps were followed for this :-
○ Checking the soundex code generated by the tokens or the
transliteration of the tokens to english
○ Checking if the words are synonyms (wordnet)
● If both the conditions mentioned above are true, then the words
are similar.
● Soundex and transliteration tools were taken from Libindic .
15. Example
● The input is given as two words and output tells whether they are same or
not.
● This can also be used to list all the same words in the corpus.
● चम्पक and चंपक have same soundex/transliteration and also they have the
same meaning , so the words are same.
16. POS Tagging
● POS(Part-Of-Speech) Tagging is process of assigning each
word of sentence a POS tag from predefined set.
● Challenge : Resolving Lexical Ambiguity
○ Tagging of text is a complex task as many times we get words which
have different tag categories as they are used in different context.
● Solution :
○ This problem can be resolved by looking at the word/tag
combinations of the surrounding words with respect to the
ambiguous word.
17. POS Tagging : Example
● For POS Tagging, we have implemented two probabilistic
models :
○ Hidden Markov Model(HMM)
○ Conditional Random Fields(CRF)
● POS Tagging Example
{ सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN ,
पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V}
● We have used BIS tagset for POS tagging.
18. POS Tagging : HMM
● A POS tagger based on HMM assigns the best tag to a word by
calculating the forward and backward probabilities of tags along
with the sequence provided as an input.
● Probability for particular tag-word combination is calculated
using following formula :
● P(ti|t-1) : Prob of current tag given the previous tag
P(ti+1|ti) : Prob of future tag given the current tag
P(wi|ti) : Prob of word given the current tag
19. POS Tagging : CRF
● A POS tagger based on CRF assigns the best tag to a word by
considering unigram and bigram features defined in template.
● Sample template file :
20. Named Entity Recognition
● Named Entity Recognition is process of identifying named
entities from given paragraph (group of sentences).
● Named entities are mainly nouns which belong to certain
categories like persons, places, organizations, numerical
quantities, expressions of times etc.
● NER Example:
○ {सूरज -> ENAMEX_LOCATION_B,
हाथ -> ENAMEX_LIVTHINGS_B}
21. NER : CRF
● For NER, probabilistic model Conditional Random Fields(CRF) is
used.
● We have implemented NER using CRF++ toolkit and tested with
different template of features.
● Generally, NER is followed by POS Tagging and hence we have
used POS Tags as features along with words.
23. Categorization of given Documents
● Three classification techniques are followed :
○ KNN based classifier
○ Naive bayes classifier
○ SVM Based Classifier
● The data set is divided in the ratio of 80:20 (training : testing)
● Only those entries of the dataset which have mcategory are
considered and the corresponding value is the tag assigned.
24. KNN based classifier
● The text with mcategory given is taken into consideration
● The similarity of the the two text documents is calculated by
using the tf-idf score.
● K most similar documents to the test document are considered
and the tag which is majorly in the k documents is assigned to
the test document.
● Accuracy : 89% (k=9)
25. Naive Bayes Classifier
● The text with mcategory given is taken into consideration.
● Naive bayes classifiers are simple probabilistic classifiers
based on applying Bayes theorem with strong independence
assumption b/w the features.
● Accuracy : 92%
26. SVM based Classifier
● The text with “mcategory” given is taken into consideration.
○ “Text” : Feature
○ “mCategory” : Class-label.
● SVM based classifier are non-probabilistic binary linear
classifiers.
● As we have more than two categories [Multi-Class
classification], we have used OneVsRestClassifier and then
selected category which gets maximum decision function
value.
● Accuracy : 90%
27. Tools Used
● We have used ‘PYTHON’ for implementing all modules.
● In python we have mainly used following libraries :
○ NLTK
○ SKLEARN
○ INDIC
○ SOUNDEX
○ TextBlob
● Along with python, we have used linux based toolkit CRF++.
28. Challenges
● For transliteration and Soundex the cmu mapping dictionaries are
considered which only map one alphabet to just 1 other alphabet which
might not be the case always as anusvara maps to more than one
alphabet. So wrong codes are generated sometimes.
● Sometimes the wordnet dictionaries are not up-to-date.
● We have applied rule-based stemming but rules don’t apply every time and
hence the word is stemmed to wrong root word.
● For NER, we have choice of different sets of features. Each feature set was
giving different accuracy. So we need to select the best and minimal
feature set out of it.
29. Applications
● Efficient retrieval of hindi language resources
● Annotating/Categorizing hindi documents
● Building responsive UIs in hindi language