SlideShare a Scribd company logo
1 of 30
Text Processing Framework for Hindi
Project ID : 2
Team Number -51
Mentor : Pulkit Parikh
Members:
Deepanshu Vijay (201302093)
Utsav Chokshi (201505581)
Arushi Dogra (201302084)
Problem Statement :
● The goal of this project is to develop a Text Processing
framework for Hindi.
● The framework that is required to develop should include all the
basic text processing algorithms such as, Stop word detection,
Tokenization, Sentence Breaker, POS Tagging, Key Concepts
Identification, Entity Recognition and Categorization.
Tasks Division
● There were three deliverables:
○ Phase1 : Scope Document
○ Phase2
○ Phase3
● All the deliverables are in the github repository.
Phase 2
● Parsing
● Tokenization
● Sentence Breaker
● Stop Word Detection
● Identifying Variations
● POS Tagging
Phase 3
● Named Entity Recognition
● Categorization of given Hindi Documents
Dataset
● We were provided with the dataset which contains hindi
paragraphs crawled from different URLs.
● Link to the Dataset :
https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
Tokenization
● Tokenization is the process of breaking a stream of text up into
words,phrases,symbols or other meaningful elements called
Tokens.
● We have done tokenisation by using nltk toolkit, regex-based
tokenization and also by using the indic-tokenizer(LTRC)
Tokenization Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी।
● Tokens :
{सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ
पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
Sentence Breaker
● This module involves separating the sentences from a
paragraph text.
Sentence Breaker Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी
डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की
है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90
फीसद अनुदान देगी। “
Sentence Breaker Example
● Sentences Generated :
○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी
○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी
○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों
के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है
○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए
90 फीसद अनुदान देगी
Stop-Word Detection
● Stop Words are the most frequent occurring words in the
corpus.
● These are basically the function words.
● Our approach for identifying stop words was first finding the
unigrams and maintain their counts.
● We Sorted the Unigrams based on their counts and then
depending on the size of the dataset top k( we took k as 50)
unigrams were detected as stop-words.
Stemmer
● Rule based stemmer is built.
● Paper followed :
http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-
Ramanathan.pdf
Identifying Variations
● Two steps were followed for this :-
○ Checking the soundex code generated by the tokens or the
transliteration of the tokens to english
○ Checking if the words are synonyms (wordnet)
● If both the conditions mentioned above are true, then the words
are similar.
● Soundex and transliteration tools were taken from Libindic .
Example
● The input is given as two words and output tells whether they are same or
not.
● This can also be used to list all the same words in the corpus.
● चम्पक and चंपक have same soundex/transliteration and also they have the
same meaning , so the words are same.
POS Tagging
● POS(Part-Of-Speech) Tagging is process of assigning each
word of sentence a POS tag from predefined set.
● Challenge : Resolving Lexical Ambiguity
○ Tagging of text is a complex task as many times we get words which
have different tag categories as they are used in different context.
● Solution :
○ This problem can be resolved by looking at the word/tag
combinations of the surrounding words with respect to the
ambiguous word.
POS Tagging : Example
● For POS Tagging, we have implemented two probabilistic
models :
○ Hidden Markov Model(HMM)
○ Conditional Random Fields(CRF)
● POS Tagging Example
{ सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN ,
पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V}
● We have used BIS tagset for POS tagging.
POS Tagging : HMM
● A POS tagger based on HMM assigns the best tag to a word by
calculating the forward and backward probabilities of tags along
with the sequence provided as an input.
● Probability for particular tag-word combination is calculated
using following formula :
● P(ti|t-1) : Prob of current tag given the previous tag
P(ti+1|ti) : Prob of future tag given the current tag
P(wi|ti) : Prob of word given the current tag
POS Tagging : CRF
● A POS tagger based on CRF assigns the best tag to a word by
considering unigram and bigram features defined in template.
● Sample template file :
Named Entity Recognition
● Named Entity Recognition is process of identifying named
entities from given paragraph (group of sentences).
● Named entities are mainly nouns which belong to certain
categories like persons, places, organizations, numerical
quantities, expressions of times etc.
● NER Example:
○ {सूरज -> ENAMEX_LOCATION_B,
हाथ -> ENAMEX_LIVTHINGS_B}
NER : CRF
● For NER, probabilistic model Conditional Random Fields(CRF) is
used.
● We have implemented NER using CRF++ toolkit and tested with
different template of features.
● Generally, NER is followed by POS Tagging and hence we have
used POS Tags as features along with words.
NER : CRF : Template Examples
Categorization of given Documents
● Three classification techniques are followed :
○ KNN based classifier
○ Naive bayes classifier
○ SVM Based Classifier
● The data set is divided in the ratio of 80:20 (training : testing)
● Only those entries of the dataset which have mcategory are
considered and the corresponding value is the tag assigned.
KNN based classifier
● The text with mcategory given is taken into consideration
● The similarity of the the two text documents is calculated by
using the tf-idf score.
● K most similar documents to the test document are considered
and the tag which is majorly in the k documents is assigned to
the test document.
● Accuracy : 89% (k=9)
Naive Bayes Classifier
● The text with mcategory given is taken into consideration.
● Naive bayes classifiers are simple probabilistic classifiers
based on applying Bayes theorem with strong independence
assumption b/w the features.
● Accuracy : 92%
SVM based Classifier
● The text with “mcategory” given is taken into consideration.
○ “Text” : Feature
○ “mCategory” : Class-label.
● SVM based classifier are non-probabilistic binary linear
classifiers.
● As we have more than two categories [Multi-Class
classification], we have used OneVsRestClassifier and then
selected category which gets maximum decision function
value.
● Accuracy : 90%
Tools Used
● We have used ‘PYTHON’ for implementing all modules.
● In python we have mainly used following libraries :
○ NLTK
○ SKLEARN
○ INDIC
○ SOUNDEX
○ TextBlob
● Along with python, we have used linux based toolkit CRF++.
Challenges
● For transliteration and Soundex the cmu mapping dictionaries are
considered which only map one alphabet to just 1 other alphabet which
might not be the case always as anusvara maps to more than one
alphabet. So wrong codes are generated sometimes.
● Sometimes the wordnet dictionaries are not up-to-date.
● We have applied rule-based stemming but rules don’t apply every time and
hence the word is stemmed to wrong root word.
● For NER, we have choice of different sets of features. Each feature set was
giving different accuracy. So we need to select the best and minimal
feature set out of it.
Applications
● Efficient retrieval of hindi language resources
● Annotating/Categorizing hindi documents
● Building responsive UIs in hindi language
References
● http://airccj.org/CSCP/vol3/csit3639.pdf [POSTagger]
● http://talukdar.net/papers/KBCS04_HPL-1.pdf [Tokenization]
● https://sanchay.co.in/papers/cpms-long-iwlc-06.pdf [Identifying Variations]
● http://www.enggjournals.com/ijcse/doc/IJCSE12-04-05-213.pdf [Entity
Recognition]

More Related Content

What's hot

Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
kperi
 

What's hot (20)

Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Text summarization
Text summarizationText summarization
Text summarization
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Text Classification
Text ClassificationText Classification
Text Classification
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
[ppt]
[ppt][ppt]
[ppt]
 

Similar to Text Processing Framework for Hindi

Document Summarization
Document SummarizationDocument Summarization
Document Summarization
Pratik Kumar
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
QuantInsti
 

Similar to Text Processing Framework for Hindi (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
 
Classification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or Evergreen
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
Tensorflow
TensorflowTensorflow
Tensorflow
 
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 

Recently uploaded (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 

Text Processing Framework for Hindi

  • 1. Text Processing Framework for Hindi Project ID : 2 Team Number -51 Mentor : Pulkit Parikh Members: Deepanshu Vijay (201302093) Utsav Chokshi (201505581) Arushi Dogra (201302084)
  • 2. Problem Statement : ● The goal of this project is to develop a Text Processing framework for Hindi. ● The framework that is required to develop should include all the basic text processing algorithms such as, Stop word detection, Tokenization, Sentence Breaker, POS Tagging, Key Concepts Identification, Entity Recognition and Categorization.
  • 3. Tasks Division ● There were three deliverables: ○ Phase1 : Scope Document ○ Phase2 ○ Phase3 ● All the deliverables are in the github repository.
  • 4. Phase 2 ● Parsing ● Tokenization ● Sentence Breaker ● Stop Word Detection ● Identifying Variations ● POS Tagging
  • 5. Phase 3 ● Named Entity Recognition ● Categorization of given Hindi Documents
  • 6. Dataset ● We were provided with the dataset which contains hindi paragraphs crawled from different URLs. ● Link to the Dataset : https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
  • 7. Tokenization ● Tokenization is the process of breaking a stream of text up into words,phrases,symbols or other meaningful elements called Tokens. ● We have done tokenisation by using nltk toolkit, regex-based tokenization and also by using the indic-tokenizer(LTRC)
  • 8. Tokenization Example ● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। ● Tokens : {सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
  • 9. Sentence Breaker ● This module involves separating the sentences from a paragraph text.
  • 10. Sentence Breaker Example ● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90 फीसद अनुदान देगी। “
  • 11. Sentence Breaker Example ● Sentences Generated : ○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी ○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी ○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है ○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90 फीसद अनुदान देगी
  • 12. Stop-Word Detection ● Stop Words are the most frequent occurring words in the corpus. ● These are basically the function words. ● Our approach for identifying stop words was first finding the unigrams and maintain their counts. ● We Sorted the Unigrams based on their counts and then depending on the size of the dataset top k( we took k as 50) unigrams were detected as stop-words.
  • 13. Stemmer ● Rule based stemmer is built. ● Paper followed : http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6- Ramanathan.pdf
  • 14. Identifying Variations ● Two steps were followed for this :- ○ Checking the soundex code generated by the tokens or the transliteration of the tokens to english ○ Checking if the words are synonyms (wordnet) ● If both the conditions mentioned above are true, then the words are similar. ● Soundex and transliteration tools were taken from Libindic .
  • 15. Example ● The input is given as two words and output tells whether they are same or not. ● This can also be used to list all the same words in the corpus. ● चम्पक and चंपक have same soundex/transliteration and also they have the same meaning , so the words are same.
  • 16. POS Tagging ● POS(Part-Of-Speech) Tagging is process of assigning each word of sentence a POS tag from predefined set. ● Challenge : Resolving Lexical Ambiguity ○ Tagging of text is a complex task as many times we get words which have different tag categories as they are used in different context. ● Solution : ○ This problem can be resolved by looking at the word/tag combinations of the surrounding words with respect to the ambiguous word.
  • 17. POS Tagging : Example ● For POS Tagging, we have implemented two probabilistic models : ○ Hidden Markov Model(HMM) ○ Conditional Random Fields(CRF) ● POS Tagging Example { सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN , पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V} ● We have used BIS tagset for POS tagging.
  • 18. POS Tagging : HMM ● A POS tagger based on HMM assigns the best tag to a word by calculating the forward and backward probabilities of tags along with the sequence provided as an input. ● Probability for particular tag-word combination is calculated using following formula : ● P(ti|t-1) : Prob of current tag given the previous tag P(ti+1|ti) : Prob of future tag given the current tag P(wi|ti) : Prob of word given the current tag
  • 19. POS Tagging : CRF ● A POS tagger based on CRF assigns the best tag to a word by considering unigram and bigram features defined in template. ● Sample template file :
  • 20. Named Entity Recognition ● Named Entity Recognition is process of identifying named entities from given paragraph (group of sentences). ● Named entities are mainly nouns which belong to certain categories like persons, places, organizations, numerical quantities, expressions of times etc. ● NER Example: ○ {सूरज -> ENAMEX_LOCATION_B, हाथ -> ENAMEX_LIVTHINGS_B}
  • 21. NER : CRF ● For NER, probabilistic model Conditional Random Fields(CRF) is used. ● We have implemented NER using CRF++ toolkit and tested with different template of features. ● Generally, NER is followed by POS Tagging and hence we have used POS Tags as features along with words.
  • 22. NER : CRF : Template Examples
  • 23. Categorization of given Documents ● Three classification techniques are followed : ○ KNN based classifier ○ Naive bayes classifier ○ SVM Based Classifier ● The data set is divided in the ratio of 80:20 (training : testing) ● Only those entries of the dataset which have mcategory are considered and the corresponding value is the tag assigned.
  • 24. KNN based classifier ● The text with mcategory given is taken into consideration ● The similarity of the the two text documents is calculated by using the tf-idf score. ● K most similar documents to the test document are considered and the tag which is majorly in the k documents is assigned to the test document. ● Accuracy : 89% (k=9)
  • 25. Naive Bayes Classifier ● The text with mcategory given is taken into consideration. ● Naive bayes classifiers are simple probabilistic classifiers based on applying Bayes theorem with strong independence assumption b/w the features. ● Accuracy : 92%
  • 26. SVM based Classifier ● The text with “mcategory” given is taken into consideration. ○ “Text” : Feature ○ “mCategory” : Class-label. ● SVM based classifier are non-probabilistic binary linear classifiers. ● As we have more than two categories [Multi-Class classification], we have used OneVsRestClassifier and then selected category which gets maximum decision function value. ● Accuracy : 90%
  • 27. Tools Used ● We have used ‘PYTHON’ for implementing all modules. ● In python we have mainly used following libraries : ○ NLTK ○ SKLEARN ○ INDIC ○ SOUNDEX ○ TextBlob ● Along with python, we have used linux based toolkit CRF++.
  • 28. Challenges ● For transliteration and Soundex the cmu mapping dictionaries are considered which only map one alphabet to just 1 other alphabet which might not be the case always as anusvara maps to more than one alphabet. So wrong codes are generated sometimes. ● Sometimes the wordnet dictionaries are not up-to-date. ● We have applied rule-based stemming but rules don’t apply every time and hence the word is stemmed to wrong root word. ● For NER, we have choice of different sets of features. Each feature set was giving different accuracy. So we need to select the best and minimal feature set out of it.
  • 29. Applications ● Efficient retrieval of hindi language resources ● Annotating/Categorizing hindi documents ● Building responsive UIs in hindi language
  • 30. References ● http://airccj.org/CSCP/vol3/csit3639.pdf [POSTagger] ● http://talukdar.net/papers/KBCS04_HPL-1.pdf [Tokenization] ● https://sanchay.co.in/papers/cpms-long-iwlc-06.pdf [Identifying Variations] ● http://www.enggjournals.com/ijcse/doc/IJCSE12-04-05-213.pdf [Entity Recognition]