SlideShare a Scribd company logo
Text Processing Framework for Hindi
Project ID : 2
Team Number -51
Mentor : Pulkit Parikh
Members:
Deepanshu Vijay (201302093)
Utsav Chokshi (201505581)
Arushi Dogra (201302084)
Problem Statement :
● The goal of this project is to develop a Text Processing
framework for Hindi.
● The framework that is required to develop should include all the
basic text processing algorithms such as, Stop word detection,
Tokenization, Sentence Breaker, POS Tagging, Key Concepts
Identification, Entity Recognition and Categorization.
Tasks Division
● There were three deliverables:
○ Phase1 : Scope Document
○ Phase2
○ Phase3
● All the deliverables are in the github repository.
Phase 2
● Parsing
● Tokenization
● Sentence Breaker
● Stop Word Detection
● Identifying Variations
● POS Tagging
Phase 3
● Named Entity Recognition
● Categorization of given Hindi Documents
Dataset
● We were provided with the dataset which contains hindi
paragraphs crawled from different URLs.
● Link to the Dataset :
https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
Tokenization
● Tokenization is the process of breaking a stream of text up into
words,phrases,symbols or other meaningful elements called
Tokens.
● We have done tokenisation by using nltk toolkit, regex-based
tokenization and also by using the indic-tokenizer(LTRC)
Tokenization Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी।
● Tokens :
{सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ
पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
Sentence Breaker
● This module involves separating the sentences from a
paragraph text.
Sentence Breaker Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी
डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की
है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90
फीसद अनुदान देगी। “
Sentence Breaker Example
● Sentences Generated :
○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी
○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी
○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों
के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है
○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए
90 फीसद अनुदान देगी
Stop-Word Detection
● Stop Words are the most frequent occurring words in the
corpus.
● These are basically the function words.
● Our approach for identifying stop words was first finding the
unigrams and maintain their counts.
● We Sorted the Unigrams based on their counts and then
depending on the size of the dataset top k( we took k as 50)
unigrams were detected as stop-words.
Stemmer
● Rule based stemmer is built.
● Paper followed :
http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-
Ramanathan.pdf
Identifying Variations
● Two steps were followed for this :-
○ Checking the soundex code generated by the tokens or the
transliteration of the tokens to english
○ Checking if the words are synonyms (wordnet)
● If both the conditions mentioned above are true, then the words
are similar.
● Soundex and transliteration tools were taken from Libindic .
Example
● The input is given as two words and output tells whether they are same or
not.
● This can also be used to list all the same words in the corpus.
● चम्पक and चंपक have same soundex/transliteration and also they have the
same meaning , so the words are same.
POS Tagging
● POS(Part-Of-Speech) Tagging is process of assigning each
word of sentence a POS tag from predefined set.
● Challenge : Resolving Lexical Ambiguity
○ Tagging of text is a complex task as many times we get words which
have different tag categories as they are used in different context.
● Solution :
○ This problem can be resolved by looking at the word/tag
combinations of the surrounding words with respect to the
ambiguous word.
POS Tagging : Example
● For POS Tagging, we have implemented two probabilistic
models :
○ Hidden Markov Model(HMM)
○ Conditional Random Fields(CRF)
● POS Tagging Example
{ सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN ,
पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V}
● We have used BIS tagset for POS tagging.
POS Tagging : HMM
● A POS tagger based on HMM assigns the best tag to a word by
calculating the forward and backward probabilities of tags along
with the sequence provided as an input.
● Probability for particular tag-word combination is calculated
using following formula :
● P(ti|t-1) : Prob of current tag given the previous tag
P(ti+1|ti) : Prob of future tag given the current tag
P(wi|ti) : Prob of word given the current tag
POS Tagging : CRF
● A POS tagger based on CRF assigns the best tag to a word by
considering unigram and bigram features defined in template.
● Sample template file :
Named Entity Recognition
● Named Entity Recognition is process of identifying named
entities from given paragraph (group of sentences).
● Named entities are mainly nouns which belong to certain
categories like persons, places, organizations, numerical
quantities, expressions of times etc.
● NER Example:
○ {सूरज -> ENAMEX_LOCATION_B,
हाथ -> ENAMEX_LIVTHINGS_B}
NER : CRF
● For NER, probabilistic model Conditional Random Fields(CRF) is
used.
● We have implemented NER using CRF++ toolkit and tested with
different template of features.
● Generally, NER is followed by POS Tagging and hence we have
used POS Tags as features along with words.
NER : CRF : Template Examples
Categorization of given Documents
● Three classification techniques are followed :
○ KNN based classifier
○ Naive bayes classifier
○ SVM Based Classifier
● The data set is divided in the ratio of 80:20 (training : testing)
● Only those entries of the dataset which have mcategory are
considered and the corresponding value is the tag assigned.
KNN based classifier
● The text with mcategory given is taken into consideration
● The similarity of the the two text documents is calculated by
using the tf-idf score.
● K most similar documents to the test document are considered
and the tag which is majorly in the k documents is assigned to
the test document.
● Accuracy : 89% (k=9)
Naive Bayes Classifier
● The text with mcategory given is taken into consideration.
● Naive bayes classifiers are simple probabilistic classifiers
based on applying Bayes theorem with strong independence
assumption b/w the features.
● Accuracy : 92%
SVM based Classifier
● The text with “mcategory” given is taken into consideration.
○ “Text” : Feature
○ “mCategory” : Class-label.
● SVM based classifier are non-probabilistic binary linear
classifiers.
● As we have more than two categories [Multi-Class
classification], we have used OneVsRestClassifier and then
selected category which gets maximum decision function
value.
● Accuracy : 90%
Tools Used
● We have used ‘PYTHON’ for implementing all modules.
● In python we have mainly used following libraries :
○ NLTK
○ SKLEARN
○ INDIC
○ SOUNDEX
○ TextBlob
● Along with python, we have used linux based toolkit CRF++.
Challenges
● For transliteration and Soundex the cmu mapping dictionaries are
considered which only map one alphabet to just 1 other alphabet which
might not be the case always as anusvara maps to more than one
alphabet. So wrong codes are generated sometimes.
● Sometimes the wordnet dictionaries are not up-to-date.
● We have applied rule-based stemming but rules don’t apply every time and
hence the word is stemmed to wrong root word.
● For NER, we have choice of different sets of features. Each feature set was
giving different accuracy. So we need to select the best and minimal
feature set out of it.
Applications
● Efficient retrieval of hindi language resources
● Annotating/Categorizing hindi documents
● Building responsive UIs in hindi language
References
● http://airccj.org/CSCP/vol3/csit3639.pdf [POSTagger]
● http://talukdar.net/papers/KBCS04_HPL-1.pdf [Tokenization]
● https://sanchay.co.in/papers/cpms-long-iwlc-06.pdf [Identifying Variations]
● http://www.enggjournals.com/ijcse/doc/IJCSE12-04-05-213.pdf [Entity
Recognition]

More Related Content

What's hot

Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
hyunyoung Lee
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
Abdullah Khan Zehady
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
Abdelaziz Al-Rihawi
 
Text summarization
Text summarizationText summarization
Text summarization
Akash Karwande
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
Viet-Trung TRAN
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Eugene Nho
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
Roelof Pieters
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Pramati Technologies
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
Nishant Kumar
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Daniele Di Mitri
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
君 廖
 

What's hot (20)

Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Text summarization
Text summarizationText summarization
Text summarization
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Text Classification
Text ClassificationText Classification
Text Classification
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
[ppt]
[ppt][ppt]
[ppt]
 

Similar to Text Processing Framework for Hindi

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
Sharvil Katariya
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Yuki Tomo
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
DeepaGupta205807
 
Classification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or Evergreen
Monis Javed
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
QuantInsti
 
Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...
Olga Zinkevych
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
 
D017232729
D017232729D017232729
D017232729
IOSR Journals
 
Tensorflow
TensorflowTensorflow
Tensorflow
Knoldus Inc.
 
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
AIST
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Jinho Choi
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
FEG
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
ananth
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
Ganesh Venkataraman
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
National Information Standards Organization (NISO)
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
sadakpramodh
 

Similar to Text Processing Framework for Hindi (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
 
Classification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or Evergreen
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
Tensorflow
TensorflowTensorflow
Tensorflow
 
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
Elena Bolshakova and Natalia Efremova - A Heuristic Strategy for Extracting T...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 

Recently uploaded

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 

Recently uploaded (20)

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 

Text Processing Framework for Hindi

  • 1. Text Processing Framework for Hindi Project ID : 2 Team Number -51 Mentor : Pulkit Parikh Members: Deepanshu Vijay (201302093) Utsav Chokshi (201505581) Arushi Dogra (201302084)
  • 2. Problem Statement : ● The goal of this project is to develop a Text Processing framework for Hindi. ● The framework that is required to develop should include all the basic text processing algorithms such as, Stop word detection, Tokenization, Sentence Breaker, POS Tagging, Key Concepts Identification, Entity Recognition and Categorization.
  • 3. Tasks Division ● There were three deliverables: ○ Phase1 : Scope Document ○ Phase2 ○ Phase3 ● All the deliverables are in the github repository.
  • 4. Phase 2 ● Parsing ● Tokenization ● Sentence Breaker ● Stop Word Detection ● Identifying Variations ● POS Tagging
  • 5. Phase 3 ● Named Entity Recognition ● Categorization of given Hindi Documents
  • 6. Dataset ● We were provided with the dataset which contains hindi paragraphs crawled from different URLs. ● Link to the Dataset : https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
  • 7. Tokenization ● Tokenization is the process of breaking a stream of text up into words,phrases,symbols or other meaningful elements called Tokens. ● We have done tokenisation by using nltk toolkit, regex-based tokenization and also by using the indic-tokenizer(LTRC)
  • 8. Tokenization Example ● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। ● Tokens : {सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
  • 9. Sentence Breaker ● This module involves separating the sentences from a paragraph text.
  • 10. Sentence Breaker Example ● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90 फीसद अनुदान देगी। “
  • 11. Sentence Breaker Example ● Sentences Generated : ○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी ○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी ○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है ○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90 फीसद अनुदान देगी
  • 12. Stop-Word Detection ● Stop Words are the most frequent occurring words in the corpus. ● These are basically the function words. ● Our approach for identifying stop words was first finding the unigrams and maintain their counts. ● We Sorted the Unigrams based on their counts and then depending on the size of the dataset top k( we took k as 50) unigrams were detected as stop-words.
  • 13. Stemmer ● Rule based stemmer is built. ● Paper followed : http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6- Ramanathan.pdf
  • 14. Identifying Variations ● Two steps were followed for this :- ○ Checking the soundex code generated by the tokens or the transliteration of the tokens to english ○ Checking if the words are synonyms (wordnet) ● If both the conditions mentioned above are true, then the words are similar. ● Soundex and transliteration tools were taken from Libindic .
  • 15. Example ● The input is given as two words and output tells whether they are same or not. ● This can also be used to list all the same words in the corpus. ● चम्पक and चंपक have same soundex/transliteration and also they have the same meaning , so the words are same.
  • 16. POS Tagging ● POS(Part-Of-Speech) Tagging is process of assigning each word of sentence a POS tag from predefined set. ● Challenge : Resolving Lexical Ambiguity ○ Tagging of text is a complex task as many times we get words which have different tag categories as they are used in different context. ● Solution : ○ This problem can be resolved by looking at the word/tag combinations of the surrounding words with respect to the ambiguous word.
  • 17. POS Tagging : Example ● For POS Tagging, we have implemented two probabilistic models : ○ Hidden Markov Model(HMM) ○ Conditional Random Fields(CRF) ● POS Tagging Example { सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN , पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V} ● We have used BIS tagset for POS tagging.
  • 18. POS Tagging : HMM ● A POS tagger based on HMM assigns the best tag to a word by calculating the forward and backward probabilities of tags along with the sequence provided as an input. ● Probability for particular tag-word combination is calculated using following formula : ● P(ti|t-1) : Prob of current tag given the previous tag P(ti+1|ti) : Prob of future tag given the current tag P(wi|ti) : Prob of word given the current tag
  • 19. POS Tagging : CRF ● A POS tagger based on CRF assigns the best tag to a word by considering unigram and bigram features defined in template. ● Sample template file :
  • 20. Named Entity Recognition ● Named Entity Recognition is process of identifying named entities from given paragraph (group of sentences). ● Named entities are mainly nouns which belong to certain categories like persons, places, organizations, numerical quantities, expressions of times etc. ● NER Example: ○ {सूरज -> ENAMEX_LOCATION_B, हाथ -> ENAMEX_LIVTHINGS_B}
  • 21. NER : CRF ● For NER, probabilistic model Conditional Random Fields(CRF) is used. ● We have implemented NER using CRF++ toolkit and tested with different template of features. ● Generally, NER is followed by POS Tagging and hence we have used POS Tags as features along with words.
  • 22. NER : CRF : Template Examples
  • 23. Categorization of given Documents ● Three classification techniques are followed : ○ KNN based classifier ○ Naive bayes classifier ○ SVM Based Classifier ● The data set is divided in the ratio of 80:20 (training : testing) ● Only those entries of the dataset which have mcategory are considered and the corresponding value is the tag assigned.
  • 24. KNN based classifier ● The text with mcategory given is taken into consideration ● The similarity of the the two text documents is calculated by using the tf-idf score. ● K most similar documents to the test document are considered and the tag which is majorly in the k documents is assigned to the test document. ● Accuracy : 89% (k=9)
  • 25. Naive Bayes Classifier ● The text with mcategory given is taken into consideration. ● Naive bayes classifiers are simple probabilistic classifiers based on applying Bayes theorem with strong independence assumption b/w the features. ● Accuracy : 92%
  • 26. SVM based Classifier ● The text with “mcategory” given is taken into consideration. ○ “Text” : Feature ○ “mCategory” : Class-label. ● SVM based classifier are non-probabilistic binary linear classifiers. ● As we have more than two categories [Multi-Class classification], we have used OneVsRestClassifier and then selected category which gets maximum decision function value. ● Accuracy : 90%
  • 27. Tools Used ● We have used ‘PYTHON’ for implementing all modules. ● In python we have mainly used following libraries : ○ NLTK ○ SKLEARN ○ INDIC ○ SOUNDEX ○ TextBlob ● Along with python, we have used linux based toolkit CRF++.
  • 28. Challenges ● For transliteration and Soundex the cmu mapping dictionaries are considered which only map one alphabet to just 1 other alphabet which might not be the case always as anusvara maps to more than one alphabet. So wrong codes are generated sometimes. ● Sometimes the wordnet dictionaries are not up-to-date. ● We have applied rule-based stemming but rules don’t apply every time and hence the word is stemmed to wrong root word. ● For NER, we have choice of different sets of features. Each feature set was giving different accuracy. So we need to select the best and minimal feature set out of it.
  • 29. Applications ● Efficient retrieval of hindi language resources ● Annotating/Categorizing hindi documents ● Building responsive UIs in hindi language
  • 30. References ● http://airccj.org/CSCP/vol3/csit3639.pdf [POSTagger] ● http://talukdar.net/papers/KBCS04_HPL-1.pdf [Tokenization] ● https://sanchay.co.in/papers/cpms-long-iwlc-06.pdf [Identifying Variations] ● http://www.enggjournals.com/ijcse/doc/IJCSE12-04-05-213.pdf [Entity Recognition]