The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Deep dive into the world of word vectors. We will cover - Bigram model, Skip-gram, CBOW, GLO. Starting from simplest models, we will journey through key results and ideas in this area.
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Deep dive into the world of word vectors. We will cover - Bigram model, Skip-gram, CBOW, GLO. Starting from simplest models, we will journey through key results and ideas in this area.
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
This is presentation about what skip-gram and CBOW is in seminar of Natural Language Processing Labs.
- how to make vector of words using skip-gram & CBOW.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
This talk covers how one can find the latent topics from a bunch of documents without any labels (unsupervised learning). Also covered are Latent Dirichlet Allocation (LDA), a type of document clustering model. LDA can be used for multiple NLP pipelines, eg; Document clustering, topic evaluation, feature extraction, Document similarity study, text summarisation etc. Evaluating the quality of result from such unsupervised models are a challenge, we will discuss few such effective evaluation methods.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
This presentation describes our approach to the Personalized Medicine: Redefining Cancer Treatment organized by Memorial Sloan Kettering Cancer Center (MSKCC). Nishant Kumar and Lezhi Li presented it at NIPS 2017 Competition Track. Understanding the genetic mutations that really matter in a cancer tumor is a challenging task with a potential huge impact on millions of lives.
MSKCC made available an expert-annotated knowledge base where world class researchers and oncologists have manually annotated thousands of mutations. Machine Learning is applied to this labelled data consisting genes, mutations and relevant
text-based clinical literature to build a multi class classifier, classifying each gene-mutation pair into 9 given classes, some of which can be clinically relevant. We found that it is possible to train a Machine Learning model, using Natural Language
Processing techniques like TF-IDF, Bag of Words, Latent Dirichlet Allocation, Cosine Similarity along with Ensemble Modelling Techniques like Meta-Bagging with Boosting (XGBoost), that predicts for each gene-mutation pair the most likely category they might fall into. Thus a smart clinical literature reading system can be built to help oncologists quickly identify the clinical relevance of a mutation, and design
the treatment based on that.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
This is presentation about what skip-gram and CBOW is in seminar of Natural Language Processing Labs.
- how to make vector of words using skip-gram & CBOW.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
This talk covers how one can find the latent topics from a bunch of documents without any labels (unsupervised learning). Also covered are Latent Dirichlet Allocation (LDA), a type of document clustering model. LDA can be used for multiple NLP pipelines, eg; Document clustering, topic evaluation, feature extraction, Document similarity study, text summarisation etc. Evaluating the quality of result from such unsupervised models are a challenge, we will discuss few such effective evaluation methods.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
This presentation describes our approach to the Personalized Medicine: Redefining Cancer Treatment organized by Memorial Sloan Kettering Cancer Center (MSKCC). Nishant Kumar and Lezhi Li presented it at NIPS 2017 Competition Track. Understanding the genetic mutations that really matter in a cancer tumor is a challenging task with a potential huge impact on millions of lives.
MSKCC made available an expert-annotated knowledge base where world class researchers and oncologists have manually annotated thousands of mutations. Machine Learning is applied to this labelled data consisting genes, mutations and relevant
text-based clinical literature to build a multi class classifier, classifying each gene-mutation pair into 9 given classes, some of which can be clinically relevant. We found that it is possible to train a Machine Learning model, using Natural Language
Processing techniques like TF-IDF, Bag of Words, Latent Dirichlet Allocation, Cosine Similarity along with Ensemble Modelling Techniques like Meta-Bagging with Boosting (XGBoost), that predicts for each gene-mutation pair the most likely category they might fall into. Thus a smart clinical literature reading system can be built to help oncologists quickly identify the clinical relevance of a mutation, and design
the treatment based on that.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
Today an increasingly large number of products use machine learning to deliver a great personalized user experience, and workplace software is no exception. Learn how Spoke uses MongoDB to do dynamic model training in real time from user interaction data and automatically train and serve thousands of models, with multiple customized models per client.
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...QuantInsti
Terry Benzschawel (Founder and Principal at Benzschawel Scientific, LLC) and Ishan Shah (AVP, Content & Research at QuantInsti) helm this Masterclass about Natural Language Processing in Trading.
In this session, Terry and Ishan discuss:
- How is Natural Language Processing applied in financial markets?
- Calculating Daily Sentiment Score on Quantra learning portal
- Compare different word embedding methods with their pros and cons
- How does Quantra learning portal provide a unique learning experience?
Summary:
This introductory webinar describes the use of Natural Language Processing (NLP) techniques in the context of building trading strategies for 1-day horizons for the corporate bond market and equities markets using news headlines.
We described various methods for converting text into digital representations and for extracting sentiment scores from those embeddings.
Looking ahead to the course, we find that approaches using the latest advances in NLP are better suited to predict future returns in credit indices, by using news headlines directly as inputs, instead of news headline sentiments
About us:
Quantra is an online education portal that specializes in Algorithmic and Quantitative trading. Quantra offers various bite-sized, self-paced and interactive courses that are perfect for busy professionals, seeking implementable knowledge in this domain.
Useful links and some bonus content:
[Blog] Step By Step Guide - http://bit.ly/StepByStepGuideNLP
[Course] Natural Language Processing in Trading - http://bit.ly/QuantraNLPT
[Courses] All Quantra courses - http://bit.ly/AllQuantraCourses
Find more info on - https://quantra.quantinsti.com/
Like us on Facebook: https://www.facebook.com/goquantra/
Follow us on Twitter: https://twitter.com/GoQuantra
Overview of text classification approaches algorithms & software v lyubin...Olga Zinkevych
The main points of the presentation:Overview of text classification approaches: algorithms & software
Summary: For the last 2 month I've been building a system for classifying customer support tickets into several categories in terms of product area, importance, etc. Throughout that time I've tried several approaches and benchmarked them against each other. In this talk I would like to showcase some of my findings, including research algorithms that perform well and relevant software. This talk would be useful for someone who needs to build a text categorization system, or someone who just wants to get an overview of one of the most popular NLP research problems (classification).
In this talk you will learn:
* About various approaches used for text classification (e.g. approaches based on TF-IDF, or approaches based on word embeddings and RNNs - recurrent neural nets).
* How these approaches perform against each other on a real-world data.
* Software that is useful for implementing these approaches.
* Research behind some of these approaches.
http://dataconf.com.ua/speaker-page/volodymyr-lyubinets.php
https://www.youtube.com/watch?v=shmc-MI-xbo&index=5&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The presentation introduces you to Tensorflow, different types of NLP techniques like CBOW and skip-gram and also Jupyter-notebook. It explains the topics through a problem statement where we wanted to cluster the feedbacks from the knolx sessions, basically it takes you through the process of problem-solving with deep learning models.
In this presentation we discuss the hypothesis of MaxEnt models, describe the role of feature functions and their applications to Natural Language Processing (NLP). The training of the classifier is discussed in a later presentation.
This presentation was provided by William Mattingly of the Smithsonian Institution, for the sixth session of NISO's 2023 Training Series on Text and Data Mining. Session six, "Text Mining Techniques" was held on Thursday, November 16, 2023.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
1. Text Processing Framework for Hindi
Project ID : 2
Team Number -51
Mentor : Pulkit Parikh
Members:
Deepanshu Vijay (201302093)
Utsav Chokshi (201505581)
Arushi Dogra (201302084)
2. Problem Statement :
● The goal of this project is to develop a Text Processing
framework for Hindi.
● The framework that is required to develop should include all the
basic text processing algorithms such as, Stop word detection,
Tokenization, Sentence Breaker, POS Tagging, Key Concepts
Identification, Entity Recognition and Categorization.
3. Tasks Division
● There were three deliverables:
○ Phase1 : Scope Document
○ Phase2
○ Phase3
● All the deliverables are in the github repository.
5. Phase 3
● Named Entity Recognition
● Categorization of given Hindi Documents
6. Dataset
● We were provided with the dataset which contains hindi
paragraphs crawled from different URLs.
● Link to the Dataset :
https://www.dropbox.com/s/h5sgwg3dfdoqfxo/hi.data?dl=0
7. Tokenization
● Tokenization is the process of breaking a stream of text up into
words,phrases,symbols or other meaningful elements called
Tokens.
● We have done tokenisation by using nltk toolkit, regex-based
tokenization and also by using the indic-tokenizer(LTRC)
8. Tokenization Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी।
● Tokens :
{सूरज,की,ककरणें,अब,पहाड़ों,पर,खाली,हाथ,नहीं,आएगी,बेरोजगारी,के ,ददद,के ,कराहते,लोग़ों,के ,कलए,ये,ककरणें,अ
पने,साथ,बेगारी,का,इलाज,लेकर,आएगी}
10. Sentence Breaker Example
● Text : सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी। बेरोजगारी के ददद के कराहते लोग़ों के
कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी। इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी
डेवलपमेंट एजेंसी (उरेडा) ने खासतौर से पवदतीय इलाक़ों के कलए सूयोदय स्वरोजगार योजना की शुरुआत की
है। इस योजना के जररये पहाड़ों पर रहने वालो को सरकार 5 ककलोवाट तक पॉवर स्टेशन बनाने के कलए 90
फीसद अनुदान देगी। “
11. Sentence Breaker Example
● Sentences Generated :
○ सूरज की ककरणें अब पहाड़ों पर खाली हाथ नहीं आएगी
○ बेरोजगारी के ददद के कराहते लोग़ों के कलए ये ककरणें अपने साथ बेगारी का इलाज लेकर आएगी
○ इसके कलए प्रदेश में उत्तराखंड ररन्यूएबल एनजी डेवलपमेंट एजेंसी उरेडा ने खासतौर से पवदतीय इलाक़ों
के कलए सूयोदय स्वरोजगार योजना की शुरुआत की है
○ इस योजना के जररये पहाड़ों पर रहने वालो को सरकार ककलोवाट तक पॉवर स्टेशन बनाने के कलए
90 फीसद अनुदान देगी
12. Stop-Word Detection
● Stop Words are the most frequent occurring words in the
corpus.
● These are basically the function words.
● Our approach for identifying stop words was first finding the
unigrams and maintain their counts.
● We Sorted the Unigrams based on their counts and then
depending on the size of the dataset top k( we took k as 50)
unigrams were detected as stop-words.
13. Stemmer
● Rule based stemmer is built.
● Paper followed :
http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-
Ramanathan.pdf
14. Identifying Variations
● Two steps were followed for this :-
○ Checking the soundex code generated by the tokens or the
transliteration of the tokens to english
○ Checking if the words are synonyms (wordnet)
● If both the conditions mentioned above are true, then the words
are similar.
● Soundex and transliteration tools were taken from Libindic .
15. Example
● The input is given as two words and output tells whether they are same or
not.
● This can also be used to list all the same words in the corpus.
● चम्पक and चंपक have same soundex/transliteration and also they have the
same meaning , so the words are same.
16. POS Tagging
● POS(Part-Of-Speech) Tagging is process of assigning each
word of sentence a POS tag from predefined set.
● Challenge : Resolving Lexical Ambiguity
○ Tagging of text is a complex task as many times we get words which
have different tag categories as they are used in different context.
● Solution :
○ This problem can be resolved by looking at the word/tag
combinations of the surrounding words with respect to the
ambiguous word.
17. POS Tagging : Example
● For POS Tagging, we have implemented two probabilistic
models :
○ Hidden Markov Model(HMM)
○ Conditional Random Fields(CRF)
● POS Tagging Example
{ सूरज -> NN,की -> PSP ,ककरणें -> NN,अब -> PRP,पहाड़ों -> NN ,
पर -> PSP, खाली -> JJ, हाथ -> NN,नहीं -> NEG,आएगी -> V}
● We have used BIS tagset for POS tagging.
18. POS Tagging : HMM
● A POS tagger based on HMM assigns the best tag to a word by
calculating the forward and backward probabilities of tags along
with the sequence provided as an input.
● Probability for particular tag-word combination is calculated
using following formula :
● P(ti|t-1) : Prob of current tag given the previous tag
P(ti+1|ti) : Prob of future tag given the current tag
P(wi|ti) : Prob of word given the current tag
19. POS Tagging : CRF
● A POS tagger based on CRF assigns the best tag to a word by
considering unigram and bigram features defined in template.
● Sample template file :
20. Named Entity Recognition
● Named Entity Recognition is process of identifying named
entities from given paragraph (group of sentences).
● Named entities are mainly nouns which belong to certain
categories like persons, places, organizations, numerical
quantities, expressions of times etc.
● NER Example:
○ {सूरज -> ENAMEX_LOCATION_B,
हाथ -> ENAMEX_LIVTHINGS_B}
21. NER : CRF
● For NER, probabilistic model Conditional Random Fields(CRF) is
used.
● We have implemented NER using CRF++ toolkit and tested with
different template of features.
● Generally, NER is followed by POS Tagging and hence we have
used POS Tags as features along with words.
23. Categorization of given Documents
● Three classification techniques are followed :
○ KNN based classifier
○ Naive bayes classifier
○ SVM Based Classifier
● The data set is divided in the ratio of 80:20 (training : testing)
● Only those entries of the dataset which have mcategory are
considered and the corresponding value is the tag assigned.
24. KNN based classifier
● The text with mcategory given is taken into consideration
● The similarity of the the two text documents is calculated by
using the tf-idf score.
● K most similar documents to the test document are considered
and the tag which is majorly in the k documents is assigned to
the test document.
● Accuracy : 89% (k=9)
25. Naive Bayes Classifier
● The text with mcategory given is taken into consideration.
● Naive bayes classifiers are simple probabilistic classifiers
based on applying Bayes theorem with strong independence
assumption b/w the features.
● Accuracy : 92%
26. SVM based Classifier
● The text with “mcategory” given is taken into consideration.
○ “Text” : Feature
○ “mCategory” : Class-label.
● SVM based classifier are non-probabilistic binary linear
classifiers.
● As we have more than two categories [Multi-Class
classification], we have used OneVsRestClassifier and then
selected category which gets maximum decision function
value.
● Accuracy : 90%
27. Tools Used
● We have used ‘PYTHON’ for implementing all modules.
● In python we have mainly used following libraries :
○ NLTK
○ SKLEARN
○ INDIC
○ SOUNDEX
○ TextBlob
● Along with python, we have used linux based toolkit CRF++.
28. Challenges
● For transliteration and Soundex the cmu mapping dictionaries are
considered which only map one alphabet to just 1 other alphabet which
might not be the case always as anusvara maps to more than one
alphabet. So wrong codes are generated sometimes.
● Sometimes the wordnet dictionaries are not up-to-date.
● We have applied rule-based stemming but rules don’t apply every time and
hence the word is stemmed to wrong root word.
● For NER, we have choice of different sets of features. Each feature set was
giving different accuracy. So we need to select the best and minimal
feature set out of it.
29. Applications
● Efficient retrieval of hindi language resources
● Annotating/Categorizing hindi documents
● Building responsive UIs in hindi language