Part of Speech Tagging automatically tags the word of a text by labels that can be used to determine the
structure of sentence. In this paper we propose an approach to the problem that is inspired from human
behavior. We used a combination of rule based and dictionary based approach to tackle this problem. Our
goal in this paper is to design a simple yet effective system to POS tagging that also helps us in more
effective understanding of human behavior.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
Detection of slang words in e data using semi supervised learningijaia
The proposed algorithmic approach deals with finding the sense of a word in an electronic data. Now a day,
in different communication mediums like internet, mobile services etc. people use few words, which are
slang in nature. This approach detects those abusive words using supervised learning procedure. But in the
real life scenario, the slang words are not used in complete word forms always. Most of the times, those
words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed
approach can detect those abbreviated forms also using semi supervised learning procedure. Using the
synset and concept analysis of the text, the probability of a suspicious word to be a slang word is also
evaluated.
The document describes Agaraadhi, a novel online dictionary framework for the Tamil language. The framework indexes over 3 lakh Tamil words, providing morphological analysis, word usage statistics, translations to English, and more. It consists of online and offline components that together enable features like spelling correction, word suggestions, analyzing word usage in literature and social media, and games to support learning. The framework aims to provide more robust Tamil language reference than existing dictionaries.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
Detection of slang words in e data using semi supervised learningijaia
The proposed algorithmic approach deals with finding the sense of a word in an electronic data. Now a day,
in different communication mediums like internet, mobile services etc. people use few words, which are
slang in nature. This approach detects those abusive words using supervised learning procedure. But in the
real life scenario, the slang words are not used in complete word forms always. Most of the times, those
words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed
approach can detect those abbreviated forms also using semi supervised learning procedure. Using the
synset and concept analysis of the text, the probability of a suspicious word to be a slang word is also
evaluated.
The document describes Agaraadhi, a novel online dictionary framework for the Tamil language. The framework indexes over 3 lakh Tamil words, providing morphological analysis, word usage statistics, translations to English, and more. It consists of online and offline components that together enable features like spelling correction, word suggestions, analyzing word usage in literature and social media, and games to support learning. The framework aims to provide more robust Tamil language reference than existing dictionaries.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Natural language processing using pythonPrakash Anand
Natural language processing (NLP) is concerned with interactions between computers and human languages. NLP analyzes text to handle tasks like summarization, translation, sentiment analysis, and topic segmentation. The Natural Language Toolkit (NLTK) is a Python library that provides tools for NLP tasks like tokenization, stemming, tagging, parsing, and classification. Tokenization is the process of splitting text into tokens or chunks. Bag-of-words is an algorithm that encodes text as numeric vectors, representing word presence or absence to allow machine learning on text data. NLP has applications in areas like spam filtering, chatbots, and sentiment analysis.
The document summarizes Sentimatrix, a multilingual sentiment analysis service that can extract sentiments from text and associate them with named entities. It uses a combination of rule-based classification, statistics, and machine learning. The system has modules for preprocessing text, detecting the language, recognizing named entities, and identifying sentiments. It was evaluated on Romanian texts and achieved promising results, with an F-measure of 90.72% for named entity extraction and 66.73% for named entity classification. The system represents sentiments as weights and uses sentiment triggers, modifiers, and negation words to determine the overall sentiment expressed towards an entity.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
The document describes a method for detecting hostile content on social media using task adaptive pretraining of transformer models.
Key points:
- The method uses pretrained IndicBERT models fine-tuned on Hindi tweets to generate embeddings for text, hashtags, and emojis, which are concatenated and passed through an MLP classifier.
- An additional stage of "task adaptive pretraining" further pretrains the text encoder on the task data prior to fine-tuning, which improves performance over directly fine-tuning.
- Evaluation on a Hindi hostile tweet detection dataset shows the task adaptive pretraining approach improves F1 scores for hostility detection and related subtasks compared to without additional pretraining.
QUESTION ANALYSIS FOR ARABIC QUESTION ANSWERING SYSTEMS ijnlc
The first step of processing a question in Question Answering(QA) Systems is to carry out a detailed analysis of the question for the purpose of determining what it is asking for and how to perfectly approach answering it. Our Question analysis uses several techniques to analyze any question given in natural language: a Stanford POS Tagger & parser for Arabic language, a named entity recognizer, tokenizer,
Stop-word removal, Question expansion, Question classification and Question focus extraction components. We employ numerous detection rules and trained classifier using features from this analysis to detect important elements of the question, including: 1) the portion of the question that is a referring to the answer (the focus); 2) different terms in the question that identify what type of entity is being asked for (the
lexical answer types); 3) Question expansion ; 4) a process of classifying the question into one or more of several and different types; and We describe how these elements are identified and evaluate the effect of accurate detection on our question-answering system using the Mean Reciprocal Rank(MRR) accuracy measure.
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal
This document summarizes a survey on using deep learning to develop grammar checkers for Indian languages. It discusses different types of existing grammar checkers such as rule-based, statistical, and hybrid approaches. Grammar checkers have been developed for several languages including English, Afan Oromo, Portuguese, Punjabi, Amharic, Swedish, Nepali, Icelandic, and Hindi. Most use rule-based approaches where input sentences are checked against predefined grammar rules. Statistical approaches check input against annotated corpora. Hybrid approaches combine rule-based and statistical methods. The survey concludes that while many grammar checkers exist for other languages, only a limited number have been developed for Indian languages, leaving room for future research on developing deep
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
QUESTION ANSWERING SYSTEM USING ONTOLOGY IN MARATHI LANGUAGEijaia
This document discusses a proposed question answering system for the Marathi language that uses ontology as a knowledge base. The system aims to provide accurate answers to user questions in Marathi by analyzing queries semantically using ontologies. Ontologies are developed with help from domain experts and represent domain knowledge through semantic relations. The system first analyzes user questions syntactically and semantically. It then extracts candidate answers from the ontology and generates a precise answer in Marathi language to satisfy the original user query. The use of ontology for semantic analysis is meant to enhance the accuracy of answers provided by the question answering system.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
IRJET- A Review on Part-of-Speech Tagging on Gujarati LanguageIRJET Journal
This document reviews part-of-speech tagging methods for the Gujarati language. It discusses rule-based, stochastic, and hybrid techniques for POS tagging. Rule-based methods use linguistic rules but require extensive manual work. Stochastic methods like Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields are more automated but can tag ungrammatical sentences. Hybrid methods combine rule-based and stochastic approaches to achieve high accuracy. The document evaluates different POS tagging methods for the challenges of tagging Gujarati text.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLijfcstjournal
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Some Generalized I nformation Inequalitiesijitjournal
Information inequalities are
very useful and play a fundamental role in the literature of Information
Theory. Applicati
ons of information inequalities
have discussed by well
-
kn
own authors like as Dragomir,
Taneja
and many researchers etc. In this research paper, we shall consider some
new functional
information inequalities in the form of
generalized
information divergence measures. We shall also
consider relations between Csiszar’s f
-
divergence, new f
-
divergence and other well
-
known divergence
measures using information inequalities.
Numerical bounds of information divergence measure have also
studied
Performance analysis of contourlet based hyperspectral image fusion methodijitjournal
Recently, contourlet transform has been widely used in hyperspectral image fusion due to its advantages,
such as high directionality and anisotropy; and studies show that the contourlet-based fusion methods
perform better than the existing conventional methods including wavelet-based fusion methods. Few studies
have been done to comparatively analyze the performance of contourlet-based fusion methods;
furthermore, no research has been done to analyze the contourlet-based fusion methods by focusing on
their unique transform mechanisms. In addition, no research has focused on the original contourlet
transform and its upgraded versions. In this paper, we investigate three different kinds of contourlet
transform: i) original contourlet transform, ii) nonsubsampled contourlet transform, iii) contourlet
transform with sharp frequency localization. The latter two transforms were developed to overcome the
major drawbacks of the original contourlet transform; so it is necessary and beneficial to see how they
perform in the context of hyperspectral image fusion. The results of our comparative analysis show that the
latter two transforms perform better than the original contourlet transform in terms of increasing spatial
resolution and preserving spectral information.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Natural language processing using pythonPrakash Anand
Natural language processing (NLP) is concerned with interactions between computers and human languages. NLP analyzes text to handle tasks like summarization, translation, sentiment analysis, and topic segmentation. The Natural Language Toolkit (NLTK) is a Python library that provides tools for NLP tasks like tokenization, stemming, tagging, parsing, and classification. Tokenization is the process of splitting text into tokens or chunks. Bag-of-words is an algorithm that encodes text as numeric vectors, representing word presence or absence to allow machine learning on text data. NLP has applications in areas like spam filtering, chatbots, and sentiment analysis.
The document summarizes Sentimatrix, a multilingual sentiment analysis service that can extract sentiments from text and associate them with named entities. It uses a combination of rule-based classification, statistics, and machine learning. The system has modules for preprocessing text, detecting the language, recognizing named entities, and identifying sentiments. It was evaluated on Romanian texts and achieved promising results, with an F-measure of 90.72% for named entity extraction and 66.73% for named entity classification. The system represents sentiments as weights and uses sentiment triggers, modifiers, and negation words to determine the overall sentiment expressed towards an entity.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
The document describes a method for detecting hostile content on social media using task adaptive pretraining of transformer models.
Key points:
- The method uses pretrained IndicBERT models fine-tuned on Hindi tweets to generate embeddings for text, hashtags, and emojis, which are concatenated and passed through an MLP classifier.
- An additional stage of "task adaptive pretraining" further pretrains the text encoder on the task data prior to fine-tuning, which improves performance over directly fine-tuning.
- Evaluation on a Hindi hostile tweet detection dataset shows the task adaptive pretraining approach improves F1 scores for hostility detection and related subtasks compared to without additional pretraining.
QUESTION ANALYSIS FOR ARABIC QUESTION ANSWERING SYSTEMS ijnlc
The first step of processing a question in Question Answering(QA) Systems is to carry out a detailed analysis of the question for the purpose of determining what it is asking for and how to perfectly approach answering it. Our Question analysis uses several techniques to analyze any question given in natural language: a Stanford POS Tagger & parser for Arabic language, a named entity recognizer, tokenizer,
Stop-word removal, Question expansion, Question classification and Question focus extraction components. We employ numerous detection rules and trained classifier using features from this analysis to detect important elements of the question, including: 1) the portion of the question that is a referring to the answer (the focus); 2) different terms in the question that identify what type of entity is being asked for (the
lexical answer types); 3) Question expansion ; 4) a process of classifying the question into one or more of several and different types; and We describe how these elements are identified and evaluate the effect of accurate detection on our question-answering system using the Mean Reciprocal Rank(MRR) accuracy measure.
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal
This document summarizes a survey on using deep learning to develop grammar checkers for Indian languages. It discusses different types of existing grammar checkers such as rule-based, statistical, and hybrid approaches. Grammar checkers have been developed for several languages including English, Afan Oromo, Portuguese, Punjabi, Amharic, Swedish, Nepali, Icelandic, and Hindi. Most use rule-based approaches where input sentences are checked against predefined grammar rules. Statistical approaches check input against annotated corpora. Hybrid approaches combine rule-based and statistical methods. The survey concludes that while many grammar checkers exist for other languages, only a limited number have been developed for Indian languages, leaving room for future research on developing deep
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
QUESTION ANSWERING SYSTEM USING ONTOLOGY IN MARATHI LANGUAGEijaia
This document discusses a proposed question answering system for the Marathi language that uses ontology as a knowledge base. The system aims to provide accurate answers to user questions in Marathi by analyzing queries semantically using ontologies. Ontologies are developed with help from domain experts and represent domain knowledge through semantic relations. The system first analyzes user questions syntactically and semantically. It then extracts candidate answers from the ontology and generates a precise answer in Marathi language to satisfy the original user query. The use of ontology for semantic analysis is meant to enhance the accuracy of answers provided by the question answering system.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
IRJET- A Review on Part-of-Speech Tagging on Gujarati LanguageIRJET Journal
This document reviews part-of-speech tagging methods for the Gujarati language. It discusses rule-based, stochastic, and hybrid techniques for POS tagging. Rule-based methods use linguistic rules but require extensive manual work. Stochastic methods like Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields are more automated but can tag ungrammatical sentences. Hybrid methods combine rule-based and stochastic approaches to achieve high accuracy. The document evaluates different POS tagging methods for the challenges of tagging Gujarati text.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLijfcstjournal
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Some Generalized I nformation Inequalitiesijitjournal
Information inequalities are
very useful and play a fundamental role in the literature of Information
Theory. Applicati
ons of information inequalities
have discussed by well
-
kn
own authors like as Dragomir,
Taneja
and many researchers etc. In this research paper, we shall consider some
new functional
information inequalities in the form of
generalized
information divergence measures. We shall also
consider relations between Csiszar’s f
-
divergence, new f
-
divergence and other well
-
known divergence
measures using information inequalities.
Numerical bounds of information divergence measure have also
studied
Performance analysis of contourlet based hyperspectral image fusion methodijitjournal
Recently, contourlet transform has been widely used in hyperspectral image fusion due to its advantages,
such as high directionality and anisotropy; and studies show that the contourlet-based fusion methods
perform better than the existing conventional methods including wavelet-based fusion methods. Few studies
have been done to comparatively analyze the performance of contourlet-based fusion methods;
furthermore, no research has been done to analyze the contourlet-based fusion methods by focusing on
their unique transform mechanisms. In addition, no research has focused on the original contourlet
transform and its upgraded versions. In this paper, we investigate three different kinds of contourlet
transform: i) original contourlet transform, ii) nonsubsampled contourlet transform, iii) contourlet
transform with sharp frequency localization. The latter two transforms were developed to overcome the
major drawbacks of the original contourlet transform; so it is necessary and beneficial to see how they
perform in the context of hyperspectral image fusion. The results of our comparative analysis show that the
latter two transforms perform better than the original contourlet transform in terms of increasing spatial
resolution and preserving spectral information.
Decision Making Framework in e-Business Cloud Environment Using Software Metr...ijitjournal
Cloud computing technology is most important one in IT industry by enabling them to offer access to their
system and application services on payment type. As a result, more than a few enterprises with Facebook,
Microsoft, Google, and amazon have started offer to their clients. Quality software is most important one in
market competition in this paper presents a hybrid framework based on the goal/question/metric paradigm
to evaluate the quality and effectiveness of previous software goods in project, product and organizations
in a cloud computing environment. In our approach it support decision making in the area of project,
product and organization levels using Neural networks and three angular metrics i.e., project metrics,
product metrics, and organization metrics
ON THE COVERING RADIUS OF CODES OVER Z4 WITH CHINESE EUCLIDEAN WEIGHTijitjournal
In this paper, we give lower and upper bounds on the covering radius of codes over the ring Z4 with respect to chinese euclidean distance. We also determine the covering radius of various Repetition codes, Simplex codes Type α and Type β and give bounds on the covering radius for MacDonald codes of both types over Z4
FUZZY CLUSTERING FOR IMPROVED POSITIONINGijitjournal
In this research, we focused on developing positioning system based on common short-range wireless. So
we developed a system that assumes the existence of a number of fixed access points (5+) and employ
arrival difference (TDOA) with Kalman filter. We also discuss multiple / tri-lateration and evaluate some
of the root causes dilution of precision (DOP) positioning using Kalman Filter. This article presents a
simpler approach fuzzy clustering (SFCA) to support the short-range positioning. We use fixed access
points calibrated at 2.4 GHz. We use real time data observed in the application of our model offline. We
have extended the model to mimic the signals communications (DSRC) 5.9 GHz dedicated short range, as
defined in 1609.x. IEEE The results are compared in each case with respect to the metering plate need
Differential Global Positioning System (DGPS) captured in the same test. We kept Line-of-Sight (LOS) in
all our clear assessment and use of vehicles (<60 km / h) moving at low speed. Two different options for
implementing the SFCA are presented, analysed and compared the two.
On the Covering Radius of Codes Over Z6 ijitjournal
In this correspondence, we give lower and upper bounds on the covering radius of codes over the finite
ring Z6 with respect to different distances such as Hamming, Lee, Euclidean and Chinese Euclidean. We
also determine the covering radius of various Block Repetition Codes over Z6
Grid computing can involve lot of computational tasks which requires trustworthy computational nodes. Load balancing in grid computing is a technique which overall optimizes the whole process of assigning computational tasks to processing nodes. Grid computing is a form of distributed computing but different from conventional distributed computing in a manner that it tends to be heterogeneous, more loosely coupled and dispersed geographically. Optimization of this process must contains the overall maximization of resources utilization with balance load on each processing unit and also by decreasing the overall time or output. Evolutionary algorithms like genetic algorithms have studied so far for the implementation of load balancing across the grid networks. But problem with these genetic algorithm is that they are quite slow in cases where large number of tasks needs to be processed. In this paper we give a novel approach of parallel genetic algorithms for enhancing the overall performance and optimization of managing the whole process of load balancing across the grid nodes.
A High Step Up Hybrid Switch Converter Connected With PV Array For High Volt...ijitjournal
T
his paper
presents
a
ste
p up DC
-
to
-
DC converter with
hybrid switch capacitor technique having
high
voltage conversion ratio with small
switch voltage stress
. The converter is suitable for the applications
where high voltage conversion is required. The proposed
DC
-
DC converter
has low voltage ratted
MOSFET switch and is connected to PV array to get high output voltage at small duty ratios.
Hence it has
high efficiency.
The principles of operations and the theoretical analysis are presented in this paper.
All the
simulations are
done in MATLAB
-
SIMULINK Environment and
results were obtained with voltage
conversion ratio of 4.
Electroencephalography (EEG) based brain Computer interface (BCI) needs efficient algorithms to extract discriminative features from raw EEG signals. The issue of selecting optimizing spatial spectral features is key to high performance motor imagery(MI) classification, which is one of the main topics in EEG-based brain computer interfaces. Some novel methods are used first which formulates the selection of features as maximizing mutual information between class labels and features. It then uses an efficient algorithms for pattern feature extraction frame work,to select an effective feature set. The results shows the classification accuracy obtained and is compared with the other existing algorithms
Rapid phenotyping of prawn biochemical attributes using hyperspectral imagingStuart Hinchliff
A seminar I presented on using hyperspectral imaging to non-invasively phenotype prawns. The project focused on removing the background of the images, developing an intuitive UI and applying statistical algorithms. The aim was to train a model to predict the biochemical attributes of prawns using their spectra.
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
This document proposes a methodology to extract information from big data sources like course handouts and directories and represent it in a graphical, ontological tree format. Keywords are extracted from documents using natural language processing techniques and used to generate a hierarchical tree based on the DMOZ open directory project. The trees provide a comprehensive overview of document content and structure. The method is implemented using Python for natural language processing and Java for visualization. Evaluation on computer science course handouts shows the trees accurately represent topic coverage and depth. Future work aims to increase the number of keywords extracted.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
This document discusses folksonomies, which are classification systems created by users tagging online items. It notes that while tags may be imprecise, ambiguous, or personalized, evidence shows tagging vocabularies converge over time. The author studies tag distributions and finds single-use tags do not dominate. The document considers how to foster better tagging through educating users and improving systems, but recognizes that limiting tags risks losing valuable metadata and the diversity inherent in folksonomies. It concludes the best approach is remaining open-minded and retaining as much user-generated metadata as possible.
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
This document discusses a proposed system for deep collaborative filtering with aspect information. The system aims to help web users efficiently locate relevant information on unfamiliar topics to increase their knowledge. It utilizes techniques like multi-keyword search, synonym matching, and ontology mapping to return relevant web links, images, and news articles to the user based on their search terms. The proposed system architecture includes an index structure to efficiently search and rank results based on similarity to the search query terms. The implementation and evaluation of the proposed system are also discussed.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
The document discusses the development of a text-assisted defence information extractor using open source software. It describes the various components used in information extraction like gazetteers, sentence splitters, tokenizers, part-of-speech taggers, transducers, and JAPE grammars. The extractor helps identify keywords related to defence from a collection of domain-specific text documents. Preliminary results are promising, though more refinement is underway including a GUI for querying and a dictionary to improve keyword searching.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
IRJET- A Pragmatic Supervised Learning Methodology of Hate Speech Detection i...IRJET Journal
This document summarizes a paper that proposes a supervised machine learning approach for detecting hate speech in social media. It begins with an introduction to the problem of hate speech online and anonymity enabling harmful communication. It then describes common text preprocessing techniques like tokenization and filtering used to clean text data. Feature extraction methods are discussed, including n-grams, bag-of-words, and word embeddings. Popular machine learning algorithms for classification are also summarized, such as support vector machines, logistic regression, and neural networks. The document concludes by reviewing related work on hate speech detection and challenges around dataset annotation.
With the rapidly increasing growth in the field of internet and web usage, it has become essential to use a certain specific powerful tool, which should be capable to analyze and rank all these available reviews/opinion on the web/Internet. In this paper we have propose a new and effective approach which uses a powerful sentiment analysis procedure which will be based on an ontological adjustment and arrangements. This study also aims to understand pos tag order to get detailed observation for any review or opinion, it also helps in identifying all present positive /Negative sentiments and suggest a proper sentence inclination. For this we have used reviews available on internet regarding Nokia and Stanford parser for the purpose or pos tagging.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
INTRODUCTION TO Natural language processingsocarem879
Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives
This paper presents a natural language processing based automated system called DrawPlus for generating UML diagrams, user scenarios and test cases after analyzing the given business requirement specification which is written in natural language. The DrawPlus is presented for analyzing the natural languages and extracting the relative and required information from the given business requirement Specification by the user. Basically user writes the requirements specifications in simple English and the designed system has conspicuous ability to analyze the given requirement specification by using some of the core natural language processing techniques with our own well defined algorithms. After compound analysis and extraction of associated information, the DrawPlus system draws use case diagram, User scenarios and system level high level test case description. The DrawPlus provides the more convenient and reliable way of generating use case, user scenarios and test cases in a way reducing the time and cost of software development process while accelerating the 70 of works in Software design and Testing phase Janani Tharmaseelan ""Cohesive Software Design"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22900.pdf
Paper URL: https://www.ijtsrd.com/computer-science/other/22900/cohesive-software-design/janani-tharmaseelan
Abstract: Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. We used sub data from a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics.
Code and data in this link:
https://github.com/hezam2022/text-categorization
This document describes a chatbot project that was developed to answer queries from UT-Dallas students. The chatbot uses natural language processing and a domain-specific knowledge base about UT-Dallas and its library to analyze user queries and generate relevant responses. It was implemented as an Android application using speech recognition and contains modules for tokenization, sentiment analysis, and pattern matching to understand and respond to queries. The document outlines the architecture, knowledge representation, matching algorithm, and provides examples of conversations with the chatbot.
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
Similar to Adhyann – a hybrid part of-speech tagger (20)
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
1. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
DOI : 10.5121/ijit.2015.4301 1
Adhyann – A Hybrid Part-of-Speech Tagger
Nitigya Sharma, Nikki and Gopal Sahni
Department of Computer Science,Bharat Institute of Technology, Meerut (250004)
ABSTRACT
Part of Speech Tagging automatically tags the word of a text by labels that can be used to determine the
structure of sentence. In this paper we propose an approach to the problem that is inspired from human
behavior. We used a combination of rule based and dictionary based approach to tackle this problem. Our
goal in this paper is to design a simple yet effective system to POS tagging that also helps us in more
effective understanding of human behavior.
KEYWORDS
POS tagger, Natural language Processing, Rule Based Approach, Dictionary Based Approach.
1. INTRODUCTION
Every language has a set of tags for each word, these tags describes the role of words in a
sentence.
A very basic approach to this problems is to use a database that will simply map each word to its
corresponding tag. But this approach suffers from various problems, such as.
• New words are created every day, A database to store all these word will grow very fast,
However only a small portion of the database will be actively used for tagging.
• Names are also Part of Speech, which are actively used while framing a sentence,
• However they are not supposed to be store in database. For instance “Mohan is
hawaldaar.” in this sentence, “Mohan” is an Indian name and “hawaldaar” is an Indian
designation.
• As single word can be mapped to various tags. For instance “fly”, however only one tag
is relevant, based on its role in the sentence. “Fly” is a verb if it precedes by “to” and it is
a noun if it is preceded by “a”
Another approach to problem is to use rules that will decide the tag of word in a sentence based
onits position with respect to other words. This approach as we observes suffer from following
problems
• In rule based approach sentence is just a sequence of character that looks like “XXXX
XX XXX XXX.” in such type of sentence various possibilities lies for tagging.
• Even though if it identifies some tags than also this approach may fail. For instance two
sentences “My car is Black.” and “My car is BMW.” have same structure and length,
2. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
2
but cannot be effectively tagged until system understand the difference between “Black”
and “BMW”.
Thus, we require a system that does not possess above limitation.
2. METHODALOGY ADOPTED
This System works on human like approach. When a human reads a sentence he identifies the
part of speeches in the sentence to understand the meaning of the respective sentence. A sentence
may contain various words some, of which are known to the reader and other are new to the
reader. For the new words reader reads the complete sentence and tries to identify the proper Part
of Speech Tag for the new word and then remembers that word unconsciously.
System also works on the same principle to achieve this objective. It uses two approaches
• Dictionary Based Approach
• Rule Based Approach
2.1.Dictionary Based Approach
In this Approach a direct mapping of each word is done with words already stored in the database.
If word is found in database than it's respective tag is fetched and assigned to the word.
Each tag set has its own table and it contains word that belongs to that particular Tag. In this way
words belonging to different Tags can be stored easily and separately.
Advantage of dictionary Based Approach :
• It has less chances of error.
• It can also Tag wrong sentences
• It can also tag sentences with ambiguity.
Disadvantage of Dictionary Based Approach :
• It require heavy database initially to work.
• unknown words are Tagged with noun,
• Some word has two or more possible tags.
• In ability to correctly Tags Name
2.2.Rule Based Approach
Rule Based Approach uses handwritten Grammar rules to tag a sentence with proper Tags. Rule
based approach can efficiently tag unknown words using the sentence structure.
Some handwritten rules are
• a Noun Phrase consist of following sequence of words
3. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
3
◦ (Determinant)(Adverb)*(Adjective)(Noun).
• A Verb Phrase consist of (Verb)(Adverb)*.
• Prepositions are followed by Noun.
• Helping verb are followed by either a Verb phrase or a Noun Phrase.
• if Pronoun is possessive than it is followed by a Noun Phrase otherwise a Verb Phrase.
• Noun is never followed by “TO”.
Though each of the above rules has exceptions. And the rule based tagging requires that some of
the tags must be known in advance.
Therefore the rule base tagging first guesses the tag for a word and then evaluates the rules to
check if the guessed tag fits properly to the sentence or not.
Advantage of Rule based Tagging
• It can effectively remove ambiguous tags.
• It can tag the words which have never been encountered.
• It has the potential to tag almost any sentence.
Disadvantage of Rule Based Tagging
• It is slower than dictionary tagging
• It cannot tag ambiguous sentence such as “My car is _____”.
• It do not improves with data its answer is always fixed.
2.3.Hybrid approach
We combined both the approaches and made a system has the advantage of both the Rule based
and dictionary based approach.
Initially we tag the words from the database and then apply rule based tagging on the semi-tagged
sentence to fill other empty tags. Then we checks these tags for their syntactic validity and
remove ambiguous tags.
After performing validity check we store newly tagged word in a Temporary Database a word
moves to its respective tag table only if it is occurred twice a week.
2.3.1. Limitingthe Size of Vocabulary
To keep the size of database as small as possible system removes the words that are not used
frequently.
4. International Journal on Information Theory (IJIT), Vol.4
System identifies those words by the equation for all words if (Today
Occurrence
Today − MentionedDate
<=0.28
Where
• Occurrence : No of time a word has occurred
• Today : Today’s Date
• Mentioned_day : Date on which the word occurred first time
if after 4 week average occurrence is less than twice a week, than the word is removed from the
database otherwise it is stored in
system do not store Words with wrong tags, Names, Spelling mistakes, etc.
Advantage:
• It can effectively remove ambiguous tags.
• It can tag the words which have never been encountered.
• It has the potential to tag almost any sentence.
• It has less chances of error.
• It can also Tag wrong sentences
• It can also tag sentences with ambiguous structure.
•
3 SYSTEM ARCHITECTURE AND DESIGN:
3.1 The Underlying Model
The System follows a sequential Structure that assigns possible tag on the respective
words of a sentence. The tool is divided into following main modules.
International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
System identifies those words by the equation for all words if (Today-Mentioned) >=28
<=0.28
Occurrence : No of time a word has occurred
Mentioned_day : Date on which the word occurred first time
if after 4 week average occurrence is less than twice a week, than the word is removed from the
it is stored in its respective Tag.. Benefit of performing above operation is that
system do not store Words with wrong tags, Names, Spelling mistakes, etc.
It can effectively remove ambiguous tags.
It can tag the words which have never been encountered.
as the potential to tag almost any sentence.
chances of error.
It can also Tag wrong sentences
It can also tag sentences with ambiguous structure.
3 SYSTEM ARCHITECTURE AND DESIGN:
sequential Structure that assigns possible tag on the respective
words of a sentence. The tool is divided into following main modules.
2015
4
Mentioned) >=28
if after 4 week average occurrence is less than twice a week, than the word is removed from the
its respective Tag.. Benefit of performing above operation is that
sequential Structure that assigns possible tag on the respective
5. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
5
• User interface
• Lexical Analyzer
• POS Tag Generator
• POS Tag Evaluator
• Knowledge
3.1.1 User Interface Module:
This Module provides a GUI (Graphical User Interface) to use the system apart from GUI
the system can also be operated through command line interface. Various functions
provided by this module are :
Spell checker: It checks the spelling errors in real-time so that user can avoid mistakes.
File Chooser: It allows selecting a file containing text to be tagged.
Area to Display tag list : from this are a user can view tags assigned to word
3.1.2 Lexical Analyzer:
Lexical analyzer can divide a continuous string to an array of Token, where each token is
a separate entity. It performs following operations on input String.
• Checks for the validity of String.
• Divide the String into various words.
• Assign Specific Type to word depending on its type, such as ATOM,
MATH_LITERAL,COMM etc.
3.1.3 POS Tag Generator:
This module is the key module in this tool it decides the tags for a specific word, it do not
check for their validity it only assign proper tag depending on their local behavior. It
performs following operations in sequence:
• Dictionary Tagging: In this phase it fills the entire tokens that are stored in its
knowledge already. For Example PRONOUN, HVERB, TO, CONJUNCTION,
INTERJECTION
• Rule Based Tagging: in this step it fills tag on the basis of their behavior in the
respective sentence. Rule based tagging is done in two steps.
• Filling Noun Phrase: in this step it fills all the word that can be assigned Noun
Phrase tags NOUN, ADJECTIVE, ADVERB.
• Filling Verb Phrase in this step it fills all the word that can be assigned Verb
Phrase tags such as VERB, ADVERB.
3.1.4 POS Tag Evaluator:
This is the last step in the tagging process; it checks the complete tag list for any specific
contradiction, violation or ambiguity. Than it performs following steps on the Tagged list:
• Validation: in this step the tags are checked for syntactic validity and it removes
6. International Journal on Information Theory (IJIT), Vol.4
any particular tag that violates it.
• Ambiguity removal: In this step if any word has more than one cardinality that the
tags of the respective word are checked the one whic
removed.
• Dictionary validation: after above steps if any tag causes ambiguity than it is
removed if it is not in the knowledge of system.
3.1.5 Knowledge:
This module defines a way to insert and retrieve data from database.
following functionality:
• Insert data into database;
• Retrieval of data from database;
• Deciding correct Part of Speech tag to be inserted in database..
• Removing data which is not used frequently
3.2 System Design
Data Flow Diagram:
Level 0
Level 2
International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
any particular tag that violates it.
Ambiguity removal: In this step if any word has more than one cardinality that the
tags of the respective word are checked the one which do not fit the scenario is
Dictionary validation: after above steps if any tag causes ambiguity than it is
removed if it is not in the knowledge of system.
This module defines a way to insert and retrieve data from database.
Insert data into database;
Retrieval of data from database;
Deciding correct Part of Speech tag to be inserted in database..
Removing data which is not used frequently
Level 1
2015
6
Ambiguity removal: In this step if any word has more than one cardinality that the
h do not fit the scenario is
Dictionary validation: after above steps if any tag causes ambiguity than it is
This module defines a way to insert and retrieve data from database. It performs
7. International Journal on Information Theory (IJIT), Vol.4
4 RESULTS
4.1 Example 1
String entered is
i was born to fly.
<!--------Lexer----------->
Lexene generated
[<ATOM(i) [NULL]>, <ATOM(was) [NULL]>, <ATOM(born) [NULL]>, <ATOM(to)
[NULL]>, <ATOM(fly) [NULL]><PERIOD>]
<!--------Lexer----------->
<!-------POS Tag Generation--------
Complete dictionary tagging
[<ATOM(i) [NULL, PRONOUN]><ATOM(was) [NULL, HVERB]><ATOM(born)
[NULL]><ATOM(to) [NULL, TO, ADVERB]><ATOM(fly) [NULL,
International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
ER Diagram
[<ATOM(i) [NULL]>, <ATOM(was) [NULL]>, <ATOM(born) [NULL]>, <ATOM(to)
, <ATOM(fly) [NULL]><PERIOD>]
-------->
[<ATOM(i) [NULL, PRONOUN]><ATOM(was) [NULL, HVERB]><ATOM(born)
[NULL]><ATOM(to) [NULL, TO, ADVERB]><ATOM(fly) [NULL,
2015
7
[<ATOM(i) [NULL]>, <ATOM(was) [NULL]>, <ATOM(born) [NULL]>, <ATOM(to)
[<ATOM(i) [NULL, PRONOUN]><ATOM(was) [NULL, HVERB]><ATOM(born)
[NULL]><ATOM(to) [NULL, TO, ADVERB]><ATOM(fly) [NULL,
8. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
8
ADVERB]><PERIOD> ]
Filling Noun Phrase
[<ATOM(i) [NULL, PRONOUN]><ATOM(was) [NULL, HVERB]><ATOM(born)
[NULL, NOUN]><ATOM(to) [NULL, TO, ADVERB]><ATOM(fly) [NULL,
ADVERB]><PERIOD>]
Filling Verb Phrase
[<ATOM(i) [NULL, PRONOUN]><ATOM(was) [NULL, HVERB,
VERB]><ATOM(born) [NULL, NOUN, ADVERB, VERB]><ATOM(to) [NULL, TO,
ADVERB, VERB]><ATOM(fly) [NULL, ADVERB, VERB]><PERIOD> ]
Intermediate filling
[<ATOM(i) [NULL, PRONOUN, NOUN]>, <ATOM(was) [NULL, HVERB, VERB]>,
<ATOM(born) [NULL, NOUN, ADVERB, VERB]>, <ATOM(to) [NULL, TO,
ADVERB, VERB]>, <ATOM(fly) [NULL, ADVERB, VERB]><PERIOD >]
<!-------POS Tag Generation-------->
<!-------POS Tag Evaluation------->
After post processing
[<ATOM(i) [PRONOUN]>, <ATOM(was) [HVERB]>, <ATOM(born) [NOUN]>,
<ATOM(to) [TO]>, <ATOM(fly) [VERB]><PERIOD>]
END
<!-------POS Tag Evaluation------->
4.2 Example 2
String entered is
My Smart phone is black.
<!--------Lexer----------->
Lexene generated
[<ATOM(My) [NULL]>, <ATOM(Smart) [NULL]>, <ATOM(phone) [NULL]>,
<ATOM(is) [NULL]>, <ATOM(black) [NULL]><PERIOD>]
<!--------Lexer----------->
9. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
9
<!-------POS Tag Generation-------->
Complete dictionary tagging
[<ATOM(My) [NULL, PRONOUN]><ATOM(Smart) [NULL]><ATOM(phone)
[NULL]><ATOM(is) [NULL, HVERB]><ATOM(black) [NULL,
ADJECTIVE]><PERIOD> ]
Filling Noun Phrase
[<ATOM(My) [NULL, PRONOUN]><ATOM(Smart) [NULL,
ADJECTIVE]><ATOM(phone) [NULL, NOUN]><ATOM(is) [NULL,
HVERB]><ATOM(black) [NULL, ADJECTIVE, NOUN]><PERIOD> ]
Filling Verb Phrase
[<ATOM(My) [NULL, PRONOUN]><ATOM(Smart) [NULL,
ADJECTIVE]><ATOM(phone) [NULL, NOUN]><ATOM(is) [NULL, HVERB,
VERB]><ATOM(black) [NULL, ADJECTIVE, NOUN, ADVERB, VERB]><PERIOD>
]
Intermediate filling
[<ATOM(My) [NULL, PRONOUN, NOUN]>, <ATOM(Smart) [NULL, ADJECTIVE]>,
<ATOM(phone) [NULL, NOUN]>, <ATOM(is) [NULL, HVERB, VERB]>,
<ATOM(black) [NULL, ADJECTIVE, NOUN, ADVERB, VERB]><PERIOD>]
<!-------POS Tag Generation-------->
<!-------POS Tag Evaluation------->
AFTER POST PROCESSING
[<ATOM(My) [PRONOUN]>, <ATOM(Smart) [ADJECTIVE]>, <ATOM(phone)
[NOUN]>, <ATOM(is) [HVERB]>, <ATOM(black) [ADJECTIVE]><PERIOD>]
END
10. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
10
5 CONCLUSION:
We have successfully designed a tool that can provide POS tag to sentences efficiently.
This System is able to tackle various problems that are faced by POS tagger. System can
differentiate between ambiguous tags. System uses a combination of Rule based and
Dictionary based approach, it combines the strength of both approaches but do not
include the weakness of any. System also improves its knowledge from the tag it
generates and thus become more stable and accurate.
System is limited to improve its vocabulary only. This System can be made more
promising if it also learns rules form the tagged sentences.
6 REFERENCES
• “How English Works a Grammar Practice Book with Answers” by Michael Swan
and Catherine Walter, Published by Oxford University Press, Sixth Edition.
• Eugene Charniak, Curtis Hendrickson Neil Jacobson, and Mike Perkowitz. 1993,
equations for Part of Speech tagger-generator. In Proceedings of the Workshop on
Very Large Corpora, Copenhagen Denmark.
• Hans van Halteren, Jakub Zarvel, and Walter Daelemans 1998. Improving data
driven world class tagging by system combination. In Proceedings of the
International Conference on Computational linguistics COLING-98, pages 491-
497, Montreal Canada..
• Martin Volk and Gerold Schneider. 1998 Comparing a statistical and a rule-based
tagger fro German. In Proceedings of KONVENS-98, page 135-137. Bonn.
• Church, K. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted
Text. In Proceedings of the Second Conference on Applied Natural Language
Processing, ACL , 136-143, 1988.
• Hindle, D. Acquiring disambiguation rules from text. Proceedings of the 27th
Annual Meeting of the Association for Computational Linguistics , 1989
• Klein, S. and Simmons, R.F. A Computational Approach to Grammatical Coding
of English Words. JACM 10: 334-47. 1963.
• Meteer, M., Schwartz, R., and Weischedel, R. Empirical Studies in Part of Speech
Labeling, Proceedings of the DARPA Speech and Natural Language Workshop ,
Morgan Kaufmann, 1991.
11. International Journal on Information Theory (IJIT), Vol.4, No.3, July 2015
11
• Shrivastava, M, B. B. Mahaptra, N. Agarwal, S. Sing, P. Bhattacharya. 2005.
Morphology-based Natural Language Processing Tools for Indian Languages.
Morphology Workshop, CFILT, IIT Bombay, INDIA.
• Aronoff, Mark. 2005. What is Morphology?. Blackwell. UK.