Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONijnlc
The document discusses handling unknown words in named entity recognition using transliteration. It proposes an approach where named entities in training data are transliterated into other languages and stored in transliteration files. During testing, if an unknown entity is encountered, it is checked against the transliteration files and assigned the corresponding tag if found. The approach is shown to achieve 95.8% recall, 96.3% precision and 96.04% F-measure on a multilingual named entity recognition task handling words from English, Hindi, Marathi, Punjabi and Urdu. Performance metrics for named entity recognition systems such as precision, recall and F-measure are also discussed.
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
Handling ambiguities and unknown words in named entity recognition using anap...ijcsa
Anaphora Resolution is a method of determining what a particular noun phrase or pronoun at a given instance refers to. In this paper, we have discussed how Anaphora Resolution is useful in performing computation linguistic task in various Natural languages including the Indian languages. Also, we have discussed how Anaphora Resolution is conducive in handling unknown words in Named Entity Recognition
This document describes a system for named entity recognition in South and Southeast Asian languages that uses conditional random fields for machine learning followed by rule-based post-processing. The system was tested on Bengali, Hindi, Oriya, Telugu, and Urdu. It uses window-based features and prefixes/suffixes to handle agglutinative properties. Post-processing improves recall by considering secondary tags and handles nested entities. Evaluation shows F-measures from 39-51% depending on the language and entity type. The system achieves decent performance without extensive language-specific resources.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
HANDLING UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING TRANSLITERATIONijnlc
The document discusses handling unknown words in named entity recognition using transliteration. It proposes an approach where named entities in training data are transliterated into other languages and stored in transliteration files. During testing, if an unknown entity is encountered, it is checked against the transliteration files and assigned the corresponding tag if found. The approach is shown to achieve 95.8% recall, 96.3% precision and 96.04% F-measure on a multilingual named entity recognition task handling words from English, Hindi, Marathi, Punjabi and Urdu. Performance metrics for named entity recognition systems such as precision, recall and F-measure are also discussed.
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
Handling ambiguities and unknown words in named entity recognition using anap...ijcsa
Anaphora Resolution is a method of determining what a particular noun phrase or pronoun at a given instance refers to. In this paper, we have discussed how Anaphora Resolution is useful in performing computation linguistic task in various Natural languages including the Indian languages. Also, we have discussed how Anaphora Resolution is conducive in handling unknown words in Named Entity Recognition
This document describes a system for named entity recognition in South and Southeast Asian languages that uses conditional random fields for machine learning followed by rule-based post-processing. The system was tested on Bengali, Hindi, Oriya, Telugu, and Urdu. It uses window-based features and prefixes/suffixes to handle agglutinative properties. Post-processing improves recall by considering secondary tags and handles nested entities. Evaluation shows F-measures from 39-51% depending on the language and entity type. The system achieves decent performance without extensive language-specific resources.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
Clustering Algorithm for Gujarati Languageijsrd.com
Natural language processing area is still under research. But now a day it is on platform for worldwide researchers. Natural language processing includes analyzing the language based on its structure and then tagging of each word appropriately with its grammar base. Here we have 50,000 tagged words set and we try to cluster those Gujarati words based on proposed algorithm, we have defined our own algorithm for processing. Many clustering techniques are available Ex. Single linkage , complete, linkage ,average linkage, Hear no of clusters to be formed are not known, so it's all depends on the type of data set provided .Clustering is preprocess for stemming . Stemming is the process where root is extracted from its word. Ex. cats= cat+S, meaning. Cat: Noun and plural form.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
This document discusses using Naive Bayesian methods for paragraph-level text classification in the Kannada language. It evaluates the performance of the Naive Bayesian and Naive Bayesian Multinomial models on a corpus of 1791 paragraphs from four categories (Commerce, Social Sciences, Natural Sciences, Aesthetics). Dimensionality reduction techniques like removing stop words and words with low term frequency are applied before classification. The results show that the Naive Bayesian Multinomial model outperforms the simple Naive Bayesian approach for paragraph classification in Kannada.
Comparative performance analysis of two anaphora resolution systemsijfcstjournal
Anaphora Resolution is the process of finding referents in the given discourse. Anaphora Resolution is one
of the complex tasks of linguistics. This paper presents the performance analysis of two computational
models that uses Gazetteer method for resolving anaphora in Hindi Language. In Gazetteer method
different classes (Gazettes) of elements are created. These Gazettes are used to provide external knowledge
to the system. The two models use Recency and Animistic factor for resolving anaphors. For Recency factor
first model uses the concept of centering approach and second uses the concept of Lappin Leass approach.
Gazetteers are used to provide Animistic knowledge. This paper presents the experimental results of both
the models. These experiments are conducted on short Hindi stories, news articles and biography content
from Wikipedia. The respective accuracy for both the model is analyzed and finally the conclusion is drawn
for the best suitable model for Hindi Language.
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
This paper proposes using additional context features from neighboring semantic arguments to improve the accuracy of automatic semantic argument classification. The researchers augment baseline features, which consider each argument independently, with semantic context features that capture the interdependence between arguments of the same predicate. Their experimental results show significant improvement in classification accuracy on the PropBank test set when exploiting argument interdependence through additional context features. Classification accuracy increases to 90.50%, an 18% relative error reduction over the baseline.
This paper describes an attempt to build an automatic system for semantic role labeling (SRL) of nominals (nouns) based on NomBank annotations. The authors treat SRL as a classification problem and explore adapting features from existing PropBank SRL systems for the NomBank task. Their best system achieves an F1 score of 72.73% on test data when using gold syntactic parses, representing the first reported automatic NomBank SRL system.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
This document discusses the Interpreter and Iterator patterns from the Gang of Four (GoF) patterns. [1] The Interpreter pattern represents grammars and expressions as object structures to define a language and interpret sentences in that language. [2] The Iterator pattern allows iterating over elements of an object sequentially without exposing its underlying representation. [3] Both patterns have limitations and the document discusses when and how they could be improved.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Paraphrasing refers to writing that either differs in its textual content or is dissimilar in rearrangement of words
but conveys the same meaning. Identifying a paraphrase is exceptionally important in various real life applications
such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering. A large amount
of work in Paraphrase Detection has been done in English and many Indian Languages. However, there is no
existing system to identify paraphrases in Marathi. This is the first such endeavor in the Marathi Language.
A paraphrase has differently structured sentences, and since Marathi is a semantically strong language, this
system is designed for checking both statistical and semantic similarities of Marathi sentences. Statistical similarity
measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data
is calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and worddistance. Universal Networking Language (UNL) speaks about the semantic significance in sentence without any
syntactic points of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two
Marathi sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated
after joining statistical and semantic similarity scores, which gives a judgment on whether there is paraphrase or
non-paraphrase about the Marathi sentences in question.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
Clustering Algorithm for Gujarati Languageijsrd.com
Natural language processing area is still under research. But now a day it is on platform for worldwide researchers. Natural language processing includes analyzing the language based on its structure and then tagging of each word appropriately with its grammar base. Here we have 50,000 tagged words set and we try to cluster those Gujarati words based on proposed algorithm, we have defined our own algorithm for processing. Many clustering techniques are available Ex. Single linkage , complete, linkage ,average linkage, Hear no of clusters to be formed are not known, so it's all depends on the type of data set provided .Clustering is preprocess for stemming . Stemming is the process where root is extracted from its word. Ex. cats= cat+S, meaning. Cat: Noun and plural form.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
This document discusses using Naive Bayesian methods for paragraph-level text classification in the Kannada language. It evaluates the performance of the Naive Bayesian and Naive Bayesian Multinomial models on a corpus of 1791 paragraphs from four categories (Commerce, Social Sciences, Natural Sciences, Aesthetics). Dimensionality reduction techniques like removing stop words and words with low term frequency are applied before classification. The results show that the Naive Bayesian Multinomial model outperforms the simple Naive Bayesian approach for paragraph classification in Kannada.
Comparative performance analysis of two anaphora resolution systemsijfcstjournal
Anaphora Resolution is the process of finding referents in the given discourse. Anaphora Resolution is one
of the complex tasks of linguistics. This paper presents the performance analysis of two computational
models that uses Gazetteer method for resolving anaphora in Hindi Language. In Gazetteer method
different classes (Gazettes) of elements are created. These Gazettes are used to provide external knowledge
to the system. The two models use Recency and Animistic factor for resolving anaphors. For Recency factor
first model uses the concept of centering approach and second uses the concept of Lappin Leass approach.
Gazetteers are used to provide Animistic knowledge. This paper presents the experimental results of both
the models. These experiments are conducted on short Hindi stories, news articles and biography content
from Wikipedia. The respective accuracy for both the model is analyzed and finally the conclusion is drawn
for the best suitable model for Hindi Language.
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
This paper proposes using additional context features from neighboring semantic arguments to improve the accuracy of automatic semantic argument classification. The researchers augment baseline features, which consider each argument independently, with semantic context features that capture the interdependence between arguments of the same predicate. Their experimental results show significant improvement in classification accuracy on the PropBank test set when exploiting argument interdependence through additional context features. Classification accuracy increases to 90.50%, an 18% relative error reduction over the baseline.
This paper describes an attempt to build an automatic system for semantic role labeling (SRL) of nominals (nouns) based on NomBank annotations. The authors treat SRL as a classification problem and explore adapting features from existing PropBank SRL systems for the NomBank task. Their best system achieves an F1 score of 72.73% on test data when using gold syntactic parses, representing the first reported automatic NomBank SRL system.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
This document discusses the Interpreter and Iterator patterns from the Gang of Four (GoF) patterns. [1] The Interpreter pattern represents grammars and expressions as object structures to define a language and interpret sentences in that language. [2] The Iterator pattern allows iterating over elements of an object sequentially without exposing its underlying representation. [3] Both patterns have limitations and the document discusses when and how they could be improved.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Paraphrasing refers to writing that either differs in its textual content or is dissimilar in rearrangement of words
but conveys the same meaning. Identifying a paraphrase is exceptionally important in various real life applications
such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering. A large amount
of work in Paraphrase Detection has been done in English and many Indian Languages. However, there is no
existing system to identify paraphrases in Marathi. This is the first such endeavor in the Marathi Language.
A paraphrase has differently structured sentences, and since Marathi is a semantically strong language, this
system is designed for checking both statistical and semantic similarities of Marathi sentences. Statistical similarity
measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data
is calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and worddistance. Universal Networking Language (UNL) speaks about the semantic significance in sentence without any
syntactic points of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two
Marathi sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated
after joining statistical and semantic similarity scores, which gives a judgment on whether there is paraphrase or
non-paraphrase about the Marathi sentences in question.
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
A POS Tagger for Tamil Language”, Proceedings of the IJCNLP-2009, Suntec,
Singapore.
Dhanalakshmi V, Anand Kumar M, Soman K P and Rajendran S (2011), “Dependency
Parsing for Tamil using Malt Parser”, Proceedings of the International Conference on
Asian Language Processing (IALP), Bali, Indonesia.
Gimenez J and Marquez L (2004), “SVMTool: A general POS tagger generator based on
Support Vector Machines”, Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), Lisbon, Portugal.
Joakim Nivre and Johan Hall (
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Implementation of Semantic Analysis Using Domain OntologyIOSR Journals
The document describes a semantic analysis system that analyzes feedback from an organization using domain ontology. The system first collects feedback data from students in an unstructured format. It then preprocesses the feedback using part-of-speech tagging to extract meaningful information. The system architecture includes preprocessing the feedback, matching entities in the feedback to an organization ontology using Jaccard similarity, and generating a summarized analysis of the feedback based on the ontology entities. The goal is to group related words and phrases expressed by students under the same entity to produce a meaningful summary for the organization.
This document summarizes Jessica Hullman's project modeling word sense disambiguation using support vector machines. She used a dataset from Senseval-2 and achieved an average accuracy of 87% at assigning word senses, with a standard deviation of 15% and median of 92%. The project involved modifying an existing implementation that used part-of-speech tags of neighboring words to classify word senses, training support vector machine classifiers on the Senseval-2 data.
A semi-supervised method for efficient construction of statistical spoken lan...Seokhwan Kim
This document presents a semi-supervised framework to efficiently construct statistical spoken language understanding resources with low cost. It generates context patterns from a small set of seed entities and unlabeled utterances. These patterns are used to extract new entities by aligning utterances and replacing entity labels. Extracted entities above a score threshold are added back to the seed set, repeating the process. An evaluation on a corpus achieved high precision and recall in extracting city names, months, and day numbers with this method.
This document describes a combined approach to part-of-speech tagging using both features extraction and hidden Markov models. It analyzes a sample text to extract morphological features of words and assign part-of-speech tags. It then uses a hidden Markov model to determine the maximum probability of a tag based on previous tags for words that are ambiguous. The approach extracts features from unique words in different tags to build a features matrix to aid in part-of-speech tagging of unknown words based on their features.
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%
This document describes a verb-based sentiment analysis of Manipuri language documents using conditional random fields for part-of-speech tagging and a manually annotated verb lexicon to determine sentiment polarity. The system was tested on 550 letters to newspaper editors, achieving an average recall of 72.1%, precision of 78.14%, and F-measure of 75% for sentiment classification. The authors conclude the work is an initial effort in sentiment analysis for the highly agglutinative Manipuri language and more methods are needed to improve accuracy.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...cscpconf
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the
attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing
sentiment analysis in Arabic language. Our approach had begun with special pre-processing
steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an
algorithm extracted sentiment-based feature selection method. This feature selection method
had utilized on three sentiment corpora using SVM classifier. We compared our approach with
some traditional methods, followed by most SA works. The experimental results were very
promising for enhancing SA accuracy.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
Similar to Towards Building Semantic Role Labeler for Indian Languages (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Influence of Marketing Strategy and Market Competition on Business Plan
Towards Building Semantic Role Labeler for Indian Languages
1. Towards Building Semantic Role Labeler for Indian Languages
Maaz Anwar Nomani♣
and Dipti Misra Sharma♣
FC Kohli Center on Intelligent Systems (KCIS), IIIT-Hyderabad, India♣
maaz.anwar@research.iiit.ac.in, dipti@iiit.ac.in,
Abstract
We present a statistical system for identifying the semantic relationships or semantic roles for two major Indian Languages, Hindi and
Urdu. Given an input sentence and a predicate/verb, the system first identifies the arguments pertaining to that verb and then classifies it
into one of the semantic labels which can either be a DOER, THEME, LOCATIVE, CAUSE, PURPOSE etc. The system is based on 2
statistical classifiers trained on roughly 130,000 words for Urdu and 100,000 words for Hindi that were hand-annotated with semantic
roles under the PropBank project for these two languages. Our system achieves an accuracy of 86% in identifying the arguments of
a verb for Hindi and 75% for Urdu. At the subsequent task of classifying the constituents into their semantic roles, the Hindi system
achieved 58% precision and 42% recall whereas Urdu system performed better and achieved 83% precision and 80% recall. Our study
also allowed us to compare the usefulness of different linguistic features and feature combinations in the semantic role labeling task. We
also examine the use of statistical syntactic parsing as feature in the role labeling task.
Keywords: PropBank, Semantic Role labeling, Treebank, Dependency relations
1. Introduction
We introduce a statistical semantic role labeler for Hindi
and Urdu, two major Indian languages. A Semantic Role
labeler (henceforth, SRL) automatically marks the argu-
ments/valency of a predicate in a sentence. The proposed
system is based on supervised machine learning approach
on Hindi and Urdu PropBanks which are being built for
these languages.
The approach is a 2-stage architecture in which, first the
arguments pertaining to a predicate in a sentence are iden-
tified by the system and then those identified arguments are
classified into one of the PropBank semantic labels. Our
system uses a basic Logistic Regression machine learn-
ing algorithm (Pedregosa et al., 2011) for identifying the
predicates and Support Vector Machines (Pedregosa et al.,
2011) to classify the arguments of a predicate into seman-
tic labels. We have used 10 linguistic features, 6 of which
have been derived from literature and 4 are novel features.
Among the new features, we have used the karaka rela-
tions/dependency relations (described later) which turns
out to be the most discriminative feature among all. We
have also experimented with the automatic parses which
are used as features and derived from state-of-art Hindi and
Urdu parsers. To the best of our knowledge, this is the first
such attempt of building a semantic role labeler for any In-
dian language.
This paper is arranged as follows: Section 2 gives a brief
introduction about Semantic Parsing. In Section 3, we give
the data statistics of the language resources used. Section 4
talks about related work in semantic role labeling. In sec-
tion 5, we briefly describe the Hindi and Urdu Propbanks
and Treebanks which are being built for these two Indian
languages. Section 6 presents the Semantic Role labeler in
detail which includes our approach, its architecture, classi-
fiers and features used. We also describe the baseline fea-
tures, their impact on Indian language datasets and new fea-
tures incorporated in the systems. In section 7, we describe
in detail the different experiments done and show their re-
sults. Section 8 throws light on the impact of using auto-
matic parses as features in the semantic role labeling task.
In section 9, we thoroughly put the error analysis forward
along with some corpus examples showing erroneous role
labeling and discuss possible reasons for it. We conclude
the paper in section 10.
2. Semantic Parsing
Semantic analysis of natural text and languages has always
been an intriguing research area in NLP. Automatic seman-
tic analysis of texts thus becomes a challenging as well as
interesting task which includes tasks like Semantic Pars-
ing, Shallow Semantic Parsing and Semantic Role label-
ing. Automatic and accurate tools that can predict naturally
occurring arguments for a given predicate in a predicate-
argument structure of a sentence can be efficiently used for
Semantic Parsing, Summarization, Information Extraction
tasks, wider NLP areas like Machine Translation and Ques-
tion Answering and modern NLP challenges like parsing
code-mixed data.
Semantic Parsing is essentially the research investigation of
identifying WHO did WHAT to WHOM, WHERE, HOW,
WHY and WHEN etc. in a sentence and adding a layer of
semantic annotation in a sentence produces such a struc-
ture. Two major Indian languages, Hindi and Urdu are hav-
ing such a resource in the form of PropBank (also called
Proposition Banks) which establishes a layer of semantic
representation in a Treebank already annotated with Depen-
dency labels.
Proposition Bank (Kingsbury and Palmer, 2003) is a corpus
in which the arguments of each verb predicate (simple or
complex) are marked with their semantic roles.
3. Data
Table 1 shows the data statistics of the language resources.
We have used a part of these language resources for SRL
task.
4588
2. In the experiments reported here, we first performed the ar-
gument identification task which is a binary classification
problem (either ‘argument’ or ‘not an argument’) by tak-
ing 130,000 tokens as training data and 30,000 as test data
from Urdu PropBank. We took around 100,000 tokens as
training data from Hindi PropBank and 20,000 as test data.
Subsequently, we trained a multi-class classifier, SVM on
the same training set, using a well defined feature-set de-
scribed later.
Language Resource Tokens Sentences Predicates pbrel
Urdu Treebank ˜200,000 ˜8,000 - -
Hindi Treebank ˜350,000 ˜14,000 - -
Urdu Propbank ˜180,000 ˜7,000 ˜2,200 ˜7,000
Hindi Propbank ˜300,000 ˜11,000 ˜3,200 ˜12,000
Table 1: Hindi and Urdu resources statistics
4. Related Work
The last decade has seen an exhaustive research in se-
mantic parsing for different languages. Gildea and Juraf-
sky (2002) started out the work on semantic role label-
ing on 2001 release of English Propbank. Surdeanu et al.
(2003) build a decision tree classifier for predicting the se-
mantic labels. Gildea and Hockenmaier (2003) uses fea-
tures from Combinatory Categorial Grammar (CCG) which
is a form of dependency grammar. Chen and Rambow
(2003) used a decision tree classifier and additional syntac-
tic and semantic representations extracted from Tree Ad-
joining Grammar (TAG). Swier and Stevenson (2004) talks
of a novel bootstrapping algorithm for identifying semantic
labels. (Cohn and Blunsom, 2005) applied conditional ran-
dom fields (CRFs) to the semantic role labeling task. Xue
and Palmer (2004) experimented with different linguistic
features related to role labeling task.
Xue and Palmer (2005) built a role labeler for Chinese
by showing that verb classes, induced from the predicate-
argument information in the frame files helps in semantic
role labeling. Johansson and Nugues (2006) came up with a
FrameNet-based semantic role labeling system for Swedish
text. Toutanova et al. (2005) used joint learning amd joint
modeling of argument frames of verbs to improve the over-
all accuracy of semantic role labeler.
5. The PropBank and The Treebank
Indian Languages are very morphologically rich languages
and Treebanks for Hindi and Urdu are available which in-
clude syntactico-semantic information in the form of de-
pendency annotations as well as lexical semantic infor-
mation in the form of predicate-argument structures, thus
forming PropBanks. Hindi and Urdu PropBanks are part
of a multi-dimensional and multi-layered resource creation
effort for the Hindi-Urdu language (Bhatt et al., 2009).
The Hindi Dependency Treebank (HDT) and Urdu Depen-
dency Treebank (UDT) are built following the CPG (Com-
putational Paninian Grammar) framework (Begum et al.,
2008). PropBank establishes a layer of semantic represen-
tation in the Treebank which is already annotated with De-
pendency labels or phrase structures (as in Penn Treebank
for English). Capturing the semantics through predicate-
argument structure involves quite a few challenges pecu-
liar to each predicate type because the syntactic notions in
which the verb’s arguments and adjuncts are realized can
vary based on the senses. Table 2 shows the PropBank la-
bels for Hindi and Urdu and these labels are also used for
building the Semantic Role labeler.
Propbank labels or semantic labels are closely associated
with the karaka relations (Vaidya et al., 2011) (described
later) in their structure though the former are defined on a
verb-by-verb basis.
Label Description
ARG0 Agent, Experiencer or
doer
ARG1 Patient or Theme
ARG2 Beneficiary
ARG3 Instrument
ARG2-ATR Attribute or Quality
ARG2-LOC Physical Location
ARG2-GOL Goal
ARG2-SOU Source
ARGM-PRX noun-verb construc-
tion
ARGM-ADV Adverb
ARGM-DIR Direction
ARGM-EXT Extent or Compari-
sion
ARGM-MNR Manner
ARGM-PRP Purpose
ARGM-DIS Discourse
ARGM-LOC Abstract Location
ARGM-MNS Means
ARGM-NEG Negation
ARGM-TMP Time
ARGM-CAU Cause or Reason
Table 2: Hindi and Urdu PropBank labels with defini-
tions
6. Semantic Role labeler
This section gives a thorough analysis on our approach to-
wards role labeling task. At first, one can comprehend
the semantic role labeling task as a simple straight forward
multi-class classification problem in which the semantic la-
bels are classified directly without identifying them. But
such a simple approach will not work due to various rea-
sons.
(Xue and Palmer, 2004) observes that for a given predicate,
many constituents in a syntactic tree are not its semantic
argument. So, non-argument count overwhelms the argu-
ment count for the given predicate and classifiers will not
be efficient in predicting the right argument or in classify-
ing them. This problem was encountered in both Hindi and
Urdu data-sets as well.
Therefore, we adopted a 2-stage approach of first iden-
tifying the arguments and subsequently classifying the
identified semantic arguments into one of the semantic
labels. The second approach of directly classifyng the
arguments was also experimented by us and we show by
4589
3. statistics that such a forthright approach will not work for
Hindi and Urdu as well.
• Approach I
• Argument Identification - This is the sub-task of iden-
tifying a constituent in a dependency tree which repre-
sents the argument for the predicate in the argument-
predicate structure in a Hindi or Urdu PropBank sen-
tence. The identification is done by a binary classifier
(‘argument’ and ‘not an argument’). The classifier has
to predict whether the constituent in the tree is an ar-
gument of that verb or not.
• Argument Classification - After identification of argu-
ments is done, the identified arguments pertaining to
that verb are assigned appropriate semantic labels by
a multi-class classifier which is trained on multiple se-
mantic labels present in the training data. We have
used Gold Parses as well as Automatic Parses from
Hindi and Urdu parsers as features in this task.
• Approach II
• Argument Identification and Argument Classification -
We performed direct classification of arguments in
Hindi and Urdu data and found that the results were
not effective. We found that due to overwhelming
number of non-arguments present in the corpus, the
classifier failed to classify the arguments correctly.
For example, in Urdu SRL system, the arguments for
which gold parses were used as feature, were getting
classified with a precision of only 20.98% and non-
arguments were getting classified with a precision of
79.71%. For automatic parses, arguments got classi-
fied with a precision of 17.12% and non-arguments got
classified with 82.87% precision.
6.1. SRL Architecture
Figure 1 is the flow diagram of Semantic Role Labeler for
Hindi and Urdu. It is a 2-stage architecture in which at first
the arguments of a predicate in a sentence are identified by a
binary classifier and then in the subsequent stage only these
identified arguments are classified as one of the semantic
labels by a multi-class classifier.
6.2. Features: A critical look
Hindi and Urdu are free-word order languages i.e. the word
order is somewhat flexible in these languages. For this rea-
son, they are sometimes called as SOV languages (subject,
object, verb). Hindi and Urdu employs postpositions and
not prepositions (as in English), the resulting word order of
postpositional phrases can be the reverse of prepositional
phrases in English. We decided to use linguistic features
in our Semantic Role Labeler based on our intuition that
they might help the labeler in taking decisions to identify
and classify the semantic arguments of a predicate in the
sentence.
Filtering of pbrole sentences
Argument Identifier
Identified Arguments projected on Test Data
Only Arguments
Argument Classifier
Classified/Predicted labels
Projected to Test Data of PropBank
Test Data containing automatic SR labels
Training (Binary)
Training (Multi-Class)
Extracted
Figure 1: SRL Architecture
6.2.1. Baseline Features
The task of identifying and classifying the arguments of a
verb is done at chunk/phrase level. Most of the previous
works done in Semantic Parsing have shown that ‘head’ of
the chunk is a very critical feature because chunks/phrases
with certain head words are likely to be arguments of the
verb and also certain types of arguments. This implies that
head word of a chunk is a very discriminative feature for
both the tasks.
For our Indian languages SRL, we have used a combination
of baseline features introduced by (Gildea and Jurafsky,
2002) and (Xue and Palmer, 2004) for English because
we wanted to see the impact of these standard features in
context of the Indian languages and how these features
perform w.r.t free word order languages.
Baseline Features for Indian Languages
Predicate - predicate itself
Head word - syntactic head of the chunk/phrase
Head-word POS - Its POS category
Phrase Type - syntactic category (NP, VP etc. of the phrase)
Predicate + Phrase Type - combinational feature
Predicate + Head word - combinational feature
Table 3: Baseline feature template
6.2.2. New Features
After analyzing the PropBanks of both the languages, we
came up with certain new features for which we had an
intuition that they will contribute significantly towards this
task. These are discussed below:
4590
4. • Dependency/karaka relation - We experimented with
the Dependency labels or karaka relations present in
the Hindi and Urdu PropBanks. A sentence is com-
posed of a primary modified which is the root in the
dependency tree of the sentence and this modified is
occasionally the main verb of the sentence. The words
modifying this verb participate in the action specified
by the verb. The participant relation of the verb with
these words/elements are called karaka relations.
There are in total 43 karaka relations for Hindi and
Urdu as specified by (Begum et al., 2008). Table 4
shows some of the major karaka or dependency re-
lations. These dependency relations helps to identify
the syntactico-semantic relations in a sentence. Since,
PropBank labels illustrates the semantic relations be-
tween the constituents, we had this intuition that using
the karaka relations as features in our classifier to pre-
dict the arguments of a verb will have an impact on our
system performance. Also, Vaidya et al. (2011) has
shown that there is a correlation between dependency
and predicate argument structures for Hindi. Since
Hindi and Urdu are linguistically similar languages,
we extracted the dependency/karaka relations of both
the language Treebanks and used them as feature in
isolation as well as in the system.
Dependency Labels Description
k1 karta - doer/agent/subject
k2 karma - object/patient
k1s noun complement
k4 sampradana - recipient
k3 karana - instrument
k4a experiencer
k2p goal, destination
k5 apadana - source
k7 location elsewhere
k7p location in space
k7t time
adv adverbs
rh reason
rd direction
rt purpose
Table 4: Major karaka relations or Dependency
labels of Hindi and Urdu Treebank
From Tables 5 and 6, it is clear that dependency rela-
tions are indeed the most discriminative feature among
all the features used by showing high precision and re-
call for both languages. The high percentage of preci-
sion and recall in the Argument Classification task can
be accounted from the fact that dependency structure
and predicate-argument structure are somewhat simi-
lar structures and dependency labels and propbank la-
bels are similar in orientation. Therefore, using de-
pendency structures as features to predict and classify
the arguments of a predicate of Hindi and Urdu is very
influential.
• Named Entities (NEs) - Named Entities also contribute
towards the identification task but it doesn’t con-
tributes heavily towards the classification task. The
reason being NEs are generally arguments of a predi-
cate. This feature has been used in the past by (Prad-
han et al., 2004) for English but we have used it dif-
ferently by combining it with the POS category of the
NE. As shown is Tables 5 and 6, NEs have a decent
precision of 63% and 61% for Urdu and Hindi respec-
tively when used in isolation for the argument identifi-
cation task. But it is not able to discriminate between
the class of semantic labels to which a identified argu-
ment belong.
• Head + Phrase type - A combination of head word of
the chunk/phrase and syntactic category of the phrase.
• Head POS + Phrase type - A combination of the POS
category of the head word of the chunk/phrase and
syntactic category of the phrase.
Abbreviations Used: ‘P’=Precision, ‘R’=Recall,
‘PT’=Phrase Type, ‘HW’=Head Word, ‘POS’=Part-
of-speech’
Feature
¯
Feature wise performance (Urdu)
¯
Argument Id.
¯
Argument Classification
¯
P R f-score P R f-score
Predicate 51 72 60 00 01 00
Head-word 71 73 72 25 14 18
Head-POS 51 72 60 06 18 09
PhraseType 51 72 60 08 08 08
HeadPOS-PT 61 72 66 06 18 09
Head-PT 71 74 72 25 14 18
Predicate-HW 72 74 73 01 01 01
Predicate-PT 64 69 66 02 00 00
NEs 63 65 64 23 18 20
Dependency 78 79 78 87 85 86
Table 5: Individual feature performance (Urdu)
• We also experimented with certain other features like chunk
boundary and sentence boundary of the sentences. We had
an intuition that these boundaries will help the classifier
learn more effectively and overlap of the 2 sentences will
be minimized. But it turned out that both the features had a
major negative impact on the overall results of the system,
so we scrapped these two features.
The reason for this dip can be accounted from the fact
that chunk boundary and sentence boundary are merely
integers for the classifier as there are ‘n’ numbers for ‘n’
sentences/chunks which lack any significant information.
6.2.3. Classifiers used
We used a basic Logistic Regression classifier and trained
it on Hindi and Urdu data-sets thus building a binary
classifier which identifies arguments as ‘Arguments’ or
‘Not-an-Argument’. For multi-class classification in
argument classification task, we implemented LinearSVC
class of SVM (Pedregosa et al., 2011) for performing the
‘one-vs-the-rest’ multi-class strategy. All the parametres
4591
5. of Linear SVC were set to default at the time of training of
Hindi and Urdu data-sets.
Feature
¯
Feature wise performance (Hindi)
¯
Argument Id.
¯
Argument Classification
¯
P R f-score P R f-score
Predicate 41 64 50 41 61 49
Head-word 70 70 70 20 04 07
Head-POS 34 58 43 07 08 07
PhraseType 34 58 43 05 07 06
HeadPOS-PT 34 58 43 14 13 13
Head-PT 67 66 66 10 04 06
Predicate-HW 65 65 65 10 04 06
Predicate-PT 41 64 50 04 06 05
NEs 61 64 62 20 16 17
Dependency 88 87 87 52 49 50
Table 6: Individual feature performance (Hindi)
7. Experiments and Results
In this section, we show the results and extent of the
baseline features (standard features) in Semantic Role
Labeling in the experiments on the Indian data sets. We
present results of baseline features for both the task and
add new features subsequently in next step. We have laid
out the experiments in a way that each experiment conveys
the extent of usefulness and information gain of the feature
in both the tasks. For each experiment, the settings of
the SVM and features used in the previous experiment
were retained. The baseline features performs well on
the argument identification task of both languages with a
precision of 71% and 67% for Hindi and Urdu respectively.
The classification task of the identified arguments is
not profited much by the baseline features for both the
languages as can be seen from Tables 7 and 8.
Feature
¯
Hindi system Feature Template
¯
Argument Id.
¯
Argument Classification
¯
P R f-score P R f-score
Baseline 71 71 66 29 14 15
+Dependency 86 86 86 56 38 41
+Head-PT 85 85 86 56 38 41
+HeadPOS-PT 84 84 84 57 40 42
+NEs 85 84 84 58 42 49
Table 7: Hindi system performance
As expected, the best result for both the tasks is obtained
on incorporating dependency relations information, which
further strengthens our conjecture on the importance of de-
pendency labels in semantic role labeling. All 4 experi-
ments shows the effectiveness of the features for Hindi and
Urdu data sets. By incorporating NEs in the system, there is
an increase in the identification task of Urdu and marginal
increase in Hindi. The classification task is also benefitted
by a marginal precision increment of 1% in both the data
sets and recall of 2% in Hindi.
However, ‘Head + Phrase Type’ and ‘HeadPOS + Phrase
Type’ features had a negative impact on the performance
of classification task for both the sets. The identification
task was benefitted by every new feature in Hindi with
dependency relations contributing the most to the final
result of the system.
Feature
¯
Urdu system Feature Template
¯
Argument Id.
¯
Argument Classification
¯
P R f-score P R f-score
Baseline 67 70 68 16 14 15
+Head-PT 65 69 67 15 12 13
+HeadPOS-PT 66 69 67 15 12 13
+Dependency 72 74 73 82 80 81
+NEs 75 71 73 83 80 81
Table 8: Urdu system performance
8. Using Automatic Parses
So far, we have shown and discussed results using hand-
corrected parses of the Hindi and Urdu PropBank. How-
ever, if we are building a NLP application, the SRL will
have to extract features from automatic parses of the sen-
tences. Also, after getting the insights of the influence of
hand-corrected dependency relations, we wanted to analyse
the same using automatic parses and evaluate our semantic
role labeler on it. These automatic parses are generated by
the parser of respective language.
To come up with such an evaluation, we used the Malt
Parser (Nivre et al., 2006) on the same test data described
above for both languages. Malt Parser requires data in
CONLL format. By using state-of-art Urdu Parser (Bhat et
al., 2012), we extracted the automatic parses from the Urdu
test data and used them as feature for our identification and
classification tasks.
Parsing accuracy is measured by LAS, UAS and LA which
stands for Labeled Attachment score, Unlabeled Attach-
ment score and Label Accuracy respectively. LAS is the
percentage of words that are assigned the correct head and
dependency label; UAS is the percentage of words with the
correct head, and the label accuracy (LA) is the percentage
of words with the correct dependency label.
Measure Percentage (%)
LAS 76.61
UAS 88.45
LA 80.03
Table 9: Parsing performance of the Urdu dependency
parser
By adding automatic parses to the baseline features of Urdu
system, we observed a marginal increase in the argument
identification task and comparable improvement in classi-
fication task in Table 10. The decent jump in the precision
of the classification task while using automatic parses as
compared to using gold parses can be attributed from the
fact that gold parses (Urdu PropBank) have been manually
validated by annotators whereas Urdu parser tends to give
certain incorrect parses as can be seen from Table 9.
4592
6. Feature
¯
Automatic Parses (Urdu)
¯
Argument Id. Argument Classification
P R f-score P R f-score
Baseline 67 70 68 16 14 15
+Automatic Parses 68 66 67 62 59 60
Table 10: Urdu system with Automatic Parses
The same process was repeated by using the state-of-art
Hindi Parser (Kosaraju et al., 2010) on Hindi test data.
Measure Percentage (%)
LAS 88.19
UAS 94.00
LA 89.77
Table 11: Parsing performance of the Hindi dependency
parser
The usage of automatic parses of Hindi have similar impact
on the Hindi SRL system. Table 12 shows the performance
devaluation when automatically generated parses are used.
Feature
¯
Automatic Parses (Hindi)
¯
Argument Id.
¯
Argument Classification
¯
P R f-score P R f-score
Baseline 71 71 66 29 14 15
+Automatic Parses 73 74 73 41 28 33
Table 12: Hindi system with Automatic Parses
9. Error Analysis and Discussion
In this section, we present a thorough error analysis of the
Semantic Role Labeling task for Hindi and Urdu by taking
a sub-set of test data for error analysis. Tables 14 and 15
are the confusion matrices for Hindi and Urdu SRL respec-
tively. As Hindi and Urdu have postpositional phrases, the
postpositions or case markers play a very significant role in
deciding the semantic role for an argument. These postpo-
sitions may not be helpful cues in identifying the argument
but they are when it comes to classifying the arguments.
There are 6 pre-dominant case markers in Hindu and Urdu
viz. ‘ne’, ‘ko’, ‘mein’, ‘se’, ‘ki’, ‘par’. Table 13 gives
the definition of Hindi and Urdu case markers. These case
markers provide important information about the token pre-
ceding it and the nature of the chunk/phrase in which they
are present. For e.g. in both Hindi and Urdu, the case
clitic ‘ne’ is often indicative of an agent, and as ‘ARG0’
is mostly associated with ‘agentivity’, this can sometimes
provide a clue about ARG0. Same is the case with ‘mein’
which most of the times gives the ‘locative’ information in
the chunk/phrase. Some post-positions are very ambigous
in nature and here is when error in classifying the argu-
ments starts. We show in the examples below how ‘ko’ and
‘mein’ post-position can take multiple types of semantic la-
bels and therefore the classifier is not very discriminative
during multi-class classification.
Urdu Case marker Meaning
ne Ergative
mein Locative
par Locative
ko Dative/Accusative
se Instrumental/Ablative
ki Genetive
Table 13: Hindi and Urdu Case markers
As can be seen from confusion matrix of Hindi, the pair
of labels in which the Hindi SRL is having the maximum
confusion is ‘ARG1’ and ‘ARG0’. ‘ARG0’ label is given
to those arguments of verb which shows agentivity or vola-
tionality to do work whereas ARG1 is the theme or object
arguments. Another frequent confusion which the Hindi
SRL has is between ‘ArgM-LOC’ and ‘ARG0’. The con-
fusion between ‘ARGM-LOC’ and ‘ARG2-LOC’ can be
attributed to the fact that both the labels depicts locative
information though ARGM-LOC shows abstract location
whereas ARG2-LOC has physical location information.
Arg0 Arg1 Arg2 Arg2LOC Arg2SOU Arg2ATR ArgADV ArgMLOC ArgDIR
Arg0 23 03 02
Arg1 21 21 04 05 02 03
Arg2 01 01 01
Arg2LOC 02 02
Arg2SOU 02
Arg2ATR 03 01 07
ArgMADV 01
ArgMLOC 05 01 04 01 16
ArgMDIR 02
Table 14: Confusion Matrix of Propbank labels for
Hindi SRL
For Urdu, similar confusion is observed between ‘ARG0’-
‘ARG1’ and ‘ARGM-LOC’-‘ARG2-LOC’ pairs. Such con-
fusions can be reduced by coming up with linguistic fea-
tures which are aimed at solving ARG0 and ARG1 con-
fusion because it is the most frequent erroneous pair for
Indian languages SRL. These errors also occur due to the
presence of ambigous case markers with the arguments of
a verb and therefore the classifier is unable to discriminate
between the two labels.
Another interesting confusion pair is ARG1:ARG2-ATR.
There are instances in the training set in which the argu-
ments of a verb are annotated with ‘ARG1’ or ‘ARG2-ATR’
based on the context and sentence structure.
Arg0 Arg1 Arg2 Arg2LOC Arg2SOU Arg2ATR ArgADV ArgMLOC ArgDIR
Arg0 45 02 02 01
Arg1 08 48 03 05 02 04
Arg2 01 07 01
Arg2LOC 10 01 01 03
Arg2SOU 04 02 01 01 04
Arg2ATR 01
ArgMADV 01 12
ArgMLOC 01 36
ArgMDIR 03
Table 15: Confusion Matrix of Propbank labels for
Urdu SRL
We extracted certain sentences from the test set of Hindi
and Urdu and analyzed how our SRL is performing in com-
parison with the gold test data. These sentences are shown
below:
4593
7. (1) (iss
this
dauran)
period
[gold:ARGM-TMP; SRL: ARGM-TMP]
his
(uski)
daughter
(putri
DAT
ko)
man
[gold: ARG1; SRL: ARG1]
roam
(yuwak)
live
[gold: ARG0; SRL: ARG0]
be-PAST.
(ghumata)
[predicate] (rehta tha).
‘During this period, the man used to roam around his
daughter.’ HINDI TEST
(2) (idhar)
here
[gold: ARGM-DIR; SRL: ARGM-DIS]
Assam
(Assam
LOC
mein)
BJP
[gold: ARGM-LOC; SRL: ARGM-LOC]
main
(BJP)
party
[gold: ARG1; SRL: ARG0]
manner
(mukhya
LOC
party
emerge
ke)
live
(roop
be-PRS.
mein)
[gold: ARGM-MNR; SRL: ARGM-MNR]
(ubhar) [predicate] (rahi) (hai). URDU TEST
‘Here, in Assam, BJP is emerging as the main party.’
(3) (Jwala
Jwala
ne)
ERG
(bataya)
said
(ki)
that
(central
central
market
market
ke
comparison
muqaable
LOC
mein)
here
[gold: ARGM-EXT; SRL: ARGM-LOC]
rate
(yahan)
economical
[gold: ARG2-LOC; SRL: ARG2:LOC]
live
(rate)
be-PRS
[gold: ARG1; SRL: ARG1] (munaasib)
[gold: ARGM-PRX; SRL: ARGM:PRX] (rehta
hai) [predicate]. HINDI TEST
‘Jwala said that rate here is economical as compared
to central market.’
(4) (Barack
Barack
Obama
Obama
ko)
DAT
[gold: ARG0-GOL; SRL: ARG2]
feel
(ehsaas)
be-PRS
[gold: ARG1; SRL: ARG2-ATR]
thathis
(hai)
party
(ki)
election
(unki)
LOC
(party)
2009
[gold: ARG0; SRL: ARG1]
of
(intekhabaat
miracle
mein)
not
[gold: ARGM-LOC; SRL: ARGM-LOC]
see
(2009
get.
ka) (karishma) (nahi) (dikha) [predicate]
(paaegi). URDU TEST
‘Barack Obama feels that his party will not be able to
repeat the miracle of 2009 in the election.’
In sentence 3, ‘central market ke muqaable mein’ is
‘ARGM-EXT’ in gold test data as it conveys the mean-
ing of ‘comparison’ with some entity. But Hindi SRL
gives it ARGM-LOC label. One possible reason for this
may be that ‘mein’ case marker generally occurs more fre-
quently with ARGM-LOC or ARG2-LOC label in the loca-
tive sense in the training data as compared to comparison.
In sentence 4, ‘Barack Obama ko’ gets ARG2 label by Urdu
SRL instead of ARG0-GOL because of the same reasons
cited in the above example. Here, ‘ko’ case marker more
frequently occurs with ARG2 label i.e. beneficiary in the
training data as compared to experiencer here.
10. Conclusion
In this work, we have presented a supervised semantic role
labeler for Hindi and Urdu. We have used linguistic fea-
tures like predicate, head, head-POS, phrase type etc. and
combinations of certain features. In all, 10 features have
been used to guide the classifiers in predicting, identifying
and classifiyng the arguments of a verb.
We further used the gold parses (dependency parses) as
a feature which had a solid impact on the overall accu-
racy of the systems. Subsequently, we extracted the au-
tomatic parses by using state-of-art Hindi and Urdu parsers
and used them as features in our SRL. We believe that de-
pendency relations are the best features for classifying the
semantic arguments of a predicate because of their close
proximity with semantic labels.
11. Acknowledgements
We would like to thank Mr. Mujadia Vandan for the discus-
sions which we had with him to get a better insight of the
task.
12. Bibliographical References
Begum, R., Husain, S., Dhwaj, A., Sharma, D. M., Bai, L.,
and Sangal, R. (2008). Dependency annotation scheme
for indian languages. In IJCNLP, pages 721–726. Cite-
seer.
Bhat, R. A., Jain, S., and Sharma, D. M. (2012). Exper-
iments on dependency parsing of urdu. In The 11th In-
ternational Workshop on Treebanks and Linguistic The-
ories.
Bhatt, R., Narasimhan, B., Palmer, M., Rambow,
O., Sharma, D. M., and Xia, F. (2009). A
multi-representational and multi-layered treebank for
hindi/urdu. In Proceedings of the Third Linguistic Anno-
tation Workshop, pages 186–189. Association for Com-
putational Linguistics.
4594
8. Chen, J. and Rambow, O. (2003). Use of deep linguistic
features for the recognition and labeling of semantic ar-
guments. In Proceedings of the 2003 conference on Em-
pirical methods in natural language processing, pages
41–48. Association for Computational Linguistics.
Cohn, T. and Blunsom, P. (2005). Semantic role labelling
with tree conditional random fields. In Proceedings of
the Ninth Conference on Computational Natural Lan-
guage Learning, pages 169–172. Association for Com-
putational Linguistics.
Gildea, D. and Hockenmaier, J. (2003). Identifying seman-
tic roles using combinatory categorial grammar. In Pro-
ceedings of the 2003 conference on Empirical methods
in natural language processing, pages 57–64. Associa-
tion for Computational Linguistics.
Gildea, D. and Jurafsky, D. (2002). Automatic labeling
of semantic roles. Computational linguistics, 28(3):245–
288.
Johansson, R. and Nugues, P. (2006). A framenet-based
semantic role labeler for swedish. In Proceedings of
the COLING/ACL on Main conference poster sessions,
pages 436–443. Association for Computational Linguis-
tics.
Kingsbury, P. and Palmer, M. (2003). Propbank: the next
level of treebank. In Proceedings of Treebanks and lexi-
cal Theories, volume 3. Citeseer.
Kosaraju, P., Kesidi, S. R., Ainavolu, V. B. R., and
Kukkadapu, P. (2010). Experiments on indian language
dependency parsing. Proceedings of the ICON10 NLP
Tools Contest: Indian Language Dependency Parsing.
Nivre, J., Hall, J., and Nilsson, J. (2006). Maltparser: A
data-driven parser-generator for dependency parsing. In
Proceedings of LREC, volume 6, pages 2216–2219.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Ma-
chine learning in python. The Journal of Machine Learn-
ing Research, 12:2825–2830.
Pradhan, S. S., Ward, W., Hacioglu, K., Martin, J. H., and
Jurafsky, D. (2004). Shallow semantic parsing using
support vector machines. In HLT-NAACL, pages 233–
240.
Surdeanu, M., Harabagiu, S., Williams, J., and Aarseth,
P. (2003). Using predicate-argument structures for in-
formation extraction. In Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics-
Volume 1, pages 8–15. Association for Computational
Linguistics.
Swier, R. S. and Stevenson, S. (2004). Unsupervised se-
mantic role labelling. In Proceedings of EMNLP, vol-
ume 95, page 102.
Toutanova, K., Haghighi, A., and Manning, C. D. (2005).
Joint learning improves semantic role labeling. In Pro-
ceedings of the 43rd Annual Meeting on Association for
Computational Linguistics, pages 589–596. Association
for Computational Linguistics.
Vaidya, A., Choi, J. D., Palmer, M., and Narasimhan, B.
(2011). Analysis of the hindi proposition bank using de-
pendency structure. In Proceedings of the 5th Linguis-
tic Annotation Workshop, pages 21–29. Association for
Computational Linguistics.
Xue, N. and Palmer, M. (2004). Calibrating features for
semantic role labeling. In EMNLP, pages 88–94.
Xue, N. and Palmer, M. (2005). Automatic semantic role
labeling for chinese verbs. In IJCAI, volume 5, pages
1160–1165. Citeseer.
4595