Relation extraction is one of the most important parts of natural language processing. It is the process of extracting relationships from a text. Extracted relationships actually occur between two or more entities of a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a large collaborative knowledge base and database of general, structured information for public use. But for Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns for relation extraction by measuring conflict score.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.
Information Retrieval System is an effective process that helps a user to trace relevant information by
Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic
Information Retrieval System(BIRS) based on information and the system is significant mathematically
and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of
Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as
compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora
resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions
of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the
accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit
(BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have
also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered
672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted
information. For testing this system, we have created 19335 questions from the introduced information
and got 97.22% accurate answer.
Swoogle: Showcasing the Significance of Semantic SearchIDES Editor
The World Wide Web hosts vast repositories of
information. The retrieval of required information from the
Internet is a great challenge since computer applications
understand only the structure and layout of web pages and
they do not have access to their intended meaning. Semantic
web is an effort to enhance the Internet, so that computers
can process the information presented on WWW, interpret
and communicate with it, to help humans find required
essential knowledge. Application of Ontology is the
predominant approach helping the evolution of the Semantic
web. The aim of our work is to illustrate how Swoogle, a
semantic search engine, helps make computer and WWW
interoperable and more intelligent. In this paper, we discuss
issues related to traditional and semantic web searching. We
outline how an understanding of the semantics of the search
terms can be used to provide better results. The experimental
results establish that semantic search provides more focused
results than the traditional search.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain.
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...IJwest
Ontology may be a conceptualization of a website into a human understandable, however machine-readable format consisting of entities, attributes, relationships and axioms. Ontologies formalize the intentional aspects of a site, whereas the denotative part is provided by a mental object that contains assertions about instances of concepts and relations. Semantic relation it might be potential to extract the whole family-tree of a outstanding personality employing a resource like Wikipedia. In a way, relations describe the linguistics relationships among the entities involve that is beneficial for a higher understanding of human language. The relation can be identified from the result of concept hierarchy extraction. The existing ontology learning process only produces the result of concept hierarchy extraction. It does not produce the semantic relation between the concepts. Here, we have to do the process of constructing the predicates and also first order logic formula. Here, also find the inference and learning weights using Markov Logic Network. To improve the relation of every input and also improve the relation between the contents we have to propose the concept of ARSRE. This method can find the frequent items between concepts and converting the extensibility of existing lightweight ontologies to formal one. The experimental results can produce the good extraction of semantic relations compared to state-of-art method.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
More Related Content
Similar to Finding out Noisy Patterns for Relation Extraction of Bangla Sentences
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.
Information Retrieval System is an effective process that helps a user to trace relevant information by
Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic
Information Retrieval System(BIRS) based on information and the system is significant mathematically
and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of
Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as
compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora
resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions
of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the
accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit
(BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have
also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered
672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted
information. For testing this system, we have created 19335 questions from the introduced information
and got 97.22% accurate answer.
Swoogle: Showcasing the Significance of Semantic SearchIDES Editor
The World Wide Web hosts vast repositories of
information. The retrieval of required information from the
Internet is a great challenge since computer applications
understand only the structure and layout of web pages and
they do not have access to their intended meaning. Semantic
web is an effort to enhance the Internet, so that computers
can process the information presented on WWW, interpret
and communicate with it, to help humans find required
essential knowledge. Application of Ontology is the
predominant approach helping the evolution of the Semantic
web. The aim of our work is to illustrate how Swoogle, a
semantic search engine, helps make computer and WWW
interoperable and more intelligent. In this paper, we discuss
issues related to traditional and semantic web searching. We
outline how an understanding of the semantics of the search
terms can be used to provide better results. The experimental
results establish that semantic search provides more focused
results than the traditional search.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain.
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...IJwest
Ontology may be a conceptualization of a website into a human understandable, however machine-readable format consisting of entities, attributes, relationships and axioms. Ontologies formalize the intentional aspects of a site, whereas the denotative part is provided by a mental object that contains assertions about instances of concepts and relations. Semantic relation it might be potential to extract the whole family-tree of a outstanding personality employing a resource like Wikipedia. In a way, relations describe the linguistics relationships among the entities involve that is beneficial for a higher understanding of human language. The relation can be identified from the result of concept hierarchy extraction. The existing ontology learning process only produces the result of concept hierarchy extraction. It does not produce the semantic relation between the concepts. Here, we have to do the process of constructing the predicates and also first order logic formula. Here, also find the inference and learning weights using Markov Logic Network. To improve the relation of every input and also improve the relation between the contents we have to propose the concept of ARSRE. This method can find the frequent items between concepts and converting the extensibility of existing lightweight ontologies to formal one. The experimental results can produce the good extraction of semantic relations compared to state-of-art method.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
Similar to Finding out Noisy Patterns for Relation Extraction of Bangla Sentences (20)
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmkevig
Information Retrieval (IR) is a very important and vast area. While searching for context web returns all
the results related to the query. Identifying the relevant result is most tedious task for a user. Word Sense
Disambiguation (WSD) is the process of identifying the senses of word in textual context, when word has
multiple meanings. We have used the approaches of WSD. This paper presents a Proposed Dynamic Page
Rank algorithm that is improved version of Page Rank Algorithm. The Proposed Dynamic Page Rank
algorithm gives much better results than existing Google’s Page Rank algorithm. To prove this we have
calculated Reciprocal Rank for both the algorithms and presented comparative results.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Finding out Noisy Patterns for Relation Extraction of Bangla Sentences
1. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
DOI: 10.5121/ijnlc.2020.9102 9
FINDING OUT NOISY PATTERNS FOR RELATION
EXTRACTION OF BANGLA SENTENCES
Rukaiya Habib and Md. Musfique Anwar
1
Department of Computer Science and Engineering, Jahangirnagar University, Savar,
Bangladesh
rukaiya.habib45@gmail.com manwar@juniv.com
ABSTRACT
Relation extraction is one of the most important parts of natural language processing. It is the process of
extracting relationships from a text. Extracted relationships actually occur between two or more entities of
a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy
patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed
containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a
large collaborative knowledge base and database of general, structured information for public use. But for
Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real
challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns
for relation extraction by measuring conflict score.
KEYWORDS
Natural Language Processing, Relation Extraction, Bangla, Conflict Score, Noisy Pattern
1. INTRODUCTION
Natural language processing (NLP) is a branch of artificial intelligence which describes
interaction of human and computer by manipulating human language. Its goal is to fill up the gap
between human communication and computer understanding. Here, relation extraction (RE) is a
fundamental topic of NLP. It is actually the task of finding semantic relationships between pairs
of entities. Relation extraction is essential for many well-known tasks such as knowledge base
completion, question answering, medical science and ontology construction [1]. There are so
many unstructured electronic text data available on the web like newspaper, articles, journals,
blogs, government and private documents etc. But unstructured text can be turned into structured
by annotating semantic information.
Here, entities can be like person, organization, locations. We have to identify the entity types of a
sentence. A relation is defined in the form of a tuple t = (e1, e2, ..., en) where the ei are entities in
a predefined relation r within document D. Most relation extraction systems focus on binary
relations. Examples of binary relations include born-in(Ruma, Dhaka), father-of(John David, Eric
David) [8]. For relation extraction, we have different methods like supervised method, distant
supervised method and unsupervised method.
In supervised method, sentences in a corpus are first hand-labeled for the presence of entities and
relations between them. Lexical, syntactic and semantic features have to be extracted by the
automatic content systems (ACE) to build supervised classifiers to label the relation between a
given pair of entities in a test set sentence. Labeled training data is expensive to produce and thus
2. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
10
limited in quantity. Another one is distant supervised method for relation extraction which aligns
texts to the given KB and use the alignment to learn a relation extractor [3]. They use the large
amounts of structured data sources (such as Freebase) as the distant supervision information. As
these methods do not need a hand-labeled dataset and KBs grow fast recently, they are more
efficient.
In our work, we tried to find out noisy patterns for relation extraction of Bangla sentences for
distantly supervised method. For this method, we need seed tuples which we can get from
knowledge base like (Freebase). But there is no available Freebase in Bangla. So we built Bangla
Freebase which contains a large amount of relations.
Fig 1: Noisy patterns identification steps for RE of Bangla Sentences
2. PREVIOUS STUDY
Distant supervision can be introduced as an efficient method to scale relation extraction to very
large corpora which contains a lot of relations. The authors proposed a sentence-level attention
model to select the valid instances, which makes full use of the supervision information from
knowledge bases [2]. And entity descriptions from Freebase and Wikipedia pages to supplement
background knowledge have been extracted for their task. The background knowledge not only
provides more information for predicting relations, but also brings better entity representations
for the attention module. Three experiments have been conducted on a widely used dataset and
the experimental results showed that their approach outperforms all the baseline systems
3. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
11
significantly [2].
Modern models of relation extraction for tasks like ACE are based on supervised learning of
relations from small hand-labeled corpora. The authors used a paradigm that does not require
labeled corpora [3]. This paradigm can avoid the domain dependence of ACE style algorithms,
and allow the use of corpora of any size. This experiment used Freebase which is a large semantic
database of several thousand relations. The Freebase provides distant supervision. For each pair
of entities that appears in some Freebase relation, all sentences containing those entities in a large
unlabeled corpus have been selected and extracted textual features to train a relation classifier.
Their algorithm combines the advantages of supervised IE and unsupervised IE [3].
There have been so many works of relation extraction entities in English. In this work, we have
worked on relation extraction of Bangla sentences on which not so much research work has been
done. So it will be very much beneficial for this language. This is the nobility of our work.
3. CREATING BANGLA FREEBASE
Freebase is a large collaborative knowledge base and database of general, structured information
for public use. Its structured data had been harvested from many sources, including individual,
user-submitted wiki contributions. Its aim is to create a global resource so that people (and
machines) can access common information more effectively [9]. It is available in English.
Actually in Freebase, the triple format is like (e1, r, e2) where e1 and e2 are the two entities and r
defines the relation. So relation can be found in a known KG and can generate large amount of
data [4]. Since we mentioned before that we created our own Bangla Freebase which contains a
large number of relation with the help of Wikidata query service and SPARQL query language. It
is a large collection of knowledge base database. Today the number of Bangla articles in the
internet is growing day by day. So it has become a necessary to have a structured data store in
Bangla. It consists of different types of concepts (topics) and relationships between those topics.
These include different types of areas like popular culture (e.g. films, music, books, sports)
location information (restaurants, locations, businesses), scholarly information (linguistics,
biology, astronomy), birth place of (poets, politicians, actor, actress) and general knowledge
(Wikipedia). Here we collected more than 100 relations according to our need. By using
SPARQL query, anyone can find out their required relation. So this knowledge base is very much
helpful. It will be much more helpful for relation extraction or any kind of NLP (Natural
Language Processing) works on Bangla language.
3.1. Wikidata Query Service
Wikidata is a website that belongs to the Wikimedia family of websites. Data from Wikidata is
available in RDF dumps. Actually RDF stands for Resource Description Framework which is a
general method for describing data by defining relationships between data objects and it allows
data integration from multiple sources. RDF has triple format which is a set of three entities that
codifies a statement about semantic data in the form of subject–predicate–object expressions [7].
Wikidata is a place to store structured data in many languages. The basic entity in Wikidata is an
item. An item can be a thing, a place, a person, an idea or anything else. Wikidata has identifier
numbers for entities and properties.
3.1.1. Entity Identifier
As Wikidata treats all languages in the same way, items don‟t have names, but generic identifiers.
Each identifier is the letter Q that is followed by a number. For example, the item about the
capital of Japan is called neither “Tokyo” nor “anything” but Q1490. But to give it a human-
4. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
12
readable name, each item has a list of labels in each language associated with it. So we‟ll see that
the English (en) label at Q1490 is “Tokyo”, also has corresponding word for the Japanese (ja)
label, the Bangla (bn) label and so on.
3.1.2. Property Identifier
Every item has a list of statements associated with it. Each statement has a “property” and a
“value”. There is a long list of possible properties. Like items, properties have generic identifiers,
but they begin with the letter P and not Q. For example, the property to indicate the country is
P17, and it has the label “country” in English. The value of P17 (country) for Q1490 (Tokyo) is
Q17 (Japan, etc.).
3.2. SPARQL Query Process
It is necessary to extract information from complaints, either scraped from the Web or received
directly from the client for many companies nowadays. The aim is to find inside them some
actionable knowledge. There is a query language, SPARQL to extract information from natural
language documents, pre-annotated with NLP information. SPARQL stands for SPARQL
Protocol and RDF Query Language. It is an RDF query language and able to retrieve and
manipulate data stored in Resource Description Framework (RDF) format. SPARQL allows
query to consist of triple patterns. It was made a standard by the RDF Data Access Working
Group (DAWG) of the World Wide Web Consortium. SPARQL allows users to write queries
against what can loosely be called “key-value” data [6]. We have to follow these steps:
We can retrieve data according to our need by making query and for this, we have an online
query service engine known as Wikidata query service. The URL of online query service is
https://query.wikidata.org/.
SPARQL is a standard query language technology which is endorsed by the World Wide
Web consortium for querying any linked data information source. For making a query, we
have to understand the SPARQL query. Here, one query has been added thorugh which we
will retrieve all the poets who are the citizen ofBangladesh.
SELECT ?item ?itemLabel ?occupationLabel ?citizenshipLabel WHERE
{
?item wdt:P31 wd:Q5.
?item wdt:P106 ?occupation.
?item wdt:P27 ?citizenship. FILTER (?citizenship=wd:Q902). FILTER
(?occupation=wd:Q49757).
SERVICE wikibase:label { bd:serviceParam wikibase:language “bn”.}}
5. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
13
Table 1. A part of Bangla Freebase
item itemLabel occupationLabel citizenshipLabel
http://www.wikidata.or
g/entity/Q4665322 (Abdul Gaffar Chowdhury)
(Poet)
(Bangladesh)
http://www.wikidata.or
g/entity/Q4665454 (Abdul Quadir) (Poet)
(Bangladesh)
http://www.wikidata.or
g/entity/Q4667573
(Abid Azad) (Poet)
(Bangladesh)
http://www.wikidata.or
g/entity/Q4670213 (Abu Hena Mustafa Kamal)
(Poet)
(Bangladesh)
4. METHODOLOGY
Relation extraction is very much significant in NLP based work. There are lot of methods for RE.
In our work, we use distant supervision for relation extraction. In distant supervision, an already
existing database, such as Freebase (knowledge-database) is prepared. Then we gather examples
for the relations we want to extract. Thus our training data will be prepared. For example,
Freebase contains the fact that Paris is the capital of France. We then label each pair of "France"
and "Paris" that appear in the same sentence as a positive example for “capital_of_the_country”
relation. A large amount of (possibly noisy) training data can be generated. In the research work
we needed seed tuples which are collected from Bangla Freebase made by us. Distant supervised
method has been used which is very much efficient. In each seed tuple, there are two entities and
their relation. There may be different types of entities like person, organization, location, films
etc. We then extracted features from the sentences containing those entities in a large corpus. So
we can say, our goal is to extract relation between two entities from sentences in a triple format
and map the triple elements existing in a knowledge base [5]. After that we made decision that
these extracted features are valid or not for each relation by measuring conflict score.
4.1. Name Entity Recognition
For our work, we had to identify the entity for each sentence. For entity identification, we used
word level features (e.g., token, prefix, suffix), list lookup features (e.g., gazetteers). Gazetteers
include names of countries, major cities, common people name, organization name etc.
4.2. Preparation of Corpus
A corpus is a collection of real world text in linguistics research. The intuition of our distant
supervision approach is to use Freebase to give us a training set of relations and entity pairs that
participate in those relations. In Freebase there are hundreds of relations. Freebase works as seed
tuple. In our seed tuples, the Bangla synonym for the relations had been used. The seed tuples
have different relations like „birth-place‟, „working-place‟, „actor‟, „film-director‟, „film-
producer‟ which Bangla synonyms are respectively , , , _ ,
_ . Our seed tuples look like the following.
6. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
14
Table 2. A seed tuple for relation extraction for Bangla sentences
Entity1(name) Entity2 (place) Relation
(Humayun
Ahmed)
(Netrokona) (place_of_birth)
(Jasimuddin) (Faridpur) (place_of_birth)
(Rahim) (Rajshahi) (place_of_work)
(Aynabaji) (Amitabh Reza)
_
(film_director)
(Monpura)
(Giasuddin
Selim)
_
(film_director)
(Karim) (Sylhet) (place_of_work)
(Nabila) (Monpura) (actress)
For each pair of entities in the seed tuple that appears in some Freebase relation, we used
Wikipedia because it is relatively up to-date. We found out all sentences containing those entities
in Wikipedia or large unlabeled corpus and collected them. Then we worked on them and
extracted textual features. A part of our corpus looks like below:
Table 3. A part of our Bangla Corpus
No. INPUT SENTENCES
1.
( Humayun Ahmed was born in
Netrokona district.)
2. (Tisha acted in Television film.)
3. (Rabindranath was born in Kolkata.)
4.
(Amitabh Reza directed the
Aynabaji film.)
5.
( Ruma has been living in Dhaka for five
years.)
6. (My younger sister, Shanu works at Dhaka.)
7.
(Catherine Masud produced the
film 'Earthen Moina'.)
8. (Rahim was working in Rajshahi.)
4.3. Preparing Data And Pattern Identification
Before identifying patterns, we performed some tasks. We had to prepare the data. The following
steps are performed:
Tokenizing: We tokenized the data by inserting spaces between words and punctuations.
Cleaning: Then we cleaned the data by removing empty lines, extra spaces and some lines
that were too short or too long.
Entity identification: In a given sentence, we identified the entity types and extracted the
7. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
15
relation patterns between them.
Chunking: In preprocessing, consecutive words with the same named entity tag are
„chunked‟, like Ruma/PERSON Rahman/PERSON. So in a sentence if Ruma and Rahman
appear together, these will be chunked together like [Ruma Rahman]/PERSON.
Lexical features: Then to find out the pattern, we worked on lexical features of each sentence.
Like:-
i) the sequence of words between the two entities is very much important.
ii) A flag indicating which entity came first in the sentence
iii) A window of n words to the left of Entity 1
iv) A window of n words to the right of Entity 2
These lexical features help us to identify the patterns of each relations. For example,
<Sentence> Here, we get two entities,
person entity- “ ” and its position is 1
location entity- “ ” and its position is 4
At first we got a „ ‟ entity which is a person entity and then got „ ‟ entity which is
also a person entity. As they are same type of entity, they are chunked together. These two
entities were available in the seed tuple list. We then identified the position of the person entity
and location entity so that we can get the words between two entities which is actually the pattern
for the relation of two entities. We then filtered out the noisy patterns by using conflict score. So
the pattern is „ ‟
4.4. Conflict Score Formula
The main goal of our work is to identify the valid patterns for relations and find out noisy patterns
existing there. The task can be done by identifying the conflict score for each pattern. Bangla
Freebase is considered as seed tuples. Each seed tuple contains two entities and their relation. Our
corpus has sentences containing these entities to get distant supervision. Our research work takes
5 relations. They are place_of_birth, place_of_work, living_place, film_director, film_producer
and for the relations, we used the Bangla synonym of the words which are respectively ,
, , _ , _ . Here the sentences which contain only one
entity of the seed tuple create conflict for a relation. So identifying the conflict score, we can
make a decision which patterns are valid for a relation. Thus we can find out noisy patterns which
are invalid. The formula for conflict score is
A threshold value is fixed which is 0.3. If the conflict score is less than or equal to threshold
value then it is a valid pattern. Otherwise the pattern is invalid or noisy for a relation. For person
and organization entities we take three relations. We know it is not necessarily for a person that
his working place and birth place will be same. It helps us to find out the conflict patterns.
8. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
16
5. RESULT AND EVALUATION ON CONFLICT SCORE
We showed our results on five relations for Bangla sentences. The threshold value is 0.3. If the
conflict is less than or equal to 0.3, it is considered as the valid pattern for that relation. Otherwise
it is the noisy pattern. In future work we will remove these noisy patterns to train our relation
classifier. The conflict scores have been mentioned in the following:
Relation 1: Here, „place_of_birth‟ relation is in Bangla. So the entities are person
and location. For relation the conflict scores of different patterns are given below:
Table 4. Valid pattern identification for (place_of_birth) relation
No. Patterns we get
The number of
patterns
with conflict
The number of
valid patterns
Conflict Score
Valid or
Invalid
1.
(was
born in/at)
1 20 0.05 Valid
2
(was
born in/at)
2 17 0.18 Valid
3. (works at) 9 1 9 Invalid
4. (born in/at) 0 14 0 Valid
5.
(went to
travel)
4 2 2 Invalid
6.
(is/are
working at)
12 3 4 Invalid
7.
(birth
place is)
3 34 0.09 Valid
8.
(has
died)
7 2 3.5 Invalid
So, the valid patterns are:- (was born in), (is born in),
(born at/in), (the birthplace is). Other patterns are noisy patterns for this relation.
So the sentences containing these patterns will be removed. Noisy patterns are ,
, ,
Relation 2: Here, „place_of_work‟ relation is „ ‟ in Bangla. So the entities are person
and location. For relation the conflict scores of different patterns are given below.
9. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
17
Table 5. Valid pattern identification for (place_of_work) relation
No. Patterns we get
The number of
patterns with
conflict
The number of
valid patterns
Conflict Score
Valid or
Invalid
1.
(works/work
at)
2 16 0.125 Valid
2
(had been
working)
0 31 0 Valid
3. (works/work at) 2 21 0.01 Valid
4.
(was/were born in)
12 1 12 Invalid
5.
(has/have
gone to travel)
5 2 2.5 Invalid
6.
(works/work
at)
1 19 0.05 Valid
7
(has
been appointed to the
work)
3 18 0.16 Valid
8.
(arranged the party)
5 3 1.67 Invalid
So, the valid patterns are:- (works at), (had been working),
(works/work at), (works at), (has been appointed to the
work). Other patterns , , are
noisy patterns.
Relation 3: Here, „living-place‟ relation is „ ‟ in Bangla. So the entities are person and
location. For relation the conflict scores of different patterns are given below:
Table 6. Valid pattern identification for (living-place) relation
No. Patterns we get
The number of
patterns with
conflict
The number of
valid patterns
Conflict Score
Valid or
Invalid
1. (lives/live in) 4 20 0.2 Valid
2
(has
been living)
2 11 0.11 Valid
3. (works at) 5 3 1.67 Invalid
4.
(are the
permanent resident)
1 14 0.08 Valid
5.
(has gone to
travel)
3 1 3 Invalid
6. (works at) 4 2 2 Invalid
10. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
18
7. (has passed the student
life)
10 2 5 Invalid
So the valid patterns are:- (lives/live in), (has been living),
((are the permanent resident), Noisy patterns are , ,
,
Relation 4: Here, „film-director‟ relation is „ _ ‟ in Bangla. So the entities are
person and film. For _ relation the conflict scores of different patterns are given
below:
Table 7. Valid patterns identification for _ (film-director) relation
No. Patterns we get
The number of
patterns with
conflict
The number of
valid patterns
Conflict Score
Valid or
Invalid
1.
(has
directed)
4 17 0.24 Valid
2 (has
produced)
6 7 0.86 Invalid
3.
(directed)
1 20 0.05 Valid
4. (has
directed)
4 22 0.19 Valid
5.
(film director is)
1 9 0.11 Valid
6. (has
acted)
7 4 1.75 Invalid
So, the valid patterns are:- (has directed), (directed),
(has directed), (the film director is).
Relation 5: Here, film_producer‟ relation is „ _ ‟ in Bangla. So the entities are
person and film. For _ relation the conflict scores of different patterns are given
below:
Table 8. Valid patterns identification for „ _ (film-producer)‟ relation.
No. Patterns we get
The number of
patterns with
conflict
The number of
valid patterns
Conflict Score
Valid or
Invalid
1. (has
produced)
1 10 0.1 Valid
2 (has acted) 5 1 5 Invalid
3.
(directed)
10 2 5 Invalid
11. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
19
So, the valid patterns are:- (has produced).
6. FUTURE WORK
Here in this research work, we worked on relation extraction of Bangla sentences. RE is used in
information extraction. As Bengali articles are increasing in the Web, this work holds very much
significance for Bangla language based research work. Bangla language is very much
enriched. In future, we will build a classifier where noisy patterns for any relation will be
removed.
7. CONCLUSIONS
Relation extraction is very fundamental topic in NLP. In this work, we made a Freebase for
Bangla which is a large collection of structured data by using Wikidata query service. It will be
much more helpful for further research work like in natural language processing of Bangla, where
the researchers need to get seed tuples. Actually researchers in areas such as entity extraction and
reconciliation, data mining, Semantic Web, information retrieval, ontology creation and analysis
can use this to support their work. With the help of Freebase, we get our seed tuples. We worked
on Bangla sentences for relation extraction using distant supervision. Then we found out the
noisy patterns using conflict score.
ACKNOWLEDGEMENTS
We are thankful to the Department of Computer Science & Engineering, Jahangirnagar
University.
REFERENCES
[1] Liu, Liyuan & Ren, Xiang & Zhu, Qi & Zhi, Shi & Gui, Huan & Ji, Heng & Han, Jiawei.
“Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach.” arXiv
preprint arXiv:1707.00166 [cs.CL] (2017).
[2] Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, “Distant supervision for relation extraction
without labeled data”
[3] Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, (2017). “Distant Supervision for Relation Extraction with
Sentence-Level Attention and Entity Descriptions”, Semantic Scholar.
[4] Wang, Guanying & Zhang, Wen & Wang, Ruoxu & Zhou, Yalin & Chen, Xi & Zhang, Wei & Zhu, Hai
& Chen, Huajun.“Label-Free Distant Supervision for Relation Extraction via Knowledge Graph
Embedding.” In proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (2018).
[5] Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, et al. “Neural Relation Extraction for
Knowledge Base Enrichment”. In proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (July, 2019).
[6] F. H. Marc Weise, Steffen Lohmann, “Ld-vowl: Extracting and visualizing schema information for
linked data”.
[7] D. Hernández, A. Hogan, and M. Krotzsch, (2015). “Reifying RDF: what works well with wikidata?”
Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems,
12. International Journal on Natural Language Computing (IJNLC) Vol.9, No.1, February 2020
20
vol. 1457 of CEUR Workshop Proceedings, pp. 32–47. CEUR- WS.org, 2015.
[8] Nguyen Bach, Sameer Badaskar. “A Review of Relation Extraction.”. Semantic Scholar.
[9] K. D. Bollacker, P. Tufts, T. Pierce, and R. Cook, (2007). “A platform for scalable, collaborative,
structured information integration.”
[10] Zeng, D. & Dai, Y. & Li, F. & Sherratt, R.S. & Wang, J. “Adversarial learning for distant supervised
relation extraction.” Computers, Materials and Continua (2018).
[11] Wang, Dongsheng & Tiwari, Prayag & Garg, Sahil & Zhu, Hongyin & Bruza, Peter. (2019).
“Structural block driven - enhanced convolutional neural representation for relation extraction.”.
Applied Soft Computing. 86. 105913. 10.1016/j.asoc.2019.105913.
Authors
Rukaiya Habib is an M.Sc. student of Computer Science and Engineering,
Jahangirnagar University, Savar, Dhaka, Bangladesh. She has completed her
B.Sc. also in Computer Science and Engineering, from Jahangirnagar
University in 2017. She is interested in Natural Language Processing, Mobile
Adhoc Network and Computer vision related research works.
Md. Musfique Anwar has awarded PhD degree from the Department of
Computer Science and Software Engineering, Faculty of Science,
Engineering and Technology of Swinburne University of Technology,
Melbourne, Australia in 2018. He has received M.Sc. degree from the
Department of Intelligence Science and Technology, Graduate School of
Informatics of Kyoto University, Japan in 2013 and B.Sc. degree in Computer
Science and Engineering from Jahangirnagar University, Savar, Dhaka,
Bangladesh in 2006. Since 2008, he is a faculty member having current
designation Associate Professor in the Department of Computer Science and Engineering of
Jahangirnagar University, Savar, Dhaka, Bangladesh. Currently, his research focuses on Data
Mining, Social Network Analysis, Natural Language Processing and Software Engineering. He
achieved Best Student Paper award in 29th Australasian Database Conference (ADC) in 2018,
Best Poster award in 26th Australasian Database Conference (ADC) in 2015 and Best Poster
Paper award in International Workshop on Computer Vision and Intelligent Systems-2019
(IWCVIS2019) in 2019.