Question Answering has been a well-researched NLP area over recent years. It has become necessary for
users to be able to query through the variety of information available - be it structured or unstructured. In
this paper, we propose a Question Answering module which a) can consume a variety of data formats - a
heterogeneous data pipeline, which ingests data from product manuals, technical data forums, internal
discussion forums, groups, etc. b) addresses practical challenges faced in real-life situations by pointing to
the exact segment of the manual or chat threads which can solve a user query c) provides segments of texts
when deemed relevant, based on user query and business context. Our solution provides a comprehensive
and detailed pipeline that is composed of elaborate data ingestion, data parsing, indexing, and querying
modules. Our solution is capable of handling a plethora of data sources such as text, images, tables,
community forums, and flow charts. Our studies performed on a variety of business-specific datasets
represent the necessity of custom pipelines like the proposed one to solve several real-world document
question-answering
Convolutional recurrent neural network with template based representation for...IJECEIAES
Complex Question answering system is developed to answer different types of questions accurately. Initially the question from the natural language is transformed to an internal representation which captures the semantics and intent of the question. In the proposed work, internal representation is provided with templates instead of using synonyms or keywords. Then for each internal representation, it is mapped to relevant query against the knowledge base. In present work, the Template representation based Convolutional Recurrent Neural Network (T-CRNN) is proposed for selecting answer in Complex Question Answering (CQA) framework. Recurrent neural network is used to obtain the exact correlation between answers and questions and the semantic matching among the collection of answers. Initially, the process of learning is accomplished through Convolutional Neural Network (CNN) which represents the questions and answers separately. Then the representation with fixed length is produced for each question with the help of fully connected neural network. In order to design the semantic matching between the answers, the representation of Question Answer (QA) pair is given into the Recurrent Neural Network (RNN). Finally, for the given question, the correctly correlated answers are identified with the softmax classifier.
2. an efficient approach for web query preprocessing edit satIAESIJEECS
The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.
The project is to ask college related queries and get the responses through a chatbot an Artificial Conversational Entity. This System is a web application which provides answer to the query of the student. Students just have to query through the bot which is used for chatting. Students can chat using any format there is no specific format the user has to follow. This system helps the student to be updated about the college activities.
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
Framework for Product Recommandation for Review Datasetrahulmonikasharma
In the social networking era, product reviews have a significant influence on the purchase decisions of customers so the market has recognized this problem The problem with this is that the customers do not know how these systems work which results in trust issues. Therefore a different system is needed that helps customers with their need to process the information in product reviews. There are different approaches and algorithms of data filtering and recommendation .Most existing recommender systems were developed for commercial domains with millions of users. In this paper we have discussed the recommendation system and its related research and implemented different techniques of the recommender system .
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
Convolutional recurrent neural network with template based representation for...IJECEIAES
Complex Question answering system is developed to answer different types of questions accurately. Initially the question from the natural language is transformed to an internal representation which captures the semantics and intent of the question. In the proposed work, internal representation is provided with templates instead of using synonyms or keywords. Then for each internal representation, it is mapped to relevant query against the knowledge base. In present work, the Template representation based Convolutional Recurrent Neural Network (T-CRNN) is proposed for selecting answer in Complex Question Answering (CQA) framework. Recurrent neural network is used to obtain the exact correlation between answers and questions and the semantic matching among the collection of answers. Initially, the process of learning is accomplished through Convolutional Neural Network (CNN) which represents the questions and answers separately. Then the representation with fixed length is produced for each question with the help of fully connected neural network. In order to design the semantic matching between the answers, the representation of Question Answer (QA) pair is given into the Recurrent Neural Network (RNN). Finally, for the given question, the correctly correlated answers are identified with the softmax classifier.
2. an efficient approach for web query preprocessing edit satIAESIJEECS
The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.
The project is to ask college related queries and get the responses through a chatbot an Artificial Conversational Entity. This System is a web application which provides answer to the query of the student. Students just have to query through the bot which is used for chatting. Students can chat using any format there is no specific format the user has to follow. This system helps the student to be updated about the college activities.
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
Framework for Product Recommandation for Review Datasetrahulmonikasharma
In the social networking era, product reviews have a significant influence on the purchase decisions of customers so the market has recognized this problem The problem with this is that the customers do not know how these systems work which results in trust issues. Therefore a different system is needed that helps customers with their need to process the information in product reviews. There are different approaches and algorithms of data filtering and recommendation .Most existing recommender systems were developed for commercial domains with millions of users. In this paper we have discussed the recommendation system and its related research and implemented different techniques of the recommender system .
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
The size of the Internet enlarging as per to grow the users of search providers continually demand search
results that are accurate to their wishes. Personalized Search is one of the options available to users in
order to sculpt search results based on their personal data returned to them provided to the search
provider. This brings up fears of privacy issues however, as users are typically anxious to revealing
personal info to an often faceless service provider along the Internet. This work proposes to administer
with the privacy issues surrounding personalized search and discusses ways that privacy can be improved
so that users can get easier with the dismissal of their personal information in order to obtain more precise
search results.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
QUESTION ANSWERING SYSTEM USING ONTOLOGY IN MARATHI LANGUAGEijaia
Humans are always in a quest to extract information related to some topic or entity. Question answering system helps user to find the precise answer of the question articulated in natural language. Question answering system provides explicit, concise and accurate answer to user questions rather than providing set of relevant documents or web pages as answers as most of the information retrieval system does. The paper proposes question answering system for Marathi natural language by using concept of ontology as a formal representation of knowledge base for extracting answers. Ontology is used to express domain specific knowledge about semantic relations and restrictions in the given domains. The ontologies are developed with the help of domain experts and the query is analyzed both syntactically and semantically. The results obtained here are accurate enough to satisfy the query raised by the user. The level of accuracy is enhanced since the query is analyzed semantically.
A Hybrid Approach for Supervised Twitter Sentiment Classification ....................................................1
K. Revathy and Dr. B. Sathiyabhama
A Survey of Dynamic Duty Cycle Scheduling Scheme at Media Access Control Layer for Energy
Conservation .....................................................................................................................................1
Prof. M. V. Nimbalkar and Sampada Khandare
A Survey on Privacy Preserving Data Mining Techniques ....................................................................1
A. K. Ilavarasi, B. Sathiyabhama and S. Poorani
An Ontology Based System for Predicting Disease using SWRL Rules ...................................................1
Mythili Thirugnanam, Tamizharasi Thirugnanam and R. Mangayarkarasi
Performance Evaluation of Web Services in C#, JAVA, and PHP ..........................................................1
Dr. S. Sagayaraj and M. Santhosh Kumar
Semi-Automated Polyhouse Cultivation Using LabVIEW......................................................................1
Prathiba Jonnala and Sivaji Satrasupalli
Performance of Biometric Palm Print Personal Identification Security System Using Ordinal Measures 1
V. K. Narendira Kumar and Dr. B. Srinivasan
MIMO System for Next Generation Wireless Communication..............................................................1
Sharif, Mohammad Emdadul Haq and Md. Arif Rana
Predicting depression using deep learning and ensemble algorithms on raw twit...IJECEIAES
Social network and microblogging sites such as Twitter are widespread amongst all generations nowadays where people connect and share their feelings, emotions, pursuits etc. Depression, one of the most common mental disorder, is an acute state of sadness where person loses interest in all activities. If not treated immediately this can result in dire consequences such as death. In this era of virtual world, people are more comfortable in expressing their emotions in such sites as they have become a part and parcel of everyday lives. The research put forth thus, employs machine learning classifiers on the twitter data set to detect if a person’s tweet indicates any sign of depression or not.
Performance Analysis of Selected Classifiers in User Profilingijdmtaiir
User profiles can serve as indicators of personal
preferences which can be effectively used while providing
personalized services. Building user files which can capture
accurate information of individuals has been a daunting task.
Several attempts have been made by researchers to extract
information from different data sources to build user profiles
on different application domains. Towards this end, in this
paper we employ different classification algorithmsto create
accurate user profiles based on information gathered from
demographic data. The aim of this work is to analyze the
performance of five most effective classification methods,
namely Bayesian Network(BN), Naïve Bayesian(NB), Naives
Bayes Updateable(NBU), J48, and Decision Table(DT). Our
simulation results show that, in general, the J48has the highest
classification accuracy performance with the lowest error rate.
On the other hand, it is found that Naïve Bayesian and Naives
Bayes Updateable classifiers have the lowest time requirement
to build the classification model
Research on Document Indexing in the Search Engines. The main theme of Informational retrieval is to send the exact response of a user for specific Query.
The information search retrieval is a very big process, to achieve this concept we need to develop an application with more effect and we have to use techniques like Document indexing, page ranking, clustering technique. Among all of these Document index is plays avital role while searching why since instead of searching hundreds of thousands of documents it will directly go to the particular index and will give the output here. Here our achievement mainly is indexing, the clear meaning of the indexing is storing an index is to optimize speed and performance in finding the appropriate/corresponding document for the user searched query.
My conclusion is the context based index approach is used in the query retrieval, this is mainly from the source document. Instead of searching every page on server, finding technically is better. Due to this we can save our time, we can reduce the burden of server.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
A scalable, lexicon based technique for sentiment analysisijfcstjournal
Rapid increase in the volume of sentiment rich social media on the web has resulted in an increased
interest among researchers regarding Sentimental Analysis and opinion mining. However, with so much
social media available on the web, sentiment analysis is now considered as a big data task. Hence the
conventional sentiment analysis approaches fails to efficiently handle the vast amount of sentiment data
available now a days. The main focus of the research was to find such a technique that can efficiently
perform sentiment analysis on big data sets. A technique that can categorize the text as positive, negative
and neutral in a fast and accurate manner. In the research, sentiment analysis was performed on a large
data set of tweets using Hadoop and the performance of the technique was measured in form of speed and
accuracy. The experimental results shows that the technique exhibits very good efficiency in handling big
sentiment data sets.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGIJwest
Machine Reading Comprehension (MRC), particularly extractive close-domain question-answering, is a prominent field in Natural Language Processing (NLP). Given a question and a passage or set of passages, a machine must be able to extract the appropriate answer from the passage(s). However, the majority of these existing questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Runtime of base models on the entire dataset is approximately one day while the runtime for all models on a third of the dataset is a little over two days. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset compared to the single-span source datasets. While the models tested on the source datasets were slightly fine-tuned in order to return multiple answers, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in question-answering and improve existing question-answering products and methods.
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGdannyijwest
Machine Reading Comprehension (MRC), particularly extractive close-domain question-answering, is a prominent field in Natural Language Processing (NLP). Given a question and a passage or set of passages, a machine must be able to extract the appropriate answer from the passage(s). However, the majority of these existing questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Runtime of base models on the entire datasetis approximately one day while the runtime for all models on a third of the dataset is a little over two days. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset compared to the single-span source datasets. While the models tested on the source datasets were slightly fine-tuned in order to return multiple answers, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in question-answering and improve existing question-answering products and methods
The size of the Internet enlarging as per to grow the users of search providers continually demand search
results that are accurate to their wishes. Personalized Search is one of the options available to users in
order to sculpt search results based on their personal data returned to them provided to the search
provider. This brings up fears of privacy issues however, as users are typically anxious to revealing
personal info to an often faceless service provider along the Internet. This work proposes to administer
with the privacy issues surrounding personalized search and discusses ways that privacy can be improved
so that users can get easier with the dismissal of their personal information in order to obtain more precise
search results.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
QUESTION ANSWERING SYSTEM USING ONTOLOGY IN MARATHI LANGUAGEijaia
Humans are always in a quest to extract information related to some topic or entity. Question answering system helps user to find the precise answer of the question articulated in natural language. Question answering system provides explicit, concise and accurate answer to user questions rather than providing set of relevant documents or web pages as answers as most of the information retrieval system does. The paper proposes question answering system for Marathi natural language by using concept of ontology as a formal representation of knowledge base for extracting answers. Ontology is used to express domain specific knowledge about semantic relations and restrictions in the given domains. The ontologies are developed with the help of domain experts and the query is analyzed both syntactically and semantically. The results obtained here are accurate enough to satisfy the query raised by the user. The level of accuracy is enhanced since the query is analyzed semantically.
A Hybrid Approach for Supervised Twitter Sentiment Classification ....................................................1
K. Revathy and Dr. B. Sathiyabhama
A Survey of Dynamic Duty Cycle Scheduling Scheme at Media Access Control Layer for Energy
Conservation .....................................................................................................................................1
Prof. M. V. Nimbalkar and Sampada Khandare
A Survey on Privacy Preserving Data Mining Techniques ....................................................................1
A. K. Ilavarasi, B. Sathiyabhama and S. Poorani
An Ontology Based System for Predicting Disease using SWRL Rules ...................................................1
Mythili Thirugnanam, Tamizharasi Thirugnanam and R. Mangayarkarasi
Performance Evaluation of Web Services in C#, JAVA, and PHP ..........................................................1
Dr. S. Sagayaraj and M. Santhosh Kumar
Semi-Automated Polyhouse Cultivation Using LabVIEW......................................................................1
Prathiba Jonnala and Sivaji Satrasupalli
Performance of Biometric Palm Print Personal Identification Security System Using Ordinal Measures 1
V. K. Narendira Kumar and Dr. B. Srinivasan
MIMO System for Next Generation Wireless Communication..............................................................1
Sharif, Mohammad Emdadul Haq and Md. Arif Rana
Predicting depression using deep learning and ensemble algorithms on raw twit...IJECEIAES
Social network and microblogging sites such as Twitter are widespread amongst all generations nowadays where people connect and share their feelings, emotions, pursuits etc. Depression, one of the most common mental disorder, is an acute state of sadness where person loses interest in all activities. If not treated immediately this can result in dire consequences such as death. In this era of virtual world, people are more comfortable in expressing their emotions in such sites as they have become a part and parcel of everyday lives. The research put forth thus, employs machine learning classifiers on the twitter data set to detect if a person’s tweet indicates any sign of depression or not.
Performance Analysis of Selected Classifiers in User Profilingijdmtaiir
User profiles can serve as indicators of personal
preferences which can be effectively used while providing
personalized services. Building user files which can capture
accurate information of individuals has been a daunting task.
Several attempts have been made by researchers to extract
information from different data sources to build user profiles
on different application domains. Towards this end, in this
paper we employ different classification algorithmsto create
accurate user profiles based on information gathered from
demographic data. The aim of this work is to analyze the
performance of five most effective classification methods,
namely Bayesian Network(BN), Naïve Bayesian(NB), Naives
Bayes Updateable(NBU), J48, and Decision Table(DT). Our
simulation results show that, in general, the J48has the highest
classification accuracy performance with the lowest error rate.
On the other hand, it is found that Naïve Bayesian and Naives
Bayes Updateable classifiers have the lowest time requirement
to build the classification model
Research on Document Indexing in the Search Engines. The main theme of Informational retrieval is to send the exact response of a user for specific Query.
The information search retrieval is a very big process, to achieve this concept we need to develop an application with more effect and we have to use techniques like Document indexing, page ranking, clustering technique. Among all of these Document index is plays avital role while searching why since instead of searching hundreds of thousands of documents it will directly go to the particular index and will give the output here. Here our achievement mainly is indexing, the clear meaning of the indexing is storing an index is to optimize speed and performance in finding the appropriate/corresponding document for the user searched query.
My conclusion is the context based index approach is used in the query retrieval, this is mainly from the source document. Instead of searching every page on server, finding technically is better. Due to this we can save our time, we can reduce the burden of server.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
A scalable, lexicon based technique for sentiment analysisijfcstjournal
Rapid increase in the volume of sentiment rich social media on the web has resulted in an increased
interest among researchers regarding Sentimental Analysis and opinion mining. However, with so much
social media available on the web, sentiment analysis is now considered as a big data task. Hence the
conventional sentiment analysis approaches fails to efficiently handle the vast amount of sentiment data
available now a days. The main focus of the research was to find such a technique that can efficiently
perform sentiment analysis on big data sets. A technique that can categorize the text as positive, negative
and neutral in a fast and accurate manner. In the research, sentiment analysis was performed on a large
data set of tweets using Hadoop and the performance of the technique was measured in form of speed and
accuracy. The experimental results shows that the technique exhibits very good efficiency in handling big
sentiment data sets.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGIJwest
Machine Reading Comprehension (MRC), particularly extractive close-domain question-answering, is a prominent field in Natural Language Processing (NLP). Given a question and a passage or set of passages, a machine must be able to extract the appropriate answer from the passage(s). However, the majority of these existing questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Runtime of base models on the entire dataset is approximately one day while the runtime for all models on a third of the dataset is a little over two days. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset compared to the single-span source datasets. While the models tested on the source datasets were slightly fine-tuned in order to return multiple answers, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in question-answering and improve existing question-answering products and methods.
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGdannyijwest
Machine Reading Comprehension (MRC), particularly extractive close-domain question-answering, is a prominent field in Natural Language Processing (NLP). Given a question and a passage or set of passages, a machine must be able to extract the appropriate answer from the passage(s). However, the majority of these existing questions have only one answer, and more substantial testing on questions with multiple answers, or multi-span questions, has not yet been applied. Thus, we introduce a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets. In addition, we run BERT-based models pre-trained for question-answering on our constructed dataset to evaluate their reading comprehension abilities. Runtime of base models on the entire datasetis approximately one day while the runtime for all models on a third of the dataset is a little over two days. Among the three of BERT-based models we ran, RoBERTa exhibits the highest consistent performance, regardless of size. We find that all our models perform similarly on this new, multi-span dataset compared to the single-span source datasets. While the models tested on the source datasets were slightly fine-tuned in order to return multiple answers, performance is similar enough to judge that task formulation does not drastically affect question-answering abilities. Our evaluations indicate that these models are indeed capable of adjusting to answer questions that require multiple answers. We hope that our findings will assist future development in question-answering and improve existing question-answering products and methods
Modern Database Management 12th Global Edition by Hoffer solution manual.docxssuserf63bd7
https://qidiantiku.com/solution-manual-for-modern-database-management-12th-global-edition-by-hoffer.shtml
name:Solution manual for Modern Database Management 12th Global Edition by Hoffer
Edition:12th Global Edition
author:by Hoffer
ISBN:ISBN 10: 0133544613 / ISBN 13: 9780133544619
type:solution manual
format:word/zip
All chapter include
Focusing on what leading database practitioners say are the most important aspects to database development, Modern Database Management presents sound pedagogy, and topics that are critical for the practical success of database professionals. The 12th Edition further facilitates learning with illustrations that clarify important concepts and new media resources that make some of the more challenging material more engaging. Also included are general updates and expanded material in the areas undergoing rapid change due to improved managerial practices, database design tools and methodologies, and database technology.
Leveraging Flat Files from the Canvas LMS Data Portal at K-StateShalin Hai-Jew
A lot of data are created in an LMS instance, and much of this can be analyzed for insight. In 2016, Instructure, the makers of Canvas, made their LMS data available to their customers through a data portal (updated monthly). This portal enables access to a number of flat files related to that particular instance. This presentation showcases how this big data was analyzed on a regular laptop with basic office software, to summarize Kansas State University’s use of the LMS. Methods for analysis include the following: basic descriptive statistics, survival analysis, computational linguistic analysis, and others.
The results are reported out with both numbers and data visualizations, including classic pie charts, line graphs, bar charts, mixed-charts, word clouds, and others. The findings provide some insights about how to approach the data, how to use a data dictionary, and other methods for extracting the data for awareness and practical decision-making. This work also is suggestive of next steps for more advanced analysis (using the flat files in a SQL database).
More information about this may be accessed at http://scalar.usc.edu/works/c2c-digital-magazine-spring--summer-2017/wrangling-big-data-in-a-small-tech-ecosystem.
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
A STUDY ON THE IMPLEMENTATION OF SEMANTIC WIKI-BASED KNOWLEDGE MANAGEMENT SYSTEM FOR APPLICATION SERVICE PROVIDERS (ASP\’S) IN MANAGING SOFTWARE ARTIFACTS
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
1. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
DOI: 10.5121/ijnlc.2021.10601 1
QUESTION ANSWERING MODULE LEVERAGING
HETEROGENEOUS DATASETS
Abinaya Govindan, Gyan Ranjan and Amit Verma
Neuron7.ai, USA
ABSTRACT
Question Answering has been a well-researched NLP area over recent years. It has become necessary for
users to be able to query through the variety of information available - be it structured or unstructured. In
this paper, we propose a Question Answering module which a) can consume a variety of data formats - a
heterogeneous data pipeline, which ingests data from product manuals, technical data forums, internal
discussion forums, groups, etc. b) addresses practical challenges faced in real-life situations by pointing to
the exact segment of the manual or chat threads which can solve a user query c) provides segments of texts
when deemed relevant, based on user query and business context. Our solution provides a comprehensive
and detailed pipeline that is composed of elaborate data ingestion, data parsing, indexing, and querying
modules. Our solution is capable of handling a plethora of data sources such as text, images, tables,
community forums, and flow charts. Our studies performed on a variety of business-specific datasets
represent the necessity of custom pipelines like the proposed one to solve several real-world document
question-answering.
KEYWORDS
Machine comprehension, document parser, question answering, information retrieval, heterogeneous data
1. INTRODUCTION
In this study, we examine factoid question-answering when a data source is restricted to a
constrained domain, such as the Hi-Tech one. The agent peruses several product manuals as the
data source to determine a technical answer to a client query. Product manuals can be considered
a source of information that is continuously evolving when there are multiple versions and
releases of a product. Manuals offer a more reliable and up-to-date source than a knowledge base
with a structured source of information like FAQs, which may not be comprehensive and require
time and manual effort to create and validate, but are easier for computers to digest and process.
Hence manuals become the perfect candidate for a long-term and scalable system. However,
these manuals are intended for humans to read rather than for machines to parse, making them
more difficult to parse automatically.
In businesses, compiling product manuals requires time, effort, and coordination by several
teams, which often results in the lack of manuals for a very long period. We could solve this issue
if we leveraged real-time interactive data sources such as community forums and technical
discussion forums where users post problems they face and community members respond with
their solutions and useful links. The extraction of data from these forums can be very useful to
businesses in creating a crowd sourced knowledge platform. The data, however, cannot be
consumed in the raw form and has noise associated with it that needs to be cleaned and processed
for our system to use this effectively. So, the community conversation threads are also an integral
component of our system for extracting relevant information.
2. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
2
The goal of this system is to reduce the amount of time and effort required to locate relevant
articles in the manual or community conversation forums, then manually navigate to the
appropriate section with the most appropriate answer, as a result of manual intervention. As a
result, for every user request, the system returns a set of related topics from numerous guides and
conversations, as well as the parent, just like a standard question-and-answer system. This
decision is based on the business rationale that several manuals may contain information relevant
to your request. When a user asks a question like What is the expected time for my battery to be
fully charged? and the solution may be found in the manuals for a variety of devices, all of those
sections must be recommended to the user. The system must also be able to interpret any
additional context provided by the user that aids in narrowing down the manuals - for example, if
the user asks What is the expected time for device A’s battery to be charged?, the system must
recognize that device A is an additional context and should only be able to search manuals for
device A.
The use of hundreds of conversation threads and user manuals for question answering involves
the inclusion of a document indexer engine in the question answering system, which should be
executed at scale because the questions should be addressed in real-time. As a result, the system
should be able to retrieve relevant sections among hundreds of acquired manuals and
conversations for each user query. Because it can process both textual (paragraphs, summaries,
etc.) and non-textual (tables, images, flow diagrams, etc.) information, this system can be
extended to any domain as long as manuals, documents, or even books and articles are available.
In traditional question answering scenarios, a small chunk of text can be regarded as an answer to
the question asked. We cannot, however, make the same argument for our business use case. If a
user inquires about What are the steps for me to log in to a device?, the response cannot be
provided in a short segment and must be replied using an entire section titled How to use and set
up?. As a result, the system should be able to decide whether the answer should be returned as a
short sequence or as a portion of text in real-time.
In this paper, we show how various existing systems try to solve this domain based question
answering by comparing their performances on a standard business dataset. We also introduce
Intelligent Question Answering system which is composed of
– Document parser, a transformer-based deep learning model that can parse and manage a
wide range of unstructured data, including images, tables, textual content etc. The parser is
mainly comprised of two sub-modules to parse user manual documents and extract
conversation threads from the technical forums.
– Document indexer, a module that uses indexed databases to index documents with essential
information to keep all different types of data in a single collection, such as images, tables,
and so on.
– Document Retriever, a natural-language-based query processor that handles several
business-specific preparation processes, recognizes if the query has any “context,” and gets
the top relevant chunks of text from the indexed database.
– Document Reader, a multi-layer transformer-based model which has been fine-tuned for the
task of specific domain-based question answering. The document reader also has a classifier
that has been trained to decide if the answer should be a small segment or section of text.
In this paper, we study the application of several deep learning models to the question answering
task. Our experiments show that the Intelligent Question Answering system outperforms
traditional question answering systems on standard business-specific datasets.
3. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
3
1.1. Related work
The study teams first focused on factoid questions, which are queries for which answers can be
extracted with certainty from a defined text source. However, there may be a few deviations to
this assumption in real-world scenarios, which can be handled by other classifier modules. These
teams were mostly interested in factoid questions, such as “Where was X born?” and “Which
year did Y take place?”. Now, the emphasis is on solving difficult problems which one
encounters regularly in real-world like “How can Y be done?”, “A was moved from B to C and
later to D. Where is A now?”. These issues, in some cases, necessitate complex comprehension
and context inference, as well as information flow between sentences. These challenges can no
longer be solved with simple comprehension models or named entity models. The Text Retrieval
Conference (TREC) has featured an annual evaluation track of question answering systems since
1999. (Voorhees 2001, 2003b). Following TREC’s success, both the CLEF and NTCIR
workshops began multilingual and cross-lingual QA tracks in 2002, with a focus on European and
Asian languages, respectively as done by Magnini et al. 2006; Yutaka Sasaki and Lin 2005 in [8].
Other datasets, such as P. Rajpurkar, et al. 2016 [11] and P. Rajpurkar, et al. 2018 [10], that
focused on question answering, were also released in the past few years.
The collection of knowledge in the field of quality assurance has expanded to the point where
various models reliably address the QA domain. On the other hand, the majority of these models
and strategies focus on academic data sources that have been selected by humans and follow great
grammar and linguistic patterns.
Data is commonly distributed across pages or portions of pages in the real world, making parsing
and further inference of this type of data significantly more difficult than in the academic setting.
There are also several advanced complete pipeline QA systems that leverage either the Web, as
does QuASE (Sun et al., 2015) [13], or Wikipedia as a resource, as do Microsoft’s AskMSR
(Brill et al., 2002) [14], IBM’s DeepQA (Ferrucci et al., 2010) [20] and YodaQA (Baudi s, 2015;
Baudi s and Sediv‘y, 2015) [15]. AskMSR is a search-engine-based QA system that prioritizes
”data redundancy over complex linguistic analyses of either queries or probable responses,” in
other words, it doesn’t prioritize machine understanding as we do. Few methods attempt to deal
with both unstructured and structured data, such as text segments and documents, as well as
knowledge bases and databases. DeepQA is one such instance. Other systems based on DeepQA,
such as YodaQA, incorporate information extraction from unstructured sources such as websites,
text, and Wikipedia.
This task is difficult since researchers must deal with scalability and accuracy issues. Rapid
development has been made in recent years, and the performance of factoid and open-domain QA
systems has greatly improved (Sasaki et al. [8], 2017; S. Schwager et al., 2019 [4] ;K. Jiang et al.,
2019, [1]). Several alternatives have been offered, notably DrQA’s two-stage ranker-reader
system (Chen et al., 2017 [9]) end-to-end transformer-based models (S.Schwager et al., 2019) [4]
and unified framework-based models to handle all text-based language difficulties (Raffel et al.,
2020) [7].
QAConv, as developed by C.S Wu et al., [23] is a new dataset that uses conversations as
knowledge source. The data is mainly composed of business emails, panel discussions, technical
conversations and work channels. Unlike open-domain and task-oriented dialogues, these
conversations are usually long, complex, asynchronous, and involve strong domain knowledge.
Traditional question answering models have proven to be inefficient on this kind of data, proving
the need for a custom and enhanced pipeline to address such data sources.
4. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
4
2. OUR PROPOSAL
Our system, Intelligent Question Answering Pipeline, is described in the following sections, and
it is made up of four parts: Document Parser (1), Document Indexer (2), Document Retriever (3),
and Document Reader (4) modules. 5 is a diagram of the entire architecture.
2.1. Document parser
The document parser is the input processing block that reads the data and converts it into a format
that can be handled by ensuing modules. The document parser is composed of two main building
blocks a) The deep learning based framework that is used to convert conversational threads from
technical forums and internal chat forums to our desired format so that they can be further
processed for question answering. b) The Mask RCNN based fine tuned instance segmentation
model, which is fine adjusted to recognise tables and images with text contained in the document,
is the core part of the document parser. In addition to the existing branch for classification and
bounding box regression, the Mask R-CNN extends Faster RCNN by adding a branch for
predicting segmentation masks on each Region of
Interest (RoI).
Deep learning based conversational thread parser The deep learning based conversation
parser module is depicted in fig 1.
Fig 1. Detailed overview of Intelligent Question Answering for Product manuals
We built a specialized pipeline to process information from technical forums and community
websites. The pipeline has the following characteristics :
– uses natural language inference models to identify the key sections of a conversation
– from the conversation threads, extracts the most relevant response
– compiles the information into a usable format that can be put in an indexed database with
information collected from manuals.
An intent categorization model is used in the initial section of the conversational thread parser.
This model is built on a BERT-based classifier that has been fine-tuned using the natural
language inference dataset. As described in [26], the classifier was fine-tuned for the terms
entailment, non-entailment, and neutral. As a zero-shot textual entailment issue, we employ this
5. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
5
fine-tuned model to classify intents in our data. The classes for which the model was trained as
“intents” are “application related, feature update, software error” etc . These intents were noun
chunks that were the most prominent in the data and were prioritised using domain expertise
provided by agents.
The intent identification model determines the question’s intent from a list of pre-defined intents
Ik and preserves messages in the conversation thread that are also part of the same intent as Ik.
This removes non-relevant messages, which is the first degree of noise reduction. For example, if
the user asks, “My battery does not function, has anyone had this issue,” the intent classifier
predicts “software error” as the intent for the inquiry. The model then predicts intents for all
individual messages in the threads, keeping just those that contain the expected intent of
“software error.”
Since these forums are typically free-text based, there is a risk of users adding non-relevant
inputs to the threads. This degree of intent-based-filtering helps the model remove all the
unnecessary messages such as “Thanks” or other random remarks in the threads.
The Answer identification module makes up the second half of the conversational thread parser.
Since we do not have that information of which answer in a thread actually solved a user
question, the answer identification module plays a crucial role in generating actual question-
answer pairs which can be considered “completely error free” and can be indexed with the other
the data sources. Once the first level of noisy messages has been removed from a thread using the
intent identification module, it is critical to determine if a proposed response truly answers the
user’s inquiry and to keep only that.
This is performed by our question-answering system which is built on modern text-to-text
framework such as T5 [27]. The data we utilised to fine-tune our model was curated in a way
similar to the BoolQA dataset, in which we extracted question and answer pairs from community
forums. Around 700 discussion threads were retrieved, with a total of 2500 question-answer pairs
annotated from these threads. The data structure is as mentioned in 2 :
To encode the question and answer together into a single input into a text in - text out format, we
first establish an uniform encoding format. The query is encoded first, followed by knowledge
context - the candidate responses for each of the dialogue threads. The delimiter “n” is used to
connect these inputs into a single sequence. We don’t use any prefixes, data, or task specific
identifiers in the encoding or model, unlike other sequence to sequence models fine-tuned on T5.
The model is trained in such a way that the answers are inferred from the context and question,
which is far superior to standard question answering systems that extract
Fig 2. Sample data format used to fine tune T5 model
or abstract answer blocks from the provided context. The model has been trained to minimise the
categorical cross entropy loss defined as
6. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
6
Where y ∈ R5 is a one-hot label vector and Nc is the number of classes.
Once the model has been fine tuned to identify if an answer is really the technical solution which
helps solves a question, we use this model to fully clean our data and extract question and
corresponding answer pairs from conversations. This is done by passing questions and all the
noise-removed candidate answers, in our desired encoded format to the fine tuned model. Since
the model has been trained to return “yes” if a context actually contains the technical solution to
the question, we retain only the answers in a thread which has output “yes” as the model’s output.
If there are more than one answers which has been predicted relevant (i.e. model output is yes),
we use the prediction probabilities from the output softmax layer and keep the answer which has
the maximum prediction probability. This finally helps us narrow down to one particular answer
from the entire thread, removing the other noisy responses which can further be passed to the
question answering modules.
Once we’ve extracted the correct response to each question given in the user forum, we’ll need to
encode it in the same format as the other data sources so it can be indexed in the same indexed
databases as the others. From the entire data of the technical forums and community forums, the
data collator extracts structured pairs of questions and related responses. The collator then adds
suitable metadata fields to act as data source identifiers before pushing the collected data into the
indexed databases.
Mask RCNN based instance segmentation model The architecture of a Mask RCNN is as
depicted in 3 and 4
The premise of Mask R-CNN is simple: For each candidate object, the faster R-CNN [17] has
two outputs: a class label and a bounding-box offset; to this, we
Fig 3. Mask RCNN framework for instance segmentation
7. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
7
Fig 4. Head architecture of Faster R-CNN
add a third branch that produces the object mask. The additional mask output, on the other hand,
is separate from the class and box outputs, necessitating the extraction of a much finer spatial
arrangement of an item. Because Mask RCNN uses picture centric training, the images are shrunk
to a scale of 800 pixels. Formally, a multi-task loss on each sampled RoI that is used during
training is defined as
L = Lcls + Lbox + Lmask
The classification loss Lcls is defined as
Lcls(p,u) = −logpu
which is the log loss for the true class u and bounding-box loss Lbox is defined as
For each RoI, the mask branch produces a Km2
- dimensional output, which encodes K binary
masks of resolution m m, one for each of the K classes. A perpixel sigmoid is applied, and Lmask is
defined as the average binary cross-entropy loss. Lmask is only specified on the kth
mask for a RoI
associated with ground-truth class k. (other mask outputs do not contribute to the loss).
The mask RCNN [18] based object detector is fine tuned as depicted in 3. The following
modules make up the Document Parser’s primary stages:
– The RCNN model has been fine-tuned to recognise two primary objects: tables and photos
with captions.
– The Document Parser then uses the fine-tuned model to divide the incoming document into
three categories: tables, images with captions, and paragraph sections.
– Based on the identified object, all three sections of the document are then stored in suitable
databases.
8. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
8
Fig 5. Detailed overview of Intelligent Question Answering for Product manuals
2.2. Document indexer
The indexer is responsible for indexing and storing the parsed data in structured databases and
indexed databases. These sections are indexed with relevant indicator/metadata characteristics
during index time, which can then be mapped to the object class such as table, text, and so on. We
chose a Lucene-based indexer for indexed databases after examining several business
characteristics such as data volume, indexing speed, and retrieval speed during query time. The
document parsing and indexing is done in batches, with triggers configured to start the process
when new documents are added to the source repository.
2.3. Document retriever
Fig 6. Fine tuned word embedding vector space
9. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
9
Fig. 7. Fine tuned models to parse user query
The real-time query process is handled by the retriever and reader after the documents have been
batched indexed. Despite the fact that we have thousands of documents spanning hundreds of
pages, we can query through them in real time because to the document retriever’s quick
information retrieval. The modules that make up the retriever are as follows:
Contextualised synonyms extractor We use GloVe word embeddings [21], which have been
fine-tuned on our business data, to give the retriever semantic abilities during the query stage.
Aside from the fact that they seek to maximise the log probability while a context window scans
across the corpus, much of the intricacies of these models have not been explored due to the
brevity of this study. Although the training is done in an online, stochastic manner, the inferred
global objective function can be expressed as,
This results in word embedding that look like 6 in the higher dimensional vector space.
Using fine-tuned word embedding, we cluster words based on their embedding to group
semantically similar words together, resulting in a set of contextual synonyms that aid in semantic
and contextual query retrieval.
Metadata and Context extractor A context extractor, which is a named entity model that has
been trained on metadata such as product family, product line, and model name, is the next stage
of document retrieval. For the objective of single phrase tagging, this named entity model was
trained using a variation of BERT cite5. This model’s construction is depicted in 8. After
extracting the entities from the user query, the metadata information is used to enhance the
retrieval by pulling information only from documents that have relevant metadata.
10. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
10
Fig. 8. Architecture for Named entity model fine Tuning
Fig. 9. Training data preparation for Question Answering
11. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
11
Tf-idf based retriever and collator The pre-processed query is then run through a Lucene-based
indexed Query processor, which converts the query into a Lucene index-specific format and
returns the n most relevant documents, along with their IDs, Tf-Idf relevance scores, and
metadata. As shown in 7, the user query is delivered in parallel to all object classes.
The gathered document results are then sent into our BM25-based similarity scorer, which re-
prioritizes the retrieved results as follows:
where
2.4. Document Reader
The document reader is responsible for producing the final consumable user results using the
following models once the results have been retrieved and aggregated:
Fine tuned textual and Image question answering model Despite the widespread availability
of pre-trained question-answering models that can answer generic queries, real-world questions
typically provide less than satisfactory results. For this use case, we constructed an unsupervised
question answering training dataset (similar to Cloze translation [16]), which was then utilised to
fine-tune pre-trained BERT [5] models, as shown in 9.
One difference between the usual SQuAD scenario and our commercial use case is that the
questions we might be asked require more detailed responses. It’s possible that the questions
related to our business case are What or Why or How inquiries, such as How shall I switch my
phone off? or What is the meaning of error X?. In the first situation, a paragraph may be required
as an answer, whereas in the second, only a small portion of the paragraph may be required. To
categorise incoming questions into one of these classes, a conventional Naive Bayes based
classifier is employed, and a decision is made based on whether the answer to the question should
be a short answer or a long one.
This approach is also applied to photographs that have text captions. The OCR parser in the
image-based question-answering model extracts textual information from the document’s picture
segments. The fine-tuned question-answering model is then used to extract relevant answer
segments.
Tabular question answering model The Data Reader’s second section uses fine-tuned models
based on the BERT architecture to answer table questions. This method is based on K.
Chakrabarti et al TableQnA’s [2]. This module contains finding the correct table from a list of
tables and utilising the TableQnA model to get the cell that could be the answer. The model also
includes a collection of table handling and parsing algorithms that convert multiple tabular forms
to the TableQnA format, as well as a post processor that returns the answer in a userfriendly
format.
12. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
12
3. EXPERIMENTS AND DATA
We evaluated a comparison between two widely used standard approaches and our pipeline for
the purposes of comparison.
3.1. Standard Tf-Idf based Retrieval Based Approach
As a domain, information retrieval has been thought of as a search engine-based task in which
retrieval from indexed databases leads to the final solution. For many question types, a basic
inverted index lookup followed by term vector model scoring works fairly well. With minimum
contextualisation, we indexed all of the manuals’ subsections into a Lucene-based indexed
database. The test queries were then passed to the database as database queries, and the top
relevant section of the manual was extracted as the answer.
3.2. SQuAD Based Retrieval Approaches
Traditional SQuAD-based models [4] were the second option we wanted to test to see how they
performed in real-world use cases. Because the SQuAD model has only ever been evaluated on
extremely small text blocks, when we apply it for larger chunks of text like Wikipedia or
documents, it frequently fails in terms of both run time and performance by failing to capture the
correct area of the product manual.
3.3. Comparison of Average Run Time
On a test size of 50 user questions, the methods’ average execution time is as follows:
Table 1. Average run time (ms)
Tf-Idf based SQuAD based Our approach
100 300000 150
Because the objective is to return the response in real-time, the SQuAD approach makes it
infeasible for the user to wait approximately 5 minutes for each question to arrive at the
corresponding answer.
3.4. Comparison of Performance metrics
We chose measures for this experiment that would demonstrate the predicted answer’s lexical and
semantic similarity to the actual answer. The Rouge score was the first metric used, and it was
defined as
where
countmatch(ngram) = n(A ∩ B)
A token t is considered to be common between A and B if the semantic similarity between t and
at least one token in sequence B is greater than a pre-defined threshold. We use ROUGE1 and
ROUGE2.
13. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
13
The Overlap similarity metric, on the other hand, seeks to capture the similarity between the
expected and predicted answer. We define it as:
This metric denotes the percentage of tokens in the expected answer S1, which is semantically
equivalent to S2. Semantic similarity between two vectors A and B is measured as
The performance numbers are as below:
Table 2. Performance comparison
Approach Metric Performance
Tf-Idf based approach
Rouge1 - F score
Rouge2 - F score
0.0838
0.0514
Overlap measure 0.2245
SQuAD based approach
Rouge1 - F score
Rouge2 - F score
0.0609
0.0224
Overlap measure 0.2041
Intelligent QnA Pipeline
Rouge1 - F score
Rouge2 - F score
0.2463
0.2094
Overlap measure 0.5918
The figures shown here are for when only the first prediction was taken into account for
comparison. When even the top three results are accounted for comparison the numbers increase
SIGNIFICANTLY AND WAS ACCEPTED FOR THE BUSINESS USE-CASE
3.5. Inference
The Tf-IDF approach worked well for simple scenarios, but the returned responses were too long
for the end user to absorb. In majority of these cases, the solutions were only available in a
subsection of the manuals, making it impractical in a business scenario to trawl through extensive
passages to find the answer. Furthermore, because there were other sections with the same
keywords, the answer was frequently not found in the first returned result, but rather somewhere
further down. The SQuAD-based results had two major drawbacks: run time and result quality.
For both simple and complicated problems, the model frequently failed to return the correct
answer. This demonstrates that, while traditional methods can perform well in some situations,
the smaller domain specific complexities provided into our pipeline have shown to be helpful in
obtaining the right response in the least amount of time.
4. CONCLUSION
In this paper, we suggested a novel question-answering pipeline based on structured and
unstructured data sources such as manuals, images, and product user guides in this paper. We
have our main focus on extracting information from heterogeneous data sources which usually
14. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
14
pose several practical challenges to parse and understand. We’ve also shown, with concrete data,
how to layer bespoke domain knowledge on top of existing and traditional systems to produce
contextualised outputs for a certain business domain. Our method has also been used in a variety
of domain use-cases to help users find relevant answers to their problems fast.
This paper extends the work done on question answering on product manuals as studied in [24].
We apply several fine tuned models to handle heterogeneous data with significant effort and
performance. One limitation of the existing approach is that the current pipeline does not allow
the incorporation of human feedback for various sub modules. As described in Future Work, we
expect to undertake that as part of our pipeline enhancement initiatives.
5. FUTURE WORK
In the current system, there is a simple layer of user feedback and named entities as mentioned in
[25] that is used to improve the final answer generated by the automated pipeline. User signals
such as feedback and more annotated data for new labels will be incorporated into our future
work. Individual components of the current pipeline will perform better as a result of these
signals being plugged into various pipeline sub-modules. One area of improvement, as proposed
by G. Abinaya et al. [22], is designing a system capable of ingesting such feedback. The
aforementioned structure will include dedicated modules that determine which section of the
pipeline feedback should flow into, the cadence at which feedback should be reflected in the
module, and the weights that should be assigned to feedback based on its importance, user
preferences, and other factors.
For the scope of this paper, we prioritised the question answering module, which was trained for
the Hi-Tech domain. Another area of consideration in the named entity recognizer module will be
the incorporation of domain knowledge and domain specific jargons, starting with the generation
of domain specific data using unstructured methodologies and then developing named entity
classifiers utilising this data.
REFERENCES
[1] K. Jiang,D. Wu, and H. Jiang, “FreebaseQA: A New Factoid QA Data Set Matching TriviaStyle
Question-Answer Pairs with Freebase,” Proceedings of NAACL-HLT 2019 pages 318–323
Minneapolis, Minnesota, June 2 - June 7, 2019.
[2] K. Chakrabarti,Z. Chen, S. Shakeri, G. Cao, and S. Chaudhuri “TableQnA: Answering ListIntent
Queries With Web Tables,” Proceedings of the VLDB Endowment, Vol. 12, 2020.
[3] C. Deng, G. Zeng, Z. Cai, and X. Xiao, “ A Survey of Knowledge Based Question Answering with
Deep Learning,” Journal on Artificial Intelligence 2020, No. 4 , pages 157-166
[4] S. Schwager and J. Solitario, “Question and Answering on SQuAD 2.0: BERT Is All You Need,”
ArXiv e-prints of 2019.
[5] J. Devlin , M.W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” ArXiv e-prints of 2018.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin
“Attention Is All You Need,” 31st Conference on Neural Information Processing Systems (NIPS
2017), Long Beach, CA, USA.
[7] C. Raffel , N. Shazeer, A. Roberts, K. Lee , S. Narang , M. Matena , Y. Zhou, W. Li and P. J. Liu,
“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” ArXiv e-prints
of 2020
[8] Y. Sasaki, Y.Chen, K. Chen and C.J. Lin, “Overview of the NTCIR-5 Cross-Lingual Question
Answering Task,” Proceedings of NTCIR-5 Workshop Meeting, 2005, Tokyo, Japan
15. International Journal on Natural Language Computing (IJNLC) Vol.10, No.6, December 2021
15
[9] D. A. Chen, J. W. Fisch, and A. Bordes. “ Reading Wikipedia to Answer Open-Domain Questions”
ArXiv e-prints, 2017.
[10] P. Rajpurkar, R. Jia, P. Liang. “Know What You Don’t Know: Unanswerable Questions for SQuAD.”
ArXiv:1806.038221v1, 2018.
[11] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang. “ SQuAD: 100,000+ Questions for Machine
Comprehension of Text,” arXiv:1606.05250v3, 2016.
[12] W. T. Yih, X. D. He and C. Meek, “Semantic parsing for single-relation question answering,” in Proc.
of the 52nd Annual MTG of the Association for Computational Linguistics, Baltimore, Maryland,
USA, pp. 643–648, 2014.
[13] H. Sun, H. Ma, W. Yih, C. Tsai, J. Liu, and M.i Chang. “Open domain question answering via
semantic enrichment,” Proceedings of the 24th International Conference on World Wide Web. ACM
2015, pages 1045–1055.
[14] E. Brill, S. Dumais, and M. Banko. “An analysis of the AskMSR question-answering system,”
Empirical Methods in Natural Language Processing (EMNLP). pages 257–264.’
[15] P. Baudi s and J. Sediv‘y. “Modeling of the question answering task in the YodaQA system,”
International Conference of the Cross- Language Evaluation Forum for European Languages.
Springer 2015, pages 222–228.
[16] P. Lewis, L. Denoyer and S. Riedel. “Unsupervised Question Answering by Cloze Translation,”
ArXiv e-prints of 2019
[17] R. Girshick. “Fast R-CNN,” ArXiv e-prints of 2015
[18] K. He, G. Gkioxari , P. Doll‘ar and R. Girshick. “Mask R-CNN,” ArXiv e-prints of 2018 19. B.
[19] Magnini, D. Giampiccolo, L. Aunimo, C. Ayache, P. Osenova, A. Pen˜as, M. Rijke, B. Sacaleanu, D.
Santos and R. Sutcliffe “The Multilingual Question Answering Track at CLEF ,” The Multilingual
Question Answering Track at CLEF , 2006
[20] Ferrucci, David and Brown, Eric and Chu-Carroll, Jennifer and Fan, James and Gondek, David and
Kalyanpur, Aditya and Lally, Adam and Murdock, J William and Nyberg, Eric and Prager, John and
Schlaefer, Nico and Welty, Christopher “Building Watson: An Overview of the DeepQA Project,”. AI
Magazine, 2010. pages 59-79
[21] J. Pennington , R. Socher and C. Manning “GloVe: Global Vectors for Word Representation,”
Empirical Methods in Natural Language Processing, 2014
[22] G. Abinaya, G. Ranjan and P. Aswin Karthik, “Continuous learning mechanism of NLU-ML models
boosted by human feedback,” International Conference on Computational Intelligence in Data
Science (ICCIDS), 2019, pages 1-6
[23] C-S Wu, A. Madotto, W. Liu, P. Fung, C. Xiong, “QAConv: Question Answering on Informative
Conversations,” CoRR 2021
[24] A. Govindan, G. Ranjan, A. Verma, “Intelligent Question Answering Module for Product Manuals,”
NATL - Nov, 2021
[25] A. Govindan, G. Ranjan, A. Verma, “Unsupervised Named Entity Recognition for Hi-Tech Domain,”
CNDC - Nov, 2021
[26] W. Yin, J. Hay and D. Roth, “Benchmarking Zero-shot Text Classification: Datasets, Evaluation and
Entailment Approach,” CORR 2019
[27] C. Raffel, N. Shazeer, A. Roberts, K. Lee , S. Narang, M. Matena , Y. Zhou , W.Li and P. J. Liu ,
‘Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” CORR 2019.