The document describes a system that uses artificial immune systems and natural language processing techniques like semantic role labeling and named entity recognition to automatically generate questions from text. It introduces a model that applies these techniques to extract semantic patterns from sentences, trains a classifier using artificial immune systems to classify question types, and then generates questions by matching patterns. The system was tested on sentences from various sources and showed promising results, correctly determining question types 95% of the time and generating matching questions 87% of the time.
Factoid based natural language question generation systemAnimesh Shaw
The proposed system addresses the generation of factoid or wh-questions from sentences in a corpus consisting of factual, descriptive and unbiased details. We discuss our heuristic algorithm for sentence simplification or pre-processing and the knowledge base extracted from previous step is stored in a structured format which assists us in further processing. We further discuss the sentence semantic relation which enables us to construct questions following certain recognizable patterns among different sentence entities, following by the evaluation of question generated.
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Translating natural language competency questions into sparql queries web2013Leila Zemmouchi-Ghomari
it's our presentation at WEB 2013 the first internationl conference on building and exploring web based environments, held at seville, Spain (January 27-February 01)
Named Entity Recognition Using Web Document CorpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a footballer. NE recognition may be viewed as a classification method, where every word is assigned to a NE class, regarding the context.
The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Named entity recognition using web document corpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE)
can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is
found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a
footballer. NE recognition may be viewed as a classification method, where every word is assigned to
a NE class, regarding the context. The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Factoid based natural language question generation systemAnimesh Shaw
The proposed system addresses the generation of factoid or wh-questions from sentences in a corpus consisting of factual, descriptive and unbiased details. We discuss our heuristic algorithm for sentence simplification or pre-processing and the knowledge base extracted from previous step is stored in a structured format which assists us in further processing. We further discuss the sentence semantic relation which enables us to construct questions following certain recognizable patterns among different sentence entities, following by the evaluation of question generated.
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Translating natural language competency questions into sparql queries web2013Leila Zemmouchi-Ghomari
it's our presentation at WEB 2013 the first internationl conference on building and exploring web based environments, held at seville, Spain (January 27-February 01)
Named Entity Recognition Using Web Document CorpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a footballer. NE recognition may be viewed as a classification method, where every word is assigned to a NE class, regarding the context.
The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
Named entity recognition using web document corpusIJMIT JOURNAL
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE)
can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is
found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE’s nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a
footballer. NE recognition may be viewed as a classification method, where every word is assigned to
a NE class, regarding the context. The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...home
Named entities are the most informative element of a textual document and identification of the names is very much
important for extracting further information from text. We have developed a conditional random field based system to
identify the named entities from homeopathic diagnosis discussion forum text. We have manually annotated a training
corpus for the task. As manual creation of a sufficiently large annotated corpus is costly and time consuming, we use an
active learning based semi-supervised framework to increase the efficiency of the system with the help of un-annotated
data. Our system achieves the highest f-value of
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
In this research work we have develop a new scoring mathematical model that works on the five types of questions. The question text failures are first extracted and a score is found based on its structure with respect to its template structure and then answer score is calculated again the question as well as paragraph. Text to finally reach at the index of the most probable answer with respect to question.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
Weather forecasting is most challenging problem around the world. There are various reason because of its experimented values in meteorology, but it is also a typical unbiased time series forecasting problem in scientific research. A lots of methods proposed by various scientists. The motive behind research is to predict more accurate. This paper contribute the same using artificial neural network (ANN) and simulated in MATLAB to predict two important weather parameters i.e. maximum and minimum temperature. The model has been trained using past 60 years of real data collected from(1901-1960) and tested over 40 years to forecast maximum and minimum temperature. The results based on mean square error function (MSE) confirm, this model which is based on multilayer perceptron has the potential to successful application to weather forecasting
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
Architecture of an ontology based domain-specific natural language question a...IJwest
Question answering (QA) system aims at retrieving precise information from a large collection of
documents against a query. This paper describes the architecture of a Natural Language Question
Answering (NLQA) system for a specifi
c domain based on the ontological information, a step towards
semantic web question answering. The proposed architecture defines four basic modules suitable for
enhancing current QA capabilities with the ability of processing complex questions. The first m
odule was
the question processing, which analyses and classifies the question and also reformulates the user query.
The second module allows the process of retrieving the relevant documents. The next module processes the
retrieved documents, and the last m
odule performs the extraction and generation of a response. Natural
language processing techniques are used for processing the question and documents and also for answer
extraction. Ontology and domain knowledge are used for reformulating queries and ident
ifying the
relations. The aim of the system is to generate short and specific answer to the question that is asked in the
natural language in a specific domain. We have achieved 94 % accuracy of natural language question
answering in our implementation
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...home
Named entities are the most informative element of a textual document and identification of the names is very much
important for extracting further information from text. We have developed a conditional random field based system to
identify the named entities from homeopathic diagnosis discussion forum text. We have manually annotated a training
corpus for the task. As manual creation of a sufficiently large annotated corpus is costly and time consuming, we use an
active learning based semi-supervised framework to increase the efficiency of the system with the help of un-annotated
data. Our system achieves the highest f-value of
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
In this research work we have develop a new scoring mathematical model that works on the five types of questions. The question text failures are first extracted and a score is found based on its structure with respect to its template structure and then answer score is calculated again the question as well as paragraph. Text to finally reach at the index of the most probable answer with respect to question.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
Weather forecasting is most challenging problem around the world. There are various reason because of its experimented values in meteorology, but it is also a typical unbiased time series forecasting problem in scientific research. A lots of methods proposed by various scientists. The motive behind research is to predict more accurate. This paper contribute the same using artificial neural network (ANN) and simulated in MATLAB to predict two important weather parameters i.e. maximum and minimum temperature. The model has been trained using past 60 years of real data collected from(1901-1960) and tested over 40 years to forecast maximum and minimum temperature. The results based on mean square error function (MSE) confirm, this model which is based on multilayer perceptron has the potential to successful application to weather forecasting
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This
paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean
model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the
researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
The World Wide Web holds a large size of different information. Sometimes while searching the World Wide Web, users always do not gain the type of information they expect. In the subject of information extraction, extracting semantic relationships between terms from documents become a challenge. This paper proposes a system helps in retrieving documents based on the query expansion and tackles the extracting of semantic relationships from biological documents. This system retrieved documents that are relevant to the input terms then it extracts the existence of a relationship. In this system, we use Boolean model and the pattern recognition which helps in determining the relevant documents and determining the place of the relationship in the biological document. The system constructs a term-relation table that accelerates the relation extracting part. The proposed method offers another usage of the system so the researchers can use it to figure out the relationship between two biological terms through the available information in the biological documents. Also for the retrieved documents, the system measures the percentage of the precision and recall.
Exploring Semantic Question Generation Methodology and a Case Study for Algor...IJCI JOURNAL
Assessment of student performance is one of the most important tasks in the educational process. Thus, formulating questions and creating tests takes the instructor a lot of time and effort. However, the time spent for learning acquisition and on exam preparation could be utilized in better ways. With the technical development in representing and linking data, ontologies have been used in academic fields to represent the terms in a field by defining concepts and categories classifies the subject. Also, the emergence of such methods that represent the data and link it logically contributed to the creation of methods and tools for creating questions. These tools can be used in existing learning systems to provide effective solutions to assist the teacher in creating test questions. This research paper introduces a semantic methodology for automating question generation in the domain of Algorithms. The primary objective of this approach is to support instructors in effectively incorporating automatically generated questions into their instructional practice, thereby enhancing the teaching and learning experience.
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
Question Answering (QA) system is one of the ever growing applications in Natural Language Processing. The purpose of Automatic Question and Answer Generation system is to generate all possible questions and its relevant answers from a given unstructured document. Complex sentences are simplified to make question generation easier. The accuracy of the generated questions is measured by identifying the subtopics from the text using Gaussian Mixture Neural Topic Model (GMNTM).The similarity between generated questions and text are calculated using Extended String Subsequence Kernel (ESSK). The syntactic correctness of the questions is measured by Syntactic Tree Kernel which computes the similarity scores between each sentence in the given context and generated questions. Based on the similarity score, questions are ranked. The answers for the generated questions are extracted using Pattern Matching Approach. This system is expected to produce better accuracy when compared with the system using Latent Dirichlet Allocation (LDA) for subtopic identification.
Automatic Generation of Multiple Choice Questions using Surface-based Semanti...CSCJournals
Multiple Choice Questions (MCQs) are a popular large-scale assessment tool. MCQs make it much easier for test-takers to take tests and for examiners to interpret their results; however, they are very expensive to compile manually, and they often need to be produced on a large scale and within short iterative cycles. We examine the problem of automated MCQ generation with the help of unsupervised Relation Extraction, a technique used in a number of related Natural Language Processing problems. Unsupervised Relation Extraction aims to identify the most important named entities and terminology in a document and then recognize semantic relations between them, without any prior knowledge as to the semantic types of the relations or their specific linguistic realization. We investigated a number of relation extraction patterns and tested a number of assumptions about linguistic expression of semantic relations between named entities. Our findings indicate that an optimized configuration of our MCQ generation system is capable of achieving high precision rates, which are much more important than recall in the automatic generation of MCQs. Its enhancement with linguistic knowledge further helps to produce significantly better patterns. We furthermore carried out a user-centric evaluation of the system, where subject domain experts from biomedical domain evaluated automatically generated MCQ items in terms of readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The results of this evaluation make it possible for us to draw conclusions about the utility of the approach in practical e-Learning applications.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Similar to Semantic based automatic question generation using artificial immune system (20)
Semantic based automatic question generation using artificial immune system
1. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
Semantic Based Automatic Question Generation using Artificial
Immune System
Ibrahim Eldesoky Fattoh
Teacher Assistant at faculty of Information Technology, MUST University, Cairo, Egypt.
Abstract
This research introduces a semantic based Automatic Question Generation System using both Semantic Role
Labeling and Named Entity Recognition techniques to convert the input sentence into a semantic pattern. A
training phase applied to build a classifier using an Artificial Immune System that will be able to classify the
patterns according to the question type. The question types considered here are set of WH-questions like who,
when, where, why, and how. Then a pattern matching phase is applied to select the best matching question
pattern for the test sentence. The proposed system is tested against a set of sentences obtained from different
sources like Wikipedia articles, TREC 2007dataset for question answering, and English book of grade II prep.
The proposed model shows promising results in determining the question type with classification accuracy
increases 95%, and also in generating (matching) the new question patterns with 87%.
74
1. Introduction:
Natural language processing is a more popular area of artificial intelligence and it has a wide application area
like text summarization, machine translation, question answering, question generation etc. [1]. Question
generation is an important component in advanced learning technologies such as intelligent tutoring systems. A
wide transformation has been made last years in the field of Natural Language Processing (NLP) in studying the
questions as a part of the task of question answering to the task of the Automatic Question Generation.
Developing Automatic Question Generation systems has been one of the important research issues because it
requires insights from a variety of disciplines, including, Artificial Intelligence, Natural Language
Understanding, and Natural Language Generation. One of the definitions of Question Generation is the process
of automatically generating questions from various inputs like raw text, database, or semantic representation
[2]. The process of AQG is divided into a set of subtasks which are content selection, question type
identification, and question construction. According to the previous trials that has been applied for question
generation, there are two types of question formats; multiple choice questions which asks about a word in a
given sentence, the word may be an adjective, adverb, vocabulary, etc., the second format is the entity questions
systems or Text to Text QG (like factual questions) which asks about a word or phrase corresponding to a
particular entity in a given sentence, the question types are like what, who, why etc.. This research introduces a
learning model using the Artificial Immune System (AIS) for the second type of question formats. The proposed
learning model depends on the labeling roles extracted from SRL and named entities extracted from NER and
the nature of the artificial immune system in supervised learning problems to build a classifier generator model
for classifying the question type then generating the best matching question (s) for a given sentence. The rest of
the paper is organized as follows: section 2 discusses the related work of AQG, section 3 speaks about AIS,
section 4 introduces both SRL and NER , section 5 introduces the proposed model AIQGS, section 6 shows the
experimental results and evaluation, and last section 7 introduces a conclusion and future work with some
remarks.
2. Related work
In this section, a review of the previous AQG systems from sentence for the second question type formats
mentioned in section 1 introduced. Previous efforts in QG from text can be broadly divided into three categories:
syntax based, semantics based, and template based. The three categories are not entirely disjoint. In the first
category, syntax based methods follow a common technique: parse the sentence to determine syntactic structure,
simplify the sentence if possible, identify key phrases, and apply syntactic transformation rules and question
word replacement. There are many systems in the literature for the syntax based methods. Kalady et al.[3], Varga
and Ha [4], Wolfe [5], and Ali et al.[6] introduce a sample of these methods. The second category, semantic
based methods also relied on transformations to generate questions from declarative sentence. They depend on
the semantic analysis rather than the syntactic. Mannem et al.[7] show us a system that combines SRL with
syntactic transformations to generate questions. Yao and Zhang [8] demonstrate an approach to QG based on
minimal recursion semantics (MRS), a framework for shallow semantic analysis developed by Copestake et
al.[9] Their method uses an eight-stage pipeline to convert input text to a set of questions. The third category,
template based methods offer the ability to ask questions that are not as tightly coupled to the exact wording of
the source text as syntax and semantics based methods. Cai et al.[10] presented NLGML (Natural Language
Generation Markup Language), a language that can be used to generate questions of any desired type. Wang et
al. [11] introduced a system to generate the questions automatically based on question templates which are
2. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
created by training the system on many medical articles. The model used in this research belongs to semantic
based methods with learning phase that introduces a classifier to decide what is the question type, and a
recognizer for question pattern that specify the generated question.
75
3. Artificial Immune System
An Artificial Immune System is a computational model based upon metaphors of the natural immune system"
[12]. From information-processing perspective, the immune system is parallel and distributed adaptive
system with partial decentralized control mechanism. Immune system utilizes feature extraction, learning,
storage memory, and associative retrieval in order to solve recognition and classification tasks. In particular, it
learns to recognize relevant patterns, remember patterns that have been seen previously, and has the ability to
efficiently construct pattern detectors. The overall behavior is an emergent property of many local interactions.
Such information-processing capabilities of the immune system provide many useful aspects in the field of
computation [13]. AIS was emerged as a new branch in Computational Intelligence (CI) in the 1990s. A number
of AIS models exist, and they are used in pattern recognition, fault detection, computer security, data analysis,
scheduling, machine learning, search optimization methods and a variety of other applications researchers are
exploring in the field of science and engineering [12, 13]. There are four basic algorithms in AIS which are
network model, positive selection, negative selection, and clonal selection [12]. The main features of the clonal
selection theory [14] are proliferation and differentiation on stimulation of cells with antigens; generation of new
random genetic changes expressed subsequently as a diverse antibody pattern; and estimation of newly
differentiated lymphocytes carrying low affinity antigenic receptor. The clonal selection algorithm is the one
used in the learning phase of the proposed model as will be shown in section 5.
4. Semantic Role Labeling (SRL) and Named Entity Recognition (NER)
The natural language processing community has recently experienced a growth of interest in semantic roles,
since they describe WHO did WHAT to WHOM, WHEN, WHERE, WHY, HOW etc. for a given situation, and
contribute to the construction of meaning [15]. SRL has been used in many different applications like automatic
text summarization [15] and automatic question answering [16]. Given a sentence, a semantic role labeler
attempts to identify the predicates (relations and actions) and the semantic entities associated with each of those
predicates. The set of semantic roles used in PropBank [17] includes both predicate-specific roles whose precise
meaning are determined by their predicate and general-purpose adjunct-like modifier roles whose meaning is
consistent across all predicates. The predicate specific roles are Arg0,Arg1, ..., Arg5 and ArgA. Table1 shows a
complete list of the modifier roles. If we have a sentence like Columbus discovered America in 1492. The SRL
parse would be as seen in (1).
[Columbus /Arg0] [discovered /v:Discover] [America /Arg1] [in 1492/ ArgM-Tmp] (1)
Table 1: ProbBank Arguments Roles
Role Meaning
ArgM-LOC Location
ArgM-EXT Extent
ArgM-DIS Discourse connectives
ArgM-ADV Adverbial
ArgM-NEG Negation marker
ArgM-MOD Modal verb
ArgM-CAU Cause
ArgM-TMP Temporal
ArgM-PNC Purpose
ArgM-MNR Manner
ArgM-DIR Direction
ArgM-PRD Secondary prediction
Also Recognition of named entities (e.g. people, organizations, location, etc.) is an essential task in many natural
language processing applications nowadays. Named entity recognition (NER) is given much attention in the
research community and considerable progress has been achieved in many domains, such as newswire and
biomedical [18]. If we have a sentence like Columbus discovered America in 1492, the output of NER would be
like (2).
[Person Columbus] discovered [GPE America] in [date1492]. (2)
3. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
The attributes extracted from both NER and SRL act as the target which we search for in the sentence to identify
the question type. Table 2 shows the attributes extracted; their source; and their associated question type which
used in this research.
Table 2: Attributes used from SRL and NERand their question types
Target Source Question Type
<AM-MNR> SRL How
<AM-CAUS> SRL Why
<AM-ADV> starts with for SRL Why
<AM-PNC> starts with to SRL Why
<Person> NER Who
<AM-LOC> SRL Where
<Location> NER Where
<AM-TMP> SRL When
<Date> NER When
<Time> NER When
The next section will show how SRL and NER are used in content selection phase in the proposed model for
AQG.
76
5. Proposed Model (AIQGS)
To generate a question, we need to perform named entity recognition, semantic role labeling, sentence and
question pattern generation for the sentences used in the training, use a learning algorithm to generate a classifier
able to classify the test sentence according to its question type, and use the classifier patterns to generate the
question pattern for the test sentence. The overall view of the AIQGS is shown in figure 1. The input sentence is
passed to feature extraction phase. In the feature extraction, the input sentence is passed to both NER system and
SRL system to extract its attributes as shown in (1) and (2) in last section then constructing its pattern. After
generating the sentence pattern, its question pattern (s) also is generated. Both NER and SRL of the University of
ILLINOIS [19,20] were used in this research.
ALG_DATA_PREPERATION (Sentence)
For each sentence S in the training set
Sent_PatGet the sentence pattern using NER, and SRL
Set the question(s) Type(s) for the SentPat
Quest_Pat Build the question(s) pattern(s) for the SentPat
End for
Return the Sent_Pat and Ques_Pat.
After preparing the training data set as shown in ALG_DATA_PREPERATION, it is passed to learning phase.
An AIS classifier generator is proposed for the training phase based on the clonal selection mechanism. The
algorithm proposed is shown in ALG_TRAIN.
4. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
Figure 1: Flow diagram of the AQG process
The training phase starts by choosing random patterns from the training patterns to act as memory cells (MCs).
Then for each antigen (AG) in the training data, the affinity between the antigen and every memory cell is
measured using the Euclidean distance, then a stimulation value is calculated for each memory cell. The memory
cell which has the highest stimulation value is chosen to be cloned in ALG_CLONAL _SELECTION. The
stimulation value for the mutated clones also calculated and the one with the highest stimulation value is chosen
to be a memory cell candidate, also if the mutated clone stimulation value is greater than the memory cell
stimulation value, the memory cell is removed from the MCs pool.
ALG_TRAIN (Sent_Pat, Quest_Pat)
MCPool get random sentence and question patterns from Sent_Pat and Quest_Pat for each class.
For each Antigen (AG) in the training patterns
Mc_Aff Get the affinity (AG, class MCs)
Mc_Stim_Val Set the stimulation value for each MC (1-MC_Aff)
Mc_High Get the MC with the highest stimulation value and its question pattern
Mc_Clones CLONAL_SELECTION (Mc_High,AG,Mc_Stim_Val)
77
End for
ALG_CLONAL _SELECTION (MC , AG, Stim_Val)
Number_ of_Clones=Stim_Val * Clonal_Rate
For each new clone
Clone_Aff Get the Affinity (clone. AG)
Mutated_Clone Mutate the clone
Mutated_Clone_Stim_Val Calculate the mutated clone stimulation value with the AG.
Select the mutated clone with the highest stimulation value and put it in the MCs pool
If the Mutated_Clone_Stim_Val >Stim_Val
Remove the MC from MCs pool
End for
5. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
After finishing the training phase, the classifier is considered the memory cells existed in the memory cell poll,
the question patterns of these memory cells are associated with them and for every mutated clone added to the
memory cell pool.
The testing phase starts with preparing the new sentence as a pattern, then calculating the stimulation value for
each memory cell exist in the classifier with that pattern. After that the stimulation value for each class is
calculated by adding all the stimulation values of each memory cell belonging to the class. The class which has
the highest average stimulation value is chosen to be the question type for the sentence. The memory cell(s) that
has stimulation value greater than a given threshold value in that class is chosen to retrieve its question pattern
as the question generated for the input sentence. The threshold value used here is 0.5 The generated pattern is
then mapped into a sentence to form the real question of the sentence.
ALG_TEST (Test_Sentence)
AG DATA_PREPERATION (Test_Sentence)
Foreach MC in MCsPool
78
Calculate the stimulation value(MC, AG)
End for
For each class
Calculate the average stimulation value of the class for that AG
End for
Return the class with the highest average stimulation value as the question type
Get the question pattern from the MCs Pool that belongs the MC that has stimulation value >0.5
5.1 Walk through example for data preparation
Applying the data preparation phase on the example shown in section 2 before, merging the output of both (1)
and (2) yield a pattern like (3)
{Arg0}<person>+ {VBD} + {Arg1}<GPE>+{AM-TMP} <date> (3)
The output of the SRL is between { }, and the output of the NER is between < >. According to table 2 which
show how we choose the question type according to the target extracted from NER and SRL. Searching the
pattern (3); the targets Person, AM-TMP, and Date were found so two question patterns are generated for the
sentence. A pattern for WHO question and the other is for WHEN question. The two patterns are
Who + {VBD} + {Arg1} <GPE> + {AM-TMP} <date>? (4)
When + did + {Arg0} <person> + {V} + {Arg1} <GPE>? (5)
The verb {V} is obtained from the output of the SRL labeling. The sentence pattern (3) and its question patterns
(4) and (5) is interpreted as a two feature vectors in the training set, one vector for each question type. The
representation of the feature vector is numeric data, 1, 0, and 5. One means that the attribute exists in the
sentence, zero means that the attribute doesn’t exist in the sentence, five is for the attribute (s) that determines
the question type.
6. Experimental Results
The proposed system is applied to a set of sentences extracted from different sources like Wikipedia articles,
TREC 2007dataset for question answering, and English book of grade II prep. 170 sentences are extracted and
mapped into 250 patterns using SRL and NER as shown previously. The number of patterns is greater than
number of sentences because some sentences are mapped into more than one pattern in case of the sentence has
more than one label of either SRL or NER labels. The 250 patterns are used in training and testing. The system
tested using a cross-validation test with number of folds=10, in each fold 25 patterns chosen for test. Two trials
are applied to the system. In the first trial 25 patterns are randomly chosen to be memory cells in each fold. In
the second trial 50 patterns are chosen as memory cells in each fold. Each time after extracting the memory cells
and the test patterns, The training phase is applied to the memory cells and the remaining patterns in the training
set to generate the classifier. After that the testing applied to the chosen 25 patterns. Table 3 shows the results
obtained from the two trials in each fold, the number of memory cells, the number of truly classified patterns, the
number of falsely classified patterns, the number of truly generated patterns for the truly classified sentences.
6. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
Table 3: Results of classification and question patterns generated using 25 and 50 M-cells in each test fold
79
EXP#
No.of
M-Cells
No. of truly
classified
No.of false
classified
No. of truly
generated patterns
Percent of true
classification
Percent of true
generated patterns
1
25 20 5 16 80.00% 80.00%
50 23 2 19 92.00% 82.61%
2
25 19 6 12 76.00% 63.16%
50 25 0 22 100.00% 88.00%
3
25 20 5 14 80.00% 70.00%
50 23 2 19 92.00% 82.61%
4
25 22 3 15 88.00% 68.18%
50 25 0 22 100.00% 88.00%
5
25 22 3 17 88.00% 77.27%
50 23 2 22 92.00% 95.65%
6
25 21 4 15 84.00% 71.43%
50 24 1 19 96.00% 79.17%
7
25 22 3 16 88.00% 72.73%
50 25 0 21 100.00% 84.00%
8
25 20 5 15 80.00% 75.00%
50 23 2 21 92.00% 91.30%
9
25 19 6 13 76.00% 68.42%
50 25 0 23 100.00% 92.00%
10
25 18 7 14 72.00% 77.78%
50 23 2 21 92.00% 91.30%
The total and average of the overall test results for the 10 folds of the two trials are shown in table 3 and figure 2.
From tables 3 and 4, it is obvious that increasing the number of memory cells at the beginning of the training
phase leads to the increase of the numbers of patterns in the classifier. The increasing of the number of the
patterns gives the system the ability to classify more accurately and to increase the number of question patterns
that have been generated or recognized truly. It is shown that out of the 10 trials, 4 times the classification
accuracy reached 100% when increasing the memory cells to 50 at the beginning of the training phase. Also the
percentage of the truly generated patterns increased in all trials as shown in last column in table3.
Figure 2: (A) represents the 10 folds for trial 1- (B) represents the 10 folds of trial 2
The overall classification accuracy is increased by increasing the number of memory cells in the second trial
from 81% to more than 95% as shown in table 4 and figure 3. The increase in the classification accuracy leads
7. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
to the increase in the total truly generated patterns, as seen it jumps to 209 truly generated patterns out of 239
truly classified pattern. The increase of both total truly classified and total truly generated patterns reflects the
effect of increasing the memory cells in the training process in order to generate a good classifier able to produce
an accurate classification ratio.
Table 4: the total of both truly classified and generated patterns for the two trials
Total truly classified total truly generated
80
Trial No.
No. of M-Cells
Total truly
classified
Total truly
generated
Percentage of
truly classified
Percentage of truly
generated
1 25 203 147 81.20% 72.41%
2 50 239 209 95.60% 87.45%
The most important measures for classification problems are the precision, recall, and f-measure for each class.
(1)
(2)
(3)
300
250
200
150
100
50
0
trial 1
trial 2
Figure 3: total truly classified and total truly generated patterns for the two trials
The precision measures the probability that the retrieved class is relevant, the recall measures the probability that
a relevant class is retrieved, and the F-measure is the harmonic mean of precision and recall. These measures are
shown in table 5 for each question type in this problem.
Table 5: precision, recall, and F-measure for classification of question types
Question Type Precision Recall F-measure
Who 0.898 1 0.946
How 0.973 1 0.986
Why 1 0.971 0.985
When 1 0.828 0.906
Where 1 1 1
The number of truly generated patterns for each question type used in this research diver and the reason for this
is the difference in the number of patterns exists in the training set for each question type. Also the memory cells
are chosen randomly in each fold so sometimes the patterns chosen for a type may increase or decrease the other
types. There are no common measurement units for the automatic question generation problem. Most systems
use manual evaluation of experts and some other uses the measures of precision, recall, and f-measure to
measure the acceptability of the system about the generated questions. In this research, because the question
patterns of the test sentences are prepared, so we can use the precision, recall, and f-measure. Table 6 and figure
4 illustrates the precision, recall, and f-measure for each generated question type used in this research. The
precision of the generated questions for when, Where, and Why increases 90% of their tested patterns and How
and who is below this percentage, increasing the patterns for these two types may increase the percentage of
generated questions for those types.
8. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
Table 6: precision, recall, and F-measure for each generated question type.
Question Type Precision Recall F-Measure
Who 0.818 0.9 1
How 0.886 1 0.939
Why 0.909 0.882 0.896
When 0.917 0.772 0.838
Where 0.941 1 0.970
Precision, Recall, and F-measure for 5
QuestionTypes
Figure 4: Precision, Recall. And F-measure for all question types
1.200
1.000
0.800
0.600
0.400
0.200
From table 6, it is shown that where, when, and why have the highest precision values and also have the
highest percentage of the truly generated questions. So we can say that as the precision value increases for
the question type, the percentage of the truly generated questions increase.
7. Conclusion and future work
This research introduced a novel Automatic question Generation model based on a learning model that is
based on the meta-dynamics of the Artificial Immune System. The learning model depends on mapping the
input sentence into a pattern, the attributes of the pattern depends on the output of two important natural
language processing models; the SRL and NER are the two NLP models which builds the pattern of the
sentence. The training phase generates the classifier based on the clonal selection algorithm inherited form
natural immunology. The meta-dynamics of the clonal selection which apply cloning and mutation for the
highly stimulated patterns leads to building a strong classifier which able to classify new patterns with
accuracy greater than 95%. The classification ratio obtained proves the power of artificial immune system in
multi-classes classification problem introducing the first contribution of this research which is building a
learning model for identifying the question type. Preserving the question patterns during the training with
the memory cells which constitutes the classifier is the idea used to generate question pattern for the new
test pattern providing the second contribution of this research as a new methodology for generating the
question. The percentage of truly generated patterns increased 87% which appears to be promising ratio in
this problem comparing it to other techniques used in generating questions automatically. Most relative
systems use a set of predefined rules that makes syntactic transformation form sentence to a question like [6
and 7]. Unfortunately the system can’t be compared to other systems for different reasons, the data set used
for every system is different, and the question types generated also differs from system to system. We plan
to introduce a set of enhancements for the system in the future by adding into account the syntax
information of the sentence beside the semantic information used. Also increasing the training sentences and
question types used like what, which, how many and how long questions. Also trying to enhance the
classification accuracy by hybridizing the clonal selection with genetic algorithm, increasing the
classification may lead to increasing the truly generated patterns.
81
-
When How Where Why Who
Precision
Recall
F-Measure
9. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.5, No.8, 2014
References:
[1] Bednarik L. and Kovacs L.. “Implementation and assessment of the automatic question generation module”.
CogInfoCom 2012. 3rd IEEE International Conference on Cognitive infocommunications. December 2-5, 2012.
Kosice, Solvakia.
[2] Rus V., Cai Z., and Graesser A.. “Question Generation: Example of A Multi-year Evaluation Campaign”. In
Proceedings of 1st Question Generation Workshop, September 25-26, 2008, NSF, Arlington, VA.
[3] Kalady S., Elikkottil A., and Das R.. ”Natural language question generation using syntax and keywords”. In
Proceedings of QG2010: The Third Workshop on Question Generation, pages 1–10, 2010.
[4] Varga A., and Ha L. “ Wlv: A question generation system for the qgstec 2010 task b” . In Proceedings of
QG2010: The Third Workshop on Question Generation, pages 80–83, 2010.
[5] Wolfe J.” Automatic question generation from text-an aid to independent study”. In ACM SIGCUE Outlook,
volume 10, pages 104–112. ACM, 1976.
[6] Ali H., Chali Y., and Hasan S. “ Automation of question generation from Sentences”. In Proceedings of
QG2010: The Third Workshop on Question Generation, pages 58–67, 2010.
[7] Mannem P., Prasad R., and Joshi A.. “Question generation from paragraphs at upenn: Qgstec system
description” . In Proceedings of QG2010: The Third Workshop on Question Generation, pages 84–91, 2010.
[8] Yao X., and Zhang Y. “Question generation with minimal recursion semantics”. In Proceedings of QG2010:
The Third Workshop on Question Generation, pages 68–75. Citeseer, 2010.
[9] Copestake A., Flickinger D., Pollard D., and Sag I.” Minimal recursion semantics: An introduction”.
Research on Language and Computation, 3(2-3):281–332, 2005.
[10] Cai Z., Rus V., Kim H., C.Susarla S., Karnam P., and C. Graesser. Nlgml A.” A markup language for
question generation”. In Thomas Reeves and Shirley Yamashita, editors, Proceedings of World Conference on
E-Learningin Corporate, Government, Healthcare, and Higher Education 2006, pages 2747–2752,Honolulu,
Hawaii, USA, October 2006. AACE.
[11] Wang W., Hao T., and Liu W., “Automatic Question Generation for Learning Evaluation in Medicine “.
Advances in Web Based Learning- ICWL 2007(2008), 242-251.
[12] Decastro. L. N, "Immune Cognition, Micro-evolution, and a Personal Account on Immune Engineering ".
S.E.E.D. Journal (Semiotics, Evolution, Energy, and Development). Universidad de Toronto, 3(3). 2003
[13] Dasgupta D., "Advances in Artificial Immune Systems". IEEE Computational intelligence magazine,
November 2006.
[14] Decastro. L. N., and Von Zuben F.J. “ Learninig and Optimization Using Clonal Selection Principle”. IEEE
TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 6, NO. 3, JUNE 2002
[15] Trandabat D. “ Using Semantic Roles to Improve Summaries”.
ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation.
pages 164-169. 2011.
[16] Pizzato L., and Molla D..”Indexing on Semantic Roles for Question Answering” . Coling 2008: Proceedings
of the 2nd workshop on Information Retrieval for Question Answering (IR4QA). pages 74–81. Manchester, UK.
August 2008.
[17] Palmer M., Gildea D., and Kingsbury P.. “The proposition bank: An annotated corpus of semantic roles”.
Computational Linguistics, 31(1):71–106, 2005.
[18]Tkachenko M., and Simanovisky A.. “Named Entity Recognition: Exploring Features”. Proceedings of
KONVENS 2012 (Main track: oral presentations), Vienna, September 20, 2012
[19] Ratinov L., and Roth D.” Design Challenges and Misconceptions in Named Entity Recognition”. CoNLL
(2009)
[20] Punyakanok V. , Roth D., and Yih W. ” The Importance of Syntactic Parsing and Inference in Semantic
Role Labeling”. Computational Linguistics (2008
82
10. The IISTE is a pioneer in the Open-Access hosting service and academic event
management. The aim of the firm is Accelerating Global Knowledge Sharing.
More information about the firm can be found on the homepage:
http://www.iiste.org
CALL FOR JOURNAL PAPERS
There are more than 30 peer-reviewed academic journals hosted under the hosting
platform.
Prospective authors of journals can find the submission instruction on the
following page: http://www.iiste.org/journals/ All the journals articles are available
online to the readers all over the world without financial, legal, or technical barriers
other than those inseparable from gaining access to the internet itself. Paper version
of the journals is also available upon request of readers and authors.
MORE RESOURCES
Book publication information: http://www.iiste.org/book/
IISTE Knowledge Sharing Partners
EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open
Archives Harvester, Bielefeld Academic Search Engine, Elektronische
Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial
Library , NewJour, Google Scholar