This summary provides the key details from the document in 3 sentences:
The document discusses using a machine learning approach to classify traceability links between requirements. It proposes a 2-learner model that uses both lexical features from word pairs as well as features derived from a hand-built ontology. The model achieves a 56% reduction in error compared to a baseline using only lexical features, and performance is improved further by generating additional pseudo training instances from the ontology.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
This document discusses a proposed system for deep collaborative filtering with aspect information. The system aims to help web users efficiently locate relevant information on unfamiliar topics to increase their knowledge. It utilizes techniques like multi-keyword search, synonym matching, and ontology mapping to return relevant web links, images, and news articles to the user based on their search terms. The proposed system architecture includes an index structure to efficiently search and rank results based on similarity to the search query terms. The implementation and evaluation of the proposed system are also discussed.
Coverage-Criteria-for-Testing-SQL-QueriesMohamed Reda
This document discusses testing SQL queries by defining coverage criteria. It introduces coverage criteria for evaluating how well a test suite exercises different situations that could affect the data retrieved by an SQL query. These include criteria related to query clauses like selection, joining, grouping and having. The document also discusses representing SQL queries as control flow graphs and applying criteria like condition coverage to account for all possible combinations of condition evaluations. Automatic test case generation and population of test databases is discussed to evaluate coverage based on the criteria.
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET Journal
This document proposes an automatic approach called Witt to categorize software technologies based on their descriptions. Witt takes a sentence describing a technology as input and outputs a general category (e.g. integrated development environment) along with qualifying attributes. It applies natural language processing and the Levenshtein distance algorithm to compare string similarities and categorize technologies from large datasets. The system architecture first obtains data on software methodologies and labels. It then applies NLP and Levenshtein distance to find hypernyms and transform them into categories with attributes for classification.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
There is a vast amount of unstructured Arabic information on the Web, this data is always organized in
semi-structured text and cannot be used directly. This research proposes a semi-supervised technique that
extracts binary relations between two Arabic named entities from the Web. Several works have been
performed for relation extraction from Latin texts and as far as we know, there isn’t any work for Arabic
text using a semi-supervised technique. The goal of this research is to extract a large list or table from
named entities and relations in a specific domain. A small set of a handful of instance relations are
required as input from the user. The system exploits summaries from Google search engine as a source
text. These instances are used to extract patterns. The output is a set of new entities and their relations. The
results from four experiments show that precision and recall varies according to relation type. Precision
ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club)
relationship, 0.72 and 0.83 for precision and recall respectively.
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
This document discusses a proposed system for deep collaborative filtering with aspect information. The system aims to help web users efficiently locate relevant information on unfamiliar topics to increase their knowledge. It utilizes techniques like multi-keyword search, synonym matching, and ontology mapping to return relevant web links, images, and news articles to the user based on their search terms. The proposed system architecture includes an index structure to efficiently search and rank results based on similarity to the search query terms. The implementation and evaluation of the proposed system are also discussed.
Coverage-Criteria-for-Testing-SQL-QueriesMohamed Reda
This document discusses testing SQL queries by defining coverage criteria. It introduces coverage criteria for evaluating how well a test suite exercises different situations that could affect the data retrieved by an SQL query. These include criteria related to query clauses like selection, joining, grouping and having. The document also discusses representing SQL queries as control flow graphs and applying criteria like condition coverage to account for all possible combinations of condition evaluations. Automatic test case generation and population of test databases is discussed to evaluate coverage based on the criteria.
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET Journal
This document proposes an automatic approach called Witt to categorize software technologies based on their descriptions. Witt takes a sentence describing a technology as input and outputs a general category (e.g. integrated development environment) along with qualifying attributes. It applies natural language processing and the Levenshtein distance algorithm to compare string similarities and categorize technologies from large datasets. The system architecture first obtains data on software methodologies and labels. It then applies NLP and Levenshtein distance to find hypernyms and transform them into categories with attributes for classification.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmijnlc
Information Retrieval (IR) is a very important and vast area. While searching for context web returns all
the results related to the query. Identifying the relevant result is most tedious task for a user. Word Sense
Disambiguation (WSD) is the process of identifying the senses of word in textual context, when word has
multiple meanings. We have used the approaches of WSD. This paper presents a Proposed Dynamic Page
Rank algorithm that is improved version of Page Rank Algorithm. The Proposed Dynamic Page Rank
algorithm gives much better results than existing Google’s Page Rank algorithm. To prove this we have
calculated Reciprocal Rank for both the algorithms and presented comparative results.
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
This document discusses an efficient way to query XML databases using natural language. It proposes a framework that can accept English language queries and translate them into XQuery or SQL expressions to retrieve data from an XML database. The system performs linguistic processing to map tokens in the natural language query to XQuery fragments, then executes the translated query against the database. Existing approaches are discussed that typically use semantic and syntactic analysis to represent the query logically before translation, but have limitations in handling ambiguity. The proposed system aims to improve query translation accuracy by leveraging token relationships and classifications determined from natural language parsing.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Architecture of an ontology based domain-specific natural language question a...IJwest
The document summarizes the architecture of an ontology-based domain-specific natural language question answering system. The proposed architecture defines four main modules: 1) question processing which analyzes and classifies questions and reformulates queries, 2) document retrieval which retrieves relevant documents, 3) document processing which processes retrieved documents, and 4) answer extraction which extracts and generates responses. Natural language processing techniques and ontologies are used to analyze questions and documents and extract relationships and answers. The system aims to generate concise, specific answers to natural language questions in a given domain and achieved 94% accuracy in testing.
2. an efficient approach for web query preprocessing edit satIAESIJEECS
The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...IJwest
The increasing interest in developing efficient and effective optimization techniques has conducted researchers to turn their attention towards biology. It has been noticed that biology offers many clues for designing novel optimization techniques, these approaches exhibit self-organizing capabilities and permit the reachability of promising solutions without the existence of a central coordinator. In this paper we handle the problem of dynamic web service composition, by using the clonal selection algorithm. In order to assess the optimality rate of a given composition, we use the QOS attributes of the services involved in the workflow as well as, the semantic similarity between these components. The experimental evaluation shows that the proposed approach has a better performance in comparison with other approaches such as the genetic algorithm.
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
DEEP LEARNING SENTIMENT ANALYSIS OF AMAZON.COM REVIEWS AND RATINGSijscai
The document summarizes research on using deep learning techniques for sentiment analysis of Amazon product reviews and ratings. Specifically, it trains recurrent neural networks using paragraph vectors of reviews to learn product embeddings that capture temporal relationships between reviews. This helps identify mismatches between highly positive/negative reviews and low/high ratings. A web service applies the model to reviews and warns users if the predicted sentiment differs from their given rating.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
This document presents a system for extracting named entities and their relationships from unstructured text data using n-gram features with hidden Markov models and conditional random fields. The system first extracts n-gram, part-of-speech, and lexicon features from documents, then trains a hidden Markov model to classify entities and a conditional random field with kernel approach to detect relationships between entities. Evaluation shows the proposed system achieves 98.03% accuracy, 88.80% precision, and 87.50% recall for entity detection, outperforming a support vector machine baseline. For relationship extraction, it achieves 87.46% accuracy, 84.46% precision, and 82.46% recall, again outperforming the SVM baseline.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...IJNSA Journal
The document presents a framework called SoNMine that identifies key players in the 9/11 covert network using node behavioral profiles. It generates profiles by analyzing node behaviors based on path types extracted from the network's multi-relational structure. The framework identifies outlier nodes with dense connections or high communication as influential players. It also determines dominant features that help classify normal and outlier nodes more accurately.
IRJET - Voice based Natural Language Query ProcessingIRJET Journal
This document describes a voice-based natural language query processing system that allows non-expert users to interact with a database using natural language queries. The system takes a user's spoken query as input, converts it to text using speech recognition, analyzes the text to generate a SQL query, executes the SQL query against the database, and displays the results in a table. The system addresses challenges like ambiguity through techniques such as tokenization, lexical analysis, syntactic analysis, and semantic analysis to map the natural language query to a valid SQL query.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
This document proposes a methodology to extract information from big data sources like course handouts and directories and represent it in a graphical, ontological tree format. Keywords are extracted from documents using natural language processing techniques and used to generate a hierarchical tree based on the DMOZ open directory project. The trees provide a comprehensive overview of document content and structure. The method is implemented using Python for natural language processing and Java for visualization. Evaluation on computer science course handouts shows the trees accurately represent topic coverage and depth. Future work aims to increase the number of keywords extracted.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmijnlc
Information Retrieval (IR) is a very important and vast area. While searching for context web returns all
the results related to the query. Identifying the relevant result is most tedious task for a user. Word Sense
Disambiguation (WSD) is the process of identifying the senses of word in textual context, when word has
multiple meanings. We have used the approaches of WSD. This paper presents a Proposed Dynamic Page
Rank algorithm that is improved version of Page Rank Algorithm. The Proposed Dynamic Page Rank
algorithm gives much better results than existing Google’s Page Rank algorithm. To prove this we have
calculated Reciprocal Rank for both the algorithms and presented comparative results.
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
This document discusses an efficient way to query XML databases using natural language. It proposes a framework that can accept English language queries and translate them into XQuery or SQL expressions to retrieve data from an XML database. The system performs linguistic processing to map tokens in the natural language query to XQuery fragments, then executes the translated query against the database. Existing approaches are discussed that typically use semantic and syntactic analysis to represent the query logically before translation, but have limitations in handling ambiguity. The proposed system aims to improve query translation accuracy by leveraging token relationships and classifications determined from natural language parsing.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Architecture of an ontology based domain-specific natural language question a...IJwest
The document summarizes the architecture of an ontology-based domain-specific natural language question answering system. The proposed architecture defines four main modules: 1) question processing which analyzes and classifies questions and reformulates queries, 2) document retrieval which retrieves relevant documents, 3) document processing which processes retrieved documents, and 4) answer extraction which extracts and generates responses. Natural language processing techniques and ontologies are used to analyze questions and documents and extract relationships and answers. The system aims to generate concise, specific answers to natural language questions in a given domain and achieved 94% accuracy in testing.
2. an efficient approach for web query preprocessing edit satIAESIJEECS
The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...IJwest
The increasing interest in developing efficient and effective optimization techniques has conducted researchers to turn their attention towards biology. It has been noticed that biology offers many clues for designing novel optimization techniques, these approaches exhibit self-organizing capabilities and permit the reachability of promising solutions without the existence of a central coordinator. In this paper we handle the problem of dynamic web service composition, by using the clonal selection algorithm. In order to assess the optimality rate of a given composition, we use the QOS attributes of the services involved in the workflow as well as, the semantic similarity between these components. The experimental evaluation shows that the proposed approach has a better performance in comparison with other approaches such as the genetic algorithm.
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
DEEP LEARNING SENTIMENT ANALYSIS OF AMAZON.COM REVIEWS AND RATINGSijscai
The document summarizes research on using deep learning techniques for sentiment analysis of Amazon product reviews and ratings. Specifically, it trains recurrent neural networks using paragraph vectors of reviews to learn product embeddings that capture temporal relationships between reviews. This helps identify mismatches between highly positive/negative reviews and low/high ratings. A web service applies the model to reviews and warns users if the predicted sentiment differs from their given rating.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
This document presents a system for extracting named entities and their relationships from unstructured text data using n-gram features with hidden Markov models and conditional random fields. The system first extracts n-gram, part-of-speech, and lexicon features from documents, then trains a hidden Markov model to classify entities and a conditional random field with kernel approach to detect relationships between entities. Evaluation shows the proposed system achieves 98.03% accuracy, 88.80% precision, and 87.50% recall for entity detection, outperforming a support vector machine baseline. For relationship extraction, it achieves 87.46% accuracy, 84.46% precision, and 82.46% recall, again outperforming the SVM baseline.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
DOMINANT FEATURES IDENTIFICATION FOR COVERT NODES IN 9/11 ATTACK USING THEIR ...IJNSA Journal
The document presents a framework called SoNMine that identifies key players in the 9/11 covert network using node behavioral profiles. It generates profiles by analyzing node behaviors based on path types extracted from the network's multi-relational structure. The framework identifies outlier nodes with dense connections or high communication as influential players. It also determines dominant features that help classify normal and outlier nodes more accurately.
IRJET - Voice based Natural Language Query ProcessingIRJET Journal
This document describes a voice-based natural language query processing system that allows non-expert users to interact with a database using natural language queries. The system takes a user's spoken query as input, converts it to text using speech recognition, analyzes the text to generate a SQL query, executes the SQL query against the database, and displays the results in a table. The system addresses challenges like ambiguity through techniques such as tokenization, lexical analysis, syntactic analysis, and semantic analysis to map the natural language query to a valid SQL query.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALijaia
This document proposes a methodology to extract information from big data sources like course handouts and directories and represent it in a graphical, ontological tree format. Keywords are extracted from documents using natural language processing techniques and used to generate a hierarchical tree based on the DMOZ open directory project. The trees provide a comprehensive overview of document content and structure. The method is implemented using Python for natural language processing and Java for visualization. Evaluation on computer science course handouts shows the trees accurately represent topic coverage and depth. Future work aims to increase the number of keywords extracted.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
This document proposes a knowledge graph and question answering system to extract and analyze information from large volumes of unstructured data like annual reports. It discusses using natural language processing techniques like named entity recognition with spaCy and dependency parsing to extract entity-relation pairs from text and construct a knowledge graph. For question answering, it analyzes user queries with similar NLP approaches and then matches query triplets to the knowledge graph to retrieve answers, combining information retrieval and trained classifiers. The proposed system aims to provide faster understanding and analysis of complex, unstructured data for professionals.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
This document describes the development of a new legal word embedding evaluation dataset for Chinese called LARQS (Legal Analogical Reasoning Questions Set). It was created using a corpus of 2,388 Chinese legal documents and contains 1,256 questions evaluating 5 categories of legal relationships. The document discusses word embedding and existing evaluation benchmarks. It then describes how LARQS was created by legal experts and its potential usefulness compared to general-purpose benchmarks for evaluating legal-domain word embeddings.
Query expansion using novel use case scenario relationship for finding featur...IJECEIAES
Feature location is a technique for determining source code that implements specific features in software. It developed to help minimize effort on program comprehension. The main challenge of feature location research is how to bridge the gap between abstract keywords in use cases and detail in source code. The use case scenarios are software requirements artifacts that state the input, logic, rules, actor, and output of a function in the software. The sentence on use case scenario is sometimes described another sentence in other use case scenario. This study contributes to creating expansion queries in feature locations by finding the relationship between use case scenarios. The relationships include inner association, outer association and intratoken association. The research employs latent Dirichlet allocation (LDA) to create model topics on source code. Query expansion using inner, outer and intratoken was tested for finding feature locations on a Java-based open-source project. The best precision rate was 50%. The best recall was 100%, which was found in several use case scenarios implemented in a few files. The best average precision rate was 16.7%, which was found in inner association experiments. The best average recall rate was 68.3%, which was found in all compound association experiments.
This document presents a framework for reusing existing software agents through ontological engineering. The framework includes components like a user interface agent, query processor, mapping agent, transfer agent, wrapper agent, and remote agents containing ontologies. The query processor reformulates the user's query, the mapping agent identifies relevant ontologies, and the transfer agent sends the query to remote agents. The remote agents provide ontologies as output, which are then integrated/merged and presented back to the user interface agent. The goal is to enable reuse of heterogeneous agents across different development environments through a standardized ontology representation.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
Association Rule Mining Based Extraction of Semantic Relations Using Markov ...dannyijwest
Ontology may be a conceptualization of a website into a human understandable, however machine-
readable format consisting of entities, attributes, relationships and axioms. Ontologies formalize the
intentional aspects of a site, whereas the denotative part is provided by a mental object that contains
assertions about instances of concepts and relations. Semantic relation it might be potential to extract the
whole family-tree of a outstanding personality employing a resource like Wikipedia. In a way, relations
describe the linguistics relationships among the entities involve that is beneficial for a higher
understanding of human language. The relation can be identified from the result of concept hierarchy
extraction. The existing ontology learning process only produces the result of concept hierarchy extraction.
It does not produce the semantic relation between the concepts. Here, we have to do the process of
constructing the predicates and also first order logic formula. Here, also find the inference and learning
weights using Markov Logic Network. To improve the relation of every input and also improve the relation
between the contents we have to propose the concept of ARSRE.
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...IJwest
Ontology may be a conceptualization of a website into a human understandable, however machine-readable format consisting of entities, attributes, relationships and axioms. Ontologies formalize the intentional aspects of a site, whereas the denotative part is provided by a mental object that contains assertions about instances of concepts and relations. Semantic relation it might be potential to extract the whole family-tree of a outstanding personality employing a resource like Wikipedia. In a way, relations describe the linguistics relationships among the entities involve that is beneficial for a higher understanding of human language. The relation can be identified from the result of concept hierarchy extraction. The existing ontology learning process only produces the result of concept hierarchy extraction. It does not produce the semantic relation between the concepts. Here, we have to do the process of constructing the predicates and also first order logic formula. Here, also find the inference and learning weights using Markov Logic Network. To improve the relation of every input and also improve the relation between the contents we have to propose the concept of ARSRE. This method can find the frequent items between concepts and converting the extensibility of existing lightweight ontologies to formal one. The experimental results can produce the good extraction of semantic relations compared to state-of-art method.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyZac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of ontology based data integration system. In this paper, we focus on the web
query service to apply this proposed algorithm. Concepts similarity is important for web query service
because the words in user input query are not same wholly with the concepts in ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that cannot be confirmed the similarity of these words by WordNet. We prove the effect of this
algorithm with two degree semantic result of web mining by generating the concepts results obtained form
the input query.
Implementation of Semantic Analysis Using Domain OntologyIOSR Journals
The document describes a semantic analysis system that analyzes feedback from an organization using domain ontology. The system first collects feedback data from students in an unstructured format. It then preprocesses the feedback using part-of-speech tagging to extract meaningful information. The system architecture includes preprocessing the feedback, matching entities in the feedback to an organization ontology using Jaccard similarity, and generating a summarized analysis of the feedback based on the ontology entities. The goal is to group related words and phrases expressed by students under the same entity to produce a meaningful summary for the organization.
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
This document summarizes a research paper that proposes a method for dynamically measuring coupling in distributed object-oriented software systems. The method involves three steps: instrumentation of the Java Virtual Machine to trace method calls, post-processing of the trace files to merge information, and calculation of coupling metrics based on the dynamic traces. The implementation results show that the proposed approach can effectively measure coupling metrics dynamically by accounting for polymorphism and dynamic binding, overcoming limitations of traditional static coupling analysis.
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
Software metrics are increasingly playing a central role in the planning and control of software development projects. Coupling measures have important applications in software development and maintenance. Existing literature on software metrics is mainly focused on centralized systems, while work in the area of distributed systems, particularly in service-oriented systems, is scarce. Distributed systems with service oriented components are even more heterogeneous networking and execution environment. Traditional coupling measures take into account only “static” couplings. They do not account for “dynamic” couplings due to polymorphism and may significantly underestimate the complexity of software and misjudge the need for code inspection, testing and debugging. This is expected to result in poor predictive accuracy of the quality models in distributed Object Oriented systems that utilize static coupling measurements. In order to overcome these issues, we propose a hybrid model in Distributed Object Oriented Software for measure the coupling dynamically. In the proposed method, there are three steps
such as Instrumentation process, Post processing and Coupling measurement. Initially the instrumentation process is done. In this process the instrumented JVM that has been modified to trace method calls. During this process, three trace files are created namely .prf, .clp, .svp. In the second step, the information in these file are merged. At the end of this step, the merged detailed trace of each JVM contains pointers to the merged trace files of the other JVM such that the path of every remote call from the client to the server can be uniquely identified. Finally, the coupling metrics are measured dynamically. The implementation results show that the proposed system will effectively measure the coupling metrics dynamically.
Building a recommendation system based on the job offers extracted from the w...IJECEIAES
Recruitment, or job search, is increasingly used throughout the world by a large population of users through various channels, such as websites, platforms, and professional networks. Given the large volume of information related to job descriptions and user profiles, it is complicated to appropriately match a user's profile with a job description, and vice versa. The job search approach has drawbacks since the job seeker needs to search a job offers in each recruitment platform, manage their accounts, and apply for the relevant job vacancies, which wastes considerable time and effort. The contribution of this research work is the construction of a recommendation system based on the job offers extracted from the web and on the e-portfolios of job seekers. After the extraction of the data, natural language processing is applied to structured data and is ready for filtering and analysis. The proposed system is a content-based system, it measures the degree of correspondence between the attributes of the e-portfolio with those of each job offer of the same list of competence specialties using the Euclidean distance, the result is classified with a decreasing way to display the most relevant to the least relevant job offers
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...IJwest
This document describes a proposed system for automatic semantic annotation of web documents based on ontology elements and relationships. It begins with an introduction to semantic web and annotation. The proposed system architecture matches topics in text to entities in an ontology document. It utilizes WordNet as a lexical ontology and ontology resources to extract knowledge from text and generate annotations. The main components of the system include a text analyzer, ontology parser, and knowledge extractor. The system aims to automatically generate metadata to improve information retrieval for non-technical users.
Candidate Link Generation Using Semantic Phermone Swarmkevig
Requirements tracing of natural Language artifacts consists of document parsing, Candidate Link
Generation, evaluation and analysis. Candidate Link Generation deals with checking if the high-level
artifact has been fulfilled by the low-level artifact. Requirements traceability is an important activity
undertaken as part of ensuring the quality of software in the early stages of the Software Development Life
Cycle (SDLC). The Semantic Relatedness between the terms is not considered in the existing system; hence
the Candidate Link Generation is not effective. In the proposed system, a hybrid technique combining both
the Semantic Ranking and Pheromone Swarm is implemented. Simple swarm agents are given freedom to
operate on their own, determining the search path randomly based on the environment. Pheromone swarm
agent decides on what term to select or what path to take is influenced by presence of pheromone markings
on the inspected object. Semantic Graph is constructed using semantic relatedness between two terms,
computed based on highest value path connecting any pair of the terms. The performance is evaluated with
Simple, Pheromone and Semantic Pheromone Swarm techniques. The Semantic Pheromone Swarm
provides better results when compared to Simple and Pheromone Swarm Techniques.
This document discusses and compares several agent-assisted methodologies for developing multi-agent systems:
- It reviews Gaia, HLIM, PASSI, and Tropos methodologies, outlining their key models and phases. Gaia focuses on analysis and design, HLIM models internal and external agent behavior, and PASSI and Tropos incorporate UML modeling.
- It then proposes a new MAB methodology intended to address shortcomings of existing approaches. MAB includes requirements, analysis, design, and implementation phases and models such as use case maps and agent roles.
- Finally, it concludes that agent technologies represent a promising approach for developing complex software systems, but that matching methodologies to problem domains and developing princip
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
Similar to Tracing Requirements as a Problem of Machine Learning (20)
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Diana Rendina
Librarians are leading the way in creating future-ready citizens – now we need to update our spaces to match. In this session, attendees will get inspiration for transforming their library spaces. You’ll learn how to survey students and patrons, create a focus group, and use design thinking to brainstorm ideas for your space. We’ll discuss budget friendly ways to change your space as well as how to find funding. No matter where you’re at, you’ll find ideas for reimagining your space in this session.
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Tracing Requirements as a Problem of Machine Learning
1. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
DOI:10.5121/ijsea.2018.9402 21
TRACING REQUIREMENTS AS A PROBLEM OF
MACHINE LEARNING
Zeheng Li and LiGuo Huang
Southern Methodist University, Dallas, Texas, USA
ABSTRACT
Software requirement engineering and evolution essential to software development process, which defines
and elaborates what is to be built in a project. Requirements are mostly written in text and will later evolve
to fine-grained and actionable artifacts with details about system configurations, technology stacks, etc.
Tracing the evolution of requirements enables stakeholders to determine the origin of each requirement and
understand how well the software’s design reflects to its requirements. Reckoning requirements traceability
is not a trivial task, a machine learning approach is used to classify traceability between various associated
requirements. In particular, a 2-learner, ontology-based, pseudo-instances-enhanced approach, where two
classifiers are trained to separately exploit two types of features, lexical features and features derived from
a hand-built ontology, is investigated for such task. The hand-built ontology is also leveraged to generate
pseudo training instances to improve machine learning results. In comparison to a supervised baseline
system that uses only lexical features, our approach yields a relative error reduction of 56.0%. Most
interestingly, results do not deteriorate when the hand-built ontology is replaced with its automatically
constructed counterpart.
KEYWORDS
Requirements Traceability, Software Design, Machine Learning
1. INTRODUCTION
Evolution and refinement of requirements guides the software system development process by
defining and specifying what to be built for a software system. Requirement specifications, mostly
documented in natural language, are refined with additional design and implementation details as
a software project move forwards in its development life cycle. An important task in software
requirements engineering process is requirements traceability, which is concerned with linking
requirements in which one is a refinement of the other. Being able to establish traceability links
allows stakeholders to find the source of each requirement and track every change that has been
made to it, and ensures the continuous understanding of the problem that needs to be solved so
that the right system is delivered.
In practice, one is given a set of high-level (coarse-grained) requirements and a set of low-level
(fine-grained) requirements, and requirements traceability aims to find for each high-level
requirement all the low-level requirements that refine it. Note that the resulting mapping
between high- and low-level requirements is many-to-many, because a low-level requirement can
potentially refine more than one high-level requirement.
2. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
22
As an example, consider the three high-level requirements and two low-level requirements shown
in Figure 1 about the well-known Pine email system. In this example, three traceability links
should be established between the high-level (HR) and low-level requirements (UC): (1) HR01 is
refined by UC01 (because UC01 specifies the shortcut key for saving an entry in the address
book);(2) HR02 is refined by UC01 (because UC01 specifies how to store contacts in the address
book); and (3) HR03 is refined by UC02 (because both of them are concerned with the help
system).
Figure 1. Samples of high- and low-level requirments.
From the perspective of the information retrieval and text mining, requirements traceability is a
very challenging task. First, there could be abundant information irrelevant to the establishment
of a link in one or both of the requirements. For instance, the information in the Description
section of UC01 appears to be irrelevant to the establishment of the link between UC01 and
HR02. Worse still, as the goal is to induce a many-to-many mapping, information irrelevant to the
establishment of one link could be relevant to the establishment of another link involving the
same requirement. For instance, while the Description section appears to be irrelevant to linking
UC01 and UR02, a traceability linking shall exist between UC01 and HR01. Above all, a link can
exist between a pair of requirements (HR01 and UC01) even if they do not possess any
overlapping or semantically similar content words.
Virtually all existing approaches to the requirements traceability task were developed in the soft-
ware engineering (SE) research community. Related work on this task can be classified into two
categories: manual and automated approaches. As for manual approaches, requirements
traceability links are manually recovered by developers. Automated approaches, on the other
hand, have relied on information retrieval (IR) techniques, which recover links based on
similarity computed between a given pair of requirements. Hence, such similarity-based
approaches are unable to recover links between those pairs that do not contain overlapping or
semantically similar words or phrases as mentioned above.
3. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
23
In light of this weakness, requirements traceability is recasted as a supervised binary
classification task, where each pair of high- and low-level requirements is to be classified as
positive (having a link) or negative (not having a link). Each pair of requirements is represented
by two types of features. First, word pairs feature is employed, where each of which is composed
of a word taken from each of the two requirements involved. These features will enable the
learning algorithm to identify both semantically similar and dissimilar word pairs that are
strongly indicative of a refinement relation between the two requirements, thus overcoming the
aforementioned weakness associated with similarity-based approaches
Next, features are derived from an ontology hand-built by a domain expert. The sample ontology
built for the Pine dataset is shown in Table 1. The ontology contains only a verb clustering and a
noun clustering: the verbs are clustered by the function they perform, whereas a noun cluster
corresponds to a (domain-specific) semantic type.
There are at least two reasons why the ontology might be useful for identifying traceability
links. First, since only those verbs and nouns that (1) appear in the training data and (2) are
deemed relevant by the domain expert for link identification are included in the ontology, it
provides guidance to the learner as to which words/phrases in the requirements it should focus
on in the learning process.1 Second, the verb and noun clusters provide a robust generalization
of the words/phrases in the requirements. For instance, a word pair that is relevant for link
identification may still be ignored by the learner due to its infrequency of occurrence. The
features that are computed based on these clusters, on the other hand, will be more robust to the
infrequency problem and therefore potentially provide better generalizations.
Last, considering ontology a set of natural annotator rationales, pseudo training instances are used
to help learners by providing indication of the importances of different parts of document as well
as increasing the size of training instances.
Our main contribution in this paper lies in the proposal of a 2-learner, ontology-based, pseudo-
instances-enhanced approach to the task of traceability link prediction, where, for the sake of
robustness, two classifiers are trained to separately exploit the word-pair features, the ontology
based features, and ontology-enhanced pseudo instances. Results on a traceability dataset
involving the Pine domain reveal that our use of two learners and the ontology-based features are
both key to the success of our approach: it significantly outperforms not only a supervised
baseline system that uses only word pairs features, but also a system that trains a single classifier
over both the word pairs and the ontology-based features. Perhaps most interestingly, results do
not deteriorate when the hand-built ontology is replaced with an automatically constructed
ontology. And by feeding the learners with pseudo training instances generated from ontology,
the performance is improved further significantly.
The rest of the paper is organized as follows. Section 2 describes related work. Section 3
introduces the Pine dataset and our hand-built ontology is described in Section 4. Then Section 5
presents our 2-learner, ontology-based, pseudo-instances-enhanced approach to traceability link
prediction. Finally, Section 6 presents evaluation results and Section 7 draws conclusions.
4. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
24
2. RELATED WORK
2.1. MANUAL APPROACHES
Traditional manual requirements tracing is usually accomplished by system analysts with the help
of requirement management tools, where analysts visually examine each pair of requirements
documented in the requirement management tools to build the Requirement Traceability Matrix
(RTM). Most existing requirement management tools (e.g., Rational DOORS2, Rational Requi-
sitePro3, CASE4) support traceability analysis. Manual tracing is often based on observing the
Table 1. Manual ontology for Pine.
(a) Noun clustering
(b) Verb clustering
5. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
25
potential relevance between a pair of requirements belonging to different categories or at different
levels of details. The manual process is human-intensive and error-prone given a large set of
requirements. Moreover, such domain knowledge could be lost due to requirements changes,
distributed teams, or system refactoring during the life cycle of system development andevolution.
2.2. AUTOMATED APPROACHES
Automated or semi-automated requirements traceability has been exploited by many researchers.
Pierce [2] designed a tool that maintains a requirements database to aid automated requirements
tracing. Jackson [3] proposed a keyphrase based approach for tracing a large number of
requirements of a large Surface Ship Command System. More advanced approaches relying on
information retrieval (IR) techniques, such as the tf-idf-based vector space model [4], Latent
Semantic Indexing [5–7], probabilistic networks [8], and Latent Dirichlet Allocation [9], have
been investigated, where traceability links were generated by calculating the textual similarity
between requirements using similarity measures such as Dice, Jaccard, and Cosine coefficients
[10]. All these methods were developed based on either matching keywords or identifying similar
words across a pair of requirements. In recent years, Li [11] studied the feasibility of employing
supervised learning to accomplish this task. Guo [12] applied word embedding and recurrent
neural network to generate trace links.
3. DATASET
The well known Pine system is used for evaluation. This dataset consists of a set of 49 (high-
level) requirements and a set of 51 (low-level) use case specifications about Pine, an email system
developed at the University of Washington. Statistics on the dataset are provided on Table 2. The
dataset has a skewed class distribution: out of the 2499 pairs of requirement and use case
specification, only 10% (250) are considered traceability links.
Table 2. Statistics on the Pine dataset.
4. HAND-BUILDING THE ONTOLOGY
As mentioned before, our ontology is composed of a verb clustering and a noun clustering. A soft-
ware engineer who has expertise in both requirements traceability and the Pine software domain is
employed to hand-build the ontology. Using his domain expertise, the engineer first identified the
noun categories and verb categories that are relevant for traceability prediction. Then, by inspect-
ing the training data, he manually populated each noun/verb category with the words and phrases
collected from the training data.
6. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
26
As will be discussed in Section 6, our approach is evaluated by using 5-fold cross validation. Since
the nouns/verbs in the ontology were collected only from the training data, the softwareengineer
built five ontologies, one for each fold experiment. Hence, nouns/verbs that appear in only the test
data in each fold experiment will not be in the ontology. In other words, our test data are truly
held-out w.r.t. the construction of the ontology. Table 1 shows the ontology built for one of the
fold experiments. Note that the five ontologies employ the same set of noun and verb categories,
differing only w.r.t. the nouns and verbs that populate each category. As it can be seen from Table
1, eight groups of nouns and ten groups of verbs are defined. Each noun category represents a
domain-specific semantic class, and each verb category corresponds to a function performed by
the action underlying a verb.
5. APPROACH
This section describes our supervised approach alongside with its three extensions.
5.1 CLASSIFIER TRAINING
Each instance corresponds to a high-level requirement and a low-level requirement. Hence, in-
stances are created by pairing each high-level requirement with each low-level requirement. The
class value of an instance is positive if the two requirements involved should be linked; otherwise,
it is negative. To conduct 5-fold cross-validation experiments, instances are randomlypartitioned
into five folds of roughly the same size. A classifier is trained on only four folds and evaluated
on the remaining fold in each fold experiment. Each instance is represented using seven types of
features, as follows
Same words. One binary feature is created for each word w appearing in the training data. Its
value is 1 if w appears in both requirements in the pair under consideration. Hence, this feature
type contains the subset of the word pair features mentioned earlier where the two words in the
pair are the same.
Different words. One binary feature is created for each word pair (wi, wj) collected from the
training instances, where wi and wj are non-identical words appearing in a high-level requirement
and a low-level requirement respectively. Its value is 1 if wi and wj appear in the high-level and
Verb pairs. One binary feature is created for each verb pair (vi, vj) collected from the training
instances, where (1) vi and vj appear in a high-level requirement and a low-level requirement
respectively, and (2) both verbs appear in the ontology. Its value is 1 if vi and vj appear in the
high-level and low-level pair under consideration, respectively. Using these verb pairs as features
may allow the learner to focus on verbs that are relevant to traceabilityprediction.
Verb group pairs. For each verb pair feature described above, one binary feature is created by
replacing each verb in the pair with its cluster id in the ontology. Its value is 1 if the two verb
groups in the pair appear in the high-level and low-level pair under consideration, respectively.
These features may enable the resulting classifier to provide robust generalizations in cases where
the learner chooses to ignore certain useful verb pairs owing to their infrequency of occurrence.
Noun pairs. One binary feature is created for each noun pair (ni, nj) collected from the training
instances, where (1) ni and nj appear in a high-level requirement and a low-level requirement
7. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
27
respectively, and (2) both nouns appear in the ontology. Its value is computed in the same manner
as the verb pairs. These noun pairs may help the learner to focus on verbs that are relevant to
traceability prediction.
Noun group pairs. For each noun pair feature described above, one binary feature is created by
replacing each noun in the pair with its cluster id in the ontology. Its value is computed in the
same manner as the verb group pairs. These features may enable the classifier to provide robust
generalizations in cases where the learner chooses to ignore certain useful noun pairs owing to
their infrequency of occurrence.
Dependency pairs. In some cases, the noun/verb pairs may not provide sufficient information
for traceability prediction. For example, the verb pair feature (delete, delete) is suggestive of
a positive instance, but the instance may turn out to be negative if one requirement concerns
deleting messages and the other concerns deleting folders. As another example, the noun pair
feature (folder, folder) is suggestive of a positive instance, but the instance may turn out to be
negative if one requirement concerns creating folders and the other concerns deleting folders.
In other words, useful features are those that encode the verbs and nouns in isolation but the
relationship between them. To do so, each requirement is parsed by the Stanford dependency
parser [13], and each noun-verb pair (ni,vj) is collected if it’s connected by a dependency relation.
Binary features are created by pairing each related noun-verb pair found in a high-level training
requirement with each related noun-verb pair found in a low-level training requirement. The
feature value is 1 if the two noun-verb pairs appear in the pair of requirements under consideration.
To enable the learner to focus on learning from relevant verbs and nouns, only verbs and nouns
that appear in the ontology are used to create these features.
LIBSVM [14] is employed as the learning algorithm for training a binary SVM classifier on the
training set. In particular the linear kernel is chosen to tune the C value (the regularization
parameter) to maximize F-score on the development (dev) set. All other learning parameters are
set to their default values.
To improve performance, feature selection (FS) is employed by using the backward elimination
algorithm [15]. Starting with all seven feature types, the algorithm iteratively removes one feature
type at a time until only one feature type is left. Specifically, in each iteration, it removes the
feature type whose removal yields the largest F-score on the dev set. Feature subset that achieve
the largest F-score on the dev set over all iterations is picked to be applied in test set.
Note that tuning the C value (from libSVM) and selecting the feature subset both require the use
of a dev set. In each fold experiment, one fold is reserved for development and the remaining three
folds is used for training classifiers. C value is jointly tuned with the selection of feature subset
that maximizes F-score on the dev set.
5.1. THREE EXTENSIONS
The following presents three extensions to our supervised approach.
8. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
28
5.1.1. EMPLOYING TWO VIEWS
Our first extension involves splitting our feature sets into two views (i.e., disjoint subsets) and
training one classifier on each view. To motivate this extension, recall that the ontology is com-
posed of words and phrases that are deemed relevant to traceability prediction according to a SE
expert. In other words, the (word- and cluster-based) features derived from the ontology (i.e.,
features 3–7 in our feature set) are sufficient for traceability prediction, and the remaining
features (features 1 and 2) are not needed according to the expert. While some of the word pairs
that appear in features 1 and 2 also appear in features 3–7, most of them do not. If these expert-
determined irrelevant features are indeed irrelevant, then retaining them could be harmful for
classification because they significantly outnumber their relevant counterparts. However, if some
of these features are relevant (because some relevant words are missed by the expert, for
instance), then removing them would not be a good idea either.
Our solution to this dilemma is to divide the feature set into two views. Given the above discussion,
a natural feature split would involve putting the ontology-based features (features 3–7) into one
view and the remaining ones (features 1–2) into the other view. Then one SVM classifier is trained
on each view as before. During test time, both classifiers are applied to a test instance, classifying
it using the prediction associated with the higher confidence value.5 This setup would prevent the
expert-determined irrelevant features from affecting the relevant ones, and at the same time avoid
totally discarding them in case they do contain some relevantinformation.
A natural question is: why not simply use backward elimination to identify the irrelevant features?
While FS could help, it may not be as powerful as one would think because (1) backward
elimination is greedy; and (2) the features are selected using a fairly small set of instances (i.e.,
the dev set) and may therefore be biased towards the dev set.
In fact, our 2-learner setup and FS are considered as complementary rather than competing
solutions to our dilemma. In particular, FS is to be used in the 2-learner setup: when training the
classifiers on the two views, backward elimination is employed in the same way as before by
removing the feature type (from one of the two classifiers) whose removal yields the highest F-
score on the dev set in each iteration.
5.1.2. LEARNING THE ONTOLOGY
An interesting question is: can the ontology be learned instead of hand-built? Not only is this
question interesting from a research perspective, it is of practical relevance: even if a domain
expert is available, hand-constructing the ontology is a time-consuming and error-prone process.
The following describes the steps for ontology learning, which involves producing a verb
clustering and a noun clustering.
Step 1: Verb/Noun selection. The nouns, noun phrases (NPs) and verbs in the training set will be
clustered. Specifically, a verb/noun/NP is selected if (1) it appears more once in the training data;
(2) it contains at least three characters (thus avoiding verbs such as be); and (3) it appears in the
high-level but not the low-level requirements and vice versa.
Step 2: Verb/Noun representation. Each noun/NP/verb is represented as a feature vector. Each
verb v is represented using the set of nouns/NPs collected in Step 1. The value of each feature is
binary: 1 if the corresponding noun/NP occurs as the direct or indirect object of v in the training
9. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
29
data (as determined by the Stanford dependency parser), and 0 otherwise. Similarly, each noun
n is represented using the set of verbs collected in Step 1. The value of each feature is binary:
1 if n serves as the direct or indirect object of the corresponding verb in the training data, and 0
otherwise.
Step 3: Clustering. Verbs and the nouns/NPs are clustered separately to produce a verb clustering
and a noun clustering. Two clustering algorithms are experimented. The first one, which is referred
to as Simple, is the classical single-link algorithm. Single-link is an agglomerative algorithm where
each object to be clustered is initially in its own cluster. In each iteration, it merges the two most
similar clusters and stops when the desired number of clusters is reached. The second clustering
algorithm is motivated by the following observation. A better verb clustering could be produced
if each verb were represented using noun categories rather than nouns/NPs, because there is no
need to distinguish between the nouns in the same category in order to produce the verb clusters.
Similarly, a better noun clustering could be produced if each noun were represented using verb
In practice, the noun and verb categories do not exist (because they are what the clustering
algorithm is trying to produce). However, the (partial) verb clusters produced during the verb
clustering process can be used to improve noun clustering and vice versa. This motivates our
Interactive clustering algorithm. Like Simple, Interactive is also a single-link clustering
algorithm. Unlike Simple, which produces the two clusterings separately, Interactive interleaves
the verb and noun clustering processes, as described below.
Initially, each verb and each noun is in its own cluster. In each iteration, (1) the two most similar
verb clusters is merged; (2) the noun’s feature representation are updated by merging the two
verb features that correspond to the newly formed verb cluster6
; (3) the two most similar noun
clusters are merged using this updated feature representation for nouns; (4) the verb’s feature
representation are updated by merging the two noun features that correspond to the newly
formed noun cluster. As in Simple, Interactive terminates when the desired number of clusters is
reached
For both clustering algorithms, the similarity between two objects are computed by taking the dot
product of their feature vectors. Since both clustering algorithms are single-link, the similarity
between two clusters is the similarity between the two most similar objects in the two clusters.
Considering the number of clusters to be produced is not known a priori, three noun clusterings
and three verb clusterings (with 10, 15, and 20 clusters each) are generated. Then the combination
of noun clustering, verb clustering, the C value, and the feature subset that maximizes F-score on
the dev set is selected, and the resulting combination is applied on the testset.
5.1.3. EXPLOITING RATIONALES
This section describes another extension to the baseline: exploiting rationales to generate
additional training instances for the SVM learner.
Background
The idea of using annotator rationales to improve text classification was proposed by Zaidan et al.
[1] A rationale is a human-annotated text fragment that motivates an annotator to assign a
particular label to a training document. In their work on classifying the sentiment expressed in
movie reviews as positive or negative, Zaidan et al. generate additional training instances by
10. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
30
removing rationales from documents. Since these pseudo-instances lack information that the
annotators thought as important, an SVM learner should be less confident about the label of
these weaker instances (by placing the hyperplane closer to the less confidently labeled training
instances). A learner that successfully learns this difference in confidence assigns a higher
importance to the pieces of text that are present only in the original instances. Thus the pseudo-
instances help the learner both by providing indication to which parts of the documents are
important and by increasing the number of training instances.
Application to Traceability Prediction
Unlike in sentiment analysis, where rationales can be identified for both positive and negative
training reviews, in traceability prediction, rationales can only be identified for the positive training
instances (i.e., pairs with links). As noted before, the reason is that in traceability prediction,
an instance is labeled as negative because of the absence of evidence that the two requirements
involved should be linked, rather than the presence of evidence that they should not be linked.
Hence, only positive pseudo-instances will be created for training a traceability predictor.
Zaidan et al.’s method cannot be applied as is to create positive pseudo-instances. According to
their method, (1) a pair of linked requirements is chosen, (2) the rationales from both of them are
removed, (3) a positive pseudo-instance from the remaining text fragments is created, and (4) a
constraint is added to the SVM learner, forcing the learner to classify that positive pseudo-instance
less confidently than the original positive instance. Creating positive pseudo-instances in this way
is problematic for our task. The reason is simple: a negative instance in our task stems from the
absence of evidence that the two requirements should be linked. In other words, after removing
the rationales from a pair of linked requirments, the pseudo-instance created from the remaining
text fragments should be labeled as negative.
Given this observation, one option is to employ Zaidan et al.’s method to create negative pseudo
instances. Another option would be to create a positive pseudo-instance from each pair of linked
requirements by removing any text fragments from the pair that are not part of a rationale. In
other words, only the rationales are used to create positive pseudo-instances. Both options could
be viable, but positive rather than negative pseudo-instances are chosen to add to our training set,
as adding positive pseudo-instances will not aggravate the class imbalanceproblem.
Unlike Zaidan et al., who force the learner to classify pseudo-instances less confidently than the
original instances, our learner decide whether it wants to classify these additional training in-
stances more or less confidently based on the dev data. In other words, this confidence parameter
(denoted as µ in Zaidan et al.’s paper) is tuned jointly with the C value to maximize F-score on the
dev set. Note that pseudo-instances are created only for the training set, as rationales are annotated
only in the training documents.
To better understand our annotator rationale framework, let us define it more formally. Recall that
in a standard soft-margin SVM, the goal is to find w and ξ tominimize
11. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
31
subject to
where xi is a training example; ci ∈ {−1, 1} is the class label of xi; ξi is a slack variable that
allows xi to be misclassified if necessary; and C > 0 is the misclassification penalty (a.k.a. the
regularization parameter).
The following constraints are added to enable this standard soft-margin SVM to also learn from
the positive pseudo-instances:
where vi is the positive pseudo-instance created from positive example xi, ξi ≥ 0 is the slack
variable associated with vi, and µ is the margin size (which controls how confident the classifier
is in classifying the pseudo-instances).
Similarly, the following constraints are added to learn from the negative pseudo-instances:
where uij is the jth negative pseudo-instance created from positive example xi, ξij ≥ 0 is the slack
variable associated with uij, and µ is the margin size.
Our learner decide how confidently it wants to classify these additional training instances based
on the dev data. Specifically, this confidence parameter µ is tuned jointly with the C value to
maximize F-score on the dev set.7
6. EVALUATION
6.1. EXPERIMENTAL SETUP
F-score, which is the unweighted harmonic mean of recall and precision, is employed as the
evaluation measure. Recall is the percentage of links in the gold standard that are recovered by
our system. Precision is the percentage of links recovered by our system that are correct. Each
document is preprocessed by removing stopwords and stemming the remaining words. All
results are obtained via 5-fold cross validation.
6.2. RESULTS AND DISCUSSION
6.2.1. BASELINE SYSTEMS
There Are Two Unsupervised And Two Supervised Baselines.
Baseline 1: Tf.Idf. Thresholds 0.1 to 0.9 with an increment of 0.1 are tested and results are
reported using the best threshold, essentially giving an advantage to it in the performance
comparison. From row 1 of Table 3, it achieves an F-score of 54.5%.
12. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
32
Baseline 2: LDA. Also motivated by previous work, LDA is employed as our second unsuper-
vised baseline. An LDA is trained on our data to produce n topics (where n=10, 20, . . ., 60). Then
the n topics are used as features for representing each document, where the value of a feature is the
probability the document belongs to the corresponding topic. Cosine is used as the similarity
measure. Any pair of requirements whose similarity exceeds a given threshold is labeled as
positive. Thresholds from 0.1 to 0.9 with an increment of 0.1 are tested and results are reported
using the best threshold, essentially giving an advantage to it in the performance comparison.
From row 2 of Table 3, it achieves an F-score of 34.2%.
Baseline 3: Features 1 and 2. As the first supervised baseline, a SVM classifier is trained using
only features 1 and 2 (all the word pairs). From row 3 of Table 3, it achieves F-scores of 57.1%
(without FS) and 67.7% (with FS). These results suggest that FS is indeed useful.
Baseline 4: Features 1, 2, and LDA. As the second supervised baseline, feature set used in
Baseline 3 is augmented with the LDA features used in Baseline 2 and then a SVM classifier is
trained. The best n (number of topics) is selected using the dev set. Clearly from row 4 of Table 3,
this is the best of the four baselines: it significantly outperforms Baseline 3 regardless of whether
feature selection is performed8, suggesting the usefulness of the LDAfeatures.
6.2.2. OUR APPROACH
Next, our 2-learner, ontology-based approach is evaluated, without using pseudo instances created
from rationales. In the single-learner experiments, a classifier is trained on the seven features
described in Section 4.1, whereas in the 2-learner experiments, these seven features are split as
described in Section 4.2.
Setting 1: Single learner, manual clusters. From row 5 of Table 3 under “No pseudo” column,
this classifier significantly outperforms the best baseline (Baseline 4): F-scores increase by 4.6%
(without FS) and 3.5% (with FS). Since the only difference between this and Baseline 4 lies in
whether the LDA features or the ontology-based features are used, these results seem to suggest
that features formed from the clusters in our hand-built ontology are more useful than the LDA
features.
13. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
33
Table 3. Five-fold cross-validation results.
Notes: R, P, and F are denoted as Recall, Precision, and F-score respectively
Setting 2: Single learner, induced clusters. From row 6 of Table 3 under “No pseudo” column,
this classifier performs statistically indistinguishably from the one in Setting 1. This is an
encouraging result: it shows that even when features are created from induced rather than
manual clusters, performance does not significantly drop regardless of whether FS is performed.
Setting 3: Two learners, manual clusters. From row 7 of Table 3 under “No pseudo” column,
this classifier performs significantly better than the one in Setting 1: F-scores increase by 9.2%
(without FS) and 3.0% (with FS). As the two settings differ only w.r.t. whether one or two learners
are used, the improvements suggest the effectiveness of our 2-learnerframework.
Setting 4: Two learners, induced clusters. From row 8 of Table 3 under “No pseudo” column,
this classifier performs significantly better than the one in Setting 2: F-scores increase by 9.3%
(without FS) and 5.8% (with FS). It also performs indistinguishably from the one in Setting 3.
Taken together, these results suggest that (1) our 2-learner framework is effective in improving
performance, and (2) features derived from induced clusters are as effective as those from manual
clusters.
Overall, these results show that (1) our 2-learner, pseudo-instances-enhanced, ontology-based
approach is effective, and (2) feature selection consistently improvesperformance.
To gain insights into which features and which clustering algorithms output are being selected, the
best-performing system (row 8 in Table 3) has selected (as determined on the dev set) the feature
subsets of features 1 (same words), 3 (verb pairs), 4 (verb group pairs), and 5 (noun pairs), as well
as the Interactive (with 20 clusters) clustering output.
14. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
34
6.2.3. FURTHER INVESTIGATION
This section discusses usage of pseudo instances during model training to boost the learning
performance.
Using Positive Pseudo-instances
The “Pseudo pos only” column of Table 3 shows the results when each of the systems is trained
with additional positive pseudo-instances.
Comparingrow-wise with “No pseudo” column, it shows thatemployingpositive pseudo-instances
increases performance on Pine. F-scores rise by 1.1–3.9% without FS and 1.8–3.8% with FS. The
corresponding F-scores in all cases are statistically distinguishable. These results seem to suggest
that the addition of positive pseudo-instances is useful for traceability link prediction.
Using Positive and Negative Pseudo-instances
The “Pseudo pos+neg” column of Table 3 shows the results when each of the systems is trained
with additional positive and negative pseudo-instances.
Comparing these results with the corresponding “Pseudo pos only” results, it shows that
additionally employing negative pseudo-instances almost consistently improves performance: F-
scores rise by 0.8–2.0% without FS and up to 2.4% with FS, with one exception case (1 learner
with manual clusters and FS, which drops by 0.7%). Nevertheless, the corresponding F-scores in
two of four 1-learner cases (manual/with FS, induced/no FS) are statistically indistinguishable.
But interestingly, the improvements in F-score in all four 2-learner cases are statistically
significant. These results suggest that the additional negative pseudo-instances provide useful
supplementary information for traceability link prediction.
In addition, the use of features derived from manual/induced clusters to the supervised baseline
consistently improves its performance: F-scores rise significantly by 1.3–14.5%.
Finally, the best results in our experiments are achieved when both positive and negative pseudo
instances are used in combination with manual/induced clusters and feature selection: F-scores
reach 81.1–81.3%. These results translate to significant improvements in F-score over the
supervised baseline (with no pseduo instances) by 13.4–13.6%, or relative error reductions of
41.5– 42.1%.
Pseudo-instances from Residuals
Recall that Zaidan et al. [1] created pseudo-instances from the text fragments that remain after
the rationales are removed. In Section 5.2.3, an argument has arisen that their method ofcreating
positive pseudo-instances for our requirements traceability task is problematic. In this subsection,
the correctness of this claim is verified empirically.
Specifically, the “Pseudo residual” column of Table 3 shows the results when each of the “No
pseudo” systems is additionally trained on the positive pseudo-instances created using Zaidan et
al.’s method. Comparing these results with the corresponding “Pseudo pos+neg” results, itshows
that replacing our method of creating positive pseudo-instances with Zaidan et al.’s method causes
15. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
35
the F-scores to drop significantly by 21.0–30.3% in all cases. In fact, comparing these results with
the corresponding “No pseudo” results, it shows that employing positive pseudo-instances created
from Zaidan et al.’s method yields significantly worse results than not employing pseudo-instances
at all. These results provide suggestive evidence for our claim.
7. CONCLUSIONS
A 2-learner, ontology-based, pseudo-instances-enhanced approach to supervised traceabilitypre-
diction has been investigated. Results showed that (1) our approach is effective: in comparison to
the best baseline, relative error reduces by 56.0%; (2) the pseudo instances extension is effective,
which is able to mitigate the situation when human labelled links are insufficient; and (3) also
interestingly, results obtained via induced clusters were as competitive as those obtained via
manual clusters, which indicates potential to automate building of ontology rationales for
traceability prediction and reduce human effort.
REFERENCES
[1] O.Zaidan,J.Eisner,andC.Piatko,“Using“annotatorrationales”toimprovemachinelearning for text
categorization,” in Human Language Technologies 2007: The Conference of the North American
Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, 2007,
pp. 260–267.
[2] R.A.Pierce, “A requirements tracing tool,” ACM SIGSOFT Software Engineering Notes, vol. 3, no.
5, pp. 53–60, 1978.
[3] J.Jackson, “A keyphrase based traceability scheme,” in Tools and Techniques for Maintaining
Traceability During Design, IEE Colloquium on. IET, 1991, pp. 2–1.
[4] S.K.Sundaram, J. H. Hayes, and A. Dekhtyar, “Baselines in requirements tracing,” ACM SIGSOFT
Software Engineering Notes, vol. 30, no. 4, pp. 1–6, 2005.
[5] M. Lormans and A. Van Deursen, “Can lsi help reconstructing requirements traceability in
designandtest?” inSoftwareMaintenanceandReengineering,2006.CSMR2006.Proceedings of the 10th
European Conference on. IEEE, 2006, pp. 10–pp.
[6] A.DeLucia,F.Fasano,R.Oliveto,andG.Tortora,“Recoveringtraceabilitylinksinsoftware artifact
management systems using information retrieval methods,” ACM Transactions on Software
Engineering and Methodology (TOSEM), vol. 16, no. 4, p. 13, 2007.
[7] De Lucia, R. Oliveto, and G. Tortora, “Assessing ir-based traceability recovery tools
through controlled experiments,” Empirical Software Engineering, vol. 14, no. 1, pp. 57– 92, 2009.
[8] J.Cleland-Huang, R. Settimi, C. Duan, and X. Zou, “Utilizing supporting evidence to improve
dynamic requirements traceability,” in Requirements Engineering, 2005. Proceedings. 13th IEEE
International Conference on. IEEE, 2005, pp. 135–144..
[9] D. Port, A. Nikora, J. H. Hayes, and L. Huang, “Text mining support for software requirements:
Traceabilityassurance,”inSystemSciences(HICSS),201144thHawaiiInternational Conference on.
IEEE, 2011, pp. 1–11.
16. International Journal of Software Engineering & Applications (IJSEA), Vol.9, No.4, July 2018
36
[10] J.N.Dag, B. Regnell, P. Carlshamre, M. Andersson, and J. Karlsson, “A feasibility study of automated
natural language requirements analysis in market-driven development,” Requirements Engineering,
vol. 7, no. 1, pp. 20–33, 2002.
[11] Z. Li, M. Chen, L. Huang, and V. Ng, “Recovering traceability links in requirements documents,” in
Proceedings of the Nineteenth Conference on Computational Natural Language Learning, 2015, pp.
237–246.
[12] J. Guo, J. Cheng, and J. Cleland-Huang, “Semantically enhanced software traceability using deep
learning techniques,” in 2017 IEEE/ACM 39th International Conference on Software Engineering
(ICSE), May 2017, pp. 3–14.
[13] M.-C. de Marneffe, B. MacCartney, and C. D. Manning, “Generating typed dependency parses from
phrase structure parses,” in Proceedings of the 5th International Conference on Language Resources
and Evaluation, 2006, pp. 449–454.
[14] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on
Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
[15] A. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial
Intelligence, vol. 97, no. 1–2, pp. 245–271, 1997.