Semantic Similarity measures plays an important role in information retrieval, natural language processing and various tasks on web such as relation extraction, community mining, document clustering, and automatic meta-data extraction. In this paper, we have proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure between the words by
combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and nonsynonymous word-pairs. The proposed approach aims to improve the Correlation values,
Precision, Recall, and F-measures, compared to the existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAINcscpconf
Today’s conventional search engines hardly do provide the essential content relevant to the
user’s search query. This is because the context and semantics of the request made by the user
is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is
upcoming in the area of web search which combines Natural Language Processing and
Artificial Intelligence.
The objective of the work done here is to design, develop and implement a semantic search
engine- SIEU(Semantic Information Extraction in University Domain) confined to the
university domain. SIEU uses ontology as a knowledge base for the information retrieval
process. It is not just a mere keyword search. It is one layer above what Google or any other
search engines retrieve by analyzing just the keywords. Here the query is analyzed both
syntactically and semantically.
The developed system retrieves the web results more relevant to the user query through keyword
expansion. The results obtained here will be accurate enough to satisfy the request made by the
user. The level of accuracy will be enhanced since the query is analyzed semantically. The
system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
This document discusses various techniques for semantic document retrieval and summarization. It begins by introducing the challenges of semantic search and techniques like word sense disambiguation that aim to improve search relevance. It then discusses using ontologies and semantic networks to reduce the semantic content of documents in order to support semantic document retrieval. Finally, it proposes using relationship-based data cleaning techniques to disambiguate references between entities by analyzing their features and relationships.
IRJET- Semantic Web Mining and Semantic Search Engine: A ReviewIRJET Journal
This document provides an overview of the semantic web, semantic web mining, and semantic search engines. It discusses how the semantic web aims to make web data machine-readable through technologies like RDF and ontology. Semantic web mining involves extracting useful knowledge from the semantic web. Semantic search engines then allow users to retrieve more precise and meaningful data from the semantic web through the use of semantic technologies. The document outlines challenges for semantic search engines and opportunities for further research.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
Research on Document Indexing in the Search Engines. The main theme of Informational retrieval is to send the exact response of a user for specific Query.
The information search retrieval is a very big process, to achieve this concept we need to develop an application with more effect and we have to use techniques like Document indexing, page ranking, clustering technique. Among all of these Document index is plays avital role while searching why since instead of searching hundreds of thousands of documents it will directly go to the particular index and will give the output here. Here our achievement mainly is indexing, the clear meaning of the indexing is storing an index is to optimize speed and performance in finding the appropriate/corresponding document for the user searched query.
My conclusion is the context based index approach is used in the query retrieval, this is mainly from the source document. Instead of searching every page on server, finding technically is better. Due to this we can save our time, we can reduce the burden of server.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAINcscpconf
Today’s conventional search engines hardly do provide the essential content relevant to the
user’s search query. This is because the context and semantics of the request made by the user
is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is
upcoming in the area of web search which combines Natural Language Processing and
Artificial Intelligence.
The objective of the work done here is to design, develop and implement a semantic search
engine- SIEU(Semantic Information Extraction in University Domain) confined to the
university domain. SIEU uses ontology as a knowledge base for the information retrieval
process. It is not just a mere keyword search. It is one layer above what Google or any other
search engines retrieve by analyzing just the keywords. Here the query is analyzed both
syntactically and semantically.
The developed system retrieves the web results more relevant to the user query through keyword
expansion. The results obtained here will be accurate enough to satisfy the request made by the
user. The level of accuracy will be enhanced since the query is analyzed semantically. The
system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
This document discusses various techniques for semantic document retrieval and summarization. It begins by introducing the challenges of semantic search and techniques like word sense disambiguation that aim to improve search relevance. It then discusses using ontologies and semantic networks to reduce the semantic content of documents in order to support semantic document retrieval. Finally, it proposes using relationship-based data cleaning techniques to disambiguate references between entities by analyzing their features and relationships.
IRJET- Semantic Web Mining and Semantic Search Engine: A ReviewIRJET Journal
This document provides an overview of the semantic web, semantic web mining, and semantic search engines. It discusses how the semantic web aims to make web data machine-readable through technologies like RDF and ontology. Semantic web mining involves extracting useful knowledge from the semantic web. Semantic search engines then allow users to retrieve more precise and meaningful data from the semantic web through the use of semantic technologies. The document outlines challenges for semantic search engines and opportunities for further research.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
Research on Document Indexing in the Search Engines. The main theme of Informational retrieval is to send the exact response of a user for specific Query.
The information search retrieval is a very big process, to achieve this concept we need to develop an application with more effect and we have to use techniques like Document indexing, page ranking, clustering technique. Among all of these Document index is plays avital role while searching why since instead of searching hundreds of thousands of documents it will directly go to the particular index and will give the output here. Here our achievement mainly is indexing, the clear meaning of the indexing is storing an index is to optimize speed and performance in finding the appropriate/corresponding document for the user searched query.
My conclusion is the context based index approach is used in the query retrieval, this is mainly from the source document. Instead of searching every page on server, finding technically is better. Due to this we can save our time, we can reduce the burden of server.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses several key aspects of mathematics and algorithms used in internet information retrieval and search engines:
1. It explains how search engines like Google can rapidly rank billions of web pages using algorithms based on the topology and link structure of the web graph, such as PageRank.
2. It describes two main types of page ranking algorithms - static importance ranking based on link analysis, and dynamic relevance ranking based on statistical learning models to match pages to queries.
3. It proposes a new ranking algorithm called BrowseRank that models user browsing behavior using Markov chains and takes into account visit duration to better reflect true page importance.
The document discusses several mathematical models and algorithms used in internet information retrieval and search engines:
1. Markov chain methods can be used to model a user's web surfing behavior and page visit transitions.
2. BrowseRank models user browsing as a Markov process to calculate page importance based on observed user behavior rather than artificial assumptions.
3. Learning to rank problems in information retrieval can be framed as a two-layer statistical learning problem where queries are the first layer and document relevance judgments are the second layer.
4. Stability theory can provide generalization bounds for learning to rank algorithms under this two-layer framework. Modifying algorithms like SVM and Boosting to have query-level stability improves performance.
Context Sensitive Search String Composition Algorithm using User Intention to...IJECEIAES
Finding the required URL among the first few result pages of a search engine is still a challenging task. This may require number of reformulations of the search string thus adversely affecting user's search time. Query ambiguity and polysemy are major reasons for not obtaining relevant results in the top few result pages. Efficient query composition and data organization are necessary for getting effective results. Context of the information need and the user intent may improve the autocomplete feature of existing search engines. This research proposes a Funnel Mesh-5 algorithm (FM5) to construct a search string taking into account context of information need and user intention with three main steps 1) Predict user intention with user profiles and the past searches via weighted mesh structure 2) Resolve ambiguity and polysemy of search strings with context and user intention 3) Generate a personalized disambiguated search string by query expansion encompassing user intention and predicted query. Experimental results for the proposed approach and a comparison with direct use of search engine are presented. A comparison of FM5 algorithm with K Nearest Neighbor algorithm for user intention identification is also presented. The proposed system provides better precision for search results for ambiguous search strings with improved identification of the user intention. Results are presented for English language dataset as well as Marathi (an Indian language) dataset of ambiguous search strings.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a research paper on developing user profiles from search engine queries to enable personalized search results. It discusses how current search engines generally return the same results regardless of individual user interests. The paper proposes methods to construct user profiles capturing both positive and negative preferences from search histories and click-through data. Experimental results showed profiles including both preferences performed best by improving query clustering and separating similar vs. dissimilar queries. Future work aims to use profiles for collaborative filtering and predicting new query intents.
This document summarizes a research paper that proposes a framework for personalized web search using query log and clickthrough data. The framework implements a re-ranking approach that combines user search context and browsing behavior to generate personalized search results with high relevance. The framework consists of five components: a request handler, query processor, result handler, event handler, and response handler. The result handler applies a re-ranking approach using query log and clickthrough data to personalize search results before returning them to the user. An evaluation found the framework and re-ranking approach to be effective for personalized web search and information retrieval.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
For the technically oriented reader, this brief paper describes the technical foundation of the Knowledge Correlation Search Engine - patented by Make Sence, Inc.
Information Retrieval on Text using Concept Similarityrahulmonikasharma
This document summarizes a research paper on concept-based information retrieval using semantic analysis and WordNet. It discusses some of the challenges with keyword-based retrieval, such as synonymy and polysemy problems. Concept-based retrieval aims to address these issues by mapping documents and queries to semantic concepts rather than keywords. The paper proposes extracting concepts from text documents using WordNet to identify synonyms, hypernyms and hyponyms. It involves calculating term frequencies to determine a hierarchy of important concepts. The methodology is implemented using Java and WordNet to extract concepts from sample input documents.
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
- The document discusses the evolution of text analytics technologies from early keyword indexing to more advanced mathematical approaches like latent semantic indexing (LSI).
- It explains that early keyword indexing focused only on word frequencies and occurrences, which could lead to false positives and did not capture the conceptual meaning of documents.
- More advanced approaches like LSI use linear algebraic calculations to analyze word co-occurrences across large document sets and derive the conceptual relationships between terms and topics in a way that better mirrors human understanding.
Improving Annotations in Digital Documents using Document Features and Fuzzy ...IRJET Journal
The document proposes a system to automatically annotate digital documents using document features extracted via natural language processing techniques and fuzzy logic. It aims to improve on existing annotation systems by maintaining semantic accuracy while annotating large amounts of documents. The system first extracts features from documents like titles, sentence length, proper nouns etc. It then uses fuzzy logic to apply the best possible annotations based on weighted feature values. The approach is meant to accurately annotate documents in all conditions while preserving semantic meaning.
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLPIRJET Journal
This document presents a proposed system for detecting plagiarism in student assignments submitted online. The system would use data mining algorithms and natural language processing to compare submitted assignments against each other and identify plagiarized content. It would analyze assignments at both the syntactic and semantic levels. The proposed system is intended to more efficiently and accurately detect plagiarism compared to teachers manually reviewing all submissions. The document describes the workflow of the system, including preprocessing of assignments, text analysis, similarity measurement, and algorithms that would be used like Rabin-Karp, KMP and SCAM.
To provide relevant data to users form massive data available on web the Semantic Web technique is used. This presentation gives introduction of semantic web and how NLP can be used in it.
The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with BoyerMoore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
The document discusses predictive text analytics, including predicting text completions, disambiguating text, and correcting errors. It also discusses extracting entities, concepts, facts, and sentiments from unstructured text sources for applications like search, knowledge discovery, and predictive analytics. Key challenges include the complexity of human language with features like ambiguity and context.
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...cscpconf
Semantic Similarity measures between words plays an important role in information retrieval, natural language processing and in various tasks on the web. In this paper, we have proposed a Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses several key aspects of mathematics and algorithms used in internet information retrieval and search engines:
1. It explains how search engines like Google can rapidly rank billions of web pages using algorithms based on the topology and link structure of the web graph, such as PageRank.
2. It describes two main types of page ranking algorithms - static importance ranking based on link analysis, and dynamic relevance ranking based on statistical learning models to match pages to queries.
3. It proposes a new ranking algorithm called BrowseRank that models user browsing behavior using Markov chains and takes into account visit duration to better reflect true page importance.
The document discusses several mathematical models and algorithms used in internet information retrieval and search engines:
1. Markov chain methods can be used to model a user's web surfing behavior and page visit transitions.
2. BrowseRank models user browsing as a Markov process to calculate page importance based on observed user behavior rather than artificial assumptions.
3. Learning to rank problems in information retrieval can be framed as a two-layer statistical learning problem where queries are the first layer and document relevance judgments are the second layer.
4. Stability theory can provide generalization bounds for learning to rank algorithms under this two-layer framework. Modifying algorithms like SVM and Boosting to have query-level stability improves performance.
Context Sensitive Search String Composition Algorithm using User Intention to...IJECEIAES
Finding the required URL among the first few result pages of a search engine is still a challenging task. This may require number of reformulations of the search string thus adversely affecting user's search time. Query ambiguity and polysemy are major reasons for not obtaining relevant results in the top few result pages. Efficient query composition and data organization are necessary for getting effective results. Context of the information need and the user intent may improve the autocomplete feature of existing search engines. This research proposes a Funnel Mesh-5 algorithm (FM5) to construct a search string taking into account context of information need and user intention with three main steps 1) Predict user intention with user profiles and the past searches via weighted mesh structure 2) Resolve ambiguity and polysemy of search strings with context and user intention 3) Generate a personalized disambiguated search string by query expansion encompassing user intention and predicted query. Experimental results for the proposed approach and a comparison with direct use of search engine are presented. A comparison of FM5 algorithm with K Nearest Neighbor algorithm for user intention identification is also presented. The proposed system provides better precision for search results for ambiguous search strings with improved identification of the user intention. Results are presented for English language dataset as well as Marathi (an Indian language) dataset of ambiguous search strings.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes a research paper on developing user profiles from search engine queries to enable personalized search results. It discusses how current search engines generally return the same results regardless of individual user interests. The paper proposes methods to construct user profiles capturing both positive and negative preferences from search histories and click-through data. Experimental results showed profiles including both preferences performed best by improving query clustering and separating similar vs. dissimilar queries. Future work aims to use profiles for collaborative filtering and predicting new query intents.
This document summarizes a research paper that proposes a framework for personalized web search using query log and clickthrough data. The framework implements a re-ranking approach that combines user search context and browsing behavior to generate personalized search results with high relevance. The framework consists of five components: a request handler, query processor, result handler, event handler, and response handler. The result handler applies a re-ranking approach using query log and clickthrough data to personalize search results before returning them to the user. An evaluation found the framework and re-ranking approach to be effective for personalized web search and information retrieval.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
For the technically oriented reader, this brief paper describes the technical foundation of the Knowledge Correlation Search Engine - patented by Make Sence, Inc.
Information Retrieval on Text using Concept Similarityrahulmonikasharma
This document summarizes a research paper on concept-based information retrieval using semantic analysis and WordNet. It discusses some of the challenges with keyword-based retrieval, such as synonymy and polysemy problems. Concept-based retrieval aims to address these issues by mapping documents and queries to semantic concepts rather than keywords. The paper proposes extracting concepts from text documents using WordNet to identify synonyms, hypernyms and hyponyms. It involves calculating term frequencies to determine a hierarchy of important concepts. The methodology is implemented using Java and WordNet to extract concepts from sample input documents.
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
- The document discusses the evolution of text analytics technologies from early keyword indexing to more advanced mathematical approaches like latent semantic indexing (LSI).
- It explains that early keyword indexing focused only on word frequencies and occurrences, which could lead to false positives and did not capture the conceptual meaning of documents.
- More advanced approaches like LSI use linear algebraic calculations to analyze word co-occurrences across large document sets and derive the conceptual relationships between terms and topics in a way that better mirrors human understanding.
Improving Annotations in Digital Documents using Document Features and Fuzzy ...IRJET Journal
The document proposes a system to automatically annotate digital documents using document features extracted via natural language processing techniques and fuzzy logic. It aims to improve on existing annotation systems by maintaining semantic accuracy while annotating large amounts of documents. The system first extracts features from documents like titles, sentence length, proper nouns etc. It then uses fuzzy logic to apply the best possible annotations based on weighted feature values. The approach is meant to accurately annotate documents in all conditions while preserving semantic meaning.
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLPIRJET Journal
This document presents a proposed system for detecting plagiarism in student assignments submitted online. The system would use data mining algorithms and natural language processing to compare submitted assignments against each other and identify plagiarized content. It would analyze assignments at both the syntactic and semantic levels. The proposed system is intended to more efficiently and accurately detect plagiarism compared to teachers manually reviewing all submissions. The document describes the workflow of the system, including preprocessing of assignments, text analysis, similarity measurement, and algorithms that would be used like Rabin-Karp, KMP and SCAM.
To provide relevant data to users form massive data available on web the Semantic Web technique is used. This presentation gives introduction of semantic web and how NLP can be used in it.
The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with BoyerMoore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
The document discusses predictive text analytics, including predicting text completions, disambiguating text, and correcting errors. It also discusses extracting entities, concepts, facts, and sentiments from unstructured text sources for applications like search, knowledge discovery, and predictive analytics. Key challenges include the complexity of human language with features like ambiguity and context.
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...cscpconf
Semantic Similarity measures between words plays an important role in information retrieval, natural language processing and in various tasks on the web. In this paper, we have proposed a Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
The document proposes a method called Page Count and Snippets Method (PCSM) to estimate semantic similarity between words using information from web search engines. PCSM uses both page counts and lexical patterns extracted from snippets to measure semantic similarity. It defines five page count-based concurrence measures and extracts lexical patterns from snippets to identify semantic relations between words. Support vector machine is used to integrate the similarity scores from page counts and snippet methods. The method is evaluated on benchmark datasets and shows improved correlation compared to existing methods.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated and combined using support vector machines. Finally, the document proposes a new method that applies lexical pattern extraction and clustering to page count data and text snippets, before using an SVM to measure semantic similarity.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
Semantic Search of E-Learning Documents Using Ontology Based Systemijcnes
The keyword searching mechanism is traditionally used for information retrieval from Web based systems. However, this system fails to meet the requirements in Web searching of the expert knowledge base based on the popular semantic systems. Semantic search of E-learning documents based on ontology is increasingly adopted in information retrieval systems. Ontology based system simplifies the task of finding correct information on the Web by building a search system based on the meaning of keyword instead of the keyword itself. The major function of the ontology based system is the development of specification of conceptualization which enhances the connection between the information present in the Web pages with that of the background knowledge.The semantic gap existing between the keyword found in documents and those in query can be matched suitably using Ontology based system. This paper provides a detailed account of the semantic search of E-learning documents using ontology based system by making comparison between various ontology systems. Based on this comparison, this survey attempts to identify the possible directions for future research.
This document presents an approach for measuring semantic similarity between terms using a semantic network. It discusses limitations of existing knowledge-based and corpus-based approaches. The proposed approach uses a large-scale isA network obtained from web corpus to compute semantic similarity scores between 0 to 1. It introduces a concept clustering algorithm to handle ambiguous terms. Experimental results show clusters of word contexts and similarity scores generated between word pairs using this semantic network approach. The approach provides an efficient way to compute semantic similarity, especially for large datasets.
Semantic Information Retrieval Using Ontology in University Domain dannyijwest
Today’s conventional search engines hardly do provide the essential content relevant to the user’s search
query. This is because the context and semantics of the request made by the user is not analyzed to the full
extent. So here the need for a semantic web search arises. SWS is upcoming in the area of web search
which combines Natural Language Processing and Artificial Intelligence. The objective of the work done
here is to design, develop and implement a semantic search engine- SIEU(Semantic Information
Extraction in University Domain) confined to the university domain. SIEU uses ontology as a knowledge
base for the information retrieval process. It is not just a mere keyword search. It is one layer above what
Google or any other search engines retrieve by analyzing just the keywords. Here the query is analyzed
both syntactically and semantically. The developed system retrieves the web results more relevant to the
user query through keyword expansion. The results obtained here will be accurate enough to satisfy the
request made by the user. The level of accuracy will be enhanced since the query is analyzed semantically.
The system will be of great use to the developers and researchers who work on web. The Google results are
re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which
fetches more apt results for the user query.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
There has been lot of research in recent years for efficient web searching. Several papers have proposed algorithm for user feedback sessions, to evaluate the performance of inferring user search goals. When the information is retrieved, user clicks on a particular URL. Based on the click rate, ranking will be done automatically, clustering the feedback sessions. Web search engines have made enormous contributions to the web and society. They make finding information on the web quick and easy. However, they are far from optimal. A major deficiency of generic search engines is that they follow the ‘‘one size fits all’’ model and are not adaptable to individual users.
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near
Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages
plays an important role in the performance degradation while integrating data from heterogeneous
sources. These pages either increase the index storage space or increase the serving costs. Detecting these
pages has many potential applications for example may indicate plagiarism or copyright infringement.
This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are
used to perform clustering of documents .We demonstrated our approach in web news articles domain. The
experimental results show that our algorithm outperforms in terms of similarity measures. The near
duplicate and duplicate document identification has resulted reduced memory in repositories.
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVALijcsa
Search engines today are retrieving more than a few thousand web pages for a single query, most of which
are irrelevant. Listing results according to user needs is, therefore, a very real necessity. The challenge lies
in ordering retrieved pages and presenting them to users in line with their interests. Search engines,
therefore, utilize page rank algorithms to analyze and re-rank search results according to the relevance of
the user’s query by estimating (over the web) the importance of a web page. The proposed work
investigates web page ranking methods and recently-developed improvements in web page ranking.
Further, a new content-based web page rank technique is also proposed for implementation. The proposed
technique finds out how important a particular web page is by evaluating the data a user has clicked on, as
well as the contents available on these web pages. The results demonstrate the effectiveness of the proposed
page ranking technique and its efficiency.
Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
PageRank algorithm and its variations: A Survey reportIOSR Journals
This document provides an overview and comparison of PageRank algorithms. It begins with a brief history of PageRank, developed by Larry Page and Sergey Brin as part of the Google search engine. It then discusses variants like Weighted PageRank and PageRank based on Visits of Links (VOL), which incorporate additional factors like link popularity and user visit data. The document also gives a basic introduction to web mining concepts and categorizes web mining into content, structure, and usage types. It concludes with a comparison of the original PageRank algorithm and its variations.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
Semantic Search Engine using OntologiesIJRES Journal
Nowadays the volume of the information on the Web is increasing dramatically. Facilitating users to get useful information has become more and more important to information retrieval systems. While information retrieval technologies have been improved to some extent, users are not satisfied with the low precision and recall. With the emergence of the Semantic Web, this situation can be remarkably improved if machines could “understand” the content of web pages. The existing information retrieval technologies can be classified mainly into three classes.The traditional information retrieval technologies mostly based on the occurrence of words in documents. It is only limited to string matching. However, these technologies are of no use when a search is based on the meaning of words, rather than onwards themselves.Search engines limited to string matching and link analysis. The most widely used algorithms are the PageRank algorithm and the HITS algorithm. The PageRank algorithm is based on the number of other pages pointing to the Web page and the value of the pages pointing to it. Search engines like Google combine information retrieval techniques with PageRank. In contrast to the PageRank algorithm, the HITS algorithm employs a query dependent ranking technique. In addition to this, the HITS algorithm produces the authority and the hub score. The widespread availability of machine understandable information on the Semantic Web offers which some opportunities to improve traditional search. If machines could “understand” the content of web pages, searches with high precision and recall would be possible.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
The document discusses keyword query routing for keyword search over multiple structured data sources. It proposes computing top-k routing plans based on their potential to contain results for a given keyword query. A keyword-element relationship summary compactly represents keyword and data element relationships. A multilevel scoring mechanism computes routing plan relevance based on scores at different levels, from keywords to subgraphs. Experiments on 150 public sources showed relevant plans can be computed in 1 second on average desktop computer. Routing helps improve keyword search performance without compromising result quality.
Similar to WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTERN RETRIEVAL ALGORITHM (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
This document discusses using social media technologies to promote student engagement in a software project management course. It describes the course and objectives of enhancing communication. It discusses using Facebook for 4 years, then switching to WhatsApp based on student feedback, and finally introducing Slack to enable personalized team communication. Surveys found students engaged and satisfied with all three tools, though less familiar with Slack. The conclusion is that social media promotes engagement but familiarity with the tool also impacts satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The document proposes a blockchain-based digital currency and streaming platform called GoMAA to address issues of piracy in the online music streaming industry. Key points:
- GoMAA would use a digital token on the iMediaStreams blockchain to enable secure dissemination and tracking of streamed content. Content owners could control access and track consumption of released content.
- Original media files would be converted to a Secure Portable Streaming (SPS) format, embedding watermarks and smart contract data to indicate ownership and enable validation on the blockchain.
- A browser plugin would provide wallets for fans to collect GoMAA tokens as rewards for consuming content, incentivizing participation and addressing royalty discrepancies by recording
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This document discusses the importance of verb suffix mapping in discourse translation from English to Telugu. It explains that after anaphora resolution, the verbs must be changed to agree with the gender, number, and person features of the subject or anaphoric pronoun. Verbs in Telugu inflect based on these features, while verbs in English only inflect based on number and person. Several examples are provided that demonstrate how the Telugu verb changes based on whether the subject or pronoun is masculine, feminine, neuter, singular or plural. Proper verb suffix mapping is essential for generating natural and coherent translations while preserving the context and meaning of the original discourse.
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The document discusses automated penetration testing and provides an overview. It compares manual and automated penetration testing, noting that automated testing allows for faster, more standardized and repeatable tests but has limitations in developing new exploits. It also reviews some current automated penetration testing methodologies and tools, including those using HTTP/TCP/IP attacks, linking common scanning tools, a Python-based tool targeting databases, and one using POMDPs for multi-step penetration test planning under uncertainty. The document concludes that automated testing is more efficient than manual for known vulnerabilities but cannot replace manual testing for discovering new exploits.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
The document proposes a new validation method for fuzzy association rules based on three steps: (1) applying the EFAR-PN algorithm to extract a generic base of non-redundant fuzzy association rules using fuzzy formal concept analysis, (2) categorizing the extracted rules into groups, and (3) evaluating the relevance of the rules using structural equation modeling, specifically partial least squares. The method aims to address issues with existing fuzzy association rule extraction algorithms such as large numbers of extracted rules, redundancy, and difficulties with manual validation.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxOH TEIK BIN
(A Free eBook comprising 3 Sets of Presentation of a selection of Puzzles, Brain Teasers and Thinking Problems to exercise both the mind and the Right and Left Brain. To help keep the mind and brain fit and healthy. Good for both the young and old alike.
Answers are given for all the puzzles and problems.)
With Metta,
Bro. Oh Teik Bin 🙏🤓🤔🥰
2. 2 Computer Science & Information Technology (CS & IT)
mining applications such as, community extraction, relation detection, and entity disambiguation.
In information retrieval, one of the main problems is to retrieve a set of documents that is
semantically related to a given user query. Efficient estimation of semantic similarity between
words is critical for various natural language processing tasks such as Word Sense
Disambiguation (WSD), textual entailment and automatic text summarization. In dictionary the
semantic similarity between words is solved, but when it comes to web, it has become the
challenging task. For example, “apple” is frequently associated with computers on the Web.
However, this sense of “apple” is not listed in most general-purpose thesauri or dictionaries. A
user, who searches for apple on the Web, may be interested in this sense of “apple” and not
“apple” as a “fruit”.
As we know that new words are being added every day in web and the present words are given
multiple meanings i.e. polysemous words. So manually maintaining these words is a very difficult
task. We have proposed a Pattern Retrieval Algorithm to estimate the semantic similarity
between words or entities using Web search engines. Due to the vastness of the web, it is
impossible to analyze each document separately; hence Web search engines provide the perfect
interface for this vast information. A web search engine gives two important information about
the documents searched, Page count and Web Snippets. Page count of a query term will give an
estimate of the number of documents or web pages that contain the given query term. A web
snippet is one which appears below the searched documents and is a brief window of text that is
searched around the query term in the document.
Page count between two objects is accepted globally as the relatedness measure between them.
For example, the page count of the query apple AND computer in Google is 977,000,000
whereas the same for banana AND computer is only 60,200,000 [as on 20 December 2012]. The
more than 16 times more numerous page counts for apple AND computer indicate that apple is
more semantically similar to computer than is banana. The drawbacks of page count is that it
ignores the position of the two words that appear in the document, hence the two words may
appear in the document but may not be related at all and page counts takes into account
polysemous words of the query term, hence a word for example Dhruv will have the page counts
for both Dhruv as the star of fortune and Dhruv as a name of the Helicopter.
Processing snippets is possible for measuring semantic similarity but it has the drawback of
downloading a large number of web pages which consumes time, and all the search engine
algorithms use a page rank algorithm, hence only the top ranked pages will have properly
processed snippets. Hence there is no guarantee that all the information we need is present in the
top ranked snippets.
Motivation: The search results returned by the most popular search engines are not satisfactory.
Because of the vastly numerous documents and the high growth rate of the Web, it is time
consuming to analyze each document separately. It is not uncommon that search engines return a
lot of Web page links that have nothing to do with the user’s need. Information retrieval such as
search engines has the most important use of semantic similarity is the main problem to retrieve
all the documents that are semantically related to the queried term by the user. Web search
engines provide an efficient interface to this vast information. Page counts and snippets are two
useful information sources provided by most Web search engines. Hence, accurately measuring
the semantic similarity between words is a very challenging task.
Contribution: We propose a Pattern Retrieval Algorithm to find the supervised semantic
similarity measure between words by combining both page count method and web snippets
method. Four association measures including variants of Web Dice, Web Overlap Ratio, Web
Jaccard, and WebPMI are used to find semantic similarity between words in page count method
3. Computer Science & Information Technology (CS & IT) 3
using web search engines. The proposed approach aims to improve the correlation values,
Precision, Recall, and F-measures, compared to the existing methods.
Organization: The remainder of the paper is organized as follows: Section 2 reviews the related
work of the semantic similarity measures between words, Section 3 gives the problem definition
Section 4 gives the architecture of the system and Section 5 explains the proposed algorithm. The
implementation and the results of the system are described in Section 6 and Conclusions are
presented in Section 7.
2. RELATED WORK
Semantic similarity between words has always been a challenging problem in data mining.
Nowadays, World Wide Web (WWW) has become a huge collection of data and documents, with
available information for every single user query.
Mehran Sahami et al., [ 1] proposes a novel method for measuring the similarity between short
text snippets by leveraging web search results to provide greater context for the short texts. In this
paper, a method for measuring the similarity between short text snippets is proposed that captures
more of the semantic context of the snippets rather than simply measuring their term-wise
similarity. Hsin-Hsi Chen et al., [2] proposed a web search with double checking model to
explore the web as a live corpus. Instead of simple web page counts and complex web page
collection, the proposed novel model is a Web Search with Double Checking (WSDC) used to
analyze snippets.
Rudi L. Cilibrasi et al., [3] proposed the words and phrases acquire meaning from the way they
are used in society, from their relative semantics to other words and phrases. It is a new theory of
similarity between words and phrases based on information distance and Kolmogorov
complexity. The method is applicable to all search engines and databases. Authors are introduced
some notions underpinning the approach: Kolmogorov complexity, information distance, and
compression-based similarity metric and a technical description of the Google distribution and the
Normalized Google Distance (NGD).
Dekang Lin et al., [4] proposed that, bootstrapping semantics from text is one of the greatest
challenges in natural language learning. They defined a word similarity measure based on the
distributional pattern of words. Jian Pei et al., [5] proposed a projection-based, sequential pattern-
growth approach for efficient mining of sequential patterns. Jiang et al., [6] combines a lexical
taxonomy structure with corpus statistical information so that the semantic distance between
nodes in the semantic space constructed by the taxonomy can be better quantified with the
computational evidence derived from a distributional analysis of corpus data.
Philip Resnik et al., [7]-[8] presents measure of semantic similarity in an is-a taxonomy, based on
the notion of information content. Bollegala et al., [9] proposed a method which exploits the page
counts and text snippets returned by a Web search engine. Ming Li et al., [10] proposed a metric
based on the non-computable notion of Kolmogorov computable distance and called it the
similarity metric. General mathematical theory of similarity that uses no background knowledge
or features specific to an application area.
Ann Gledson et al., [11] describes a simple web-based similarity measure which relies on page-
counts only, can be utilized to measure the similarity of entire sets of words in addition to word-
pairs and can use any web-service enabled search engine distributional similarity measure which
uses internet search counts and extends to calculating the similarity within word-groups. T
Hughes et al., [12] proposed a method that presents the application of random walk Markov chain
theory for measuring lexical semantic relatedness. Dekang Lin et al., [13] present an information
4. 4 Computer Science & Information Technology (CS & IT)
theoretic definition of similarity that is applicable as long as there is a probabilistic model.
Vincent Schickel-Zuber et al., [14] present a novel approach that allows similarities to be
asymmetric while still using only information contained in the structure of the ontology.
These literature surveys proved the fact that semantic similarity measures plays an important role
in information retrieval, relation extraction, community mining, document clustering and
automatic meta-data extraction. Thus there is need for more efficient system to find semantic
similarity between words.
3. PROBLEM DEFINITION
Given two words A and B, we model the problem of measuring the semantic similarity between
A and B, as a one of constructing a function semanticsim (A, B) that returns a value in the range
of 0 and 1. If A and B are highly similar (e.g. synonyms), we expect semantic similarity value to
be closer to 1, otherwise semantic similarity value to be closer to 0. We define numerous features
that express the similarity between A and B using page counts and snippets retrieved from a web
search engine for the two words. Using this feature representation of words, we train a two-class
Support Vector Machine (SVM) to classify synonymous and non-synonymous word pairs. Our
objectives are:
i) To find the semantic similarity between two words and improves the correlation value.
ii) To improves the Precision, Recall and the F-measure metrics of the system.
4. SYSTEM ARCHITECTURE
The outline of the proposed method for finding the semantic similarity using web search engine
results is as shown in Figure 1.
Figure. 1 System Architecture
When a query q is submitted to a search engine, web-snippets, which are brief summaries of the
search results, are returned to the user. First, we need to query the word-pair in a search engine
for example say we query “cricket” and “sport” in Google search engine. We get the page counts
5. Computer Science & Information Technology (CS & IT) 5
of the word-pair along with the page counts for individual words i.e. H (cricket), H(sport),
H(cricket AND sport). These page counts are used to find the co-occurrence measures such as
Web-Jaccard, Web-Overlap, Web-Dice and Web-PMI and store these values for future
references. We collect the snippets from the web search engine results. Snippets are collected
only for the query X and Y. Similarly, we collect both snippets and page counts for 200 word
pairs. Now we need to extract patterns from the collected snippets using our proposed algorithm
and to find the frequency of occurrence of these patterns.
The chi-square statistical method is used to find out the good patterns from the top 200 patterns of
interest using the pattern frequencies. After that we integrate these top 200 patterns with the co-
occurrence measures computed. If the pattern exists in the set of good patterns then we select the
good pattern with the frequency of occurrence in the patterns of the word-pair else we set the
frequency as 0. Hence we get a feature vector with 204 values i.e. the top 200 patterns and four
co-occurrence measures values. We use a Sequential Minimal Optimization (or SMO) support
vector machines (SVM) to find the optimal combination of page counts-based similarity scores
and top-ranking patterns. The SVM is trained to classify synonymous word-pairs and non-
synonymous word pairs. We select synonymous word-pairs and Non-synonymous word-pairs and
convert the output of SVM into a posterior probability. We define the semantic similarity between
two words as the posterior probability if they belong to the synonymous-words (positive) class.
5. ALGORITHM
The proposed Pattern Retrieval Algorithm is used to measure the semantic similarity between
words is as shown in the Table 1. Given two words A and B, we query a web search engine using
the wildcard query A * * * * * B and download snippets. The * operator matches one word or
none in a web page. Therefore, our wildcard query retrieves snippets in which A and B appears
within a window of seven words. Because a search engine snippet contains 20 words on an
average, and includes two fragments of texts selected from a document, we assume that the seven
word window is sufficient to cover most relations between two words in snippets. The algorithm
which is described in the Table 1 shows that how to retrieve the patterns and the frequency of the
patterns.
The pattern retrieval algorithm as described above yields numerous unique patterns. Of those
patterns only 80% of the patterns occur less than10 times. It is impossible to train a classifier with
such numerous parse patterns. We must measure the confidence of each pattern as an indicator of
synonymy that is, most of the patterns have frequency less than 10 so it is very difficult to find
the patterns which are significant so, we have to compute their confidence
so as to arrive at the significant patterns. We compute chi-square value to find the confidence of
each pattern.
The chi-square value is calculated by using the formula given below:
= (P + N) (pv (N - nv) - nv(P - pv))2
PN (pv + nv)(P + N - pv – nv) (1)
Where,
P and N are the Total frequency of synonymous word pair patterns and non-synonymous word
pair patterns, pv and nv are frequencies of the pattern v retrieved from snippets of synonymous
and non-synonymous word pairs respectively.
6. 6 Computer Science & Information Technology (CS & IT)
Table 1. Pattern Retrieval Algorithm [PRA]
Input: Given a set WS of word-pairs
Step 1: Read each snippet S, remove all the non-ASCII character and store it in database.
Step 2: for each snippet S do
if word is same as A then
Replace A by X
end if
if word is same as B then
Replace B by Y
end if
end for
Step 3: for each snippet S do
if X € S then
goto Step 4
end if
if Y € S then
goto Step 9
end if
end for
Step 4: if Y or Number of words > Max. length L then
stop the sequence seq.
end if
Step 5: for each seq do
Perform stemming operation.
end for
Step 6: Form the sub-sequences of the sequence such that each sub-sequence
contains [X . . . Y . .].
Step 7: for each subseq do
if subseq is same as existing pattern and unique then
list_ pat = list_ pat + subseq
freq _pat = freq _pat + 1
end if
end for
Step 8: if length exceeds L then
Discard the pattern until you find an X or Y.
end if
Step 9: if you encounter Y then
goto Step 4.
end if
6. IMPLEMENTATION AND RESULTS
6.1. Page-count-based Co-occurrence Measures
We compute four popular co-occurrence measures; Jaccard, Overlap (Simpson), Dice, and Point
wise Mutual Information (PMI), to compute semantic similarity using page counts.
7. Computer Science & Information Technology (CS & IT) 7
Web Jaccard coefficient between words (or multi-word phrases) A and B, is defined as:
∩+
∩
≤∩
=
B)H(A-)H(BH(A)
B)H(A
c,B)H(Aif0
B)(A,WebJaccard otherwise ( 2 )
Web Overlap is a natural modification to the Overlap (Simpson) coefficient, is defined as:
∩
≤∩
=
))H(B,(H(A)min
B)H(A
c,B)H(Aif0
B)Overlap(A,Web otherwise ( 3 )
WebDice is defined as:
+
∩
≤∩
=
)H(BH(A)
B)H(A2
c,B)H(Aif0
B)Dice(A,Web otherwise ( 4 )
Web PMI is defined as:
∩
≤∩
=
N
H(B)
N
H(A)
N
B)H(A
l
c,B)H(Aif0
B)(A,PMIWeb ((2og otherwise ( 5 )
We have implemented this in Java programming language and used Eclipse as an extensible open
source IDE (Integrated Development Environment) [15]. We query for A AND B and collect 500
snippets for each word pair and for each pair of words (A, B) store it in the database.
By using the Pattern Retrieval algorithm, we retrieved huge patterns and select only top 200
patterns. After that we compare each of the top 200 patterns based on the chi-square values “ ”
which are called as good patterns with the patterns generated by the given word pair. If the
pattern extracted for the particular word pair is one among the good patterns, store that good
pattern with a unique ID and store the frequency of this pattern as that of the pattern generated by
the given word pair. If a pattern does not match then store it with a unique ID and with its
frequency set as 0 and store it in the table.
8. 8 Computer Science & Information Technology (CS & IT)
To the same table add the four values of Web-Jaccard coefficient, Web-Overlap, Web-Dice
coefficient and Web-PMI which gives a table having 204 rows of unique ID, frequency and word
pair ID. Later normalize the frequency values by dividing the value in each tuple by the sum of all
the frequency values. Now this 204-dimension vector is called the feature vector for the given
word pair. Convert the feature vectors of all the word-pairs into a .CSV (Comma Separated
Values) file. The generated .CSV file is fed to the SVM classifier which is inbuilt in Weka
software [16 ]. This classifies the values and gives a similarity score for the word pair in between
0 and 1.
6.2. Test Data
In order to test our system, we selected the standard Miller-Charles dataset, which is having 28
word-pairs. The proposed algorithm outperforms by 89.8 percent of correlation value, as
illustrated in Table 2.
Table 2. Comparison of Correlation value of PRA with existing methods
The Figure 2 shows the comparison of correlation value of our PRA with existing methods.
Figure 2: Comparison of correlation value of PRA with existing methods.
The success of a search engine algorithm lies in its ability to retrieve information for a given
query. There are two ways in which one might consider the return of results to be successful.
Either you can obtain very accurate results or you can find many results which have some
Web
Jaccard
Web
Dice
Web
Overlap
Web
PMI
Bollegala
Method
PRA
Proposed
Correlation
Value
0.26 0.27 0.38 0.55 0.87 0.898
9. Computer Science & Information Technology (CS & IT) 9
connection with the search query. In information retrieval, these are termed precision and recall,
respectively [17].
The precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of
relevant instances that are retrieved. Both precision and recall are therefore based on an
understanding and measure of relevance. In even simpler terms, high Recall means that an
algorithm returned most of the relevant results. High precision means that an algorithm returned
more relevant results than irrelevant.
Table 2. Precision, Recall and F-measure values for both Synonymous and Non-synonymous
classes.
In this paper, F-measure is computed based on the precision and recall evaluation metrics. The
results are better than the previous algorithms, the Table 3 shows that the comparison of
Precision, Recall and F-measure improvement of the proposed Algorithm.
Table 3. Comparison of Precision, Recall and F-measure values of PRA with previous method
7. CONCLUSIONS
Semantic Similarity measures between words plays an important role information retrieval,
natural language processing and in various tasks on the web. We have proposed a Pattern
Retrieval algorithm to extract numerous semantic relations that exist between two words and the
four word co-occurrence measures were computed using page counts. We integrate the patterns
and co-occurrence measures to generate a feature vector. These feature vectors are fed to a 2-
Class SVM to classify the data into synonymous and non-synonymous classes. We compute the
posterior probability for each word-pair which is the similarity score for that word-pair. The
proposed algorithm outperforms by 89.8 percent of correlation value. The Precision, Recall and
F-measure values are improved compared to previous methods.
REFERENCES
[1] Sahami M. & Heilman T, (2006) “A Web-based Kernel Function for Measuring the Similarity of
Short Text Snippets”, 15th International Conference on World Wide Web, pp. 377-386.
[2] Chen H, Lin M & Wei Y, (2006) “ Novel Association Measures using Web Search with Double
Checking”, International Committee on Computational Linguistics and the Association for
Computational Linguistics, pp. 1009-1016.
[3] Cilibrasi R & Vitanyi P, (2007) “ The google similarity distance”, IEEE Transactions on Knowledge
and Data Engineering, Vol. 19, No. 3, pp. 370-383.
[4] Lin D, (1998) “Automatic Retrieival and Clustering of Similar Words”, International Committee on
Computational Linguistics and the Association for Computational Linguistics, pp. 768-774.
Class Precision Recall F-Measure
Synonymous 0.9 0.947 0.923
Non-synonymous 0.833 0.714 0.769
Method Precision Recall F-Measure
Bollegala 0.7958 0.804 0.7897
PRA 0.9 0.947 0.923
10. 10 Computer Science & Information Technology (CS & IT)
[5] Pei J, Han J, Mortazavi-Asi B, Wang J, Pinto H, Chen Q, Dayal U & Hsu M, (2004) “Mining
Sequential Patterns by Pattern growth: the Prefix span Approach”, IEEE Transactions on Knowledge
and Data Engineering, Vol. 16, No. 11, pp. 1424-1440.
[6] Jay J Jiang & David W Conrath, “Semantic Similarity based on Corpus Statistics and Lexical
Taxonomy”, International Conference Research on Computational Linguistics.
[7] Resnik P, (1995) “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, 14th
International Joint Conference on Aritificial Intelligence, Vol. 1, pp. 24-26.
[8] Resnik P, (1999) “Semantic Similarity in a Taxonomy: An Information based Measure and its
Application to problems of Ambiguity in Natural Language”, Journal of Artificial Intelligence
Research , Vol. 11, pp. 95-130.
[9] Danushka Bollegala, Yutaka Matsuo & Mitsuru Ishizuka, (2011) “A Web Search Engine-based
Approach to Measure Semantic Similarity between Words”, IEEE Transactions on Knowledge and
Data Engineering , Vol. 23, No.7, pp.977-990.
[10] Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M & B Vitnyi, (2004) “The Similarity Metric”, IEEE
Transactions on Information Theory, Vol. 50, No. 12, pp. 3250-3264.
[11] Ann Gledson & John Keane, (2008) “Using Web-Search Results to Measure Word-Group Similarity”,
22nd International Conference on Computational Linguistics), pp. 281-28.
[12] Hughes T & Ramage D (2007) “Lexical Semantic Relatedness with Random Graph Walks”,
Conference on Empirical Methods in Natural Language Processing Conference on Computational
Natural Language Learning, (EMNLP-CoNLL07), pp. 581-589.
[13] Lin D, (1998) “An Information-Theoretic Definition of Similarity”, 15th
International Conference on
Machine Learning, pp. 296-304.
[14] Schickel-Zuber V & Faltings B, (2007) “OSS: A Semantic Similarity Function Based on Hierarchical
Ontologies”, International Joint Conference on Artificial Intelligence, pp. 551-556.
[15] http://onjava.com/onjava/2002/12/11/eclipe.html
[16] www.cs.waikato.ac.nz/ml/weka/
[17] Pushpa C N, Thriveni J, Venugopal K R & L M Patnaik , (2011) “Enhancement of F-measure for
Web People Search using Hashing Technique”, International Journal on Information Processing
(IJIP), Vol. 5, No. 4, pp. 35-44.
Authors
Pushpa C N has completed Bachelor of Engineering in Computer Science and
Engineering from Bangalore University, Master of Technology in VLSI Design and
Embedded Systems from Visvesvaraya Technological University. She has 13 years of
teaching experience. Presently she is working as Assistant Professor in Department of
Computer Science and Engineering at UVCE, Bangalore and pursuing her Ph.D in
Semantic Web.
Thriveni J has completed Bachelor of Engineering, Masters of Engineering and
Doctoral Degree in Computer Science and Engineering. She has 4 years of industrial
experience and 16 years of teaching experience. Currently she is an Associate
Professor in the Department of Computer Science and Engineering, University
Visvesvaraya College of Engineering, Bangalore. Her research interests include
Networks, Data Mining and Biometrics.
11. Computer Science & Information Technology (CS & IT) 11
Venugopal K R is currently the Principal, University Visvesvaraya College of
Engineering, Bangalore University, Bangalore. He obtained his Bachelor of
Engineering from University Visvesvaraya College of Engineering. He received his
Masters degree in Computer Science and Automation from Indian Institute of
Science Bangalore. He was awarded Ph.D. in Economics from Bangalore University
and Ph.D. in Computer Science from Indian Institute of Technology, Madras. He has
a distinguished academic career and has degrees in Electronics, Economics, Law,
Business Finance, Public Relations, Communications, Industrial Relations, Computer
Science and Journalism. He has authored 31 books on Computer Science and
Economics, which include Petrodollar and the World Economy, C Aptitude, Mastering C, Microprocessor
Programming, Mastering C++ and Digital Circuits and Systems etc.. During his three decades of service at
VCE he has over 250 research papers to his credit. His research interests include Computer Networks,
Wireless Sensor Networks, Parallel and Distributed Systems, Digital Signal Processing and Data Mining.
L M Patnaik is a Honorary Professor in Indian Instituteof Science, Bangalore. During
the past 35 years of his service at the Institute he has over 700 research publications
in refereed International Journals and refereed International Conference Proceedings.
He is a Fellow of all the four leading Science and Engineering Academies in India;
Fellow of the IEEE and the Academy of Science for the Developing World. He has
received twenty national and international awards; notable among them is the IEEE
Technical Achievement Award for his significant contributions to High Performance
Computing and Soft Computing. His areas of research interest have been Parallel and
Distributed Computing, Mobile Computing, CAD for VLSI circuits, Soft Computing and Computational
Neuroscience.