The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with BoyerMoore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
This document discusses various techniques for semantic document retrieval and summarization. It begins by introducing the challenges of semantic search and techniques like word sense disambiguation that aim to improve search relevance. It then discusses using ontologies and semantic networks to reduce the semantic content of documents in order to support semantic document retrieval. Finally, it proposes using relationship-based data cleaning techniques to disambiguate references between entities by analyzing their features and relationships.
This document describes a system called UProRevs that aims to personalize web search results based on a user's profile and interests.
The system works as a filter that takes the results from a normal search engine like Google and re-ranks them based on their relevance to the user's profile. It generates user profiles based on information provided during registration, and updates the profiles over time based on the user's feedback on search results.
The system calculates relevance scores for search results by comparing the keywords in each web page to those in the user's profile. Results are displayed along with their relevance scores. As the user provides feedback, their profile is updated, allowing the system to continuously improve the personalization of search
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines,
and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the
World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search
provides the user with results for queries on that domain. Meta Search Engines send the user’s search
queries to various search engines and combine the search results. This paper introduces intelligent user
interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search
Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate
search engines. Few algorithms are designed to combine results from various search engines and also to
display the results.
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
PageRank algorithm and its variations: A Survey reportIOSR Journals
This document provides an overview and comparison of PageRank algorithms. It begins with a brief history of PageRank, developed by Larry Page and Sergey Brin as part of the Google search engine. It then discusses variants like Weighted PageRank and PageRank based on Visits of Links (VOL), which incorporate additional factors like link popularity and user visit data. The document also gives a basic introduction to web mining concepts and categorizes web mining into content, structure, and usage types. It concludes with a comparison of the original PageRank algorithm and its variations.
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
The World Wide Web (WWW) allows the people to share the information (data) from the large database repositories globally. The amount of information grows billions of databases. We need to search the information will specialize tools known generically search engine. There are many of search engines available today, retrieving meaningful information is difficult. However to overcome this problem in search engines to retrieve meaningful information intelligently, semantic web technologies are playing a major role. In this paper we present survey on the search engine generations and the role of search engines in intelligent web and semantic search technologies.
Context Based Indexing in Search Engines Using Ontology: Reviewiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
This document discusses various techniques for semantic document retrieval and summarization. It begins by introducing the challenges of semantic search and techniques like word sense disambiguation that aim to improve search relevance. It then discusses using ontologies and semantic networks to reduce the semantic content of documents in order to support semantic document retrieval. Finally, it proposes using relationship-based data cleaning techniques to disambiguate references between entities by analyzing their features and relationships.
This document describes a system called UProRevs that aims to personalize web search results based on a user's profile and interests.
The system works as a filter that takes the results from a normal search engine like Google and re-ranks them based on their relevance to the user's profile. It generates user profiles based on information provided during registration, and updates the profiles over time based on the user's feedback on search results.
The system calculates relevance scores for search results by comparing the keywords in each web page to those in the user's profile. Results are displayed along with their relevance scores. As the user provides feedback, their profile is updated, allowing the system to continuously improve the personalization of search
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines,
and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the
World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search
provides the user with results for queries on that domain. Meta Search Engines send the user’s search
queries to various search engines and combine the search results. This paper introduces intelligent user
interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search
Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate
search engines. Few algorithms are designed to combine results from various search engines and also to
display the results.
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
PageRank algorithm and its variations: A Survey reportIOSR Journals
This document provides an overview and comparison of PageRank algorithms. It begins with a brief history of PageRank, developed by Larry Page and Sergey Brin as part of the Google search engine. It then discusses variants like Weighted PageRank and PageRank based on Visits of Links (VOL), which incorporate additional factors like link popularity and user visit data. The document also gives a basic introduction to web mining concepts and categorizes web mining into content, structure, and usage types. It concludes with a comparison of the original PageRank algorithm and its variations.
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
The World Wide Web (WWW) allows the people to share the information (data) from the large database repositories globally. The amount of information grows billions of databases. We need to search the information will specialize tools known generically search engine. There are many of search engines available today, retrieving meaningful information is difficult. However to overcome this problem in search engines to retrieve meaningful information intelligently, semantic web technologies are playing a major role. In this paper we present survey on the search engine generations and the role of search engines in intelligent web and semantic search technologies.
Context Based Indexing in Search Engines Using Ontology: Reviewiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Search engine optimization (SEO) is the process of improving the visibility of a website in organic search engine results. SEO involves editing website content and code to increase relevance to keywords and remove barriers to search engines indexing pages. Promoting backlinks is another SEO tactic. Effective SEO requires changes to HTML and content on a site. Black hat SEO uses spammy techniques like keyword stuffing and link farms which can degrade search results and the user experience.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLESijcsit
This paper proposes a Bibliographic system intends to exchange bibliographic information of survey/review articles by relying on Web service technology. It allows researchers and university students
to interact with system via single service using platform-independent standard named Web service to add,
search and retrieve bibliographic information of review articles in various science and technology fields
and build-up a dedicated database for these articles in each science and technology field. Additionally,
different implementation scenarios of the proposed system are presented and described, andrich features
that offered by such system are studied and described. However, this paper explains the proposed system
using computing area due to the existence of detailed taxonomy of this area, which allows defining the
system, their functionalities and features provided.However, the proposed system is not only confined to
computing area, it can support any other science and technology area without any need to modify this
system.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Search engines are designed to help users find information on the internet and other systems by reducing search times and limiting results to relevant information. They work by querying information sources, ranking results by relevance, and indexing accumulated data. The most popular search engines today are Google, Yahoo, MSN, AOL, and Ask. Search engines have changed how people research by providing convenient access to information online and saving users time compared to traditional library research.
This document summarizes various approaches to web search personalization that have been proposed by researchers. It discusses 12 different approaches that use techniques such as building user profiles based on browsing history and click data, constructing personalized ontologies and taxonomies, re-ranking search results based on learned user interests, and updating user profiles based on implicit feedback during web browsing. The document concludes that personalized search is needed to better tailor search results to individual users and their contexts, and that continued research on innovative personalization techniques is important as the amount of online information grows.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
Building efficient and effective metasearch enginesunyil96
This document discusses building efficient and effective metasearch engines. It describes how metasearch engines provide unified access to multiple existing search engines without maintaining their own indexes. The main challenges in building metasearch engines are database selection, document selection, and result merging. Database selection identifies relevant search engines for a query. Document selection determines which documents to retrieve from selected search engines. Result merging combines results into a single ranked list. Techniques for tackling these challenges are surveyed.
A Review: Text Classification on Social Media DataIOSR Journals
This document provides a review of different classifiers used for text classification on social media data. It discusses how social media data is often unstructured and contains users' opinions and sentiments. Various machine learning algorithms can be used to classify this social media text data, extracting meaningful information. The document focuses on describing Naive Bayes classifiers, which are commonly used for text classification tasks. It explains how Naive Bayes classifiers work by calculating the posterior probability that a document belongs to a certain class, based on applying Bayes' theorem with an independence assumption between features.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The document provides an overview of the key components and objectives of an information retrieval system. It discusses how an IR system aims to minimize the time a user spends locating needed information by facilitating search generation, presenting search results in a relevant order, and processing incoming documents through normalization, indexing, and selective dissemination to users. The major measures of an IR system's effectiveness are precision and recall.
Improving search result via search keywords and data classification similarityConference Papers
This document proposes a new method to improve search results by predicting queries based on similarity between input queries and historical data of keywords and their classifications. It involves extracting keywords from queries and classifying them using named entity recognition and lexical databases. Prediction rules are generated by analyzing relationships between keywords and classifications in historical data. These rules are then used to predict additional related keywords for new queries to enhance search criteria and provide more relevant search results. The method aims to address limitations of existing search engines that rely only on raw historical query data without enrichment to predict queries.
Comparison of used metadata elements in digital libraries in iran with dublin...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...ijcsa
The high school students must be observed for their slow learning or quick learning abilities to provide
them with the best education practices. Such analysis can be perfectly performed over the student
performance data. The high school student data has been obtained from the schools from the various
regions in Punjab, a pivotal state of India. The complete student data and the selective data of almost 1300
students obtained from one school in the regions has been undergone the test using the proposed model in
this paper. The proposed model is based upon the naïve bayes classification model for the data
classification using the multi-factor features obtained from the input dataset. The subject groups have been
divided into the two primary groups: difficult and normal. The classification algorithm has been applied
individually over data grouped in the various subject groups. Both of the early stage classification events
have produced the almost similar results, whereas the results obtained from the classification events over
the averaging factors and the floating factors told the different story than the early stage classification. The
proposed model results have shown that the deep analysis of the data tells the in-depth facts from the input
data. The proposed model can be considered as the effectiv
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAINcscpconf
Today’s conventional search engines hardly do provide the essential content relevant to the
user’s search query. This is because the context and semantics of the request made by the user
is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is
upcoming in the area of web search which combines Natural Language Processing and
Artificial Intelligence.
The objective of the work done here is to design, develop and implement a semantic search
engine- SIEU(Semantic Information Extraction in University Domain) confined to the
university domain. SIEU uses ontology as a knowledge base for the information retrieval
process. It is not just a mere keyword search. It is one layer above what Google or any other
search engines retrieve by analyzing just the keywords. Here the query is analyzed both
syntactically and semantically.
The developed system retrieves the web results more relevant to the user query through keyword
expansion. The results obtained here will be accurate enough to satisfy the request made by the
user. The level of accuracy will be enhanced since the query is analyzed semantically. The
system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine submits user queries to multiple traditional search engines including Google, Yahoo, Bing and Ask. It then uses a crawler and modified page ranking algorithm to analyze and rank the results from the different search engines. The top results are then generated and displayed to the user, aimed to be more relevant than results from individual search engines. The meta search engine was implemented using technologies like PHP, MySQL and utilizes components like a graphical user interface, query formulator, metacrawler and redundant URL eliminator.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine queries multiple traditional search engines like Google, Yahoo, Bing and Ask simultaneously using a single user query. It then ranks the retrieved results using a new two phase ranking algorithm called modified ranking that considers page relevance and popularity. The goal of the new meta search engine is to produce more efficient search results compared to traditional search engines. It includes components like a graphical user interface, query formulator, metacrawler, redundant URL eliminator and modified ranking algorithm to retrieve and rank results.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
Quest Trail: An Effective Approach for Construction of Personalized Search En...Editor IJCATR
This document discusses developing a personalized search engine for software development organizations. It proposes using semantic analysis and genetic algorithms to personalize search results. Semantic analysis resolves ambiguity in queries by understanding their meaning, while genetic algorithms use machine learning to better understand user preferences over time. Quest analysis is also used to identify the goal or task behind a user's search by analyzing search logs at the quest level rather than query or session levels. Together these approaches aim to increase search relevance for users in software organizations by creating group profiles based on domain or project rather than individual user profiles.
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
There has been lot of research in recent years for efficient web searching. Several papers have proposed algorithm for user feedback sessions, to evaluate the performance of inferring user search goals. When the information is retrieved, user clicks on a particular URL. Based on the click rate, ranking will be done automatically, clustering the feedback sessions. Web search engines have made enormous contributions to the web and society. They make finding information on the web quick and easy. However, they are far from optimal. A major deficiency of generic search engines is that they follow the ‘‘one size fits all’’ model and are not adaptable to individual users.
A novel method to search information through multi agent search and retrieIAEME Publication
The document proposes a novel method for searching information through multi-agent search and retrieval that uses both content and context-based search. It describes a system that accepts two text inputs, processes them according to whether desktop or internet search is selected, and provides relevant results through indexing and multi-agents, with content search performed on Hadoop for increased performance. The system aims to provide faster and more accurate search results by filtering out irrelevant results.
Search engine optimization (SEO) is the process of improving the visibility of a website in organic search engine results. SEO involves editing website content and code to increase relevance to keywords and remove barriers to search engines indexing pages. Promoting backlinks is another SEO tactic. Effective SEO requires changes to HTML and content on a site. Black hat SEO uses spammy techniques like keyword stuffing and link farms which can degrade search results and the user experience.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
COLLABORATIVE BIBLIOGRAPHIC SYSTEM FOR REVIEW/SURVEY ARTICLESijcsit
This paper proposes a Bibliographic system intends to exchange bibliographic information of survey/review articles by relying on Web service technology. It allows researchers and university students
to interact with system via single service using platform-independent standard named Web service to add,
search and retrieve bibliographic information of review articles in various science and technology fields
and build-up a dedicated database for these articles in each science and technology field. Additionally,
different implementation scenarios of the proposed system are presented and described, andrich features
that offered by such system are studied and described. However, this paper explains the proposed system
using computing area due to the existence of detailed taxonomy of this area, which allows defining the
system, their functionalities and features provided.However, the proposed system is not only confined to
computing area, it can support any other science and technology area without any need to modify this
system.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Search engines are designed to help users find information on the internet and other systems by reducing search times and limiting results to relevant information. They work by querying information sources, ranking results by relevance, and indexing accumulated data. The most popular search engines today are Google, Yahoo, MSN, AOL, and Ask. Search engines have changed how people research by providing convenient access to information online and saving users time compared to traditional library research.
This document summarizes various approaches to web search personalization that have been proposed by researchers. It discusses 12 different approaches that use techniques such as building user profiles based on browsing history and click data, constructing personalized ontologies and taxonomies, re-ranking search results based on learned user interests, and updating user profiles based on implicit feedback during web browsing. The document concludes that personalized search is needed to better tailor search results to individual users and their contexts, and that continued research on innovative personalization techniques is important as the amount of online information grows.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
Building efficient and effective metasearch enginesunyil96
This document discusses building efficient and effective metasearch engines. It describes how metasearch engines provide unified access to multiple existing search engines without maintaining their own indexes. The main challenges in building metasearch engines are database selection, document selection, and result merging. Database selection identifies relevant search engines for a query. Document selection determines which documents to retrieve from selected search engines. Result merging combines results into a single ranked list. Techniques for tackling these challenges are surveyed.
A Review: Text Classification on Social Media DataIOSR Journals
This document provides a review of different classifiers used for text classification on social media data. It discusses how social media data is often unstructured and contains users' opinions and sentiments. Various machine learning algorithms can be used to classify this social media text data, extracting meaningful information. The document focuses on describing Naive Bayes classifiers, which are commonly used for text classification tasks. It explains how Naive Bayes classifiers work by calculating the posterior probability that a document belongs to a certain class, based on applying Bayes' theorem with an independence assumption between features.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The document provides an overview of the key components and objectives of an information retrieval system. It discusses how an IR system aims to minimize the time a user spends locating needed information by facilitating search generation, presenting search results in a relevant order, and processing incoming documents through normalization, indexing, and selective dissemination to users. The major measures of an IR system's effectiveness are precision and recall.
Improving search result via search keywords and data classification similarityConference Papers
This document proposes a new method to improve search results by predicting queries based on similarity between input queries and historical data of keywords and their classifications. It involves extracting keywords from queries and classifying them using named entity recognition and lexical databases. Prediction rules are generated by analyzing relationships between keywords and classifications in historical data. These rules are then used to predict additional related keywords for new queries to enhance search criteria and provide more relevant search results. The method aims to address limitations of existing search engines that rely only on raw historical query data without enrichment to predict queries.
Comparison of used metadata elements in digital libraries in iran with dublin...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...ijcsa
The high school students must be observed for their slow learning or quick learning abilities to provide
them with the best education practices. Such analysis can be perfectly performed over the student
performance data. The high school student data has been obtained from the schools from the various
regions in Punjab, a pivotal state of India. The complete student data and the selective data of almost 1300
students obtained from one school in the regions has been undergone the test using the proposed model in
this paper. The proposed model is based upon the naïve bayes classification model for the data
classification using the multi-factor features obtained from the input dataset. The subject groups have been
divided into the two primary groups: difficult and normal. The classification algorithm has been applied
individually over data grouped in the various subject groups. Both of the early stage classification events
have produced the almost similar results, whereas the results obtained from the classification events over
the averaging factors and the floating factors told the different story than the early stage classification. The
proposed model results have shown that the deep analysis of the data tells the in-depth facts from the input
data. The proposed model can be considered as the effectiv
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAINcscpconf
Today’s conventional search engines hardly do provide the essential content relevant to the
user’s search query. This is because the context and semantics of the request made by the user
is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is
upcoming in the area of web search which combines Natural Language Processing and
Artificial Intelligence.
The objective of the work done here is to design, develop and implement a semantic search
engine- SIEU(Semantic Information Extraction in University Domain) confined to the
university domain. SIEU uses ontology as a knowledge base for the information retrieval
process. It is not just a mere keyword search. It is one layer above what Google or any other
search engines retrieve by analyzing just the keywords. Here the query is analyzed both
syntactically and semantically.
The developed system retrieves the web results more relevant to the user query through keyword
expansion. The results obtained here will be accurate enough to satisfy the request made by the
user. The level of accuracy will be enhanced since the query is analyzed semantically. The
system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine submits user queries to multiple traditional search engines including Google, Yahoo, Bing and Ask. It then uses a crawler and modified page ranking algorithm to analyze and rank the results from the different search engines. The top results are then generated and displayed to the user, aimed to be more relevant than results from individual search engines. The meta search engine was implemented using technologies like PHP, MySQL and utilizes components like a graphical user interface, query formulator, metacrawler and redundant URL eliminator.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine queries multiple traditional search engines like Google, Yahoo, Bing and Ask simultaneously using a single user query. It then ranks the retrieved results using a new two phase ranking algorithm called modified ranking that considers page relevance and popularity. The goal of the new meta search engine is to produce more efficient search results compared to traditional search engines. It includes components like a graphical user interface, query formulator, metacrawler, redundant URL eliminator and modified ranking algorithm to retrieve and rank results.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
Quest Trail: An Effective Approach for Construction of Personalized Search En...Editor IJCATR
This document discusses developing a personalized search engine for software development organizations. It proposes using semantic analysis and genetic algorithms to personalize search results. Semantic analysis resolves ambiguity in queries by understanding their meaning, while genetic algorithms use machine learning to better understand user preferences over time. Quest analysis is also used to identify the goal or task behind a user's search by analyzing search logs at the quest level rather than query or session levels. Together these approaches aim to increase search relevance for users in software organizations by creating group profiles based on domain or project rather than individual user profiles.
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
There has been lot of research in recent years for efficient web searching. Several papers have proposed algorithm for user feedback sessions, to evaluate the performance of inferring user search goals. When the information is retrieved, user clicks on a particular URL. Based on the click rate, ranking will be done automatically, clustering the feedback sessions. Web search engines have made enormous contributions to the web and society. They make finding information on the web quick and easy. However, they are far from optimal. A major deficiency of generic search engines is that they follow the ‘‘one size fits all’’ model and are not adaptable to individual users.
A novel method to search information through multi agent search and retrieIAEME Publication
The document proposes a novel method for searching information through multi-agent search and retrieval that uses both content and context-based search. It describes a system that accepts two text inputs, processes them according to whether desktop or internet search is selected, and provides relevant results through indexing and multi-agents, with content search performed on Hadoop for increased performance. The system aims to provide faster and more accurate search results by filtering out irrelevant results.
Semantic Search Engine using OntologiesIJRES Journal
Nowadays the volume of the information on the Web is increasing dramatically. Facilitating users to get useful information has become more and more important to information retrieval systems. While information retrieval technologies have been improved to some extent, users are not satisfied with the low precision and recall. With the emergence of the Semantic Web, this situation can be remarkably improved if machines could “understand” the content of web pages. The existing information retrieval technologies can be classified mainly into three classes.The traditional information retrieval technologies mostly based on the occurrence of words in documents. It is only limited to string matching. However, these technologies are of no use when a search is based on the meaning of words, rather than onwards themselves.Search engines limited to string matching and link analysis. The most widely used algorithms are the PageRank algorithm and the HITS algorithm. The PageRank algorithm is based on the number of other pages pointing to the Web page and the value of the pages pointing to it. Search engines like Google combine information retrieval techniques with PageRank. In contrast to the PageRank algorithm, the HITS algorithm employs a query dependent ranking technique. In addition to this, the HITS algorithm produces the authority and the hub score. The widespread availability of machine understandable information on the Semantic Web offers which some opportunities to improve traditional search. If machines could “understand” the content of web pages, searches with high precision and recall would be possible.
Design Issues for Search Engines and Web Crawlers: A ReviewIOSR Journals
This document provides a review of design issues for search engines and web crawlers. It discusses how search engines use web crawlers to collect web documents for storage and indexing from the growing World Wide Web. The three main parts of a search engine are the crawler, indexer, and query engine. Crawler-based search engines create listings automatically using algorithms while human-powered directories rely on human organization. Designing efficient search engines and crawlers faces challenges from the diversity of web documents and changing user behaviors. Web crawlers prioritize URLs to download pages and extract new links to update search engine databases.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Recommendation generation by integrating sequential pattern mining and semanticseSAT Journals
Abstract As the Internet usage keeps increasing, the number of web sites and hence the number of web pages also keeps increasing. A recommendation system can be used to provide personalized web service by suggesting the pages that are likely to be accessed in future. Most of the recommendation systems are based on association rule mining or based on keywords. Using the association rule mining the prediction rate is less as it doesn’t take into account the order of access of the web pages by the users. The recommendation systems that are key-word based provides lesser relevant results. This paper proposes a recommendation system that uses the advantages of sequential pattern mining and semantics over the association rule mining and keyword based systems respectively. Keywords: Sequential Pattern Mining, Taxonomy, Apriori-All, CS-Mine, Semantic, Clustering
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVALijcsa
Search engines today are retrieving more than a few thousand web pages for a single query, most of which
are irrelevant. Listing results according to user needs is, therefore, a very real necessity. The challenge lies
in ordering retrieved pages and presenting them to users in line with their interests. Search engines,
therefore, utilize page rank algorithms to analyze and re-rank search results according to the relevance of
the user’s query by estimating (over the web) the importance of a web page. The proposed work
investigates web page ranking methods and recently-developed improvements in web page ranking.
Further, a new content-based web page rank technique is also proposed for implementation. The proposed
technique finds out how important a particular web page is by evaluating the data a user has clicked on, as
well as the contents available on these web pages. The results demonstrate the effectiveness of the proposed
page ranking technique and its efficiency.
The document discusses search engines and web crawlers. It provides information on how search engines work by using web crawlers to index web pages and then return relevant results when users search. It also compares major search engines like Google, Yahoo, MSN, Ask Jeeves, and Live Search based on factors like market share, database size and freshness, ranking algorithms, and treatment of spam. Google is highlighted as having the largest market share and best algorithms for determining natural vs artificial links.
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
The World Wide Web (WWW) allows the people to share the information (data) from the large database repositories globally. The amount of information grows billions of databases. We need to search the information will specialize tools known generically search engine. There are many of search engines available today, retrieving meaningful information is difficult. However to overcome this problem in search engines to retrieve meaningful information intelligently, semantic web technologies are playing a major role. In this paper we present survey on the search engine generations and the role of search engines in intelligent web and semantic search technologies.
A search engine is a software system that searches the World Wide Web for information and presents search results on search engine results pages (SERPs). Search engines work by using web crawlers to index web pages, then searching their indexes to provide relevant results for user queries. They offer operators like Boolean logic to refine searches. The usefulness of search engines depends on how relevant their results are, and they employ various ranking algorithms to provide the most relevant pages first. Metasearch engines simultaneously query multiple other search engines and aggregate their results.
Semantic Information Retrieval Using Ontology in University Domain dannyijwest
Today’s conventional search engines hardly do provide the essential content relevant to the user’s search
query. This is because the context and semantics of the request made by the user is not analyzed to the full
extent. So here the need for a semantic web search arises. SWS is upcoming in the area of web search
which combines Natural Language Processing and Artificial Intelligence. The objective of the work done
here is to design, develop and implement a semantic search engine- SIEU(Semantic Information
Extraction in University Domain) confined to the university domain. SIEU uses ontology as a knowledge
base for the information retrieval process. It is not just a mere keyword search. It is one layer above what
Google or any other search engines retrieve by analyzing just the keywords. Here the query is analyzed
both syntactically and semantically. The developed system retrieves the web results more relevant to the
user query through keyword expansion. The results obtained here will be accurate enough to satisfy the
request made by the user. The level of accuracy will be enhanced since the query is analyzed semantically.
The system will be of great use to the developers and researchers who work on web. The Google results are
re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which
fetches more apt results for the user query.
Effective Performance of Information Retrieval on Web by Using Web Crawling dannyijwest
The document describes the EPOW (Effective Performance of Web Crawler) architecture, which aims to improve the performance of web crawlers. The EPOW crawler uses multiple downloaders in parallel and queues URLs to prioritize downloading. It analyzes downloaded pages to find new relevant URLs to add to the queue. The goal is to maximize download speed while minimizing overhead, keeping the crawled data fresh by periodically revisiting pages based on change frequency.
Comparable Analysis of Web Mining Categoriestheijes
Web Data Mining is the current field of analysis which is a combination of two research area known as Data Mining and World Wide Web. Web Data Mining research associates with various research diversities like Database, Artificial Intelligence and Information redeem. The mining techniques are categorized into various categories namely Web Content Mining, Web Structure Mining and Web Usage Mining. In this work, analysis of mining techniques are done. From the analysis it has been concluded that Web Content Mining has unstructured or semi- structure view of data whereas Web Structure Mining have linked structure and Web Usage Mining mainly includes interaction.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSZac Darcy
A Conversational Agent for the Web of Data, Journal of Web Semantics, Volume 37–38,
2016, Pages 64-85, ISSN 1570-8268.
[4] J. M. Kleinberg, (1999), Authoritative sources in a hyperlinked environment, Journal of the ACM
(JACM), 46(5), 604-632.
[5] L. Page, S. Brin, R. Motwani, and T. Winograd, (1999), The PageRank citation ranking: Bringing
order to the web. Technical Report, Stanford InfoLab.
[6] S. Chakrabarti, (2003), Min
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Identifying Important Features of Users to Improve Page Ranking Algorithms dannyijwest
Increase in number of ontologies on Semantic Web and endorsement of OWL as language of discourse for the Semantic Web has lead to a scenario where research efforts in the field of ontology engineering may be applied for making the process of ontology development through reuse a viable option for ontology developers. The advantages are twofold as when existing ontological artefacts from the Semantic Web are reused, semantic heterogeneity is reduced and help in interoperability which is the essence of Semantic Web. From the perspective of ontology development advantages of reuse are in terms of cutting down on cost as well as development life as ontology engineering requires expert domain skills and is time taking process. We have devised a framework to address challenges associated with reusing ontologies from the Semantic Web. In this paper we present methods adopted for extraction and integration of concepts across multiple ontologies. We have based extraction method on features of OWL language constructs and context to extract concepts and for integration a relative semantic similarity measure is devised. We also present here guidelines for evaluation of ontology constructed. The proposed methods have been applied on concepts from food ontology and evaluation has been done on concepts from domain of academics using Golden Ontology Evaluation Method with satisfactory outcomes
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
TEXT ANALYZER
1. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
9
TEXT ANALYZER
G. Arumugam #
, M.Thangaraj $
, R. Sasirekha*
#
Prof. & Head, Department of Computer Science
Madurai Kamaraj University, Madurai – 21.
1
gurusamyarumugam@gmail.com
$
Assosiate prof. Department of Computer Science
Madurai Kamaraj University, Madurai – 21.
2
thangarajmku@yahoo.com
*Research Assistant, SSE Project, Department of Computer Science, Madurai Kamaraj
University
Madurai – 21.
3
sasirekhars@gmail.com
ABSTRACT
The web has become a resourceful tool for almost all domains today. Search engines prominently use
inverted indexing technique to locate the web pages having the users query. The performance of inverted
index fundamentally depends upon the searching of keyword in the list maintained by search engine. Text
matching is done with the help of string matching algorithm. It is important to any string matching
algorithm to locate quickly the occurrences of the user specified pattern in large text. In this paper a new
string matching algorithm for keyword searching is proposed. The proposed algorithm relies on new
technique based on pattern length and FML (First-Middle-Last) character match. This proposed
algorithm is analysed and implemented. The extensive testing and comparisons are done with Boyer-
Moore, Naïve, Improved Naïve, Horspool and Zhu Takaoka. The result shows that the proposed
algorithm takes less time than other existing algorithm.
KEYWORDS
Web searching, String matching, Syntactic IR
1. INTRODUCTION
Internet is potentially the world’s largest knowledge base and finding information in the large
knowledge base is very difficult by surfing internet information globe. Search Engines emerged
and developed quickly in this background. Search engine is a very important tool for people to
obtain information on internet. Day by day the quantity of information is increasing
exponentially on the internet. According to survey of net cafe, the web has crossed 110 million
sites in March 2007[6] [8] and as of November 2009 there are about 20,340,000,000 WebPages
are crawled [11].With this exponentially growing information size on the Internet, utilization of
information has become major focus. And so search engine developers began to pay attention to
the quality and relativity of searching results. There are three types of search engines whose
searching technique is different [10].
2. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
10
1.1 Contents Search Engines
Information mostly faces to websites, offering contents browsing and direct searching services,
and users would query information limited to some contents. Because this kind of Search
Engines join intelligence of people, information which was searched is accurate, navigation
quality is good, and the defect is that they need to manually intervene, information is
incomplete, and information updating is out of time.
1.2 Robot Search Engines
These search engines provide full text querying services. Robot program which is called Spider
or Robot automatically collects information with width or depth strategy or other strategies on
Internet, and collected information is stored in database, created index by indexer. Querying
machine searches indexed database based on user’s request, returning relatively queried results
to users. The merits of this kind of Search Engines are that they don’t need to manually intrude,
vast information, and information updating is in time, while the defect is that returned results
are excessive, having vastly irrelevant results, so users must choose relevant information from
results.
1.3 Meta Search Engines
They don’t analyze through Internet, not having themselves data, but submitting users queried
requests to many Search Engines at the same time, after uniting, removing repetition, and
renewing rank so unifying disposal, they return searched results to users. The merits of this kind
of Search Engines are that they can provide comparatively general and correct information in
short time, and defects are that not completely using functions of Search Engines, users need
filter more. Search engine navigates efficiently through the information on the internet, offering
the retrieval services for the users. Usually working of search engine can be divided into
information gathering, indexing, query service and ranking. The performance of indexing
module plays a major role in its overall performance. The conventional keyword searching
algorithm has been used in response timing requirements of recent search engine [6] [8]. So
research on this topic has been done which results in a fast retrieval keyword searching
algorithm. Searching algorithm helps to retrieve information from pool of websites
available in the internet. The two major approaches to Information Retrieval (IR) are syntactic
IR and semantic IR.
Search terms are represented as arbitrary sequences of characters and information retrieval is
performed through the computation of string similarity. The search procedure used by these
search engines is principally based on the syntactic matching of document and query
representations. These search engines are known to suffer in general from low precision [4].
In text retrieval, full text search refers to a technique for searching a computer-
stored document or database. In a full text search, the search engine examines all of the words in
every stored document as it tries to match search words supplied by the user. Full-text searching
techniques became common in online bibliographic databases Most Web sites and application
programs provide full text search capabilities. Some Web search engines, such as
AltaVista employ full text search techniques, while others index only a portion of the Web
pages examined by its indexing system. Text search functionality takes a set of keywords and
locates them in a document or a database containing words. Search strings provided by users are
usually context based. However, semantic technologies represent meaning using ontologies and
provide reasoning through the relationships, rules, logic, and conditions represented in those
ontologies.
3. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
11
2. GLOBAL SENARIO
A web search engine is a tool designed to search for information on the World Wide Web. The
search results are usually presented in a list and are commonly called hits. A search engine
operates, in the following order
Web crawling, Indexing and Searching
When a user enters a query into a search engine the engine examines its index and provides a
listing of best-matched web pages according to some criteria, usually with a short summary
containing the document's title and sometimes parts of the text.
2.1. Web Crawling
Web search engines work by storing information about many web pages, which they retrieve
from the WWW. These pages are retrieved by a Web crawler — an automated Web browser
which follows every link it sees [3] [7]. The contents of each page are then analysed to
determine how it should be indexed (for example, words are extracted from the titles, headings,
or special fields called Meta tags). Meta data about web pages are stored in an index database
for use in later queries.
2.2. Indexing
Some search engines, such as Google, store all or part of the source page as well as information
about the web pages whereas others such as AltaVista store every word of every page they
found[3][7]. This cached page always holds the actual search text since it is the one that was
actually indexed, so it can be very useful when the content of the current page has been updated
and the search terms are no longer in it. Increased search relevance makes these cached pages
very useful even beyond the fact that they may contain data that may no longer be available
elsewhere.
2.3. Searching
When a user enters a query into a search engine the engine examines its index and provides a
listing of best-matching web pages according to its criteria usually with a short summary
containing the document's title and sometimes parts of the text. Most search engines support the
use of Boolean operators AND, OR and NOT to further specify the search query. Some search
engines provide an advanced feature called proximity search which allows users to define the
distance between keywords.The usefulness of a search engine depends on the relevance of
the result set it gives back. While there may be millions of WebPages that include a particular
word or phrase some pages may be more relevant, popular, or authoritative than others [1].
Most search engines employ methods to rank the results to provide the best results first. Search
engine decision to have the best matching pages and order of results vary widely from one
engine to another. The methods also change over time as Internet usage changes and new
techniques evolve.
2.4. Searching Technique
At the frontal end, browser is connected with the web server when users put forward a
searching request to the browser. Web server can find the matched documents in a
large-scale indexed database, lists the indexes of these documents, and returns the result
to the user. At the hinder end of the search engine, administrator extracts the web page’s
4. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
12
indexed information saves them with the web page’s URL in the indexed database. The
indexed database must be updated continuously to satisfy the dynamic changes of the
Internet [10].The work flow of searching mechanism is shown in the diagram
[Fig:2.4.1] below.
Fig: 2.4.1
3. RELATED WORK
3.1 String Matching Algorithm
The string matching algorithm is used to find the occurrences of a pattern P of length m in a
large text T. Many techniques and algorithms have been designed to solve this problem. [2]
These algorithms have many applications including Information retrieval, database query, and
DNA and protein sequence analysis. Therefore the efficiency of string matching has great
impact. Basically a string matching algorithm uses the pattern to scan the text. The size of the
text taken from file is equal to length of pattern. It aligns the left ends of the pattern and text.
Then it checks if the pattern occurs in the text and shifts the pattern to right. It repeats the same
procedure again and again till the right end of the pattern goes to right end of the text. Two
types of matching are currently in use. Exact string matching technique [12] [15] used to find all
occurrence of pattern in the text. Approximate string matching is the technique of finding
approximate matches of a pattern in a text. The closeness of the match is measured [4] [9].
The most obvious approach to string matching problem is Brute-Force algorithm, which is also
called “naïve algorithm“. This algorithm is simple and easy to implement, works good mostly
for binary string search. It compares the pattern P with text T if mismatch occurs the pattern is
shifted to one position right and again checks with the text T. This algorithm performs too many
comparisons over the same text; hence it took more searching time. This algorithm is
unacceptably slow
The Rabin-Karp string-searching algorithm calculates a hash value for the pattern, and for each
M-character subsequence of text to be compared. If the hash values are unequal, the algorithm
will calculate the hash value for next M-character sequence. If the hash values are equal, the
algorithm will do brute- force comparison between M-character sequences. In this way, there is
only one comparison per text subsequence, and character matching is only needed when hash
values match. There are so many different strings, to keep the hash values small we have to
assign some strings the same number. This means that if the hash values match, the strings
might not match; we have to verify that they do, which can take a long time for long sub strings
Knuth, Morris and Pratt [13] discovered first linear time string-matching algorithm by following
a tight analysis of the naïve algorithm. Knuth-Morris-Pratt algorithm keeps the information that
Network
browser
Web Server
Indexed Database
Update
database
Results
Keywords
Database Query
Matched
records
Hinder end
Records
Frontal end
5. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
13
naïve approach wasted gathered during the scan of the text. By avoiding this waste of
information this turns out to be the length of the longest prefix of fewer than x characters in the
pattern that also appears as a suffix of the first x characters in the pattern. If there is no such
prefix, then we know that we can skip over x characters. It is extremely important to note that
the text itself is not considered if we have already matched x characters in the text, and then we
know what that text is in terms of the pattern In the worst-case Knuth-Morris-Pratt algorithm we
have to examine all the characters in the text and pattern at least once.
Boyer Moore algorithm scans the characters of the pattern from right to left beginning with the
right most character. During the testing of a possible placement of pattern P against text T, a
mismatch of text character T[i] = c with the corresponding pattern character P[j] is handled as
follows:
• If C is not contained any were in P, then shift the pattern P completely past T[i]
• Otherwise, shift P until an occurrence of character C in P gets aligned with T[i].
The major drawback is the significant pre-processing time to set up the tables.
Some of the exact string matching algorithm used to solve the problem of searching and pattern
in text are Boyer-Moore [5], Horspool[14], Zhu Takaoka, Naïve, Naïve FLC, and Naïve FML
[4]. Although Horspool algorithm has better worst case running time than Boyer Moore. The
latter is known to be extremely efficient in practice [2]. There have been many papers published
that deal with exact pattern matching and introduce variants of Boyer-Moore algorithm
3.2 Common Issues of Existing Algorithms
Some of the common issues identified from the existing algorithms listed below
• The case sensitiveness is not considered
• Character wise searching is performed.
• Special characters and spaces are not eliminated.
• Pre-processed text is not formatted.
To overcome these issues and to speed up the searching a new algorithm is proposed. The
spaces, special characters and stop words are not considered while searching. Pre-processing the
text helps to minimize the size of search data. The reduced data is then stored in a data base
where the searching actually done
4. PROPOSED WORK
Searching process of the text analyser tool is done in the proposed single pattern matching
algorithm which is fast compared to existing single pattern matching algorithms. And the
comparisons based on time complexity is given in this paper.
6. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
14
Figure 4.1: Text Analyser Flow Diagram
4.1 Site Map
This is the first phase in text analyser tool. Before searching process starts the Pre processing is
done. Using web crawler a sitemap is developed. User needs to give a website and search query,
and then the website is crawled to fetch all the hyperlinks available in it. Web crawlers are
programs that give graph structure of the web to move page by page. Web crawlers are used to
retrieve web pages and then to a local repository.
4.2 Pre processing
Pre-processing phase foresees the creation of a repository of the information from the web
during the execution of the first phase. During this phase, the documents collected by the web
crawlers will be pre-processed for converting the semi-structured data into a structured format.
When parsing a webpage to extract content information it is necessary to remove commonly
used words or stop words such as “it” ,”can”, “then” etc and this process is called stop listing. In
addition to stop word elimination, stemming is done in order to normalize words to its root
form. Eg: ”connected”, ”connecting”, ”connection” is all reduced to its root connect. This is
done by porter stemming algorithm. Finally, to avoid type conflict while searching all upper
case letters are converted into lower case.
4.3 Pattern Matching
Once the pre processing is done pattern matching is performed to find the word match. This is
done by checking the length of the pattern and the text. If the length of the pattern and the text is
matched further pattern matching is done with proposed pattern matching algorithm given
below.
4.3.1 Proposed Pattern Matching Algorithm
• If the first character in the pattern is matched with first character in text, then mid
characters are compared, if matches, then last character of pattern are compared with
text.
• If all the three comparisons are succeeded then remaining characters in pattern are
compared with text from left to right.
• If any mismatch occurs in any of the comparison then next occurrence of first character
in pattern is searched in the text and search starts from that position.
• Other words are skipped without comparison.
Site mapList of
websites
Pre
processing
Raw Data
Pattern
matching
Display
Results
7. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
15
4.3.2 Features of proposed algorithm
The following are the salient features of proposed algorithm
• Case insensitive
• Reduced time complexity
• Calculating the length of the words helps to skip unwanted words.
• Number of comparisons which affects the processing time, rapidly decreased after the
FML match.
• By building the site map all the hyperlinks can be searched.
4.3.3 Algorithm for keyword searching
Algorithm PMA (String w ,String p)
if (length (w) ==length (p))
{
if(w[0]=p[0]&&w[mid]=p[mid]&&
w[last]=p[l ast])
{
Procedure1 (1,mid-1);
Procedure1 (mid+1,last-1);
}
}
Procedure1 (int a, int b)
{
For i=a to b do
Repeat
If (w[i]!=p[i])
Return False
Until True
}
4.3.4 Example
Text: Due to this huge size of data, web search engines are becoming increasingly important as
the primary means of locating relevant information. Such search engines rely on massive
collections of web pages that are acquired with the help of web crawler, also known as robots or
spiders. Crawler traverse the web by following hyperlinks and storing downloaded pages in a
large database that is later indexed for efficient execution of user queries. They are the essential
components of all search engines and becoming increasingly important in data mining and other
indexing applications crowlar.
Pre-processed Text: huge size data web search engine important locate relevant inform. search
engine massive collection web page acquire web crawler robots spiders Crawler traverse web
follow hyperlink storing downloaded page database later index efficient execution user query
essential components search engine become increase important data mine index applicant
crowlar.
8. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
16
Pattern: Crawler
Step 1: length of the pattern = 7
Step 2: selects words from text of length 7
Text: Engines primary engines crawler crawler storing queries engines crowlar
Pattern: Crawler
Step 3: performs FML evaluation
Text: crawler crawler crowlar
Pattern: Crawler
Step 4: character matching is done
Text: Crawler Crawler
Pattern: Crawler
Step 5: Result: 2 words
4.3.5 Analysis
The length of the words equal to the length of the pattern is taken for searching. Remaining
words are skipped. Considering there are x tokens of same length, FML (First Middle Last)
characters search is performed, from which n tokens satisfies the FML condition. Then for each
word the algorithm will not compare first, middle, last characters since they already matched at
the previous step. So at each word algorithm compares a-3 character until it reaches n-1 tokens.
So, this means the algorithms takes time complexity of n (a-3) for searching. Were n is number
of tokens and a is length of token.
5. PERFORMANCE EVALUATION
5.1 Comparisons with Existing Algorithm
The algorithm was implemented using Java and JSP. It was tested using different file sizes. The
performance of proposed algorithm is measured by comparing with existing algorithms such as
Boyer Moore, Zhu Tahoka, Horspool, Naïve, NaïveFL, and Naïve FML. The comparison chart
is shown below. The chart shows the reduced time complexity of the proposed algorithm
9. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
17
Fig 5.1 A Sample Execution Time
Fig-3 shows the experimental results of the tested algorithms. The average execution time of
given keywords searched by various algorithms are plotted in above graph (Fig-3). The
execution time in milliseconds is denoted in y-axis and algorithms taken for comparisons are
denoted in x-axis. The values plotted in the above graph (Fig-3) are taken by finding average
searching time of words in a file. The number of character comparisons can be used as a
measuring factor, this factor affects time complexity. When number of character comparisons
decreased complexity also decreased.
5.2 FML Condition Evaluation
After the preprocessing phase tokenized text file is generated and gives out the text file for
pattern matching phase. The proposed pattern matching algorithm checks the length of the given
search keyword and length of the word in the tokenized text file. If length doesn’t match, next
word in the tokenized text file is checked for same length. Once the length of keyword is
matched with the word in the tokenized text file, First Middle Last characters are checked. This
condition helps to skip large number of words in the tokenized text file without complete
checking of all characters of a word. By checking only 3 characters of a word, it is found that if
the words returns true for FML condition then the word to be searched are almost equal or same
word. This is shown in below graphs Fig 5.2.1, Fig-5.2.2, Fig-5.2.3. The character comparison
is reduced with the help of this condition, which is major measuring factor for time complexity.
To show the efficiency of FML condition, a survey is done by collecting all words starting A to
Z of length 10,9,8,7,6 and 5 is taken in text file and checked for average number of words that
satisfy the FML condition. The graph below shows only words of length 10, 8 and 9.
0
10 0 0
2 0 0 0
3 0 0 0
4 0 0 0
50 0 0
6 0 0 0
70 0 0
8 0 0 0
B oyer M oore
Horspoo l
Zhu Takoaka
N aï ve FC
N aï ve FL
N aï ve FM L
Pro posed
10. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
18
10 Letter w ords (20,397 words)
0
10
20
30
40
50
60
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
wor ds i n a lph a be t i c a l or de r
Fig-5.2.1: No. of words with length 10 satisfies FML condition
.
Fig-5.2.1 says that there is average of 52 words totally of length 10 with starting letters a-z
satisfies FML condition among 20,397 words of length 10
9 Letter w ords(26,280 w ords)
0
10
20
30
40
50
60
70
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
wor ds i n a l pha be t s or de r
Series1
Fig-5.2.2: No. of words with length 9 satisfies FML condition
Fig-5.2.2 says that there is average of 65 words totally of length 9 with starting letters a-z
satisfies FML condition among 26,280 words of length 9
8 letter w ords (27,845 w ords)
0
10
20
30
40
50
60
70
80
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
wor ds i n a pl ha be t i c a l o r de r
count
Fig-5.2.3: No. of words with length 8 satisfies FML condition
11. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.1, April 2011
19
Fig-5.2.3 says that there is average of 68 words totally of length 8 with starting letters a-z
satisfies FML condition among 27,845 words of length 8. This shows the efficiency of FML
condition. Number of words returned true is very small compare to total number of words.
Hence FML condition is checked before character by character matching.
6. ACKNOWLEDGEMENTS
This paper is a part of SSE project, Funded by National technical research organization, New
Delhi.
7. CONCLUSIONS
To facilitate fast retrieval preprocessing is done from the collected information using certain
technique called FML condition which is used to retrieve the needed words from the list of large
set of words. A new exact pattern matching algorithm is proposed. All the algorithms including
proposed algorithm, naïve, Boyer Moore, Horspool, and Zhu Tahoka have been implemented.
Compared results and chart prove the performance enhancement and improved time complexity
of the proposed algorithm.
8. REFERENCES
[1] Hai Dong, Farookh Khadeer Hussain, Elizabeth chang, (2008) State of art in metadata
abstraction crawlers, IEEE computer society.
[2] Musbah Aqel & Ibrahiem.M, (2007)Multiple skip multiple pattern matching Algorithm,
IAENG international journal.
[3] Qing Cheng Li, Shan Lin, Zhen hua dong, (2008) Research on web information mining by
using Crawler techniques, IEEE computer society.
[4] Rami H.Mansi and jehad Q.odeh, (2009) Improving Naïve String matching algorithm,
medwell journals.
[5] Robert S. Boyer & J Strother Moore,(1977) A fast String Searching algorithm,
Communications of ACM
[6] Vishal Gupta, Lal Quan, (2008) A keywords searching for fast search engines, Ghaziabad
IEEE computer society.
[7] Xiang peisu, tain ke, hunang Qinzhen, (2008)A framework of deep web crawler, IEEE
computer society
[8] Xiaoming Song, Jianhua Feng Guoling Li,Qin Hong, (2008) LCA based keyword search for
effective retrieval of “information unit” from web pages, IEEE computer society
[9] Yan Li, Xin Zhong Chen,Bing Ru Yang, (2002) Research on web mining based intelligent
search engine, IEEE computer society
[10] Zu-Kuan Wel, Jiang-Ping Du, (2008)Research on several key issues about search engines,
IEEE computer society
[11] http://wiki.answers.com/q/how_many_websites_are_there_in_the_www
[12] Zu-Hao Pan , Richard Chia-Tung Lee, (2008) An approximate String matching based on
Exact String matching with constant pattern length, Department of computer science and
Information Engineering, National Chi Nan university .
[13] Kunth, D.E, Morris J.H, Pratt V.R (1977), Fast pattern matching in strings, Journal on
computing.
[14] Horspool R.N, (1980) Practical fast searching in strings, Software-practice & experience
10(6), pp 501-506
[15] K-W Liu, Richard Chia-Tung Lee, (2008) Some Results on Exact String matching algorithm,
Department of computer science and Information Engineering, National Chi Nan university.