The document discusses various techniques for web crawling and focused web crawling. It describes the functions of web crawlers including web content mining, web structure mining, and web usage mining. It also discusses different types of crawlers and compares algorithms for focused crawling such as decision trees, neural networks, and naive bayes. The goal of focused crawling is to improve precision and download only relevant pages through relevancy prediction.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Comparable Analysis of Web Mining Categoriestheijes
Web Data Mining is the current field of analysis which is a combination of two research area known as Data Mining and World Wide Web. Web Data Mining research associates with various research diversities like Database, Artificial Intelligence and Information redeem. The mining techniques are categorized into various categories namely Web Content Mining, Web Structure Mining and Web Usage Mining. In this work, analysis of mining techniques are done. From the analysis it has been concluded that Web Content Mining has unstructured or semi- structure view of data whereas Web Structure Mining have linked structure and Web Usage Mining mainly includes interaction.
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
The conventional search engines existing over the internet are active in searching the appropriate
information. The search engine gets few constraints similar to attainment the information seeked from a
different sources. The web crawlers are intended towards a exact lane of the web.Web Crawlers are limited
in moving towards a different path as they are protected or at times limited because of the apprehension of
threats. It is possible to make a web crawler,which will have the ability of penetrating from side to side the
paths of the web, not reachable by the usual web crawlers, so as to get a improved answer in terms of
infoemation, time and relevancy for the given search query. The proposed web crawler is designed to
attend Hyper Text Transfer Protocol Secure (HTTPS) websites including the web pages,which requires
verification to view and index.
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
This document discusses an enhanced approach for detecting user behavior through country-wise local search. It begins with an abstract describing the development of the web and challenges in the field. It then discusses various techniques for web mining including web usage mining, web content mining, and web structure mining. It also discusses sequential pattern mining algorithms and procedures for recommendation systems. The key contribution is proposing a new local search algorithm for country-wise search to make searching more efficient based on local results.
A detail survey of page re ranking various web features and techniquesijctet
This document discusses techniques for page re-ranking on websites based on user behavior analysis. It describes how web usage mining involves analyzing web server logs to extract patterns in user behavior. Common techniques discussed for page re-ranking include Markov models, data mining approaches like clustering and association rule mining, and analyzing linked web page structures. The goal is to better understand user interests and predict future page access to improve information retrieval and optimize website design.
This document discusses web crawling and summarizes key aspects. It begins by outlining the basic process of web crawling, which involves maintaining a frontier of unvisited URLs, fetching pages from the frontier, extracting links, and adding new links to the frontier. The document notes that while some crawlers exhaustively crawl the web, others use preferential or topical crawling to focus on specific topics or applications. It discusses challenges in evaluating crawlers and comparing their performance.
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
This document summarizes a research paper that proposes a hybrid personalized re-ranking approach to search results. It models a user's search interests using a conceptual user profile containing categories and concepts extracted from clicked results and a concept hierarchy. The user profile contains two types of documents - taxonomy documents representing general interests and viewed documents representing specific interests. A hybrid re-ranking process then semantically integrates the user's general and specific interests from their profile with search engine rankings to improve result relevance.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Comparable Analysis of Web Mining Categoriestheijes
Web Data Mining is the current field of analysis which is a combination of two research area known as Data Mining and World Wide Web. Web Data Mining research associates with various research diversities like Database, Artificial Intelligence and Information redeem. The mining techniques are categorized into various categories namely Web Content Mining, Web Structure Mining and Web Usage Mining. In this work, analysis of mining techniques are done. From the analysis it has been concluded that Web Content Mining has unstructured or semi- structure view of data whereas Web Structure Mining have linked structure and Web Usage Mining mainly includes interaction.
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
The conventional search engines existing over the internet are active in searching the appropriate
information. The search engine gets few constraints similar to attainment the information seeked from a
different sources. The web crawlers are intended towards a exact lane of the web.Web Crawlers are limited
in moving towards a different path as they are protected or at times limited because of the apprehension of
threats. It is possible to make a web crawler,which will have the ability of penetrating from side to side the
paths of the web, not reachable by the usual web crawlers, so as to get a improved answer in terms of
infoemation, time and relevancy for the given search query. The proposed web crawler is designed to
attend Hyper Text Transfer Protocol Secure (HTTPS) websites including the web pages,which requires
verification to view and index.
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
This document discusses an enhanced approach for detecting user behavior through country-wise local search. It begins with an abstract describing the development of the web and challenges in the field. It then discusses various techniques for web mining including web usage mining, web content mining, and web structure mining. It also discusses sequential pattern mining algorithms and procedures for recommendation systems. The key contribution is proposing a new local search algorithm for country-wise search to make searching more efficient based on local results.
A detail survey of page re ranking various web features and techniquesijctet
This document discusses techniques for page re-ranking on websites based on user behavior analysis. It describes how web usage mining involves analyzing web server logs to extract patterns in user behavior. Common techniques discussed for page re-ranking include Markov models, data mining approaches like clustering and association rule mining, and analyzing linked web page structures. The goal is to better understand user interests and predict future page access to improve information retrieval and optimize website design.
This document discusses web crawling and summarizes key aspects. It begins by outlining the basic process of web crawling, which involves maintaining a frontier of unvisited URLs, fetching pages from the frontier, extracting links, and adding new links to the frontier. The document notes that while some crawlers exhaustively crawl the web, others use preferential or topical crawling to focus on specific topics or applications. It discusses challenges in evaluating crawlers and comparing their performance.
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
This document summarizes a research paper that proposes a hybrid personalized re-ranking approach to search results. It models a user's search interests using a conceptual user profile containing categories and concepts extracted from clicked results and a concept hierarchy. The user profile contains two types of documents - taxonomy documents representing general interests and viewed documents representing specific interests. A hybrid re-ranking process then semantically integrates the user's general and specific interests from their profile with search engine rankings to improve result relevance.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine queries multiple traditional search engines like Google, Yahoo, Bing and Ask simultaneously using a single user query. It then ranks the retrieved results using a new two phase ranking algorithm called modified ranking that considers page relevance and popularity. The goal of the new meta search engine is to produce more efficient search results compared to traditional search engines. It includes components like a graphical user interface, query formulator, metacrawler, redundant URL eliminator and modified ranking algorithm to retrieve and rank results.
This document discusses web structure mining and various algorithms used for it. It begins with an abstract describing web mining and how structure mining analyzes the hyperlink structure between documents. It then provides an overview of the different types of web mining (content, structure, usage) and describes structure mining in more detail. The document focuses on structure mining algorithms like PageRank, HITS, Weighted PageRank, Distance Rank and others. It explains how each algorithm works and its advantages/disadvantages for analyzing the link structure of a website.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
This document proposes ODAM, an optimized distributed association rule mining algorithm. It aims to discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Modern organizations have geographically distributed data stored locally at each site, making centralized data mining infeasible due to high communication costs. Distributed data mining emerged to address this challenge. ODAM reduces communication costs compared to previous distributed ARM algorithms by mining patterns across distributed databases without requiring data consolidation.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
The document discusses web crawlers, which are programs that download web pages to help search engines index websites. It explains that crawlers use strategies like breadth-first search and depth-first search to systematically crawl the web. The architecture of crawlers includes components like the URL frontier, DNS lookup, and parsing pages to extract links. Crawling policies determine which pages to download and when to revisit pages. Distributed crawling improves efficiency by using multiple coordinated crawlers.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBIJDKP
The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses search engines and web crawlers. It provides information on how search engines work by using web crawlers to index web pages and then return relevant results when users search. It also compares major search engines like Google, Yahoo, MSN, Ask Jeeves, and Live Search based on factors like market share, database size and freshness, ranking algorithms, and treatment of spam. Google is highlighted as having the largest market share and best algorithms for determining natural vs artificial links.
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
This document proposes a focused semantic web crawler to efficiently access valuable and relevant deep web content in two stages. The first stage fetches relevant websites, while the second performs a deep search within sites using cosine similarity to rank pages. Deep web content, estimated at over 500 times the size of the surface web, is difficult for search engines to index as it is dynamic. The proposed crawler aims to address this using adaptive learning and storing patterns to become more efficient at locating deep web information.
Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
The document describes a proposed framework for a metacrawler that retrieves and ranks web documents from multiple search engines based on user queries. The metacrawler fetches results from different search engines in parallel using a web crawler. It eliminates duplicate URLs and ranks the pages using an improved PageRank algorithm to reduce topic drift. The ranked results are then clustered to group similar pages to help users easily find relevant information. An evaluation of the metacrawler shows it achieves better retrieval effectiveness and relevance ratios compared to individual search engines.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium
sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the
environment, regional end-to-end public transport services are established by analyzing online travel data.
The usage of computer programs for processing of the web page is necessary for accessing to a large
number of the carpool data. In the paper, web crawlers are designed to capture the travel data from
several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used.
The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient
method of data collecting to the program.
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
The document describes a web crawler designed to collect carpool data from websites. It begins with an introduction to the need for efficient carpool data collection and issues with existing methods. It then details the design and implementation of the web crawler program. Key aspects summarized are:
1) The web crawler uses a breadth-first search algorithm to crawl links across multiple pages and maximize data collection. It filters URLs to remove duplicates and irrelevant links.
2) It analyzes pages using the BeautifulSoup library to extract relevant text data and links. It stores cleaned data in a structured format.
3) The program architecture involves crawling URLs, cleaning the URL list, and then crawling pages to extract carpool data fields using BeautifulSoup functions
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
Data mining in web search engine optimizationBookStoreLib
This document presents a proposed approach for optimizing web search by incorporating user feedback to improve result rankings. The approach uses keyword analysis on the user query to initially retrieve and rank relevant web pages. It then analyzes user responses like likes/dislikes and visit counts to update the page rankings. Experimental results on sample education queries show how page rankings change as user responses increase likes for certain pages. The approach aims to provide more useful search results by better reflecting individual user preferences.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine queries multiple traditional search engines like Google, Yahoo, Bing and Ask simultaneously using a single user query. It then ranks the retrieved results using a new two phase ranking algorithm called modified ranking that considers page relevance and popularity. The goal of the new meta search engine is to produce more efficient search results compared to traditional search engines. It includes components like a graphical user interface, query formulator, metacrawler, redundant URL eliminator and modified ranking algorithm to retrieve and rank results.
This document discusses web structure mining and various algorithms used for it. It begins with an abstract describing web mining and how structure mining analyzes the hyperlink structure between documents. It then provides an overview of the different types of web mining (content, structure, usage) and describes structure mining in more detail. The document focuses on structure mining algorithms like PageRank, HITS, Weighted PageRank, Distance Rank and others. It explains how each algorithm works and its advantages/disadvantages for analyzing the link structure of a website.
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
This document proposes ODAM, an optimized distributed association rule mining algorithm. It aims to discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Modern organizations have geographically distributed data stored locally at each site, making centralized data mining infeasible due to high communication costs. Distributed data mining emerged to address this challenge. ODAM reduces communication costs compared to previous distributed ARM algorithms by mining patterns across distributed databases without requiring data consolidation.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
The document discusses web crawlers, which are programs that download web pages to help search engines index websites. It explains that crawlers use strategies like breadth-first search and depth-first search to systematically crawl the web. The architecture of crawlers includes components like the URL frontier, DNS lookup, and parsing pages to extract links. Crawling policies determine which pages to download and when to revisit pages. Distributed crawling improves efficiency by using multiple coordinated crawlers.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBIJDKP
The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses search engines and web crawlers. It provides information on how search engines work by using web crawlers to index web pages and then return relevant results when users search. It also compares major search engines like Google, Yahoo, MSN, Ask Jeeves, and Live Search based on factors like market share, database size and freshness, ranking algorithms, and treatment of spam. Google is highlighted as having the largest market share and best algorithms for determining natural vs artificial links.
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
This document proposes a focused semantic web crawler to efficiently access valuable and relevant deep web content in two stages. The first stage fetches relevant websites, while the second performs a deep search within sites using cosine similarity to rank pages. Deep web content, estimated at over 500 times the size of the surface web, is difficult for search engines to index as it is dynamic. The proposed crawler aims to address this using adaptive learning and storing patterns to become more efficient at locating deep web information.
Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
The document describes a proposed framework for a metacrawler that retrieves and ranks web documents from multiple search engines based on user queries. The metacrawler fetches results from different search engines in parallel using a web crawler. It eliminates duplicate URLs and ranks the pages using an improved PageRank algorithm to reduce topic drift. The ranked results are then clustered to group similar pages to help users easily find relevant information. An evaluation of the metacrawler shows it achieves better retrieval effectiveness and relevance ratios compared to individual search engines.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium
sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the
environment, regional end-to-end public transport services are established by analyzing online travel data.
The usage of computer programs for processing of the web page is necessary for accessing to a large
number of the carpool data. In the paper, web crawlers are designed to capture the travel data from
several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used.
The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient
method of data collecting to the program.
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
The document describes a web crawler designed to collect carpool data from websites. It begins with an introduction to the need for efficient carpool data collection and issues with existing methods. It then details the design and implementation of the web crawler program. Key aspects summarized are:
1) The web crawler uses a breadth-first search algorithm to crawl links across multiple pages and maximize data collection. It filters URLs to remove duplicates and irrelevant links.
2) It analyzes pages using the BeautifulSoup library to extract relevant text data and links. It stores cleaned data in a structured format.
3) The program architecture involves crawling URLs, cleaning the URL list, and then crawling pages to extract carpool data fields using BeautifulSoup functions
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
Data mining in web search engine optimizationBookStoreLib
This document presents a proposed approach for optimizing web search by incorporating user feedback to improve result rankings. The approach uses keyword analysis on the user query to initially retrieve and rank relevant web pages. It then analyzes user responses like likes/dislikes and visit counts to update the page rankings. Experimental results on sample education queries show how page rankings change as user responses increase likes for certain pages. The approach aims to provide more useful search results by better reflecting individual user preferences.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
http://lab.lvduit.com:7850/~lvduit/crawl-the-web/
Special thanks to Hatforrent, csail.mit.edu, Amitavroy, ...
Contact us at dafculi.xc@gmail.com or duyet2000@gmail.com
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logsijsrd.com
With an expontial growth of World Wide Web, there are so many information overloaded and it became hard to find out data according to need. Web usage mining is a part of web mining, which deal with automatic discovery of user navigation pattern from web log. This paper presents an overview of web mining and also provide navigation pattern from classification and clustering algorithm for web usage mining. Web usage mining contain three important task namely data preprocessing, pattern discovery and pattern analysis based on discovered pattern. And also contain the comparative study of web mining techniques.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Recommendation generation by integrating sequential pattern mining and semanticseSAT Journals
Abstract As the Internet usage keeps increasing, the number of web sites and hence the number of web pages also keeps increasing. A recommendation system can be used to provide personalized web service by suggesting the pages that are likely to be accessed in future. Most of the recommendation systems are based on association rule mining or based on keywords. Using the association rule mining the prediction rate is less as it doesn’t take into account the order of access of the web pages by the users. The recommendation systems that are key-word based provides lesser relevant results. This paper proposes a recommendation system that uses the advantages of sequential pattern mining and semantics over the association rule mining and keyword based systems respectively. Keywords: Sequential Pattern Mining, Taxonomy, Apriori-All, CS-Mine, Semantic, Clustering
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
This document describes the design of a web crawler interface created using VB.NET technology. It discusses the components and architecture of web crawlers, including the seed URLs, frontier, parser, and performance metrics used to evaluate crawlers. The high-level design of the crawler simulator is presented as an algorithm, and screenshots of the VB.NET user interface for the crawler are shown. The crawler was tested on the website www.cdlu.edu.in using different crawling algorithms like breadth-first and best-first, and the results were stored in an MS Access database.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Web crawlers, also known as spiders, are programs that systematically browse the World Wide Web to download pages for search engines and indexes. Crawlers start with a list of URLs and identify links on pages to add to the list to visit recursively. Effective crawlers require flexibility, high performance, fault tolerance, and maintainability. Crawling strategies include breadth-first, repetitive, targeted, and random walks. Selection, revisit, politeness, and parallelization policies help crawlers efficiently gather relevant information from the dynamic web. Distributed crawling employs multiple computers to index large portions of the internet in parallel.
This document describes an intelligent meta search engine that was developed to efficiently retrieve relevant web documents. The meta search engine submits user queries to multiple traditional search engines including Google, Yahoo, Bing and Ask. It then uses a crawler and modified page ranking algorithm to analyze and rank the results from the different search engines. The top results are then generated and displayed to the user, aimed to be more relevant than results from individual search engines. The meta search engine was implemented using technologies like PHP, MySQL and utilizes components like a graphical user interface, query formulator, metacrawler and redundant URL eliminator.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Web Page Recommendation Using Web MiningIJERA Editor
On World Wide Web various kind of content are generated in huge amount, so to give relevant result to user web recommendation become important part of web application. On web different kind of web recommendation are made available to user every day that includes Image, Video, Audio, query suggestion and web page. In this paper we are aiming at providing framework for web page recommendation. 1) First we describe the basics of web mining, types of web mining. 2) Details of each web mining technique.3)We propose the architecture for the personalized web page recommendation.
A machine learning approach to web page filtering using ...butest
This document describes a machine learning approach to web page filtering that combines content and structural analysis. The proposed approach represents web pages with features extracted from content and links. These features are used as input for machine learning algorithms like neural networks and support vector machines to classify pages. An experiment compares this approach to keyword-based and lexicon-based filtering, finding the proposed approach generally performs better, especially with few training documents.
A machine learning approach to web page filtering using ...butest
This document describes a machine learning approach to web page filtering that combines content and structural analysis. The proposed approach represents web pages with features extracted from content, such as terms and phrases, and from links. These features are used as input for machine learning algorithms like neural networks and support vector machines to classify pages. An experiment compares this approach to keyword-based and lexicon-based filtering, finding the proposed approach generally performs better, especially with few training examples. The approach could benefit topic-specific search engines and other applications.
Enhance Crawler For Efficiently Harvesting Deep Web Interfacesrahulmonikasharma
Scenario in web is varying quickly and size of web resources is rising, efficiency has become a challenging problem for crawling such data. The hidden web content is the data that cannot be indexed by search engines as they always stay behind searchable web interfaces. The proposed system purposes to develop a framework for focused crawler for efficient gathering hidden web interfaces. Firstly Crawler performs site-based searching for getting center pages with the help of web search tools to avoid from visiting additional number of pages. To get more specific results for a focused crawler, projected crawler ranks websites by giving high priority to more related ones for a given search. Crawler accomplishes fast in-site searching via watching for more relevant links with an adaptive link ranking. Here we have incorporated spell checker for giving correct input and apply reverse searching with incremental site prioritizing for wide-ranging coverage of hidden web sites.
A Review on Pattern Discovery Techniques of Web Usage MiningIJERA Editor
In the recent years with the development of Internet technology the growth of World Wide Web exceeded all expectations. A lot of information is available in different formats and retrieving interesting content has become a very difficult task. One possible approach to solve this problem is Web Usage Mining (WUM), the important application of Web Mining. Extracting the hidden knowledge in the log files of a web server, recognizing various interests of web users, discovering customer behavior while at the site are normally referred as the applications of web usage mining. In this paper we provide an updated focused survey on techniques of web usage mining.
A Review on Pattern Discovery Techniques of Web Usage Mining
Sekhon final 1_ppt
1. SRI GURU GRANTH SAHIB WORLD UNIVERSITY
FATEHGARH SAHIB – 140406
JULY,2014
Guided By : Presented By:
Ms. Shruti Aggarwal Prabhjit Singh Sekhon
Assistant Professor Roll No: 12012238
(CSE Department) M.tech(CSE)
3. Data mining is computational process of discovering patterns in large data sets. The
overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
3
5. Functions of data mining can be classified as:
Class Description: It can be useful to describe individual classes and concepts.
Association Analysis: It is the discovery of association rules that define attribute-
value conditions.
Classification: It is the process of finding a set of models that describe and
distinguish data classes or concepts by using class label.
Cluster Analysis: The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the inter-class similarity.
Outlier Analysis: Outliers are data objects that do not comply with the general
behavior of model of the data.
Evolution Analysis: It describes and models trends for objects whose behaviors
changes over time.
5
6. Pre-processing: In this data is collected, irrelevant data items are removed and
converted into appropriate form.
Processing: To process the data algorithms are considered to develop and evaluate
the models. Models are implemented on data.
Post-processing: When models are implemented on data it is presented if still there
is irrelevant data then again steps repeated.
6
8. There are various types of Mining :
Text Mining: Deriving high quality information from large text.
Visual Mining: In this patterns are generated which are converted from analog to
digital data that are hidden in data.
Audio Mining: In this content of an audio signal can be analysed and searched.
Spatial Mining: Finding patterns in data with respect to geography.
Web Mining :Discover patterns from web.
Video Mining: Video mining is process to extract the relevant video content and
structure of video automatically like moving objects, spatial and correlations of
features as well as object activity, video events, video structure patterns .
8
9. Advantages
Predict future trends, customer purchase habits.
Help with decision making.
Improve company revenue and lower costs.
Market basket analysis.
Fraud detection.
Disadvantages
User privacy/security.
Great cost at implementation stage.
Possible misuse of information.
Possible inaccuracy of data.
9
10. Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services .
There are three general classes of information that can be discovered by web mining:
Web Activity from server logs and Web browser activity tracking.
Web Graph from links between pages, people and other data.
Web Content for the data found on Web pages and inside of documents.
10
12. Web Content Mining is the process of extracting useful information from the
contents of web documents. There are two types of approach in content mining
called agent based approach and database based approach.
Agent Based Approach: This approach concentrate on searching relevant
information using the characteristics of a particular domain to interpret and organize
the collected information.
Database Approach: This approach is used for retrieving the semi-structure data
from the web.
12
13. Web structure mining is the process of using graph theory to analyze the node and
connection structure of a web site. According to the type of web structural data, web
structure mining can be divided into two kinds:
Extracting patterns from hyperlinks in the web: A hyperlink is a structural
component that connects the web page to a different location.
Mining the document structure: Analysis of the tree-like structure of page
structures.
13
14. Web usage mining is the process of extracting useful information from server logs
e.g. use Web usage mining is the process of finding out what users are looking for
on the Internet. Some users might be looking at only textual data, whereas some
others might be interested in multi media data.
14
15. A Search Engine Spider is a program that most search engines use to find what’s
new on the Internet.
The program starts at a website and follows every hyperlink on each page. So we
can say that everything on the web will eventually be found and spidered, as the so
called “spider” crawls from one website to another.
15
16. Crawler Description
GNU Wget GNU Wget is a command-line-operated
crawler written in C and released under
the GPL. It is typically used to mirror Web
and FTP sites.
DataparkSearch It is a crawler and search engine released
under the General Public License.
GRUB It is an open source distributed search
crawler that Wikia Search used to crawl
the web.
HTTrack It uses a Web crawler to create a mirror of
a web site for off-line viewing. It is written
in C and released under the GPL
16
17. FAST Crawler is a distributed crawler, used by Fast Search & Transfer.
Googlebot is used by google search engine, which is based on C++
and Python.
Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.
18.
19. Information on the Web changes over time.
Knowledge of this ‘rate of change’ is crucial, as it allows us to
estimate how often a search engine should visit each portion of
the Web in order to maintain a fresh index.
19
20. Recognizing the relevance or importance of sites
Define a scoring function for relevance
where is the set of parameters, u is url ξ is relevance
20
:)()(
us
21. Indexing the web is a challenge due to its growing and dynamic nature.
A single crawling process even if multithreading is used will be insufficient for large
scale engines that need to fetch large amounts of data rapidly.
Distributing crawling activity is the splitting the load, decreases hardware
requirements and at the same time increases the overall download speed and
reliability.
21
22. Focused web crawling is to visit unvisited url to check whether it is relevant to
search topic or not.
Avoid irrelevant documents.
Reduce network traffic.
22
23. Basically three steps that are involved in the web crawling process:-
The search crawler starts by crawling the pages of your site.
Then it continues indexing the words and content of the site.
Finally it visit the links (web page addresses or URLs) that are found in
your site.
24. The Web crawler is the outcome of a combination of policies:-
Selection policy: It states that which pages to download. Given the current size of
the Web, even large search engines cover only a portion of the publicly available
part.
Re-visit policy: This policy states when to check for changes to the pages. The
most-used cost functions are freshness and age.
Politeness policy: This policy states how to avoid overloading sites. If a single
crawler is performing multiple requests per second and/or downloading large files.
Parallelization policy: A parallel crawler is a crawler that runs multiple processes
in parallel.
24
25. New learning-based approach which uses the Naïve Bayes classifier as the base
prediction model to improve relevance prediction in focused Web crawlers is used.
New learning-based focused crawling approach that uses four relevance attributes to
predict the relevance and if additionally allowing dynamic update of the training
dataset, the prediction accuracy is further boosted [1].
25
26. Intelligent crawling method involves looking for specific features in a page to rank
the candidate links.
These features include page content, URL names of referred Web page, and the
nature of the parent and sibling pages.
It is a generic framework in that it allows the user to specify the relevant criteria [2].
26
27. The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for
specific search goal.
The Fish-Search approach assigns priority values (1 or 0) to candidate pages using
simple keyword matching.
One of the disadvantages of Fish-Search is that all relevant pages are assigned the
same priority value 1 based on keyword matching [3].
27
28. The Shark-Search is a modified version of Fish-Search, in which, Vector Space
Model (VSM) is used.
The priority values (more than just 1 and 0) are computed based on the priority
values of parent pages, page content, and anchor text[3].
28
44. Decision tree is an important tool for machine learning. It is used as a predictive
model to map conclusions about an item's target value to observations about the
item.
Tree structures comprise of leaves and branches.
Leaves represent class labels and branches represent conjunctions of features that
lead to those class labels.
In decision analysis, a decision tree can be used to explicitly and visually represent
decisions and decision making.
The resulting classification tree can be an input for decision making.
44
45. NN is always learning model.
It usually receives many inputs, called input vector, (analogy of dendrites) and
the sum of them forms output (analogy of synapse).
The sums of inputs are weighted and the sum is passed to the activation
function.
However, the brain makes up for the relatively slow rate of operation of a
neuron by having a truly staggering number of neurons(nerve cells) with
massive interconnections between them.
It is estimated that there must be on the order of 10 billion neurons in the
human cortex, and 60 trillion connections.
The net result is that the brain is an enormously efficient structure.
45
46. Naïve Bayes algorithm is based on Probabilistic learning and classification.
It assumes that one feature is independent of another.
This algorithm proved to be efficient over many other approaches although its
simple assumption is not much applicable in realistic world cases.
Exploits the fact that relevant pages possibly link to other relevant pages. Therefore,
the relevance of a page a to a topic t, pointed by a page b, is estimated by the
relevance of page b to the topic t.
46
47. With the advent of WWW and with the increase in the size of web, it becomes a
challenging task for searching relevant information on the web. The result of any search
is millions of documents ranked by a search engine.
The ranking depends on keywords count, keywords density, link count and some other
proprietary algorithms that are specific to every search engine.
A user does not have an easy way to reduce those millions of documents by defining a
context or refining the meaning of the keyword.
The documents appearing in the search result are ranked according to a specific algorithm
that is characteristic to every search engine. The results may in all probability not sorted
according to a users interest.
Due to the above mentioned problems and the keyword approach used by most search
engines the amount of time and effort required to find the right information is directly
proportional to the amount of information on the web. Data mining is a valid solution to
the problem.
Crawlers are important tools of data mining which traverses the internet for retrieving
web pages which are relevant to a topic.
47
48. To understand the working of content Block segmentation form previous research.
To find attributes for seed URLs and their child URLs.
To classify the URLs according to their weight or score in the weight table.
To prepare the full training data set and maintain the keyword table.
To classify the relevancy of the new unseen URLs using Decision Tree Induction,
Neural Network, Naïve Bayes Classifier.
To find out the Harvest ratio or Precision rate for overall performance evaluation.
48
50. There is basically Precision rate parameters which are considered to form results. On
the basis of precision rate performance of all the algorithms is measured.
Precision rate: It estimates the fraction of crawled pages that are relevant. We
depend on multiple classifiers to make this relevance judgement.
Precision Rate = No. of Relevant pages/Total downloaded pages
50
52. The name MATLAB stands for MATrix-LABoratory.
MATLAB is a tool for numerical computation and visualization. The basic data
element is a matrix, so if you need a program that manipulates array-based data it is
generally fast to write .
MATLAB is a high-performance language for technical computing. It
integrates computation, visualization, and programming environment.
52
53. MATLAB is an excellent tool because it has unsophisticated data structures,
contains built-in editing and debugging tools, and supports object-oriented
programming.
It also has easy to use graphics commands that make the visualization of results
immediately available.
There are toolboxes (specific applications) for signal processing, symbolic
computation, control theory, simulation, optimization, and several other fields of
applied science and engineering.
53
56. Our work was to compare the accuracy the three prime classifiers. We can end up
this discussion by saying that in terms of classification accuracy Neural network
leads the Decision tree induction and Naive Bayesian by a big margin.
Moreover in order to have more complex computing we need to improve upon the
memory of a particular classifier by training it, in this context also the NN
dominance prevails over the DTI and NB.
56
57. It can be applied on large database for relevancy prediction.
Further can be used support vector matching.
58. [1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for
Topic Specific Resource Discovery”, In Journal of Computer and Information
Science, vol. 31, no. 11-16, pp. 1623-1640,1999.
[2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor
Text Using Decision Tree”,In Special interest tracks and posters of the 14th
international conference on World Wide Web,Chiba, Japan,2005
[3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web
Efficiently," In proceedings of the 16th International Conference on Machine
Learning(ICML-99), pp. 335-343, 1999.
58