• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
“Web crawler”
 

“Web crawler”

on

  • 771 views

Web crawler”

Web crawler”

Statistics

Views

Total Views
771
Views on SlideShare
771
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    “Web crawler” “Web crawler” Presentation Transcript

    • “Web Crawler” Ranjit R. Banshpal 1 1
    • Overview  OBJECTIVE  INTRODUCTION PROBLEM STATEMENT ARCHITECTURE OF WEB CRAWLER APPROACHES FOR CRAWLING PROCESS POLICIES USED UTILITIES OF WEB CRAWLER CONCLUSION SCOPE FOR FUTURE REFERENCES 2 2
    • Objective  Internet users and accessible web pages. Hypertext system . Most crucial components in search engines and their optimization would have a great effect on improving the searching efficiency. 3 3
    • Introduction Programs that exploit the graph structures of the web to move from page to page. Program that browses the World Wide Web in a methodical, automated manner. Search Engines: Most crucial components Improves the searching efficiency. 4
    • Literature survey Literature survey paper 1 “Distributed Ontology-Driven Focused Crawling” •Vertical search technologies. •Focused crawling. •Ontological structure. Web Crawler architechture uses URL scoring functions,Scheduler and DOM parser,Page ranker to download web pages. 57
    • • Literature survey paper 2 “Efficient Focused Crawling based on Best First Search” •Seek out pages that are relevant to given keywords. •A focused crawler analyze links that are likely to be most relevant. •“Best” first search strategy is identified as a “focused crawler” Focused crawler has two main components: (i)To find specific web page. (ii)To proceed from seed pages. 8 6
    • Literature survey paper 3 “Design of an Ontology based Adaptive Crawler for Hidden Web”. •Deep web/ invisible web / hidden web. •Accessing deep web using ontology. •Download relevant hidden web pages. 79
    • • Literature survey paper 4 “URL Rule Based Focused Crawlers.” • Use of URL regular expression . • Retrieving Topic-specific Pages. Search the topic-specific information, need to crawl a small part of data use fewer server resources . 8 10
    • • Literature survey paper 5 “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree.” •Representation of data in hierarchical Dom-Tree. •Dom-Tree is structural representation of HTML pages. •Use the concept of Ontology. 9
    • Problem statement Most prominent challenge with current web crawlers Selection of important pages for downloading. Cannot download all pages from the web. It is important for the crawler “To select the pages and to visit “important” pages first by prioritizing the URLs in the queue properly.” It minimizing the load on the websites crawled with parallelization of the crawling process. 12
    • Functional diagram of web crawler 11
    • Approaches for Crawling process Basically if we consider there are 2 different types of crawler Priory Defined path A priory Do not follow a specific path. 12 14
    • Policies Used  A selection policy that states which pages to download.  A politeness policy that states how to avoid overloading web sites.  A parallelization policy that states how to coordinate distributed web crawl. 13
    • Utilities of Web Crawler  Gather pages from the Web.  Support a search engine.  Perform data mining  Improving the sites (web site analysis) 1416
    • Conclusion The number of extracted documents was reduced. Link analyzed, and deleted a great deal of irrelevant web page. Crawling time is reduced. After a great deal of irrelevant web page is deleted, crawling load is reduced. 15
    • References Rodrigo Campos, Oscar Rojas, Mauricio Mar´ın, Marcelo Mendoza “Distributed Ontology-Driven Focused Crawling” 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 10666192/12 © 2012 IEEE DOI 10.1109/PDP.2013.23 Sunita Rawat, D. R. Patil “Efficient Focused Crawling based on Best First Search” 978-1-4673-4529-3/12/c2012 IEEE. Manvi, Ashutosh Dixit, Komal Kumar Bhatia “Design of an Ontology based Adaptive Crawler for Hidden Web” 978-0-7695-4958-3/13© 2013 IEEE DOI 10.1109/CSNT.2013.140. Xiaolin Zheng, Tao Zhou, Zukun Yu, Deren Chen “URL Rule Based Focused Crawlers” IEEE International Conference on e-Business Engineering. 978-07695-3395-7/08 © 2008 IEEE DOI 10.1109/ICEBE.2008.61. Yuekui Yang, Yajun Du, Yufeng Hai, Zhaoqiong Gao “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree” 2009 Asia-Pacific Conference on Information Processing. 16
    • 17