Your SlideShare is downloading. ×
0
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
“Web crawler”
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

“Web crawler”

1,108

Published on

Web crawler”

Web crawler”

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,108
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. “Web Crawler” Ranjit R. Banshpal 1 1
  • 2. Overview  OBJECTIVE  INTRODUCTION PROBLEM STATEMENT ARCHITECTURE OF WEB CRAWLER APPROACHES FOR CRAWLING PROCESS POLICIES USED UTILITIES OF WEB CRAWLER CONCLUSION SCOPE FOR FUTURE REFERENCES 2 2
  • 3. Objective  Internet users and accessible web pages. Hypertext system . Most crucial components in search engines and their optimization would have a great effect on improving the searching efficiency. 3 3
  • 4. Introduction Programs that exploit the graph structures of the web to move from page to page. Program that browses the World Wide Web in a methodical, automated manner. Search Engines: Most crucial components Improves the searching efficiency. 4
  • 5. Literature survey Literature survey paper 1 “Distributed Ontology-Driven Focused Crawling” •Vertical search technologies. •Focused crawling. •Ontological structure. Web Crawler architechture uses URL scoring functions,Scheduler and DOM parser,Page ranker to download web pages. 57
  • 6. • Literature survey paper 2 “Efficient Focused Crawling based on Best First Search” •Seek out pages that are relevant to given keywords. •A focused crawler analyze links that are likely to be most relevant. •“Best” first search strategy is identified as a “focused crawler” Focused crawler has two main components: (i)To find specific web page. (ii)To proceed from seed pages. 8 6
  • 7. Literature survey paper 3 “Design of an Ontology based Adaptive Crawler for Hidden Web”. •Deep web/ invisible web / hidden web. •Accessing deep web using ontology. •Download relevant hidden web pages. 79
  • 8. • Literature survey paper 4 “URL Rule Based Focused Crawlers.” • Use of URL regular expression . • Retrieving Topic-specific Pages. Search the topic-specific information, need to crawl a small part of data use fewer server resources . 8 10
  • 9. • Literature survey paper 5 “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree.” •Representation of data in hierarchical Dom-Tree. •Dom-Tree is structural representation of HTML pages. •Use the concept of Ontology. 9
  • 10. Problem statement Most prominent challenge with current web crawlers Selection of important pages for downloading. Cannot download all pages from the web. It is important for the crawler “To select the pages and to visit “important” pages first by prioritizing the URLs in the queue properly.” It minimizing the load on the websites crawled with parallelization of the crawling process. 12
  • 11. Functional diagram of web crawler 11
  • 12. Approaches for Crawling process Basically if we consider there are 2 different types of crawler Priory Defined path A priory Do not follow a specific path. 12 14
  • 13. Policies Used  A selection policy that states which pages to download.  A politeness policy that states how to avoid overloading web sites.  A parallelization policy that states how to coordinate distributed web crawl. 13
  • 14. Utilities of Web Crawler  Gather pages from the Web.  Support a search engine.  Perform data mining  Improving the sites (web site analysis) 1416
  • 15. Conclusion The number of extracted documents was reduced. Link analyzed, and deleted a great deal of irrelevant web page. Crawling time is reduced. After a great deal of irrelevant web page is deleted, crawling load is reduced. 15
  • 16. References Rodrigo Campos, Oscar Rojas, Mauricio Mar´ın, Marcelo Mendoza “Distributed Ontology-Driven Focused Crawling” 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 10666192/12 © 2012 IEEE DOI 10.1109/PDP.2013.23 Sunita Rawat, D. R. Patil “Efficient Focused Crawling based on Best First Search” 978-1-4673-4529-3/12/c2012 IEEE. Manvi, Ashutosh Dixit, Komal Kumar Bhatia “Design of an Ontology based Adaptive Crawler for Hidden Web” 978-0-7695-4958-3/13© 2013 IEEE DOI 10.1109/CSNT.2013.140. Xiaolin Zheng, Tao Zhou, Zukun Yu, Deren Chen “URL Rule Based Focused Crawlers” IEEE International Conference on e-Business Engineering. 978-07695-3395-7/08 © 2008 IEEE DOI 10.1109/ICEBE.2008.61. Yuekui Yang, Yajun Du, Yufeng Hai, Zhaoqiong Gao “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree” 2009 Asia-Pacific Conference on Information Processing. 16
  • 17. 17

×