Published on

Published in: Education, Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Web Crawling Submitted by:Govind Raj Registration no:1001227464 INFORMATION TECHNOLOGY
  2. 2. Beginning A key motivation for designing Web crawlers has been to retrieve Web pages and add their representations to a local repository.
  3. 3. Web crawling ? What is the “Web Crawling”? What are the uses of Web Crawling? Types of crawling
  4. 4. Web Crawling: • A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community— Web scutter) is a program or automated script that browses the World Wide Web in a - methodical - automated manner. • Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.
  5. 5. What the Crawlers are:- ∗ Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. ∗ The role of Crawlers is to collect Web Content.
  6. 6. Basic crawler operation:∗ ∗ ∗ ∗ ∗ Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a Queue Fetch each URL on the queue and repeat
  7. 7. Traditional Web Crawler HT'06 7
  8. 8. Beginning with Web Crawler: The basic Algorithm : { Pick up the next URL Connect to the server GET the URL When the page arrives, get its links other stuff) } REPEAT (optionally do
  9. 9. Uses for crawling:Complete web search engine Search Engine = Crawler + Indexer/Searcher /(Lucene) + GUI ∗ Find stuff ∗ Gather stuff ∗ Check stuff
  10. 10. Several Types of Crawlers: • Batch Crawlers- Crawl a snapshot of their crawl space, until reaching a certain size or time limit. • Incremental Crawlers- Continuously crawl their crawl space, revisiting URL to ensure freshness. • Focused Crawlers- Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected.
  11. 11. URL normalization ∗ Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization refers to the process ofmodifying standardizing A URL in a consistent manner.
  12. 12. The challenges of “Web Crawling”:There are three important characteristics of the Web that make crawling it very difficult: ∗ Its large volume ∗ Its fast rate of change ∗ Dynamic page generation
  13. 13. Examples of Web crawlers ∗ ∗ ∗ ∗ Yahoo! Slurp: Yahoo Search crawler. Msnbot: Microsoft's Bing web crawler. Googlebot : Google’s web crawler. WebCrawler : Used to build the first publicly-available full-text index of a subset of the Web. ∗ World Wide Web Worm : Used to build a simple index of document titles and URLs. ∗ Web Fountain: Distributed, modular crawler written in C++. ∗ Slug: Semantic web crawler
  14. 14. Web 3.0 Crawling Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in -Semantic Web -Website Parse Template concepts Web 3.0 crawling and indexing technologies will be based on -Human-machine clever associations
  15. 15. Distributed Web Crawling ∗ A distributed computing technique whereby search engines employ many computers to index the Internet via web crawling. ∗ The idea is to spread out the required resources of computation and bandwidth to many computers and networks. ∗ Types of distributed web crawling: 1. Dynamic Assignment 2. Static Assignment
  16. 16. Dynamic Assigment ∗ With this, a central server assigns new URLs to different crawlers dynamically. This allows the central server dynamically balance the load of each crawler. ∗ Configurations of crawling architectures with dynamic assignments: • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed down loaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed.
  17. 17. Static Assignment • • • Here a fixed rule is stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. A hashing function can be used to transform URLs into a number that corresponds to the index of the corresponding crawling process. To reduce the overhead due to the exchange of URLs between crawling processes, when links switch from one website to another, the exchange should be done in batch.
  18. 18. Conclusion ∗ Web crawlers are an important aspect of the search engines. ∗ Web crawling processes deemed high performance are the basic components of various Web services. ∗ It is not a trivial matter to set up such systems: 1. Data manipulated by these crawlers cover a wide area. 2. It is crucial to preserve a good balance between random access memory and disk accesses.
  19. 19. References • http://en.wikipedia.org/wiki/Web_crawling • www.cs.cmu.edu/~spandey • www.cs.odu.edu/~fmccown/research/lazy/crawling-policiesht06.ppt • http://java.sun.com/developer/technicalArticles/ThirdParty/WebC rawler/ • www.grub.org • www.filesland.com/companies/Shettysoft-com/web-crawler.html • www.ciw.cl/recursos/webCrawling.pdf • www.openldap.org/conf/odd-wien-2003/peter.pdf
  20. 20. Thank You For Your Attention