Your SlideShare is downloading. ×
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Webcrawler
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Webcrawler

1,724

Published on

Published in: Education, Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,724
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
127
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Web Crawling Submitted by:Govind Raj Registration no:1001227464 INFORMATION TECHNOLOGY
  • 2. Beginning A key motivation for designing Web crawlers has been to retrieve Web pages and add their representations to a local repository.
  • 3. Web crawling ? What is the “Web Crawling”? What are the uses of Web Crawling? Types of crawling
  • 4. Web Crawling: • A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community— Web scutter) is a program or automated script that browses the World Wide Web in a - methodical - automated manner. • Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.
  • 5. What the Crawlers are:- ∗ Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. ∗ The role of Crawlers is to collect Web Content.
  • 6. Basic crawler operation:∗ ∗ ∗ ∗ ∗ Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a Queue Fetch each URL on the queue and repeat
  • 7. Traditional Web Crawler HT'06 7
  • 8. Beginning with Web Crawler: The basic Algorithm : { Pick up the next URL Connect to the server GET the URL When the page arrives, get its links other stuff) } REPEAT (optionally do
  • 9. Uses for crawling:Complete web search engine Search Engine = Crawler + Indexer/Searcher /(Lucene) + GUI ∗ Find stuff ∗ Gather stuff ∗ Check stuff
  • 10. Several Types of Crawlers: • Batch Crawlers- Crawl a snapshot of their crawl space, until reaching a certain size or time limit. • Incremental Crawlers- Continuously crawl their crawl space, revisiting URL to ensure freshness. • Focused Crawlers- Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected.
  • 11. URL normalization ∗ Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization refers to the process ofmodifying standardizing A URL in a consistent manner.
  • 12. The challenges of “Web Crawling”:There are three important characteristics of the Web that make crawling it very difficult: ∗ Its large volume ∗ Its fast rate of change ∗ Dynamic page generation
  • 13. Examples of Web crawlers ∗ ∗ ∗ ∗ Yahoo! Slurp: Yahoo Search crawler. Msnbot: Microsoft's Bing web crawler. Googlebot : Google’s web crawler. WebCrawler : Used to build the first publicly-available full-text index of a subset of the Web. ∗ World Wide Web Worm : Used to build a simple index of document titles and URLs. ∗ Web Fountain: Distributed, modular crawler written in C++. ∗ Slug: Semantic web crawler
  • 14. Web 3.0 Crawling Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in -Semantic Web -Website Parse Template concepts Web 3.0 crawling and indexing technologies will be based on -Human-machine clever associations
  • 15. Distributed Web Crawling ∗ A distributed computing technique whereby search engines employ many computers to index the Internet via web crawling. ∗ The idea is to spread out the required resources of computation and bandwidth to many computers and networks. ∗ Types of distributed web crawling: 1. Dynamic Assignment 2. Static Assignment
  • 16. Dynamic Assigment ∗ With this, a central server assigns new URLs to different crawlers dynamically. This allows the central server dynamically balance the load of each crawler. ∗ Configurations of crawling architectures with dynamic assignments: • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed down loaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed.
  • 17. Static Assignment • • • Here a fixed rule is stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. A hashing function can be used to transform URLs into a number that corresponds to the index of the corresponding crawling process. To reduce the overhead due to the exchange of URLs between crawling processes, when links switch from one website to another, the exchange should be done in batch.
  • 18. Conclusion ∗ Web crawlers are an important aspect of the search engines. ∗ Web crawling processes deemed high performance are the basic components of various Web services. ∗ It is not a trivial matter to set up such systems: 1. Data manipulated by these crawlers cover a wide area. 2. It is crucial to preserve a good balance between random access memory and disk accesses.
  • 19. References • http://en.wikipedia.org/wiki/Web_crawling • www.cs.cmu.edu/~spandey • www.cs.odu.edu/~fmccown/research/lazy/crawling-policiesht06.ppt • http://java.sun.com/developer/technicalArticles/ThirdParty/WebC rawler/ • www.grub.org • www.filesland.com/companies/Shettysoft-com/web-crawler.html • www.ciw.cl/recursos/webCrawling.pdf • www.openldap.org/conf/odd-wien-2003/peter.pdf
  • 20. Thank You For Your Attention

×