Introduction to
Web Crawlers
Presented By
Rehab
INTRODUCTION
 Definition:
A Web crawler is a computer program that
browses the World Wide Web in a methodical,
automated...
WHY CRAWLERS?
Internet has a
wide expanse of
Information.
 Finding relevant
information
requires an
efficient
mechanism....
Features oF a crawler
 Must provide:
 Robustness: spider traps
 Politeness: which pages can be crawled, and which
canno...
How does web crawler
work?
arcHitecture oF a crawler
www
DNS
Fetch
Parse
Content
Seen?
URL
Filter
Dup
URL
Elim
URL Frontier
Doc
Fingerprint
Robots
te...
Architecture of A crAwler
URL Frontier: containing URLs yet to be fetches in the current crawl. At first,
a seed set is s...
Architecture of A crAwler
Content Seen?: test whether a web page with the same content has already been seen at
another U...
Architecture of A crAwler
Other issues:
 Housekeeping tasks:
 Log crawl progress statistics: URLs crawled, frontier
size...
Thank You
For Your
Attention
Upcoming SlideShare
Loading in …5
×

WebCrawler

700
-1

Published on

Intro to Web Crawler

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
700
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
71
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

WebCrawler

  1. 1. Introduction to Web Crawlers Presented By Rehab
  2. 2. INTRODUCTION  Definition: A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)  Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on.
  3. 3. WHY CRAWLERS? Internet has a wide expanse of Information.  Finding relevant information requires an efficient mechanism. • Web Crawlers provide that scope to the search engine.
  4. 4. Features oF a crawler  Must provide:  Robustness: spider traps  Politeness: which pages can be crawled, and which cannot  Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible
  5. 5. How does web crawler work?
  6. 6. arcHitecture oF a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set
  7. 7. Architecture of A crAwler URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  8. 8. Architecture of A crAwler Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/MainPage <“disclaimer">Disclaimers</a> Dup URL Elm: the URL is checked for duplicate elimination.
  9. 9. Architecture of A crAwler Other issues:  Housekeeping tasks:  Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)  Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier:  Change rate.  Quality.  Politeness:  Avoid repeated fetch requests to a host within a short time span.  Otherwise: blocked 
  10. 10. Thank You For Your Attention
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×