WebCrawler
Upcoming SlideShare
Loading in...5
×
 

WebCrawler

on

  • 89 views

Intro to Web Crawler

Intro to Web Crawler

Statistics

Views

Total Views
89
Views on SlideShare
89
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

WebCrawler WebCrawler Presentation Transcript

  • Introduction to Web Crawlers Presented By Rehab
  • INTRODUCTION  Definition: A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)  Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on.
  • WHY CRAWLERS? Internet has a wide expanse of Information.  Finding relevant information requires an efficient mechanism. • Web Crawlers provide that scope to the search engine.
  • Features oF a crawler  Must provide:  Robustness: spider traps  Politeness: which pages can be crawled, and which cannot  Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible
  • How does web crawler work?
  • arcHitecture oF a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set
  • Architecture of A crAwler URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  • Architecture of A crAwler Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/MainPage <“disclaimer">Disclaimers</a> Dup URL Elm: the URL is checked for duplicate elimination.
  • Architecture of A crAwler Other issues:  Housekeeping tasks:  Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)  Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier:  Change rate.  Quality.  Politeness:  Avoid repeated fetch requests to a host within a short time span.  Otherwise: blocked 
  • Thank You For Your Attention