Information Retrieval 07

1. Information Retrieval

2.  Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine.  The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them.  it is sometimes WEB CRAWLER referred to as a spider.

3.  Begin with known “seed” URLs  Fetch and parse them  Extract URLs they point to  Place extracted URLs on a queue  Fetch each URL on the queue and repeat

4. Features a Crawler must provide:  Robustness  Politeness Features a crawler should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible Book(Chapter 20, page 443)

Information Retrieval 07

Recommended

Recommended

More Related Content

More from Jeet Das

More from Jeet Das (8)

Recently uploaded

Recently uploaded (20)

Information Retrieval 07