Web Crawlers


Published on

Understanding Basics of how a web crawler works.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Crawlers

  1. 1. WEBCRAWLERSPresented At: Indies Services
  2. 2. Contents What is a web crawler How does it work? Why use it? Challenges faced Coding crawlers Possible uses for us
  3. 3. What are crawlers? It’s a computer program. Ants, Automatic Indexers, Bots, Web spiders, Web robots, Web scutters Search the web for web pages, links on the pages. Any type of automated search or listing. Crawlers identification (user agent in http request)
  4. 4. How it works
  5. 5. Basic algorithm for a crawler1. Remove a URL from the unvisited URL list2. Determine the IP Address of its host name3. Download the corresponding document4. Extract any links contained in it.5. If the URL is new, add it to the list of unvisited URLs6. Process the downloaded document7. Back to step 1
  6. 6. The Process Initialize URL list with starting URLs(seeds) [Yes] List over ?Crawling [No]loop Pick URL from URL list [No more URL] [URL] Parse page [new URL] Add URL to URL List
  7. 7. Uses of crawlers Search engines :  list out URLs, get page information up-to-date  Manipulates the web graph
  8. 8. Uses of crawlers Automated maintenance tasks :  checking for broken internal links  Validating HTML code Crawler
  9. 9. Uses of crawlers Linguistics  Textual search (what word common today) Market researchers  Determine trends Getting Certain type of information from the web  Email addresses (spamming)  Images (special images searches)  Meta tags information
  10. 10. Challenges faced What pages should it download?  Large size of web : prioritize downloads How to determine useful and unique links?  URLs with GET requests (Internal links)  URL normalization
  11. 11. Challenges … Crawling policies  Selective policy (download most relevant pages)  Re-visit policy (when to check for changes in the page)  Politeness policy (robots exclusion/robots.txt protocol)  Parallelization policy (list new URLs)
  12. 12. Coding Crawlers Common Languages : PHP Python PERL Java etc. or any other server side scripting languages Logic used:  Get the URLs  Search for unique URLs from the list  Download the page or get information from any particular page  Process that information
  13. 13. Possible uses for us To maintain coding standards : check for proper code in a page. Getting rid of unwanted or deprecated data : images or files that are no longer used. To provide customized search in any particular site.
  14. 14. Thankshttp://www.indies.co.inhttp://www.indieswebs.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.