Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Crawler

796 views

Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

Web Crawler

  1. 1. Web crawler Allan@eSobi.com
  2. 2. Agenda • Forward of Web Crawler – HTML Parser • Practice – Feed Crawler • Prototype demo • Conclusion
  3. 3. HTML Parser • HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. • First clean up the mess and bring the order to tags, attributes and ordinary text.
  4. 4. Well-known Parser • Access the information using standard XML interfaces. • HtmlCleaner • HtmlParser • Nekohtml
  5. 5. Parser inner structure • HTML scanner – Pre-processing action • Tag balancer – Reorders individual elements – Produces well-formed XML • Extraction • Transformation
  6. 6. Example
  7. 7. Extraction • Text extraction – for use as input for text search engine databases for example • Link extraction – for crawling through web pages or harvesting email addresses • Screen scraping – for programmatic data input from web pages
  8. 8. Extraction • Resource extraction – collecting images or sound • A browser front end – the preliminary stage of page display • Link checking – ensuring links are valid • Site monitoring – checking for page differences beyond simplistic diffs
  9. 9. Transformation • URL rewriting – modifying some or all links on a page • Site capture – moving content from the web to local disk • Censorship – removing offending words and phrases from pages
  10. 10. Transformation • HTML cleanup – correcting erroneous pages • AD removal – excising URLs referencing advertising • Conversion to XML – moving existing web pages to XML
  11. 11. Practice • Feed Crawler – HTML • Bloglines, Feedage – XML • RssMountain – JSON • Google AJAX Feed API • Prototype – Demo
  12. 12. Conclusion • Page search, image search, news search, blog search, feed search ... • Fault toleration of text processing • Text mining in web • Q&A

×