Web Crawler

677 views
591 views

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
677
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Web Crawler

  1. 1. Web crawler Allan@eSobi.com
  2. 2. Agenda • Forward of Web Crawler – HTML Parser • Practice – Feed Crawler • Prototype demo • Conclusion
  3. 3. HTML Parser • HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. • First clean up the mess and bring the order to tags, attributes and ordinary text.
  4. 4. Well-known Parser • Access the information using standard XML interfaces. • HtmlCleaner • HtmlParser • Nekohtml
  5. 5. Parser inner structure • HTML scanner – Pre-processing action • Tag balancer – Reorders individual elements – Produces well-formed XML • Extraction • Transformation
  6. 6. Example
  7. 7. Extraction • Text extraction – for use as input for text search engine databases for example • Link extraction – for crawling through web pages or harvesting email addresses • Screen scraping – for programmatic data input from web pages
  8. 8. Extraction • Resource extraction – collecting images or sound • A browser front end – the preliminary stage of page display • Link checking – ensuring links are valid • Site monitoring – checking for page differences beyond simplistic diffs
  9. 9. Transformation • URL rewriting – modifying some or all links on a page • Site capture – moving content from the web to local disk • Censorship – removing offending words and phrases from pages
  10. 10. Transformation • HTML cleanup – correcting erroneous pages • AD removal – excising URLs referencing advertising • Conversion to XML – moving existing web pages to XML
  11. 11. Practice • Feed Crawler – HTML • Bloglines, Feedage – XML • RssMountain – JSON • Google AJAX Feed API • Prototype – Demo
  12. 12. Conclusion • Page search, image search, news search, blog search, feed search ... • Fault toleration of text processing • Text mining in web • Q&A

×