Your SlideShare is downloading. ×
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Crawler

478

Published on

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
478
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Web crawler Allan@eSobi.com
  • 2. Agenda • Forward of Web Crawler – HTML Parser • Practice – Feed Crawler • Prototype demo • Conclusion
  • 3. HTML Parser • HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. • First clean up the mess and bring the order to tags, attributes and ordinary text.
  • 4. Well-known Parser • Access the information using standard XML interfaces. • HtmlCleaner • HtmlParser • Nekohtml
  • 5. Parser inner structure • HTML scanner – Pre-processing action • Tag balancer – Reorders individual elements – Produces well-formed XML • Extraction • Transformation
  • 6. Example
  • 7. Extraction • Text extraction – for use as input for text search engine databases for example • Link extraction – for crawling through web pages or harvesting email addresses • Screen scraping – for programmatic data input from web pages
  • 8. Extraction • Resource extraction – collecting images or sound • A browser front end – the preliminary stage of page display • Link checking – ensuring links are valid • Site monitoring – checking for page differences beyond simplistic diffs
  • 9. Transformation • URL rewriting – modifying some or all links on a page • Site capture – moving content from the web to local disk • Censorship – removing offending words and phrases from pages
  • 10. Transformation • HTML cleanup – correcting erroneous pages • AD removal – excising URLs referencing advertising • Conversion to XML – moving existing web pages to XML
  • 11. Practice • Feed Crawler – HTML • Bloglines, Feedage – XML • RssMountain – JSON • Google AJAX Feed API • Prototype – Demo
  • 12. Conclusion • Page search, image search, news search, blog search, feed search ... • Fault toleration of text processing • Text mining in web • Q&A

×