Web crawler
Allan@eSobi.com
Agenda
• Forward of Web Crawler
– HTML Parser

• Practice
– Feed Crawler

• Prototype demo
• Conclusion
HTML Parser
• HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
• First clean up the mess and bring the
order to tags, attributes and
ordinary text.
Well-known Parser
• Access the information using
standard XML interfaces.
• HtmlCleaner
• HtmlParser
• Nekohtml
Parser inner structure
• HTML scanner

– Pre-processing action

• Tag balancer

– Reorders individual elements
– Produces well-formed XML

• Extraction
• Transformation
Example
Extraction
• Text extraction
– for use as input for text search engine
databases for example

• Link extraction
– for crawling through web pages or harvesting
email addresses

• Screen scraping
– for programmatic data input from web pages
Extraction

• Resource extraction

– collecting images or sound

• A browser front end

– the preliminary stage of page display

• Link checking

– ensuring links are valid

• Site monitoring

– checking for page differences beyond
simplistic diffs
Transformation
• URL rewriting
– modifying some or all links on a page

• Site capture
– moving content from the web to local disk

• Censorship
– removing offending words and phrases from
pages
Transformation
• HTML cleanup
– correcting erroneous pages

• AD removal
– excising URLs referencing advertising

• Conversion to XML
– moving existing web pages to XML
Practice
• Feed Crawler
– HTML

• Bloglines, Feedage

– XML

• RssMountain

– JSON

• Google AJAX Feed API

• Prototype
– Demo
Conclusion
• Page search, image search, news
search, blog search, feed search ...
• Fault toleration of text processing
• Text mining in web
• Q&A

Web Crawler

  • 1.
  • 2.
    Agenda • Forward ofWeb Crawler – HTML Parser • Practice – Feed Crawler • Prototype demo • Conclusion
  • 3.
    HTML Parser • HTMLfound on Web is usually dirty, ill-formed and unsuitable for further processing. • First clean up the mess and bring the order to tags, attributes and ordinary text.
  • 4.
    Well-known Parser • Accessthe information using standard XML interfaces. • HtmlCleaner • HtmlParser • Nekohtml
  • 5.
    Parser inner structure •HTML scanner – Pre-processing action • Tag balancer – Reorders individual elements – Produces well-formed XML • Extraction • Transformation
  • 6.
  • 7.
    Extraction • Text extraction –for use as input for text search engine databases for example • Link extraction – for crawling through web pages or harvesting email addresses • Screen scraping – for programmatic data input from web pages
  • 8.
    Extraction • Resource extraction –collecting images or sound • A browser front end – the preliminary stage of page display • Link checking – ensuring links are valid • Site monitoring – checking for page differences beyond simplistic diffs
  • 9.
    Transformation • URL rewriting –modifying some or all links on a page • Site capture – moving content from the web to local disk • Censorship – removing offending words and phrases from pages
  • 10.
    Transformation • HTML cleanup –correcting erroneous pages • AD removal – excising URLs referencing advertising • Conversion to XML – moving existing web pages to XML
  • 11.
    Practice • Feed Crawler –HTML • Bloglines, Feedage – XML • RssMountain – JSON • Google AJAX Feed API • Prototype – Demo
  • 12.
    Conclusion • Page search,image search, news search, blog search, feed search ... • Fault toleration of text processing • Text mining in web • Q&A