Web Crawler

Agenda
• Forward of Web Crawler
– HTML Parser

• Practice
– Feed Crawler

• Prototype demo
• Conclusion

HTML Parser
• HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
• First clean up the mess and bring the
order to tags, attributes and
ordinary text.

Well-known Parser
• Access the information using
standard XML interfaces.
• HtmlCleaner
• HtmlParser
• Nekohtml

Parser inner structure
• HTML scanner

– Pre-processing action

• Tag balancer

– Reorders individual elements
– Produces well-formed XML

• Extraction
• Transformation

Extraction
• Text extraction
– for use as input for text search engine
databases for example

• Link extraction
– for crawling through web pages or harvesting
email addresses

• Screen scraping
– for programmatic data input from web pages

Extraction

• Resource extraction

– collecting images or sound

• A browser front end

– the preliminary stage of page display

• Link checking

– ensuring links are valid

• Site monitoring

– checking for page differences beyond
simplistic diffs

Transformation
• URL rewriting
– modifying some or all links on a page

• Site capture
– moving content from the web to local disk

• Censorship
– removing offending words and phrases from
pages

Transformation
• HTML cleanup
– correcting erroneous pages

• AD removal
– excising URLs referencing advertising

• Conversion to XML
– moving existing web pages to XML

Practice
• Feed Crawler
– HTML

• Bloglines, Feedage

– XML

• RssMountain

– JSON

• Google AJAX Feed API

• Prototype
– Demo

Conclusion
• Page search, image search, news
search, blog search, feed search ...
• Fault toleration of text processing
• Text mining in web
• Q&A

Web Crawler

More Related Content

Viewers also liked

Similar to Web Crawler

More from Allan Huang

Recently uploaded

Web Crawler