This document discusses techniques for discovering structured information from web sites. It presents three main contributions:
1. A method to extract structured data in the form of web lists that are split across multiple web pages, called logical lists.
2. An approach for automatically extracting sitemaps from web sites.
3. A technique for clustering web pages based on intra-page and extra-page features.