2. Contents: Introduction Wrappers Clustering System Description Working Types Advantages and Disadvantages Conclusion
3. Introduction:STAVIES is a system for Information Extraction through Automatic Web Wrapper Using clustering Techniques.
4. STAVIES is used in: Automatic Information Discovery. Extraction of structured web data.
5. WRAPPERS Piece of software to extract the useful information from web data sources. Data extracted is referred as Structural Tokens.
6. Categories of Wrappers: Site Specific: Extracts information from a web pages or family of web pages. Generic wrappers: Can be applied to almost any page regardless of the structures.
7. CLUSTERINGProcess of recognizing input data set in such a way that data points in same cluster are similar other than in different clusters.
8. Quality Evaluation Measures: Cluster Compactness: Evaluates how the subsets of input are redistributed by clustering system, compared with whole input set. Cluster Separation: Indicates overall dissimilarity among the output clusters.
9. System Description Two modules 1.Transformation module 2.Extraction module
10. Phases: Preparation Phase: 1.Validation correction and XHTML generation. 2.Tree transformation and Terminal node selecton
11. • Segmentation Phase: 1. Nodes Comparison. 2. Hierarchical clustering. 3. Cluster Evaluation and Target area Discover. 4. Boundary selection.
12. • Information Retrieval Phase: 1. Information Extraction component.
14. Experimental Results:
15. Types: OMINI MDR
16. Advantages: Executes in less than 0.4 sec. No human assistance is required. High performance.
17. Disadvantage: Hard to implement in free texts and non-template pages.
18. Conclusion STAVIES saves precious time and effort. Tested successfully in more than 63,000 HTML pages from 50 different web data sources.