Contents: Introduction Wrappers Clustering System Description Working Types Advantages and Disadvantages Conclusion
Introduction:STAVIES is a system for Information Extraction through Automatic Web Wrapper Using clustering Techniques.
STAVIES is used in: Automatic Information Discovery. Extraction of structured web data.
WRAPPERS Piece of software to extract the useful information from web data sources. Data extracted is referred as Structural Tokens.
Categories of Wrappers: Site Specific: Extracts information from a web pages or family of web pages. Generic wrappers: Can be applied to almost any page regardless of the structures.
CLUSTERINGProcess of recognizing input data set in such a way that data points in same cluster are similar other than in different clusters.
Quality Evaluation Measures: Cluster Compactness: Evaluates how the subsets of input are redistributed by clustering system, compared with whole input set. Cluster Separation: Indicates overall dissimilarity among the output clusters.
System Description Two modules 1.Transformation module 2.Extraction module
Phases: Preparation Phase: 1.Validation correction and XHTML generation. 2.Tree transformation and Terminal node selecton
• Segmentation Phase: 1. Nodes Comparison. 2. Hierarchical clustering. 3. Cluster Evaluation and Target area Discover. 4. Boundary selection.
• Information Retrieval Phase: 1. Information Extraction component.