FivaTech ： Page-Level Web Data Extraction from Template Pages ICDM Workshops 2007 Reporter ： Che-Min Liao
FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program.
Deep Web refers to World Wide Web content that is not part of surface Web, which is indexed by search engines.
Dynamic Web Pages Such pages share the same template since they are generated with a predefined template by plugging data values. The key to automatic extraction depends on whether we can deduce the template automatically.
In this paper, we focus on page-level extraction tasks and propose a new approach, called FivaTech.
The FivaTech Approach
The proposed approach FivaTech contains two modules ：
It merges all input DOM trees at the same time into a structure called fixed/variant pattern tree.
Multiple Tree Merging Algorithm
Peer Node Recognition As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. We adopt Yang’s algorithm A more serious problem is score normalization.
A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees.
Tree Merging Score Algorithm
Example For example, given the two matched trees A and B as shown in Figure 6, where tr1─tr6 are six similar data records, we assume that the mapping pairs between any two different subtrees tr i and tr j are 6. Assume also that the size of every tr i is approximately 10.
Peer Matrix Alignment After peer node recognition, all peer subtrees will be given the same symbol.
Each row has (except for empty columns) either the same symbol for every column or is a text (<img>) node of variant text (SRC attribute, respectively) values.
Matrix Alignment Algorithm
Pattern Mining This pattern step is designed to handle set-typed data where multiple-values occur.
We detect every consecutive repetitive pattern and merge them (by deleting all occurences except for the first one) from small length to large length.
Pattern Mining Algorithm
Optional Node Merging
After the mining step, we are able to detect optional nodes based the ocurence vectors.
Schema Detection Detecting the structure of a Web site includes two tasks ：
Defining the template for each type constructor of this schema.
Identifying the Schema
Recognize order of the set type and optional data.
Schema of Example-2
Defining the Template
Templates can be obtained by segmenting the pattern tree at reference nodes defined below ：
Defining the Template For any k-order type constructor < τ 1 , τ 2 , τ 3 ,…, τ k > at node n, where every type τ i is located at a node n i (i = 1,2,…,k) The template P will be the null template or the one containing its reference node if it is the first data type in the schema tree. If τ i is a type constructor, then C i will be the template that includes node n i and the respective insertion position will be 0. If τ i is of basic type, then C i will be the template that is under n and includes the reference node of n i or null if no such templates exist. If C i is not null, the respective insertion position will be the distance of n i to the righmost path of C i .
Template C i+1 will be the that has rightmost reference node inside n or null otherwise.
Templates of Example-2 T( τ 1 ) = (T 1 , (T 2 , Φ ), 0) T( τ 2 ) = ( Φ , (T 3 , Φ ), 0) T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0)) T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0))
T( τ 13 ) = ( Φ , (T 20 , Φ ), 2)
Experiments FivaTech as a schema extractor
FivaTech as a SRRs (Search Result Records) Extractor
FivaTech as a schema extractor
FivaTech as a SRRs Extractor
Conclusion FivaTech has much higher precision than EXALG
FivaTech is comparable with other record-level extraction systems like ViPER and MSE.