The Problem of Peer Node Recognition

FivaTech ： Schema & Template Discovery Reporter ： Che-Min Liao

Introduction FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. Tree Merging Schema Detection

The FivaTech Approach The proposed approach FivaTech contains two modules ： Tree merging Schema detection

Peer Node Recognition As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. We adopt Yang’s algorithm A more serious problem is score normalization. A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees.

Optional Node Merging After the mining step, we are able to detect optional nodes based the ocurence vectors .

Identifying the Schema Recognize tuple type Recognize order of the set type and optional data.

Defining the Template Templates can be obtained by segmenting the pattern tree at reference nodes defined below ：

The Example of Template T( τ 1 ) = (T 1 , (T 2 , Φ ), 0) T( τ 2 ) = ( Φ , (T 3 , Φ ), 0) T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0)) T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0)) … T( τ 13 ) = ( Φ , (T 20 , Φ ), 2)

The Problem of Peer Node Recognition

More Related Content

Viewers also liked

Similar to The Problem of Peer Node Recognition

More from marxliouville

Recently uploaded

The Problem of Peer Node Recognition