FivaTech : Schema & Template Discovery Reporter : Che-Min Liao
Introduction FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. Tree Merging Schema Detection
Problem Formulation
Problem Formulation
The FivaTech Approach The proposed approach FivaTech contains two modules : Tree merging Schema detection
Peer Node Recognition As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. We adopt Yang’s algorithm A more serious problem is score normalization. A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees.
Tree Merging Score Algorithm
Example
Peer Matrix Alignment
Pattern Mining
Optional Node Merging After the mining step, we are able to detect optional nodes based the ocurence vectors .
The Example of Pattern Tree
Identifying the Schema Recognize tuple type Recognize order of the set type and optional data.
Defining the Template Templates can be obtained by segmenting the pattern tree at reference nodes defined below :
The Example of Schema
The Example of Template T( τ 1 ) = (T 1 , (T 2 ,  Φ ), 0)  T( τ 2 ) = ( Φ , (T 3 ,  Φ ), 0) T( τ 3 ) = ( Φ , (T 4 ,   T 5 , T 21 ), (0,0)) T( τ 4 ) = ( Φ , (T 6 ,   T 7 ,  Φ ), (0,0)) … T( τ 13 ) = ( Φ , (T 20 , Φ ), 2)

The Problem of Peer Node Recognition

  • 1.
    FivaTech : Schema& Template Discovery Reporter : Che-Min Liao
  • 2.
    Introduction FivaTech isa page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. Tree Merging Schema Detection
  • 3.
  • 4.
  • 5.
    The FivaTech ApproachThe proposed approach FivaTech contains two modules : Tree merging Schema detection
  • 6.
    Peer Node RecognitionAs each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. We adopt Yang’s algorithm A more serious problem is score normalization. A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Optional Node MergingAfter the mining step, we are able to detect optional nodes based the ocurence vectors .
  • 12.
    The Example ofPattern Tree
  • 13.
    Identifying the SchemaRecognize tuple type Recognize order of the set type and optional data.
  • 14.
    Defining the TemplateTemplates can be obtained by segmenting the pattern tree at reference nodes defined below :
  • 15.
  • 16.
    The Example ofTemplate T( τ 1 ) = (T 1 , (T 2 , Φ ), 0) T( τ 2 ) = ( Φ , (T 3 , Φ ), 0) T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0)) T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0)) … T( τ 13 ) = ( Φ , (T 20 , Φ ), 2)