FivaTech : Schema & Template Discovery Reporter : Che-Min Liao
Introduction <ul><li>FivaTech is a page-level data extraction system which deduces the data schema and templates for the i...
Problem Formulation
Problem Formulation
The FivaTech Approach <ul><li>The proposed approach FivaTech contains two modules : </li></ul><ul><ul><li>Tree merging </l...
Peer Node Recognition <ul><li>As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for comput...
Tree Merging Score Algorithm
Example
Peer Matrix Alignment
Pattern Mining
Optional Node Merging <ul><li>After the mining step, we are able to detect optional nodes based the ocurence vectors . </l...
The Example of Pattern Tree
Identifying the Schema <ul><li>Recognize tuple type </li></ul><ul><li>Recognize order of the set type and optional data. <...
Defining the Template <ul><li>Templates can be obtained by segmenting the pattern tree at reference nodes defined below : ...
The Example of Schema
The Example of Template <ul><li>T( τ 1 ) = (T 1 , (T 2 ,  Φ ), 0)  </li></ul><ul><li>T( τ 2 ) = ( Φ , (T 3 ,  Φ ), 0) </li...
Upcoming SlideShare
Loading in …5
×

The Problem of Peer Node Recognition

1,649 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,649
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Problem of Peer Node Recognition

  1. 1. FivaTech : Schema & Template Discovery Reporter : Che-Min Liao
  2. 2. Introduction <ul><li>FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. </li></ul><ul><ul><li>Tree Merging </li></ul></ul><ul><ul><li>Schema Detection </li></ul></ul>
  3. 3. Problem Formulation
  4. 4. Problem Formulation
  5. 5. The FivaTech Approach <ul><li>The proposed approach FivaTech contains two modules : </li></ul><ul><ul><li>Tree merging </li></ul></ul><ul><ul><li>Schema detection </li></ul></ul>
  6. 6. Peer Node Recognition <ul><li>As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. </li></ul><ul><ul><li>We adopt Yang’s algorithm </li></ul></ul><ul><li>A more serious problem is score normalization. </li></ul><ul><ul><li>A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees. </li></ul></ul>
  7. 7. Tree Merging Score Algorithm
  8. 8. Example
  9. 9. Peer Matrix Alignment
  10. 10. Pattern Mining
  11. 11. Optional Node Merging <ul><li>After the mining step, we are able to detect optional nodes based the ocurence vectors . </li></ul>
  12. 12. The Example of Pattern Tree
  13. 13. Identifying the Schema <ul><li>Recognize tuple type </li></ul><ul><li>Recognize order of the set type and optional data. </li></ul>
  14. 14. Defining the Template <ul><li>Templates can be obtained by segmenting the pattern tree at reference nodes defined below : </li></ul>
  15. 15. The Example of Schema
  16. 16. The Example of Template <ul><li>T( τ 1 ) = (T 1 , (T 2 , Φ ), 0) </li></ul><ul><li>T( τ 2 ) = ( Φ , (T 3 , Φ ), 0) </li></ul><ul><li>T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0)) </li></ul><ul><li>T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0)) </li></ul><ul><li>… </li></ul><ul><li>T( τ 13 ) = ( Φ , (T 20 , Φ ), 2) </li></ul>

×