FivaTech : Page-Level Web Data Extraction from Template Pages ICDM Workshops 2007 Reporter : Che-Min Liao
Abstract <ul><li>FivaTech is a page-level data extraction system which deduces the data schema and templates for the input...
Outline <ul><li>Introduction </li></ul><ul><li>Problem formulation </li></ul><ul><li>The FivaTech approach </li></ul><ul><...
Introduction <ul><li>Deep Web refers to World Wide Web content that is not part of surface Web, which is indexed by search...
Dynamic Web Pages <ul><li>Such pages share the same template since they are generated with a predefined template by pluggi...
Problem Formulation
Problem Formulation
Problem Formulation
Problem Formulation
Problem Formulation
Problem Formulation
Problem Formulation
The FivaTech Approach <ul><li>The proposed approach FivaTech contains two modules : </li></ul><ul><ul><li>Tree merging </l...
Tree Merging <ul><li>It merges all input DOM trees at the same time into a structure called fixed/variant pattern tree. </...
Multiple Tree Merging Algorithm
Peer Node Recognition <ul><li>As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for comput...
Yang’s Algorithm
Tree Merging Score Algorithm
Example For example, given the two matched trees A and B as shown in Figure 6, where tr1─tr6 are six similar data records,...
Peer Matrix Alignment <ul><li>After peer node recognition, all peer subtrees will be given the same symbol. </li></ul><ul>...
Matrix Alignment Algorithm
getShiftColumn Function
Example
Pattern Mining <ul><li>This pattern step is designed to handle set-typed data where multiple-values occur. </li></ul><ul><...
Pattern Mining Algorithm
Example
Optional Node Merging <ul><li>After the mining step, we are able to detect optional nodes based the ocurence vectors. </li...
Example-1
Example-2
Example-2
Schema Detection <ul><li>Detecting the structure of a Web site includes two tasks : </li></ul><ul><ul><li>Identifying the ...
Identifying the Schema <ul><li>Recognize tuple type </li></ul><ul><li>Recognize order of the set type and optional data. <...
Schema of Example-2
Defining the Template <ul><li>Templates can be obtained by segmenting the pattern tree at reference nodes defined below : ...
Defining the Template <ul><li>For any k-order type constructor < τ 1 ,  τ 2 ,  τ 3 ,…,  τ k > at node n, where every type ...
Templates of Example-2 <ul><li>T( τ 1 ) = (T 1 , (T 2 ,  Φ ), 0)  </li></ul><ul><li>T( τ 2 ) = ( Φ , (T 3 ,  Φ ), 0) </li>...
Experiments <ul><li>FivaTech as a schema extractor </li></ul><ul><li>FivaTech as a SRRs (Search Result Records) Extractor ...
FivaTech as a schema extractor
FivaTech as a SRRs Extractor
Conclusion <ul><li>FivaTech has much higher precision than EXALG </li></ul><ul><li>FivaTech is comparable with other recor...
Upcoming SlideShare
Loading in...5
×

1212 regular meeting

834

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
834
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

1212 regular meeting

  1. 1. FivaTech : Page-Level Web Data Extraction from Template Pages ICDM Workshops 2007 Reporter : Che-Min Liao
  2. 2. Abstract <ul><li>FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. </li></ul><ul><ul><li>Tree Merging </li></ul></ul><ul><ul><li>Schema Detection </li></ul></ul>
  3. 3. Outline <ul><li>Introduction </li></ul><ul><li>Problem formulation </li></ul><ul><li>The FivaTech approach </li></ul><ul><li>Data schema detection </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusion </li></ul>
  4. 4. Introduction <ul><li>Deep Web refers to World Wide Web content that is not part of surface Web, which is indexed by search engines. </li></ul><ul><ul><li>Dynamic content </li></ul></ul><ul><ul><li>Unlinked content </li></ul></ul><ul><ul><li>Private Web </li></ul></ul><ul><ul><li>Limited access content </li></ul></ul><ul><ul><li>Scripted content </li></ul></ul><ul><ul><li>Non HTML/text content </li></ul></ul>
  5. 5. Dynamic Web Pages <ul><li>Such pages share the same template since they are generated with a predefined template by plugging data values. </li></ul><ul><li>The key to automatic extraction depends on whether we can deduce the template automatically. </li></ul><ul><ul><li>EXALG (page-level) </li></ul></ul><ul><ul><li>DEPTA (record-level) </li></ul></ul><ul><li>In this paper, we focus on page-level extraction tasks and propose a new approach, called FivaTech. </li></ul>
  6. 6. Problem Formulation
  7. 7. Problem Formulation
  8. 8. Problem Formulation
  9. 9. Problem Formulation
  10. 10. Problem Formulation
  11. 11. Problem Formulation
  12. 12. Problem Formulation
  13. 13. The FivaTech Approach <ul><li>The proposed approach FivaTech contains two modules : </li></ul><ul><ul><li>Tree merging </li></ul></ul><ul><ul><li>Schema detection </li></ul></ul>
  14. 14. Tree Merging <ul><li>It merges all input DOM trees at the same time into a structure called fixed/variant pattern tree. </li></ul><ul><ul><li>Peer node recognition </li></ul></ul><ul><ul><li>Peer matrix alignment </li></ul></ul><ul><ul><li>Pattern mining </li></ul></ul><ul><ul><li>Optional node merging </li></ul></ul>
  15. 15. Multiple Tree Merging Algorithm
  16. 16. Peer Node Recognition <ul><li>As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar. </li></ul><ul><ul><li>We adopt Yang’s algorithm </li></ul></ul><ul><li>A more serious problem is score normalization. </li></ul><ul><ul><li>A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees. </li></ul></ul>
  17. 17. Yang’s Algorithm
  18. 18. Tree Merging Score Algorithm
  19. 19. Example For example, given the two matched trees A and B as shown in Figure 6, where tr1─tr6 are six similar data records, we assume that the mapping pairs between any two different subtrees tr i and tr j are 6. Assume also that the size of every tr i is approximately 10.
  20. 20. Peer Matrix Alignment <ul><li>After peer node recognition, all peer subtrees will be given the same symbol. </li></ul><ul><li>An aligned peer matrix </li></ul><ul><ul><li>Each row has (except for empty columns) either the same symbol for every column or is a text (<img>) node of variant text (SRC attribute, respectively) values. </li></ul></ul>
  21. 21. Matrix Alignment Algorithm
  22. 22. getShiftColumn Function
  23. 23. Example
  24. 24. Pattern Mining <ul><li>This pattern step is designed to handle set-typed data where multiple-values occur. </li></ul><ul><li>We detect every consecutive repetitive pattern and merge them (by deleting all occurences except for the first one) from small length to large length. </li></ul>
  25. 25. Pattern Mining Algorithm
  26. 26. Example
  27. 27. Optional Node Merging <ul><li>After the mining step, we are able to detect optional nodes based the ocurence vectors. </li></ul>
  28. 28. Example-1
  29. 29. Example-2
  30. 30. Example-2
  31. 31. Schema Detection <ul><li>Detecting the structure of a Web site includes two tasks : </li></ul><ul><ul><li>Identifying the schema. </li></ul></ul><ul><ul><li>Defining the template for each type constructor of this schema. </li></ul></ul>
  32. 32. Identifying the Schema <ul><li>Recognize tuple type </li></ul><ul><li>Recognize order of the set type and optional data. </li></ul>
  33. 33. Schema of Example-2
  34. 34. Defining the Template <ul><li>Templates can be obtained by segmenting the pattern tree at reference nodes defined below : </li></ul>
  35. 35. Defining the Template <ul><li>For any k-order type constructor < τ 1 , τ 2 , τ 3 ,…, τ k > at node n, where every type τ i is located at a node n i (i = 1,2,…,k) </li></ul><ul><ul><li>The template P will be the null template or the one containing its reference node if it is the first data type in the schema tree. </li></ul></ul><ul><ul><li>If τ i is a type constructor, then C i will be the template that includes node n i and the respective insertion position will be 0. </li></ul></ul><ul><ul><li>If τ i is of basic type, then C i will be the template that is under n and includes the reference node of n i or null if no such templates exist. </li></ul></ul><ul><ul><li>If C i is not null, the respective insertion position will be the distance of n i to the righmost path of C i . </li></ul></ul><ul><ul><li>Template C i+1 will be the that has rightmost reference node inside n or null otherwise. </li></ul></ul>
  36. 36. Templates of Example-2 <ul><li>T( τ 1 ) = (T 1 , (T 2 , Φ ), 0) </li></ul><ul><li>T( τ 2 ) = ( Φ , (T 3 , Φ ), 0) </li></ul><ul><li>T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0)) </li></ul><ul><li>T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0)) </li></ul><ul><li>… </li></ul><ul><li>T( τ 13 ) = ( Φ , (T 20 , Φ ), 2) </li></ul>
  37. 37. Experiments <ul><li>FivaTech as a schema extractor </li></ul><ul><li>FivaTech as a SRRs (Search Result Records) Extractor </li></ul>
  38. 38. FivaTech as a schema extractor
  39. 39. FivaTech as a SRRs Extractor
  40. 40. Conclusion <ul><li>FivaTech has much higher precision than EXALG </li></ul><ul><li>FivaTech is comparable with other record-level extraction systems like ViPER and MSE. </li></ul>

×