• Save
1212 regular meeting
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,125
On Slideshare
1,125
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. FivaTech : Page-Level Web Data Extraction from Template Pages ICDM Workshops 2007 Reporter : Che-Min Liao
  • 2. Abstract
    • FivaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program.
      • Tree Merging
      • Schema Detection
  • 3. Outline
    • Introduction
    • Problem formulation
    • The FivaTech approach
    • Data schema detection
    • Experiments
    • Conclusion
  • 4. Introduction
    • Deep Web refers to World Wide Web content that is not part of surface Web, which is indexed by search engines.
      • Dynamic content
      • Unlinked content
      • Private Web
      • Limited access content
      • Scripted content
      • Non HTML/text content
  • 5. Dynamic Web Pages
    • Such pages share the same template since they are generated with a predefined template by plugging data values.
    • The key to automatic extraction depends on whether we can deduce the template automatically.
      • EXALG (page-level)
      • DEPTA (record-level)
    • In this paper, we focus on page-level extraction tasks and propose a new approach, called FivaTech.
  • 6. Problem Formulation
  • 7. Problem Formulation
  • 8. Problem Formulation
  • 9. Problem Formulation
  • 10. Problem Formulation
  • 11. Problem Formulation
  • 12. Problem Formulation
  • 13. The FivaTech Approach
    • The proposed approach FivaTech contains two modules :
      • Tree merging
      • Schema detection
  • 14. Tree Merging
    • It merges all input DOM trees at the same time into a structure called fixed/variant pattern tree.
      • Peer node recognition
      • Peer matrix alignment
      • Pattern mining
      • Optional node merging
  • 15. Multiple Tree Merging Algorithm
  • 16. Peer Node Recognition
    • As each tag/node is actually denoted a tree, we can use 2-tree matching algorithm for computing whether two nodes with the same tag are similar.
      • We adopt Yang’s algorithm
    • A more serious problem is score normalization.
      • A typical way to compute a normalized score is the ratio between the numbers of parts in the mapping over the maximum size of the two trees.
  • 17. Yang’s Algorithm
  • 18. Tree Merging Score Algorithm
  • 19. Example For example, given the two matched trees A and B as shown in Figure 6, where tr1─tr6 are six similar data records, we assume that the mapping pairs between any two different subtrees tr i and tr j are 6. Assume also that the size of every tr i is approximately 10.
  • 20. Peer Matrix Alignment
    • After peer node recognition, all peer subtrees will be given the same symbol.
    • An aligned peer matrix
      • Each row has (except for empty columns) either the same symbol for every column or is a text (<img>) node of variant text (SRC attribute, respectively) values.
  • 21. Matrix Alignment Algorithm
  • 22. getShiftColumn Function
  • 23. Example
  • 24. Pattern Mining
    • This pattern step is designed to handle set-typed data where multiple-values occur.
    • We detect every consecutive repetitive pattern and merge them (by deleting all occurences except for the first one) from small length to large length.
  • 25. Pattern Mining Algorithm
  • 26. Example
  • 27. Optional Node Merging
    • After the mining step, we are able to detect optional nodes based the ocurence vectors.
  • 28. Example-1
  • 29. Example-2
  • 30. Example-2
  • 31. Schema Detection
    • Detecting the structure of a Web site includes two tasks :
      • Identifying the schema.
      • Defining the template for each type constructor of this schema.
  • 32. Identifying the Schema
    • Recognize tuple type
    • Recognize order of the set type and optional data.
  • 33. Schema of Example-2
  • 34. Defining the Template
    • Templates can be obtained by segmenting the pattern tree at reference nodes defined below :
  • 35. Defining the Template
    • For any k-order type constructor < τ 1 , τ 2 , τ 3 ,…, τ k > at node n, where every type τ i is located at a node n i (i = 1,2,…,k)
      • The template P will be the null template or the one containing its reference node if it is the first data type in the schema tree.
      • If τ i is a type constructor, then C i will be the template that includes node n i and the respective insertion position will be 0.
      • If τ i is of basic type, then C i will be the template that is under n and includes the reference node of n i or null if no such templates exist.
      • If C i is not null, the respective insertion position will be the distance of n i to the righmost path of C i .
      • Template C i+1 will be the that has rightmost reference node inside n or null otherwise.
  • 36. Templates of Example-2
    • T( τ 1 ) = (T 1 , (T 2 , Φ ), 0)
    • T( τ 2 ) = ( Φ , (T 3 , Φ ), 0)
    • T( τ 3 ) = ( Φ , (T 4 , T 5 , T 21 ), (0,0))
    • T( τ 4 ) = ( Φ , (T 6 , T 7 , Φ ), (0,0))
    • T( τ 13 ) = ( Φ , (T 20 , Φ ), 2)
  • 37. Experiments
    • FivaTech as a schema extractor
    • FivaTech as a SRRs (Search Result Records) Extractor
  • 38. FivaTech as a schema extractor
  • 39. FivaTech as a SRRs Extractor
  • 40. Conclusion
    • FivaTech has much higher precision than EXALG
    • FivaTech is comparable with other record-level extraction systems like ViPER and MSE.