Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Overview of

                     ExAlg
Extracting Structured Data from Web Pages


                       Original work b...
Introduction

WWW: billions of unstructured HTML pages


Many dynamically generated from underlying

structured source (e....
Real pages
Pages Structure
The goal
Full automatic extraction of structured data

... in order to pose complex queries over data



Semantic meaning ...
Challenges

Distinguish template from data

  Syntactically, there is nothing that tells
  text part of the template apart...
Page generation
Data
Structure conformed to a common Schema or Type

Types:

  β (Basic Type: string of tokens)

     Token: word or HTML ...
Template
      Function that maps each type constructor τ
      of S into an ordered set of strings T(τ), e.g.
A      <htm...
Values to extract
ExAlg aim to ...
Template extracted
Roles
  Different Roles!
Token Occurrence Vectors


          OV = <1,1,1,1>
Token Occurrence Vectors



       OV = <1,2,1,0>
Equivalence classes


    OV = <1,2,1,0>
       OV = <1,2,1,0>
           OV = <1,2,1,0>
              OV = <1,2,1,0>
Assumption
For real pages, an equivalence class of large size and
support is usually derived from the same template
portio...
EQs abstraction


<b> Reviewer Name </b> <em> John </em>




<b> Reviewer Name </b> <em>   β   </em>
LFEQs
Equivalence Class : A maximal set of tokens having the
same occurrence-vector

LFEQ: Large and Frequently occurring ...
How ExAlg Works
ExAlg: Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differenti...
Schema                              β

      Type                          < .., .., ..>τ

      τ : Type Constructor     ...
Template Construction

Recursively consider the EQ ‘e’ between consecutive
tokens i and j from starting and trailing empty...
Data Extraction



Deduced the template, data extraction is
trivial and is therefore not covered
Heuristic Assumption
          Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of...
Experiments

ExAlg has been tested on both input pages sets
that fits the previous assumptions or not.

Input collections h...
Procedure

Manually generate the schemas

  No values encoded in tag attributes has been
  considered (e.g. urls, images s...
Evaluation
For each leaf, the system behave:

  Correct , if the extracted value is the same
  extracted by the manually d...
Evaluation
        Correct




               Correct




       Partially Correct
Results

Holding A3, none of the schema attributes
are classified as incorrect

Holding also A1 and A2, all of the schema
a...
Heuristic Assumption
          Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of...
Some Numbers




Acc. = Accuracy of the system for all of the collections

Norm. Acc. = Normalized Accuracy respect to the...
Conclusions


EQ and Token Roles to discover the template


Limited Failures when missing some assumptions
Open Problems
Lower bounds for complexity of some modules
algorithms are not given.

Failure for some given configurations
...
More Details
Definitions
It’s a long way to the top
( if you want rock ‘n roll )
Recall on Structured
            Data
Conformed to a common Schema or Type

Types:

  β (Basic Type: string of tokens)

  ...
Example
Set of pages describing books: each page contains book
title, set of authors ( first and last name ), cost of the
b...
Schemas and Trees
    < > t₁


β            β
    { }t₂        S₁ = < β , { <β,β>t₃ }t₂ , β >t₁

                 Sub-Tree...
Recall: Template
          Function that maps each type constructor τ
          of S into an ordered set of strings T(τ), ...
Encoding
The encoding λ(T,x) of an instance x of a schema S,
given a template TS is defined recursively in terms of
encodin...
Page generation
Simpler (infix) notation
   A     <html><body><b> Book :</b>   Template:
   B                by
                           ...
Optionals and Disjunctions


  ( T )? = { T }τ such as | { T }τ | is 0 or 1



  ( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T...
Problem Model &
   Formulation
EXTRACT Problem
“ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created
from some unknown template T and values {x1,......
Tokens
A template-token is an occurrence of a token in a
template. Each page token is is created from either a
template-to...
What the Algorithm
      Does
ExAlg Stages
Stage 1 (ECGM: Equivalence Class Generation Module):
discover tokens associated with the same type
constructo...
ExAlg: Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differenti...
Stage 1
LFEQs
Large and Frequently occurring EQuivalence classes

  Large: high number of tokens ( gt SizeThres )

  Frequently oc...
Towards FindEq
Token associated with the same type constructor
tend to occur in the same equivalence class.

A token has a...
Valid EQs
If all tokens of an equivalence class ε have unique
roles and are associated with the same type
constructor τj o...
Assumption
For real pages, an equivalence class of large size and
support is usually valid (*)

  Size: # of tokens in the...
Deducing template
Since typically LFEQs consist only of tokens associated
with the same type constructor in the (unknown) ...
EQs Properties


An equivalence class is ordered

  The span of each occurrence of ε is sub-
  divided into (m-1) position...
EQs Properties

A pair of equivalence classes ε1 and ε2 is
nested if the span of all occurrences of ε2 is
within the Pos(p...
Invalid EQs
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate...
HandInv
Handling invalid LFEQs, if any:

  Typically, invalid LFEQs are either not ordered or
  not nested with respect to...
Core Concept:
 Token Roles
Differentiating roles

Typically an LFEQ only contains tokens with unique roles

  template-tokens can’t be discovered jus...
Roles
  Different Roles!
Differentiating roles:
       First Technique
This technique uses the html formatting of input pages

  An occurrence-path...
Differentiating roles:
      Second Technique
Use valid equivalence classes ε

The role of an occurrence of a token t outs...
DTokens

Different roles mean different contexts

Def. dtoken: refer to both a token and a
context, identified by different...
Stage 1 Summary
Schema                          β

      Type                      < .., .., ..>τ

      τ : Type Constructor             ...
ECGM Summary

Input: a set of pages P

Output: a set of LFEQs (made of dtonkens)
and pages P represented as string of dtok...
Modules
DiffForm sub-module differentiates roles of tokens
in P (first technique)

Sub-modules FindEq, HandInv and DiffEq i...
Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate Rol...
Stage 2
PosString(ε,p)

For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp
and dp+1 always occur contiguously.

An equivalence class...
Construct Template
For every non-empty EQ εi, ConstTemp recursively
constructs a template, Tεi, and the template Tεi,p,
co...
Template Tε
        Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1>
To construct the template Tεi,p ConstTemp checks if the...
Template extracted
Original work



http://infolab.stanford.edu/~arvind/papers/
extract-sigmod03.pdf
END

         Overview of ExAlg
Extracting Structured Data from Web Pages
                     Original work by
          ...
Upcoming SlideShare
Loading in …5
×

ExAlg Overview

3,693 views

Published on

Overview of ExAlg algorithm on Extracting Structured Data from Web Pages. Original worl by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu . Presentation by Alessandro Manfredi & Marco Bontempi {a.n0on3,bontempi}@gmail.com

Published in: Technology
  • Hi,i need this presentation ,would you mind sending it me in a ppt version ,this is my email:sarachakir1992@gmail.com Thanks :-)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • For anyone looking for the source of the presentation, send an email to the first author's email address on Slide 1, including the reason for which you need it :)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • cuold u please help me out by sendin this ppt sir

    my mail id is priyanspecial@gmail.com

    thankq
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • sir,
    pls send the EXALG OVERVIEW ppt to my email id:SUMANSVEC@GMAIL.COM.It will help me alot in my project work,thanks for ur work....
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

ExAlg Overview

  1. 1. Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  2. 2. Introduction WWW: billions of unstructured HTML pages Many dynamically generated from underlying structured source (e.g. relational database) Information is hard to query
  3. 3. Real pages
  4. 4. Pages Structure
  5. 5. The goal Full automatic extraction of structured data ... in order to pose complex queries over data Semantic meaning for extracted data is not an aim Avoid human input ( error prone and time consuming, especially for frequent template changes )
  6. 6. Challenges Distinguish template from data Syntactically, there is nothing that tells text part of the template apart from data Deduce the schema for encoded information: Complex schemas, arbitrary levels nesting
  7. 7. Page generation
  8. 8. Data Structure conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  9. 9. Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> B by <html><body><b> Book :</b> C <b>Cost :</b> C Programming Language D </body></html> by Brian Kernighan and E,G ε (empty string) Dennis Ritchie <b>Cost :</b> F _ (space) $30.00 </body></html> H and x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
  10. 10. Values to extract
  11. 11. ExAlg aim to ...
  12. 12. Template extracted
  13. 13. Roles Different Roles!
  14. 14. Token Occurrence Vectors OV = <1,1,1,1>
  15. 15. Token Occurrence Vectors OV = <1,2,1,0>
  16. 16. Equivalence classes OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0>
  17. 17. Assumption For real pages, an equivalence class of large size and support is usually derived from the same template portion Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. ( where OV is not 0 ) Tokens generated by data usually occur infrequently and therefore do not occurr in an EQ that occur very often (well ...)
  18. 18. EQs abstraction <b> Reviewer Name </b> <em> John </em> <b> Reviewer Name </b> <em> β </em>
  19. 19. LFEQs Equivalence Class : A maximal set of tokens having the same occurrence-vector LFEQ: Large and Frequently occurring EQuivalence classes Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres SizeThres & SupThres are user-given parameters Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the template used to create the input pages
  20. 20. How ExAlg Works
  21. 21. ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  22. 22. Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested LFEQs & (constructed) Differentiating token pages with Template roles using Equivalence differentiated Classes (2) tokens
  23. 23. Template Construction Recursively consider the EQ ‘e’ between consecutive tokens i and j from starting and trailing empty strings Def. e ‘empty’ iff i and j always occur contiguously if e is empty i and j are single template tokens if e is not empty there are either basic types (coming from database) or a more complex unknown type
  24. 24. Data Extraction Deduced the template, data extraction is trivial and is therefore not covered
  25. 25. Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  26. 26. Experiments ExAlg has been tested on both input pages sets that fits the previous assumptions or not. Input collections has been collected from RISE repository, from related works sites such as ROADRUNNER and IEPAD, and from various well- known sites i.e. eBay, DBLP, Google, and Sigmod Anthology.
  27. 27. Procedure Manually generate the schemas No values encoded in tag attributes has been considered (e.g. urls, images src, etc.) Aggregated data has been considered a single token (e.g. dates such as “Nov 8 2008”)
  28. 28. Evaluation For each leaf, the system behave: Correct , if the extracted value is the same extracted by the manually defined schema Partially Correct , if the extracted value match two or more leaf in the manually defined schema Incorrect , if the extracted value is neither correct nor partially correct. I.e. the extracted value is a portion of the same in manually defined schema or another misaligned form
  29. 29. Evaluation Correct Correct Partially Correct
  30. 30. Results Holding A3, none of the schema attributes are classified as incorrect Holding also A1 and A2, all of the schema attribute are classified as correct Partially correct classified elements can be refined with very small human interaction.
  31. 31. Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  32. 32. Some Numbers Acc. = Accuracy of the system for all of the collections Norm. Acc. = Normalized Accuracy respect to the # of attributes for each collection.
  33. 33. Conclusions EQ and Token Roles to discover the template Limited Failures when missing some assumptions
  34. 34. Open Problems Lower bounds for complexity of some modules algorithms are not given. Failure for some given configurations eg <ol> <li>β</li> <li>β</li> <li>β</li> </ol> <ol> <li>β</li> <li>β</li> </ol> ... does not belong to the same EQ, so the list elements (all foreach list) are extracted as different attributes.
  35. 35. More Details
  36. 36. Definitions It’s a long way to the top ( if you want rock ‘n roll )
  37. 37. Recall on Structured Data Conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  38. 38. Example Set of pages describing books: each page contains book title, set of authors ( first and last name ), cost of the book Schema of the encoded data: S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ t₁ and t₃ are tuple constructors, t₂ is a set constructor
  39. 39. Schemas and Trees < > t₁ β β { }t₂ S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ Sub-Trees are also schemas, < > t₃ called sub-schema of the original schema β β
  40. 40. Recall: Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> T₁(τ₁) = <A,B,C,D> B by C <b>Cost :</b> T₁(τ₂) = H D </body></html> T₁(τ₃) = <E,F,G> E,G ε (empty string) F _ (space) H and {A..H} are strings x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’ <html><body><b> Book :</b> C Programming Language by Brian Kernighan and Dennis Ritchie <b>Cost :</b>$30.00 </body></html>
  41. 41. Encoding The encoding λ(T,x) of an instance x of a schema S, given a template TS is defined recursively in terms of encoding of sub-values of x, e.g.: if x is a basic type β, λ(T,x) = x elif x is a tuple < x1, ..., xn >τt, λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) > elif x is a set { e1, ..., em }τs, λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) > T(τt) = < C1 , ..., Cn+1>, T(τs) = ς
  42. 42. Page generation
  43. 43. Simpler (infix) notation A <html><body><b> Book :</b> Template: B by < A * B { < E * F * G > }H C * D > C <b>Cost :</b> D </body></html> E,G ε (empty string) Where F _ (space) * is a wildcard for β (Basic Type) H and { § }H is ‘ § (H §)+ ’ Assumption: the same values in a tuple occur in the same relative position with respect to the values of other attributes in all the pages. (e.g. the book name always occurs before the list of the authors and the price)
  44. 44. Optionals and Disjunctions ( T )? = { T }τ such as | { T }τ | is 0 or 1 ( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ
  45. 45. Problem Model & Formulation
  46. 46. EXTRACT Problem “ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created from some unknown template T and values {x1,...,xn} deduce the template T and the values alone. “ It is an instance of regular grammar inference problem from positive examples. N.B. Inference “in the limit” ( not Σ* ) with positive example is undecidable. However, humans rarely have any ambiguity in picking the right template from real template-generated web pages, and the goal is to solve EXTRACT problem for real web pages, i.e. produce the template and values that would be considered correct by a human.
  47. 47. Tokens A template-token is an occurrence of a token in a template. Each page token is is created from either a template-token or a page-token. A token itself is different from its occurences Two page-tokens are said to have the same role iff they are generated from the same template-token.
  48. 48. What the Algorithm Does
  49. 49. ExAlg Stages Stage 1 (ECGM: Equivalence Class Generation Module): discover tokens associated with the same type constructor finding equivalence classes Occurrence Vector : The occurrence-vector of a token t, is defined as the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi. Equivalence Class : An equivalence class is a maximal set of tokens having the same occurrence-vector Stage 2 (Analysis Module): using the above set to deduce the template
  50. 50. ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  51. 51. Stage 1
  52. 52. LFEQs Large and Frequently occurring EQuivalence classes Large: high number of tokens ( gt SizeThres ) Frequently occurring: whose tokens occur in a large number of input pages ( gt SupThres ) Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the (unknown) template used to create the input pages
  53. 53. Towards FindEq Token associated with the same type constructor tend to occur in the same equivalence class. A token has a unique role if all the occurrences of the token in the pages, is generated by a single template-token. Tokens associated with the same type constructor τj in T that have unique-roles occurs in the same equivalence class.
  54. 54. Valid EQs If all tokens of an equivalence class ε have unique roles and are associated with the same type constructor τj of S, ExAlg assume that ε is derived from τj Only in this case the equivalence class is valid ( You don’t know the type constructors ex ante, ExAlg has to find them )
  55. 55. Assumption For real pages, an equivalence class of large size and support is usually valid (*) Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4 Tokens generated by value-tokens usually occur infrequently and therefore do not occur in a LFEQs
  56. 56. Deducing template Since typically LFEQs consist only of tokens associated with the same type constructor in the (unknown) input template, ExAlg uses LFEQs to deduce the template and the schema Two main obstacles Base assumption(*) is heuristic ( no guarantee ) Even if an LFEQ is valid, its tokens have unique roles, and therefore only contains partially information about the template used to generate the pages
  57. 57. EQs Properties An equivalence class is ordered The span of each occurrence of ε is sub- divided into (m-1) positions (Pos). Posk is page-token that may occur between tk and tk+1
  58. 58. EQs Properties A pair of equivalence classes ε1 and ε2 is nested if the span of all occurrences of ε2 is within the Pos(p) of some occurrence of ε1 A set of equivalence classes {ε1,...,εn} is nested if every pair of equivalence classes of the set is nested.
  59. 59. Invalid EQs Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  60. 60. HandInv Handling invalid LFEQs, if any: Typically, invalid LFEQs are either not ordered or not nested with respect to other LFEQs. HandInv detects the existence of invalid LFEQs using violations of ordered and nesting properites It discards some of the LFEQs completely and breaks others into smaller LFEQs Outuput: ordered set of nested (with high probability valid) LFEQs
  61. 61. Core Concept: Token Roles
  62. 62. Differentiating roles Typically an LFEQ only contains tokens with unique roles template-tokens can’t be discovered just using LFEQs “Differentiating roles of tokens” is used in EXALG to discover a greater number of template-tokens Different token roles are driven by different contexts such that the occurrences of a token in different contexts above necessarily have different roles
  63. 63. Roles Different Roles!
  64. 64. Differentiating roles: First Technique This technique uses the html formatting of input pages An occurrence-path of a page-token is the path from the root to the page token, in the parse-tree In practice, two page-tokens with different occurrence path have different roles Equivalently, all page-token generated by a template token have the same occurrence-path
  65. 65. Differentiating roles: Second Technique Use valid equivalence classes ε The role of an occurrence of a token t outside the span of any occurence of ε is different from the role of an occurence within the span of some occurrence of ε Further, the role of an occurrence of t within the Pos(l) of some occurrence of ε, is different from the role of an occurrence of t within the Pos(m) of some occurrence of ε, with l≠m
  66. 66. DTokens Different roles mean different contexts Def. dtoken: refer to both a token and a context, identified by differentiation A dtoken is a token in a given context EXALG aims to work with equivalence classes defined using dtokens
  67. 67. Stage 1 Summary
  68. 68. Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested (constructed) Differentiating token Template dtokens roles using Equivalence Classes (2)
  69. 69. ECGM Summary Input: a set of pages P Output: a set of LFEQs (made of dtonkens) and pages P represented as string of dtokens NB: dtokens are marked
  70. 70. Modules DiffForm sub-module differentiates roles of tokens in P (first technique) Sub-modules FindEq, HandInv and DiffEq iterate in a loop, until no new dtokens are formed FindEq need two parameters: SizeThres and SupThres. Equivalence classes with size and support greater than SizeThres and SupThres are considered LFEQs.
  71. 71. Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  72. 72. Stage 2
  73. 73. PosString(ε,p) For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp and dp+1 always occur contiguously. An equivalence class is defined to be ‘empty’ if all of its Pos() are empty. For an occurrence of EQ ε and non-empty POS(p) of ε, PosString(ε,p) is the concatenation of tokens and EQs ε’ in Pos(P) but not EQs in the span of ε’.
  74. 74. Construct Template For every non-empty EQ εi, ConstTemp recursively constructs a template, Tεi, and the template Tεi,p, corresponding to each non-empty Pos p of εi. Its output is the template for the root-EQ. The root-EQ occurrence vector is always <1,1,⋄⋄⋄>, at least prepending and appending greater than SizeThres number of dummy tokens to beginning and end of each page.
  75. 75. Template Tε Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1> To construct the template Tεi,p ConstTemp checks if the set of strings PosString(εi,p) has some recognizable pattern Pattern used by ConstTemp prototype: Pattern Tεi,p 1 εj εj ... { T εj }∈ 2 εj S εj S ... εj { T εj }S 3 εj OR εk Tεj | Tεk 4 ∈ OR εj (Tεj)? string of dtokens and 5 Tβ empty EQs 6 -unknown- Tβ
  76. 76. Template extracted
  77. 77. Original work http://infolab.stanford.edu/~arvind/papers/ extract-sigmod03.pdf
  78. 78. END Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com

×