• Save
ExAlg Overview
Upcoming SlideShare
Loading in...5
×
 

ExAlg Overview

on

  • 3,453 views

Overview of ExAlg algorithm on Extracting Structured Data from Web Pages. Original worl by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu . Presentation by Alessandro Manfredi ...

Overview of ExAlg algorithm on Extracting Structured Data from Web Pages. Original worl by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu . Presentation by Alessandro Manfredi & Marco Bontempi {a.n0on3,bontempi}@gmail.com

Statistics

Views

Total Views
3,453
Slideshare-icon Views on SlideShare
3,279
Embed Views
174

Actions

Likes
3
Downloads
0
Comments
3

5 Embeds 174

http://www.n0on3.net 149
http://www.slideshare.net 14
http://www.linkedin.com 5
https://www.n0on3.net 3
http://translate.googleusercontent.com 3

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • For anyone looking for the source of the presentation, send an email to the first author's email address on Slide 1, including the reason for which you need it :)
    Are you sure you want to
    Your message goes here
    Processing…
  • cuold u please help me out by sendin this ppt sir

    my mail id is priyanspecial@gmail.com

    thankq
    Are you sure you want to
    Your message goes here
    Processing…
  • sir,
    pls send the EXALG OVERVIEW ppt to my email id:SUMANSVEC@GMAIL.COM.It will help me alot in my project work,thanks for ur work....
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

ExAlg Overview ExAlg Overview Presentation Transcript

  • Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  • Introduction WWW: billions of unstructured HTML pages Many dynamically generated from underlying structured source (e.g. relational database) Information is hard to query
  • Real pages
  • Pages Structure
  • The goal Full automatic extraction of structured data ... in order to pose complex queries over data Semantic meaning for extracted data is not an aim Avoid human input ( error prone and time consuming, especially for frequent template changes )
  • Challenges Distinguish template from data Syntactically, there is nothing that tells text part of the template apart from data Deduce the schema for encoded information: Complex schemas, arbitrary levels nesting
  • Page generation
  • Data Structure conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  • Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> B by <html><body><b> Book :</b> C <b>Cost :</b> C Programming Language D </body></html> by Brian Kernighan and E,G ε (empty string) Dennis Ritchie <b>Cost :</b> F _ (space) $30.00 </body></html> H and x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
  • Values to extract
  • ExAlg aim to ...
  • Template extracted
  • Roles Different Roles!
  • Token Occurrence Vectors OV = <1,1,1,1>
  • Token Occurrence Vectors OV = <1,2,1,0>
  • Equivalence classes OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0>
  • Assumption For real pages, an equivalence class of large size and support is usually derived from the same template portion Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. ( where OV is not 0 ) Tokens generated by data usually occur infrequently and therefore do not occurr in an EQ that occur very often (well ...)
  • EQs abstraction <b> Reviewer Name </b> <em> John </em> <b> Reviewer Name </b> <em> β </em>
  • LFEQs Equivalence Class : A maximal set of tokens having the same occurrence-vector LFEQ: Large and Frequently occurring EQuivalence classes Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres SizeThres & SupThres are user-given parameters Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the template used to create the input pages
  • How ExAlg Works
  • ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested LFEQs & (constructed) Differentiating token pages with Template roles using Equivalence differentiated Classes (2) tokens
  • Template Construction Recursively consider the EQ ‘e’ between consecutive tokens i and j from starting and trailing empty strings Def. e ‘empty’ iff i and j always occur contiguously if e is empty i and j are single template tokens if e is not empty there are either basic types (coming from database) or a more complex unknown type
  • Data Extraction Deduced the template, data extraction is trivial and is therefore not covered
  • Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  • Experiments ExAlg has been tested on both input pages sets that fits the previous assumptions or not. Input collections has been collected from RISE repository, from related works sites such as ROADRUNNER and IEPAD, and from various well- known sites i.e. eBay, DBLP, Google, and Sigmod Anthology.
  • Procedure Manually generate the schemas No values encoded in tag attributes has been considered (e.g. urls, images src, etc.) Aggregated data has been considered a single token (e.g. dates such as “Nov 8 2008”)
  • Evaluation For each leaf, the system behave: Correct , if the extracted value is the same extracted by the manually defined schema Partially Correct , if the extracted value match two or more leaf in the manually defined schema Incorrect , if the extracted value is neither correct nor partially correct. I.e. the extracted value is a portion of the same in manually defined schema or another misaligned form
  • Evaluation Correct Correct Partially Correct
  • Results Holding A3, none of the schema attributes are classified as incorrect Holding also A1 and A2, all of the schema attribute are classified as correct Partially correct classified elements can be refined with very small human interaction.
  • Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  • Some Numbers Acc. = Accuracy of the system for all of the collections Norm. Acc. = Normalized Accuracy respect to the # of attributes for each collection.
  • Conclusions EQ and Token Roles to discover the template Limited Failures when missing some assumptions
  • Open Problems Lower bounds for complexity of some modules algorithms are not given. Failure for some given configurations eg <ol> <li>β</li> <li>β</li> <li>β</li> </ol> <ol> <li>β</li> <li>β</li> </ol> ... does not belong to the same EQ, so the list elements (all foreach list) are extracted as different attributes.
  • More Details
  • Definitions It’s a long way to the top ( if you want rock ‘n roll )
  • Recall on Structured Data Conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  • Example Set of pages describing books: each page contains book title, set of authors ( first and last name ), cost of the book Schema of the encoded data: S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ t₁ and t₃ are tuple constructors, t₂ is a set constructor
  • Schemas and Trees < > t₁ β β { }t₂ S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ Sub-Trees are also schemas, < > t₃ called sub-schema of the original schema β β
  • Recall: Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> T₁(τ₁) = <A,B,C,D> B by C <b>Cost :</b> T₁(τ₂) = H D </body></html> T₁(τ₃) = <E,F,G> E,G ε (empty string) F _ (space) H and {A..H} are strings x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’ <html><body><b> Book :</b> C Programming Language by Brian Kernighan and Dennis Ritchie <b>Cost :</b>$30.00 </body></html>
  • Encoding The encoding λ(T,x) of an instance x of a schema S, given a template TS is defined recursively in terms of encoding of sub-values of x, e.g.: if x is a basic type β, λ(T,x) = x elif x is a tuple < x1, ..., xn >τt, λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) > elif x is a set { e1, ..., em }τs, λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) > T(τt) = < C1 , ..., Cn+1>, T(τs) = ς
  • Page generation
  • Simpler (infix) notation A <html><body><b> Book :</b> Template: B by < A * B { < E * F * G > }H C * D > C <b>Cost :</b> D </body></html> E,G ε (empty string) Where F _ (space) * is a wildcard for β (Basic Type) H and { § }H is ‘ § (H §)+ ’ Assumption: the same values in a tuple occur in the same relative position with respect to the values of other attributes in all the pages. (e.g. the book name always occurs before the list of the authors and the price)
  • Optionals and Disjunctions ( T )? = { T }τ such as | { T }τ | is 0 or 1 ( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ
  • Problem Model & Formulation
  • EXTRACT Problem “ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created from some unknown template T and values {x1,...,xn} deduce the template T and the values alone. “ It is an instance of regular grammar inference problem from positive examples. N.B. Inference “in the limit” ( not Σ* ) with positive example is undecidable. However, humans rarely have any ambiguity in picking the right template from real template-generated web pages, and the goal is to solve EXTRACT problem for real web pages, i.e. produce the template and values that would be considered correct by a human.
  • Tokens A template-token is an occurrence of a token in a template. Each page token is is created from either a template-token or a page-token. A token itself is different from its occurences Two page-tokens are said to have the same role iff they are generated from the same template-token.
  • What the Algorithm Does
  • ExAlg Stages Stage 1 (ECGM: Equivalence Class Generation Module): discover tokens associated with the same type constructor finding equivalence classes Occurrence Vector : The occurrence-vector of a token t, is defined as the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi. Equivalence Class : An equivalence class is a maximal set of tokens having the same occurrence-vector Stage 2 (Analysis Module): using the above set to deduce the template
  • ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • Stage 1
  • LFEQs Large and Frequently occurring EQuivalence classes Large: high number of tokens ( gt SizeThres ) Frequently occurring: whose tokens occur in a large number of input pages ( gt SupThres ) Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the (unknown) template used to create the input pages
  • Towards FindEq Token associated with the same type constructor tend to occur in the same equivalence class. A token has a unique role if all the occurrences of the token in the pages, is generated by a single template-token. Tokens associated with the same type constructor τj in T that have unique-roles occurs in the same equivalence class.
  • Valid EQs If all tokens of an equivalence class ε have unique roles and are associated with the same type constructor τj of S, ExAlg assume that ε is derived from τj Only in this case the equivalence class is valid ( You don’t know the type constructors ex ante, ExAlg has to find them )
  • Assumption For real pages, an equivalence class of large size and support is usually valid (*) Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4 Tokens generated by value-tokens usually occur infrequently and therefore do not occur in a LFEQs
  • Deducing template Since typically LFEQs consist only of tokens associated with the same type constructor in the (unknown) input template, ExAlg uses LFEQs to deduce the template and the schema Two main obstacles Base assumption(*) is heuristic ( no guarantee ) Even if an LFEQ is valid, its tokens have unique roles, and therefore only contains partially information about the template used to generate the pages
  • EQs Properties An equivalence class is ordered The span of each occurrence of ε is sub- divided into (m-1) positions (Pos). Posk is page-token that may occur between tk and tk+1
  • EQs Properties A pair of equivalence classes ε1 and ε2 is nested if the span of all occurrences of ε2 is within the Pos(p) of some occurrence of ε1 A set of equivalence classes {ε1,...,εn} is nested if every pair of equivalence classes of the set is nested.
  • Invalid EQs Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • HandInv Handling invalid LFEQs, if any: Typically, invalid LFEQs are either not ordered or not nested with respect to other LFEQs. HandInv detects the existence of invalid LFEQs using violations of ordered and nesting properites It discards some of the LFEQs completely and breaks others into smaller LFEQs Outuput: ordered set of nested (with high probability valid) LFEQs
  • Core Concept: Token Roles
  • Differentiating roles Typically an LFEQ only contains tokens with unique roles template-tokens can’t be discovered just using LFEQs “Differentiating roles of tokens” is used in EXALG to discover a greater number of template-tokens Different token roles are driven by different contexts such that the occurrences of a token in different contexts above necessarily have different roles
  • Roles Different Roles!
  • Differentiating roles: First Technique This technique uses the html formatting of input pages An occurrence-path of a page-token is the path from the root to the page token, in the parse-tree In practice, two page-tokens with different occurrence path have different roles Equivalently, all page-token generated by a template token have the same occurrence-path
  • Differentiating roles: Second Technique Use valid equivalence classes ε The role of an occurrence of a token t outside the span of any occurence of ε is different from the role of an occurence within the span of some occurrence of ε Further, the role of an occurrence of t within the Pos(l) of some occurrence of ε, is different from the role of an occurrence of t within the Pos(m) of some occurrence of ε, with l≠m
  • DTokens Different roles mean different contexts Def. dtoken: refer to both a token and a context, identified by differentiation A dtoken is a token in a given context EXALG aims to work with equivalence classes defined using dtokens
  • Stage 1 Summary
  • Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested (constructed) Differentiating token Template dtokens roles using Equivalence Classes (2)
  • ECGM Summary Input: a set of pages P Output: a set of LFEQs (made of dtonkens) and pages P represented as string of dtokens NB: dtokens are marked
  • Modules DiffForm sub-module differentiates roles of tokens in P (first technique) Sub-modules FindEq, HandInv and DiffEq iterate in a loop, until no new dtokens are formed FindEq need two parameters: SizeThres and SupThres. Equivalence classes with size and support greater than SizeThres and SupThres are considered LFEQs.
  • Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • Stage 2
  • PosString(ε,p) For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp and dp+1 always occur contiguously. An equivalence class is defined to be ‘empty’ if all of its Pos() are empty. For an occurrence of EQ ε and non-empty POS(p) of ε, PosString(ε,p) is the concatenation of tokens and EQs ε’ in Pos(P) but not EQs in the span of ε’.
  • Construct Template For every non-empty EQ εi, ConstTemp recursively constructs a template, Tεi, and the template Tεi,p, corresponding to each non-empty Pos p of εi. Its output is the template for the root-EQ. The root-EQ occurrence vector is always <1,1,⋄⋄⋄>, at least prepending and appending greater than SizeThres number of dummy tokens to beginning and end of each page.
  • Template Tε Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1> To construct the template Tεi,p ConstTemp checks if the set of strings PosString(εi,p) has some recognizable pattern Pattern used by ConstTemp prototype: Pattern Tεi,p 1 εj εj ... { T εj }∈ 2 εj S εj S ... εj { T εj }S 3 εj OR εk Tεj | Tεk 4 ∈ OR εj (Tεj)? string of dtokens and 5 Tβ empty EQs 6 -unknown- Tβ
  • Template extracted
  • Original work http://infolab.stanford.edu/~arvind/papers/ extract-sigmod03.pdf
  • END Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com