A Thesis Submitted In Partial Fulfillment for the Award of the Degree of Doctor of Philosophy (Ph.D)  By Mohammed Kayed Information Extraction from Semi-Structured Web Pages Faculty of Science, Beni-Suef University, Egypt 2007
Outline Introduction Part 1: Chapters 2-3 A Survey and a Comparative Analysis for IE Systems Part 2: Chapters 4-7 FiVaTech: A New Page Level Web Data Extraction Approach Conclusion
Introduction Information Extraction A key to  web information integration. Transforms Web pages into program-friendly format such as a relational database.
Introduction (cont.) IE is applied in many Web applications Data extraction problem is very important for many applications that interact with search engines.
An IE task is defined by its input and its extraction target A free text IE task Template Web pages IE task Record-Level IE task Page-Level IE task Wrapper Induction Systems Systems that generate wrappers for given IE task Definitions
A free text IE task which is specified by the input and its output. IE from Free Texts
A Semi-structured page containing list of data records. IE from Template Pages
Many IE Systems are Developed Minerva AutoSlog LIEP TSIMMIS WebOQL W4F XWRAP Rapier SRV Whisk NoDoSE DEByE Wien Stalker Softmealy IEPAD OLERA DeLa RoadRunner Exalg Depta ViPER ViNTs MSE Introduction (cont.)
Low Effort Satisfying  his/her  requirements High Performance& General Solution A Survey and a Comparative Analysis are necessity
Surveys the major IE Systems. A taxonomy of IE systems from users’ viewpoint Three dimensions for comparison Task domain, Techniques used, Automation degree. Criteria for each dimension Part 1 of The Thesis
FiVaTech Template pages generated from deep Web Page-level Web data extraction Unsupervised approach Work for singleton and multiple record pages Detect schema and (tree) templates automatically Part 2 of The Thesis
A Survey of Web Information Extraction Systems Part I
Related Work Taxonomies Time MUCs *MUC *Post-MUC Automation Degree Hsu and Dung *Hand-crafted *Special language *Heuristic-based *WI approaches Automation Degree Chang and Kuo  *Need programmers *annotation examples *Annotation-free *Semi-supervised Extraction rules Kushmerick *Finite-state *Relational learning Usability Kuhlins *Commercial *Noncommercial Techniques Laender *Special languages *HTML-aware *NLP-based *WI tools *Modeling-based *Ontology-based Input & Extraction rules Muslea *Free text (syntactic/semantic  rules) *WI tools (delimiter-based rules) *Online documents(delimiters,  syntactic/semantic) Output Target Sarawagi *Record-level *Page-level *Site-level
Three Dimensions for Comparing IE Tools Survey (cont.) Automation Degree “ the degree of automation for IE systems ” Programmer-involved, learning-based or annotation-free approaches Techniques “ the performance of IE systems ” Regular expression rules vs. Prolog-like logic rules Pattern Mining, Deterministic FST, or probabilistic models Task Domain “ why an IE system fails to handle some Web sites of particular structures ” Input (Free text, semi-structured) Output Targets (record-level, page-level, site-level) Task Difficulties
Task Domain:  Criteria Page type Non-HTML support (NHS) Extraction Level Extraction Target Variation Missing/Multi-valued attributes Multi-ordering attributes Nested data Template Variations Variant formats Common formats Un-tokenized Attributes (UTA)
Techniques:  Criteria Scan passes Extraction rule types Feature used Learning algorithms Tokenization schemes Automation Degree:  Criteria User Expertise Page-fetching Support Output / API Support Applicability Limitation
Task Domain:  What are semi-structured pages?
Automation Degree:  Four approaches Manually-constructed IE tools Supervised IE systems Semi-supervised IE systems Unsupervised IE systems
Manually TSIMMIS Minerva WebOQL W4F XWrap Users program a wrapper by hand using a general programming language (e.g., Perl) or using special designed languages. Survey (cont.)
Supervised SRV Rapier Wien Whisk NoDoSE Softmealy Stalker DEByE General users instead of programmers can be trained to use the labeling GUI, thus reducing the cost of wrapper generation. Survey (cont.)
Semi-supervised IEPAD OLERA Thresher Although require no labeled training pages, post-effort from the user is required to choose the target pattern. Survey (cont.)
Unsupervised DeLa RoadRunner Exalg Depta ViPER MSE Do not use any labeled training examples and have no user interactions to generate a wrapper. Survey (cont.)
Dimension 1:  Task Domain
Dimension 2:  Techniques Tools Scan Pass Extraction Rule Type Features Used Learning Algorithm Tokenization Schemes Minerva Single Regular exp. HTML tags/Literal words None Manually TSIMMIS Single Regular exp. HTML tags/Literal words None Manually WebOQL Single Regular exp. Hypertree None Manually W4F Single Regular exp. DOM tree path addressing None Tag Level XWRAP Single Context-Free DOM tree None Tag Level RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level DEByE Multiple Regular exp. HTML tags/Literal words Data Modeling Word Level WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level IEPAD Single Regular exp. HTML tags Pattern Mining, String Alignment Multi-Level OLERA Single Regular exp. HTML tags String Alignment Multi-Level DeLa Single Regular exp. HTML tags Pattern Mining Tag Level RoadRunner Single Regular exp. HTML tags String Alignment Tag Level EXALG Single Regular exp. HTML tags/Literal words Equivalent Class and Role Differentiation by DOM tree path Word Level DEPTA Single Tag Tree HTML tags tree Pattern Mining, String comparison, Partial tree alignment Tag Level ViPER Single Tag Tree Visual Features/HTML tags tree Pattern Mining, global string alignment by Divide and Conquer Tag Level MSE Single Tag Tree Visual Features/HTML tags tree Pattern Mining with visual features Tag Level
Dimension 3:  Automation degree Tools User Expertise Fetch support Output/API Support Applicability Limitation Minerva Programming No XML High Not restricted TSIMMIS Programming No Text High Not restricted WebOQL Programming No Text High Not restricted W4F Programming Yes XML Medium Not restricted XWRAP Programming Yes XML Medium Not restricted RAPIER Labeling No Text Medium Not restricted SRV Labeling No Text Medium Not restricted WHISK Labeling No Text Medium Not restricted NoDoSE Labeling No XML, OEM Medium Not restricted DEByE Labeling Yes XML, SQL DB Medium Not restricted WIEN Labeling No Text Medium Not restricted STALKER Labeling No Text Medium Not restricted SoftMealy Labeling Yes XML, SQL DB Medium Not restricted IEPAD Post labeling Pattern selection No Text Low Multiple-records page OLERA Partial Labeling No XML Low Not restricted DeLa No Interaction Yes Text Low Multiple-records page, More than one page RoadRunner No Interaction Yes XML Low More than one page EXALG No Interaction No Text Low More than one page DEPTA Pattern selection No SQL DB Low Multiple-records pages ViPER No Interaction No SQL DB Low Multiple-records pages MSE No Interaction No -- Low More than one page
Relationship Among Dimensions Template pages have higher automation degree than others. Semantic features are required for manual systems where the input has less common tags.
Overall Comparison Practitioner High effectiveness tech., i.e., high recall and high precision. IE systems can only compared from their applicability.  Semi-supervised and unsupervised systems have low applicability. Manual and supervised has high applicability.  Researcher Which technique to apply when tailoring current systems to a new task domain . Unsupervised  tech. are hard to extend to free texts and non-template pages.  Supervised approaches can be extended to new task domain by adding new features.
FiVaTech: A Page-Level Web Data Extraction Approach Part II
Problem Formulation for Template Pages Data Extraction
Page Generation Model A Web page is generated by embedding data values  x  (taken from a Database) into a predefined template T. All data instances of the database conform to a common schema.
A data schema can be of the following types A basic type (β) , A tuple type <T 1 , T 2 , …, T n >, A set type {T}. An optional type (T)?  is a tuple (not a set) type with cardinality 0 or 1 for every instantiation. A disjunctive type (T 1 | T 2 | ...| T k ) is a k-tuple  <{T 1 }  1 , {T 2 }  2 , ..., {T 2 }  k >    where the cardinality sum of   1  to   k  equals one for every instantiation of   . Schema
Dealing with a Web page (template & data) as a sequence of string is not a good solution. Both Templates and Web pages have tree-like structure. Our proposed Page Creation Model considers that both the template and the Web page have a tree structure. String Templates EXALG: If    is a tuple constructor of order n, T(  ) is an order set of n+1 string,  If    is a set constructor, T(  ) is string  S τ .
Tree Templates T 1  i T 2  is a new tree resulted by appending tree T 2  to the i th  node (from  the reference point ) on the right most path of tree T 1 .
Tree Template:  Encoding 1 We define the encoding for a type    and its instance x as: If    is of a basic type β, then   (T, x) is a node containing x. If    is a n-tuple, then T(  )=[(S 1 , …,S n+1 ), (i 1 ,…, i n ), (j 1 , …, j n )]. If x = (x 1 , …, x n ), then   (T, x) is produced by S 1  i1  (T, x 1 )  j1 S 2  i2  (T, x 2 )  j2 S 3  ...   in  (T, x n )  jn S n+1   If    is a set constructor, then T(  )=P. If x = {e 1 , e 2 , ..., e m }, then   (T, x) will be the tree by inserting the m subtrees   (T, e 1 ),   (T, e 2 ), ...,   (T, e m ) as siblings at the leaf node on the right most path of P.
T(  1 )= [A],  T(  2 )=[ (B  0 C, D  0 E ,F,H,  ,  ,K) (0,0,0,0,0,0), (2,1,0,0,1,6) ],  T(  3 )=[ (G,  ),0,0 ],  T(  4 )=[ (I,  ),1,0 ], T(  5 )=[ (J,  ),2,0 ], Example for Encoding 1
Tree Template:  Encoding 2 If    is a n-tuple constructor, then T(  )=[(C 1 , …,C n+1 ), (i 1 ,…, i n )]. If x = (x 1 , …, x n ), then   (T, x) is the tree produced by inserting the n+1 ordered subtrees  C 1  i1  (T, x 1 ), C 2  i2  (T, x 2 ), …, C n  in  (T, x n ), and C n+1  as siblings at the leaf node on the right most path of a template P. We merge set type with the tuple type and use the tuple template for set as well.   Thus, all type constructors have the same template format. We use the name n-order set for this merged template. We define the encoding for a type    and its instance x as:
T(w 1 )= [A,(B,  ),0], T(w 2 )= [  ,(C,D,K),(0,0)],  T(w 3 )= [  ,(E,F,  ),(0,0)],  T(w 4 )= [  ,(  ,H,  ,  ),(0,0,0)],  T(w 5 )= [  ,(  ,  ,  ),(0,0)],  T(  3 )= [  ,(G,  ),0] T(  4 )= [  ,(I,  ),1],  T(  5 )= [  ,(J,  ),2]. Example for Encoding 2
Problem Formulation Definition : Given a set of  n  DOM trees,  DOM i   =   ( T, x i ) (1  ≤ i ≤ n ), created from some unknown template  T  and values { x 1 ,. . .,x n }, deduce the template and values, from the set of DOM trees alone. We call this problem a  page-level  information extraction. If one single page ( n =1) which contains tuple constructors is given as input, the problem is to deduce the template for the schema inside the tuple constructors. We call this problem a  record-level  information extraction task.
Multiple Tree Merging for FiVaTech
FiVaTech System Overview Given some DOM trees (Web pages) as input, we try to merge all DOM trees at the same time into a single tree called a  fixed/variant pattern tree .  From this pattern tree, we can recognize variant leaf nodes for basic-typed data and mine repetitive nodes for set-typed data.
Almost, collect all required information
Fixed/Variant Tree Construction The tree merging algorithm.
Assumption:  Basic type data usually occur at the leaf nodes  A top-down approach for multiple tree merging:   From a tree level to a string level  Level by level (multiple) string alignment that considers both missing data and multiple-value data.  At each internal node n (starting from root):  collects all  first-level child nodes  of the input trees (subtrees matched with the subtree from n) in a matrix Conducts four steps Peer node recognition Matrix alignment Pattern mining Optional & disjunctive node identification Fixed/Variant Tree Construction (cont.)
Peer Matrix M Aligned Peer Matrix Aligned List List after Mining Pattern Tree
Matching Score Normalization Algorithm Step 1:  Peer Node Recognition
A matching Score example Depta: 15/43 (≈0.35) FiVaTech: ( 1.0 + 0.6 + 0.6 + 0.6 + 0.6 ) / 5 = 68.0 68.0 + ( 1 / Average (43, 23) ) ≈ 0.71  Assume that tr i  (i = 1, 2, 3, 4) has 6 mapping tr j  (j = 5, 6). Assume also that the size of every tr i  is 10.
A Fixed Template Tree Two trees may have similar structures, but different functions. A Fixed Template Tree is the tree that is a part of template. Matching score (=1) can be used to differentiate such fixed template trees.
Step 2:  Peer Matrix Alignment The peerMatrixAlignment algorithm.
Peer Matrix Alignment (cont.) R1: Select a node with  checkSpan r (n rc ) = -1 ; R2:  checkSpan r (n rc ) = 1  and  M[r][c]=M[r down ][c′]) . If R 1  and R 2  fail,  divide the row into 2 parts: P 1  and P 2 .  Span(n rc ) is the maximum number of different nodes (without repetition) between any two consecutive occurrences of n rc  in each column c plus one. Shifting a node n rc  from M is based on the following rules:
Span  of  a, b, c, d, e  are  0, 3, 3, 3, 0 Peer Matrix Alignment (cont.)
Peer Matrix Alignment (cont.) The function  alignmentResult   handles the problem of different functionalities by a clustering algorithm.
Peer Matrix Alignment (cont.) The clustering algorithm. The principle here is: &quot; as well as nodes of each row in the matrix M have the same structure, they should also have the same functionality &quot;
Step 3:  Frequent Pattern Mining A Formal Description of a Repetitive Pattern.
Frequent Mining Algorithm Step 3:  Tandem Repeat Mining
Example for Tandem Repeat Mining A Frequent Mining Example
Step 4:  Optional Node Merging The occurrence vector of: a  and  e  is  (1,1,1) . b  and  c  is  (1,1,1,1,1,1) d  is  (1,0,1,1,0,1)     Optional
A Running Example
The constructed fixed/variant pattern tree A Running Example The next step is:  Identifying Tuples
Schema Detection
Data Schema Detection Mark nodes as  k -tuple For nodes with only one child and are not marked as set or optional types, there is no need to mark it as 1-tuple.  For nodes with more than one branch, we mark them as  k -tuple ( k ≥1) if  k  calls of the function  schemaDetection   return true.
Data Schema Detection (cont.) The schema  S  is the pattern tree after excluding all tag nodes that have no types.
A reference node r is identified as: r is of a tuple type ( table 2 , td 4 , div, span 1 , span 2 , span 3 ), The next (right) node of r is a node of type   , where    is either β ( td 2 , a, strike, br 4 , br 6 , “Delivery:” ) OR a set type {} ( table 1 , tr 1 ), OR a virtual node ( br 2 , “Features:” ), r is a leaf node on the right most path of a k-tuple or k-order set and is not of any type ( td 5 , br 3 ). We call r a rightmost reference node.  Reference Node Identification
Template Identification Templates are identified by segmenting the pre-order traversing of the trees (skipping basic type nodes) at every reference nodes.
Data Schema Detection Given a  k -tuple or  k -order set <  1 ,   2 , …,   k > at node n, where every type   i  (1  i  k ) is located at a node n i ; [P,  (C 1 , C 2 , …, C k+1 )] For template P  If    is the first data type in the schema tree, then P will be the one containing its reference node, or    otherwise. For template  C i   If   i  is a tuple type, then the template that includes node n i  and the respective insertion position will be 0.  If   i  is of set type or basic type, then C i  will be the template that is under n and includes the reference node of n i  or    if no such templates exist.  If C i  is not null, the respective insertion position will be the distance of n i  to the rightmost path of C i . Template C k+1  will be the one that has the rightmost reference node inside n or    otherwise.
Data Schema Detection (cont.) T(  1 )=(T 1 ,(T 2 ,  ),0), T(  2 )=(  ,(T 3 ,  ),0), T(  3 )=(  ,(T 4 ,T 5 ,T 18 ), (0,0)), T(  4 )=(  ,(T 6 ,T 7 ,  ),(0,0)), T(  5 )=(  ,(T 8 ,T 11 ,  ,  ,  ), (1,0,0,0)), T(  6 )=(  , (T 9 ,T 10 ),0),  T(  7 )=(  ,(  ,  ,  ),(0,0)), T(  8 )=(  ,(T 12 ,  ),1), T(  9 )=(  ,(T 13 ,  ),0),  T(  10 )=(  ,(T 14 ,  ),2), T(  11 )=(  ,(T 15 ,  ),1),
FiVaTech Vs. Depta Order: mining then alignment Rely on Tag tree not the DOM tree. Partial alignment relies on the order of trees matched with the seed tree. Can not differentiate between data sections that have noisy data and data sections that have relevant data. Can not handle singleton pages FiVaTech Vs. EXALG FiVaTech has different granularity of the used tokens than EXALG. Assume that HTML tags as part of the data. EXALG assumes that a pair of two valid equivalence classes is nested.
FiVaTech as a Schema Extractor Experiments The comparison with EXALG schema. Dataset:  9 Web sites on EXALG home page. site N Manual EXALG FiVaTech A m O m {} A e O e {} c Incorr. A e O e {} c Incorr. i n i n Amazon (Cars) 21 13 0 5 15 0 5 11 4 2 8 1 4 8 0 0 Amazon  (Pop) 19 5 0 1 5 0 1 5 0 0 5 0 1 5 0 0 MLB 10 7 0 4 7 0 4 7 0 0 6 0 1 6 0 1 RPM 20 6 1 3 6 1 3 6 0 0 5 0 3 5 0 1 UEFA (Teams) 20 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 UEFA (Play) 20 2 0 1 4 2 1 2 2 0 2 0 0 2 0 0 E-Bay 50 22 3 0 28 2 0 18 10 4 20 5 0 19 1 3 Netflix 50 29 9 6 37 2 1 25 12 4 34 12 7 29 5 0 US Open 32 35 13 10 42 4 10 33 9 2 33 14 11 33 0 2 Total 242 128 26 25 153 11 23 116 37 12 122 32 20 116 6 7 Recall 90.6% 90.6% Precision 75.8% 95.1%
FiVaTech as a SRR Extractor Experiments (cont.) To recognize the data sections of a Web site, FiVaTech identifies a set of nodes n SRRs  that are the outer most set type nodes, i.e. the path from the node n SRRs  to the root of the schema tree has no other nodes of set type.  A special case is when the identified node n SRRs  in the schema tree has only one child node of another set type, this means data records of this section are presented in more than one column of a Web page, while FiVaTech still catches the data .
FiVaTech As a SRR Extractor Experiments (cont.) Data set:  11 Web site from Testbed Ver. 1.02. Step 1:  SRRs Extraction Step 1:  Alignment #Actual SRRs: 419 #Actual attributes: 92 Depta   FiVaTech   Depta   FiVaTech   #Extracted 248 409 93 91 #Correct 226 401 45 82 Recall 53.9% 95.7% 48.9% 89.1% Precision 91.1% 98.0% 48.4% 90.1% Dataset TBDW MSE [55] #Actual SRRs 693 1242 System ViPER FiVaTech MSE FiVaTech #Extracted 686 690 1281 1260 #Correct 676 672 1193 1186 Recall 97.6% 97.0% 96.1% 95.5% Precision 98.5% 97.4% 93.1% 94.1%
Conclusions & Future Work
Conclusions We surveyed for contemporary IE tools in the literature and compared them in three dimensions: the task domain, the techniques used, the automation degree. A set of criteria are proposed to compare and evaluate the surveyed systems. A global comparison based on the criteria of the three dimensions is made.
Conclusions (cont.) We proposed a novel Web data extraction approach, called  FiVaTech , to merge multiple DOM trees simultaneously, and deduce the schema and the template for the template-based Web pages. We presented a new model for dynamic page creation. We presented a new data structure called the  fixed/variant pattern tree , that can easily deduce the schema and template for the input Web site.
Evaluation From the 3 Dimensions Dimension 1 Page type:  Template NHS:  No Extraction Level:  Page-level Extraction Target Variation:  yes VF:  No CT:  By order UTA:  No Dimension 3 User Expertise:  No Interaction Fetch Support:  No Output:  XML Applicability:  Low Limitation:  More than one page Dimension 2 Scan Pass:  Single Extraction Rule type:  Tag tree Learning Algorithm:  Multiple tree merging, Patter mining, String Alignment, Tokenization Schemes:  Tag-Level
Extend the analysis to string contents inside text nodes and consider the situations when multiple templates are used for the same data ( i.e., we plan to make FiVaTech supports for the two criteria UTA and VF of the first dimension ).  Enlarge FiVaTech to be a whole  information integration system  by developing new page fetching and schema mapping techniques.  We plan to use FiVaTech in the creation of  virtual Web services  to speed up the realization of SOA ( service oriented architecture ). Future Work
PhD Presentation

PhD Presentation

  • 1.
    A Thesis SubmittedIn Partial Fulfillment for the Award of the Degree of Doctor of Philosophy (Ph.D) By Mohammed Kayed Information Extraction from Semi-Structured Web Pages Faculty of Science, Beni-Suef University, Egypt 2007
  • 2.
    Outline Introduction Part1: Chapters 2-3 A Survey and a Comparative Analysis for IE Systems Part 2: Chapters 4-7 FiVaTech: A New Page Level Web Data Extraction Approach Conclusion
  • 3.
    Introduction Information ExtractionA key to web information integration. Transforms Web pages into program-friendly format such as a relational database.
  • 4.
    Introduction (cont.) IEis applied in many Web applications Data extraction problem is very important for many applications that interact with search engines.
  • 5.
    An IE taskis defined by its input and its extraction target A free text IE task Template Web pages IE task Record-Level IE task Page-Level IE task Wrapper Induction Systems Systems that generate wrappers for given IE task Definitions
  • 6.
    A free textIE task which is specified by the input and its output. IE from Free Texts
  • 7.
    A Semi-structured pagecontaining list of data records. IE from Template Pages
  • 8.
    Many IE Systemsare Developed Minerva AutoSlog LIEP TSIMMIS WebOQL W4F XWRAP Rapier SRV Whisk NoDoSE DEByE Wien Stalker Softmealy IEPAD OLERA DeLa RoadRunner Exalg Depta ViPER ViNTs MSE Introduction (cont.)
  • 9.
    Low Effort Satisfying his/her requirements High Performance& General Solution A Survey and a Comparative Analysis are necessity
  • 10.
    Surveys the majorIE Systems. A taxonomy of IE systems from users’ viewpoint Three dimensions for comparison Task domain, Techniques used, Automation degree. Criteria for each dimension Part 1 of The Thesis
  • 11.
    FiVaTech Template pagesgenerated from deep Web Page-level Web data extraction Unsupervised approach Work for singleton and multiple record pages Detect schema and (tree) templates automatically Part 2 of The Thesis
  • 12.
    A Survey ofWeb Information Extraction Systems Part I
  • 13.
    Related Work TaxonomiesTime MUCs *MUC *Post-MUC Automation Degree Hsu and Dung *Hand-crafted *Special language *Heuristic-based *WI approaches Automation Degree Chang and Kuo *Need programmers *annotation examples *Annotation-free *Semi-supervised Extraction rules Kushmerick *Finite-state *Relational learning Usability Kuhlins *Commercial *Noncommercial Techniques Laender *Special languages *HTML-aware *NLP-based *WI tools *Modeling-based *Ontology-based Input & Extraction rules Muslea *Free text (syntactic/semantic rules) *WI tools (delimiter-based rules) *Online documents(delimiters, syntactic/semantic) Output Target Sarawagi *Record-level *Page-level *Site-level
  • 14.
    Three Dimensions forComparing IE Tools Survey (cont.) Automation Degree “ the degree of automation for IE systems ” Programmer-involved, learning-based or annotation-free approaches Techniques “ the performance of IE systems ” Regular expression rules vs. Prolog-like logic rules Pattern Mining, Deterministic FST, or probabilistic models Task Domain “ why an IE system fails to handle some Web sites of particular structures ” Input (Free text, semi-structured) Output Targets (record-level, page-level, site-level) Task Difficulties
  • 15.
    Task Domain: Criteria Page type Non-HTML support (NHS) Extraction Level Extraction Target Variation Missing/Multi-valued attributes Multi-ordering attributes Nested data Template Variations Variant formats Common formats Un-tokenized Attributes (UTA)
  • 16.
    Techniques: CriteriaScan passes Extraction rule types Feature used Learning algorithms Tokenization schemes Automation Degree: Criteria User Expertise Page-fetching Support Output / API Support Applicability Limitation
  • 17.
    Task Domain: What are semi-structured pages?
  • 18.
    Automation Degree: Four approaches Manually-constructed IE tools Supervised IE systems Semi-supervised IE systems Unsupervised IE systems
  • 19.
    Manually TSIMMIS MinervaWebOQL W4F XWrap Users program a wrapper by hand using a general programming language (e.g., Perl) or using special designed languages. Survey (cont.)
  • 20.
    Supervised SRV RapierWien Whisk NoDoSE Softmealy Stalker DEByE General users instead of programmers can be trained to use the labeling GUI, thus reducing the cost of wrapper generation. Survey (cont.)
  • 21.
    Semi-supervised IEPAD OLERAThresher Although require no labeled training pages, post-effort from the user is required to choose the target pattern. Survey (cont.)
  • 22.
    Unsupervised DeLa RoadRunnerExalg Depta ViPER MSE Do not use any labeled training examples and have no user interactions to generate a wrapper. Survey (cont.)
  • 23.
    Dimension 1: Task Domain
  • 24.
    Dimension 2: Techniques Tools Scan Pass Extraction Rule Type Features Used Learning Algorithm Tokenization Schemes Minerva Single Regular exp. HTML tags/Literal words None Manually TSIMMIS Single Regular exp. HTML tags/Literal words None Manually WebOQL Single Regular exp. Hypertree None Manually W4F Single Regular exp. DOM tree path addressing None Tag Level XWRAP Single Context-Free DOM tree None Tag Level RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level DEByE Multiple Regular exp. HTML tags/Literal words Data Modeling Word Level WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level IEPAD Single Regular exp. HTML tags Pattern Mining, String Alignment Multi-Level OLERA Single Regular exp. HTML tags String Alignment Multi-Level DeLa Single Regular exp. HTML tags Pattern Mining Tag Level RoadRunner Single Regular exp. HTML tags String Alignment Tag Level EXALG Single Regular exp. HTML tags/Literal words Equivalent Class and Role Differentiation by DOM tree path Word Level DEPTA Single Tag Tree HTML tags tree Pattern Mining, String comparison, Partial tree alignment Tag Level ViPER Single Tag Tree Visual Features/HTML tags tree Pattern Mining, global string alignment by Divide and Conquer Tag Level MSE Single Tag Tree Visual Features/HTML tags tree Pattern Mining with visual features Tag Level
  • 25.
    Dimension 3: Automation degree Tools User Expertise Fetch support Output/API Support Applicability Limitation Minerva Programming No XML High Not restricted TSIMMIS Programming No Text High Not restricted WebOQL Programming No Text High Not restricted W4F Programming Yes XML Medium Not restricted XWRAP Programming Yes XML Medium Not restricted RAPIER Labeling No Text Medium Not restricted SRV Labeling No Text Medium Not restricted WHISK Labeling No Text Medium Not restricted NoDoSE Labeling No XML, OEM Medium Not restricted DEByE Labeling Yes XML, SQL DB Medium Not restricted WIEN Labeling No Text Medium Not restricted STALKER Labeling No Text Medium Not restricted SoftMealy Labeling Yes XML, SQL DB Medium Not restricted IEPAD Post labeling Pattern selection No Text Low Multiple-records page OLERA Partial Labeling No XML Low Not restricted DeLa No Interaction Yes Text Low Multiple-records page, More than one page RoadRunner No Interaction Yes XML Low More than one page EXALG No Interaction No Text Low More than one page DEPTA Pattern selection No SQL DB Low Multiple-records pages ViPER No Interaction No SQL DB Low Multiple-records pages MSE No Interaction No -- Low More than one page
  • 26.
    Relationship Among DimensionsTemplate pages have higher automation degree than others. Semantic features are required for manual systems where the input has less common tags.
  • 27.
    Overall Comparison PractitionerHigh effectiveness tech., i.e., high recall and high precision. IE systems can only compared from their applicability. Semi-supervised and unsupervised systems have low applicability. Manual and supervised has high applicability. Researcher Which technique to apply when tailoring current systems to a new task domain . Unsupervised tech. are hard to extend to free texts and non-template pages. Supervised approaches can be extended to new task domain by adding new features.
  • 28.
    FiVaTech: A Page-LevelWeb Data Extraction Approach Part II
  • 29.
    Problem Formulation forTemplate Pages Data Extraction
  • 30.
    Page Generation ModelA Web page is generated by embedding data values x (taken from a Database) into a predefined template T. All data instances of the database conform to a common schema.
  • 31.
    A data schemacan be of the following types A basic type (β) , A tuple type <T 1 , T 2 , …, T n >, A set type {T}. An optional type (T)? is a tuple (not a set) type with cardinality 0 or 1 for every instantiation. A disjunctive type (T 1 | T 2 | ...| T k ) is a k-tuple <{T 1 }  1 , {T 2 }  2 , ..., {T 2 }  k >  where the cardinality sum of  1 to  k equals one for every instantiation of  . Schema
  • 32.
    Dealing with aWeb page (template & data) as a sequence of string is not a good solution. Both Templates and Web pages have tree-like structure. Our proposed Page Creation Model considers that both the template and the Web page have a tree structure. String Templates EXALG: If  is a tuple constructor of order n, T(  ) is an order set of n+1 string, If  is a set constructor, T(  ) is string S τ .
  • 33.
    Tree Templates T1  i T 2 is a new tree resulted by appending tree T 2 to the i th node (from the reference point ) on the right most path of tree T 1 .
  • 34.
    Tree Template: Encoding 1 We define the encoding for a type  and its instance x as: If  is of a basic type β, then  (T, x) is a node containing x. If  is a n-tuple, then T(  )=[(S 1 , …,S n+1 ), (i 1 ,…, i n ), (j 1 , …, j n )]. If x = (x 1 , …, x n ), then  (T, x) is produced by S 1  i1  (T, x 1 )  j1 S 2  i2  (T, x 2 )  j2 S 3 ...  in  (T, x n )  jn S n+1 If  is a set constructor, then T(  )=P. If x = {e 1 , e 2 , ..., e m }, then  (T, x) will be the tree by inserting the m subtrees  (T, e 1 ),  (T, e 2 ), ...,  (T, e m ) as siblings at the leaf node on the right most path of P.
  • 35.
    T(  1)= [A], T(  2 )=[ (B  0 C, D  0 E ,F,H,  ,  ,K) (0,0,0,0,0,0), (2,1,0,0,1,6) ], T(  3 )=[ (G,  ),0,0 ], T(  4 )=[ (I,  ),1,0 ], T(  5 )=[ (J,  ),2,0 ], Example for Encoding 1
  • 36.
    Tree Template: Encoding 2 If  is a n-tuple constructor, then T(  )=[(C 1 , …,C n+1 ), (i 1 ,…, i n )]. If x = (x 1 , …, x n ), then  (T, x) is the tree produced by inserting the n+1 ordered subtrees C 1  i1  (T, x 1 ), C 2  i2  (T, x 2 ), …, C n  in  (T, x n ), and C n+1 as siblings at the leaf node on the right most path of a template P. We merge set type with the tuple type and use the tuple template for set as well. Thus, all type constructors have the same template format. We use the name n-order set for this merged template. We define the encoding for a type  and its instance x as:
  • 37.
    T(w 1 )=[A,(B,  ),0], T(w 2 )= [  ,(C,D,K),(0,0)], T(w 3 )= [  ,(E,F,  ),(0,0)], T(w 4 )= [  ,(  ,H,  ,  ),(0,0,0)], T(w 5 )= [  ,(  ,  ,  ),(0,0)], T(  3 )= [  ,(G,  ),0] T(  4 )= [  ,(I,  ),1], T(  5 )= [  ,(J,  ),2]. Example for Encoding 2
  • 38.
    Problem Formulation Definition: Given a set of n DOM trees, DOM i =  ( T, x i ) (1 ≤ i ≤ n ), created from some unknown template T and values { x 1 ,. . .,x n }, deduce the template and values, from the set of DOM trees alone. We call this problem a page-level information extraction. If one single page ( n =1) which contains tuple constructors is given as input, the problem is to deduce the template for the schema inside the tuple constructors. We call this problem a record-level information extraction task.
  • 39.
  • 40.
    FiVaTech System OverviewGiven some DOM trees (Web pages) as input, we try to merge all DOM trees at the same time into a single tree called a fixed/variant pattern tree . From this pattern tree, we can recognize variant leaf nodes for basic-typed data and mine repetitive nodes for set-typed data.
  • 41.
    Almost, collect allrequired information
  • 42.
    Fixed/Variant Tree ConstructionThe tree merging algorithm.
  • 43.
    Assumption: Basictype data usually occur at the leaf nodes A top-down approach for multiple tree merging: From a tree level to a string level Level by level (multiple) string alignment that considers both missing data and multiple-value data. At each internal node n (starting from root): collects all first-level child nodes of the input trees (subtrees matched with the subtree from n) in a matrix Conducts four steps Peer node recognition Matrix alignment Pattern mining Optional & disjunctive node identification Fixed/Variant Tree Construction (cont.)
  • 44.
    Peer Matrix MAligned Peer Matrix Aligned List List after Mining Pattern Tree
  • 45.
    Matching Score NormalizationAlgorithm Step 1: Peer Node Recognition
  • 46.
    A matching Scoreexample Depta: 15/43 (≈0.35) FiVaTech: ( 1.0 + 0.6 + 0.6 + 0.6 + 0.6 ) / 5 = 68.0 68.0 + ( 1 / Average (43, 23) ) ≈ 0.71 Assume that tr i (i = 1, 2, 3, 4) has 6 mapping tr j (j = 5, 6). Assume also that the size of every tr i is 10.
  • 47.
    A Fixed TemplateTree Two trees may have similar structures, but different functions. A Fixed Template Tree is the tree that is a part of template. Matching score (=1) can be used to differentiate such fixed template trees.
  • 48.
    Step 2: Peer Matrix Alignment The peerMatrixAlignment algorithm.
  • 49.
    Peer Matrix Alignment(cont.) R1: Select a node with checkSpan r (n rc ) = -1 ; R2: checkSpan r (n rc ) = 1 and M[r][c]=M[r down ][c′]) . If R 1 and R 2 fail, divide the row into 2 parts: P 1 and P 2 . Span(n rc ) is the maximum number of different nodes (without repetition) between any two consecutive occurrences of n rc in each column c plus one. Shifting a node n rc from M is based on the following rules:
  • 50.
    Span of a, b, c, d, e are 0, 3, 3, 3, 0 Peer Matrix Alignment (cont.)
  • 51.
    Peer Matrix Alignment(cont.) The function alignmentResult handles the problem of different functionalities by a clustering algorithm.
  • 52.
    Peer Matrix Alignment(cont.) The clustering algorithm. The principle here is: &quot; as well as nodes of each row in the matrix M have the same structure, they should also have the same functionality &quot;
  • 53.
    Step 3: Frequent Pattern Mining A Formal Description of a Repetitive Pattern.
  • 54.
    Frequent Mining AlgorithmStep 3: Tandem Repeat Mining
  • 55.
    Example for TandemRepeat Mining A Frequent Mining Example
  • 56.
    Step 4: Optional Node Merging The occurrence vector of: a and e is (1,1,1) . b and c is (1,1,1,1,1,1) d is (1,0,1,1,0,1)  Optional
  • 57.
  • 58.
    The constructed fixed/variantpattern tree A Running Example The next step is: Identifying Tuples
  • 59.
  • 60.
    Data Schema DetectionMark nodes as k -tuple For nodes with only one child and are not marked as set or optional types, there is no need to mark it as 1-tuple. For nodes with more than one branch, we mark them as k -tuple ( k ≥1) if k calls of the function schemaDetection return true.
  • 61.
    Data Schema Detection(cont.) The schema S is the pattern tree after excluding all tag nodes that have no types.
  • 62.
    A reference noder is identified as: r is of a tuple type ( table 2 , td 4 , div, span 1 , span 2 , span 3 ), The next (right) node of r is a node of type  , where  is either β ( td 2 , a, strike, br 4 , br 6 , “Delivery:” ) OR a set type {} ( table 1 , tr 1 ), OR a virtual node ( br 2 , “Features:” ), r is a leaf node on the right most path of a k-tuple or k-order set and is not of any type ( td 5 , br 3 ). We call r a rightmost reference node. Reference Node Identification
  • 63.
    Template Identification Templatesare identified by segmenting the pre-order traversing of the trees (skipping basic type nodes) at every reference nodes.
  • 64.
    Data Schema DetectionGiven a k -tuple or k -order set <  1 ,  2 , …,  k > at node n, where every type  i (1  i  k ) is located at a node n i ; [P, (C 1 , C 2 , …, C k+1 )] For template P If  is the first data type in the schema tree, then P will be the one containing its reference node, or  otherwise. For template C i If  i is a tuple type, then the template that includes node n i and the respective insertion position will be 0. If  i is of set type or basic type, then C i will be the template that is under n and includes the reference node of n i or  if no such templates exist. If C i is not null, the respective insertion position will be the distance of n i to the rightmost path of C i . Template C k+1 will be the one that has the rightmost reference node inside n or  otherwise.
  • 65.
    Data Schema Detection(cont.) T(  1 )=(T 1 ,(T 2 ,  ),0), T(  2 )=(  ,(T 3 ,  ),0), T(  3 )=(  ,(T 4 ,T 5 ,T 18 ), (0,0)), T(  4 )=(  ,(T 6 ,T 7 ,  ),(0,0)), T(  5 )=(  ,(T 8 ,T 11 ,  ,  ,  ), (1,0,0,0)), T(  6 )=(  , (T 9 ,T 10 ),0), T(  7 )=(  ,(  ,  ,  ),(0,0)), T(  8 )=(  ,(T 12 ,  ),1), T(  9 )=(  ,(T 13 ,  ),0), T(  10 )=(  ,(T 14 ,  ),2), T(  11 )=(  ,(T 15 ,  ),1),
  • 66.
    FiVaTech Vs. DeptaOrder: mining then alignment Rely on Tag tree not the DOM tree. Partial alignment relies on the order of trees matched with the seed tree. Can not differentiate between data sections that have noisy data and data sections that have relevant data. Can not handle singleton pages FiVaTech Vs. EXALG FiVaTech has different granularity of the used tokens than EXALG. Assume that HTML tags as part of the data. EXALG assumes that a pair of two valid equivalence classes is nested.
  • 67.
    FiVaTech as aSchema Extractor Experiments The comparison with EXALG schema. Dataset: 9 Web sites on EXALG home page. site N Manual EXALG FiVaTech A m O m {} A e O e {} c Incorr. A e O e {} c Incorr. i n i n Amazon (Cars) 21 13 0 5 15 0 5 11 4 2 8 1 4 8 0 0 Amazon (Pop) 19 5 0 1 5 0 1 5 0 0 5 0 1 5 0 0 MLB 10 7 0 4 7 0 4 7 0 0 6 0 1 6 0 1 RPM 20 6 1 3 6 1 3 6 0 0 5 0 3 5 0 1 UEFA (Teams) 20 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 UEFA (Play) 20 2 0 1 4 2 1 2 2 0 2 0 0 2 0 0 E-Bay 50 22 3 0 28 2 0 18 10 4 20 5 0 19 1 3 Netflix 50 29 9 6 37 2 1 25 12 4 34 12 7 29 5 0 US Open 32 35 13 10 42 4 10 33 9 2 33 14 11 33 0 2 Total 242 128 26 25 153 11 23 116 37 12 122 32 20 116 6 7 Recall 90.6% 90.6% Precision 75.8% 95.1%
  • 68.
    FiVaTech as aSRR Extractor Experiments (cont.) To recognize the data sections of a Web site, FiVaTech identifies a set of nodes n SRRs that are the outer most set type nodes, i.e. the path from the node n SRRs to the root of the schema tree has no other nodes of set type. A special case is when the identified node n SRRs in the schema tree has only one child node of another set type, this means data records of this section are presented in more than one column of a Web page, while FiVaTech still catches the data .
  • 69.
    FiVaTech As aSRR Extractor Experiments (cont.) Data set: 11 Web site from Testbed Ver. 1.02. Step 1: SRRs Extraction Step 1: Alignment #Actual SRRs: 419 #Actual attributes: 92 Depta FiVaTech Depta FiVaTech #Extracted 248 409 93 91 #Correct 226 401 45 82 Recall 53.9% 95.7% 48.9% 89.1% Precision 91.1% 98.0% 48.4% 90.1% Dataset TBDW MSE [55] #Actual SRRs 693 1242 System ViPER FiVaTech MSE FiVaTech #Extracted 686 690 1281 1260 #Correct 676 672 1193 1186 Recall 97.6% 97.0% 96.1% 95.5% Precision 98.5% 97.4% 93.1% 94.1%
  • 70.
  • 71.
    Conclusions We surveyedfor contemporary IE tools in the literature and compared them in three dimensions: the task domain, the techniques used, the automation degree. A set of criteria are proposed to compare and evaluate the surveyed systems. A global comparison based on the criteria of the three dimensions is made.
  • 72.
    Conclusions (cont.) Weproposed a novel Web data extraction approach, called FiVaTech , to merge multiple DOM trees simultaneously, and deduce the schema and the template for the template-based Web pages. We presented a new model for dynamic page creation. We presented a new data structure called the fixed/variant pattern tree , that can easily deduce the schema and template for the input Web site.
  • 73.
    Evaluation From the3 Dimensions Dimension 1 Page type: Template NHS: No Extraction Level: Page-level Extraction Target Variation: yes VF: No CT: By order UTA: No Dimension 3 User Expertise: No Interaction Fetch Support: No Output: XML Applicability: Low Limitation: More than one page Dimension 2 Scan Pass: Single Extraction Rule type: Tag tree Learning Algorithm: Multiple tree merging, Patter mining, String Alignment, Tokenization Schemes: Tag-Level
  • 74.
    Extend the analysisto string contents inside text nodes and consider the situations when multiple templates are used for the same data ( i.e., we plan to make FiVaTech supports for the two criteria UTA and VF of the first dimension ). Enlarge FiVaTech to be a whole information integration system by developing new page fetching and schema mapping techniques. We plan to use FiVaTech in the creation of virtual Web services to speed up the realization of SOA ( service oriented architecture ). Future Work