SlideShare a Scribd company logo
1 of 37
DIADEM           domain-centric intelligent automated
                 data extraction methodology




                          OPAL:
        A Passe-partout for Web
                          Forms
                                                        Xiaonan Guo
  DIADEM Group, Department of Computer Science, Oxford University
 joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio
                                               Orsi, Christian Schallhart
DIADEM ›❯ OPAL


A scenario...
     Looking for a house ?
           Too many websites to check ?
           Tired of filling every search form ?




www, Lyon, Apr 20, 2012
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Scenario...
DIADEM ›❯ OPAL


A Unique Interpretation
DIADEM ›❯ OPAL


A Unique Interpretation



                                      (A)                             purpose     Real%Estate)Web)Form           Combined)Form           purpose.{combined}


                                                                                purpose
                                                                                            AND
                                                               1..1                                     0..1                  0..1           1..1
                                                          Price                                Property%Contract       Search%Option   Form%Buttons
                                      purpose                                    OR
                                                        Segment                                    Segment               Segment         Segment
                                                                       1..1                 1..*
                                                                  Geographic          Property%Feature
                                                                                                             purpose
                                                                   Segment                Segment




                       (B)                      Currency 0..1
                                                                          Price&Segment            purpose
                                                Element

                                             0..1            1..1
                                       Currency          Currency
                                        Label           Input<Field
                                                                                XOR

                                            purpose                               1..1                                   1..1
                                                                                             purpose                               purpose
                                                          OR              Price&Element                          Price&Element
                                                                                             priceType.{apx}                       priceType.{range}
                                                 1..1                  1..1
                           purpose                                               purpose
                                       Price&Element           Price&Element
                    priceType.{min}                                              priceType.{max}
DIADEM ›❯ OPAL


A Unique Interpretation

                                                                OPAL

                                      (A)                             purpose     Real%Estate)Web)Form           Combined)Form           purpose.{combined}


                                                                                purpose
                                                                                            AND
                                                               1..1                                     0..1                  0..1           1..1
                                                          Price                                Property%Contract       Search%Option   Form%Buttons
                                      purpose                                    OR
                                                        Segment                                    Segment               Segment         Segment
                                                                       1..1                 1..*
                                                                  Geographic          Property%Feature
                                                                                                             purpose
                                                                   Segment                Segment




                       (B)                      Currency 0..1
                                                                          Price&Segment            purpose
                                                Element

                                             0..1            1..1
                                       Currency          Currency
                                        Label           Input<Field
                                                                                XOR

                                            purpose                               1..1                                   1..1
                                                                                             purpose                               purpose
                                                          OR              Price&Element                          Price&Element
                                                                                             priceType.{apx}                       priceType.{range}
                                                 1..1                  1..1
                           purpose                                               purpose
                                       Price&Element           Price&Element
                    priceType.{min}                                              priceType.{max}
DIADEM ›❯ OPAL


A Unique Interpretation

                                                                    OPAL

                                          (A)                             purpose     Real%Estate)Web)Form           Combined)Form           purpose.{combined}


                                                                                    purpose
                                                                                                AND
                                                                   1..1                                     0..1                  0..1           1..1
                                                              Price                                Property%Contract       Search%Option   Form%Buttons
                                          purpose                                    OR
                                                            Segment                                    Segment               Segment         Segment
                                                                           1..1                 1..*
                                                                      Geographic          Property%Feature
                                                                                                                 purpose
                                                                       Segment                Segment




                           (B)                      Currency 0..1
                                                                              Price&Segment            purpose
                                                    Element

                                                 0..1            1..1
                                           Currency          Currency
                                            Label           Input<Field




    Master Form
                                                                                    XOR

                                                purpose                               1..1                                   1..1
                                                                                                 purpose                               purpose
                                                              OR              Price&Element                          Price&Element
                                                                                                 priceType.{apx}                       priceType.{range}
                                                     1..1                  1..1
                               purpose                                               purpose
                                           Price&Element           Price&Element
                        priceType.{min}                                              priceType.{max}




                  ...
DIADEM ›❯ OPAL


OPAL Overview
DIADEM ›❯ OPAL


OPAL Overview
  Ontology based Pattern Analysis with Logic
DIADEM ›❯ OPAL


OPAL Overview
  Ontology based Pattern Analysis with Logic
  multi-scope domain independent analysis
     combines visual, textual, and structural features
DIADEM ›❯ OPAL


OPAL Overview
  Ontology based Pattern Analysis with Logic
  multi-scope domain independent analysis
     combines visual, textual, and structural features
  domain-aware form interpretation
     parameterizes domain knowledge with OPAL-TL
     classifies and repair form model
DIADEM ›❯ OPAL


OPAL Overview
  Ontology based Pattern Analysis with Logic
  multi-scope domain independent analysis
     combines visual, textual, and structural features
  domain-aware form interpretation
     parameterizes domain knowledge with OPAL-TL
     classifies form fields and repair domain form model
  Template language OPAL-TL
     allows domain parameterization
     allows natural access to domain knowledge
DIADEM ›❯ OPAL


OPAL Overview
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)
                  Price Element(max)
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)
                  Price Element(max)
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch
                                         Segment
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)
                  Price Element(max)
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch   Geographic
                                         Segment      Segment
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)
                  Price Element(max)
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch   Geographic
                                         Segment      Segment
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)
                  Price Element(max)
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch   Geographic
                                         Segment      Segment
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)      Price
                  Price Element(max)    Segment
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch   Geographic
                                         Segment      Segment
                  AreaBranch Element
                  AreaBranch Element
                          ...

                  Price Element(min)      Price
                  Price Element(max)    Segment
                          ...
DIADEM ›❯ OPAL


OPAL Overview

                          ...

                  AreaBranch Element
                  AreaBranch Element
                  AreaBranch Element   Area-Branch   Geographic
                                         Segment      Segment
                  AreaBranch Element
                                                                  Real-
                  AreaBranch Element
                                                                  Estate
                          ...                                     Form
                  Price Element(min)      Price
                  Price Element(max)    Segment
                          ...
DIADEM ›❯ OPAL


OPAL Evaluation
  Domain-awareness Experiment
     100 real-estate forms, 100 used car forms
  Domain-independence Experiment - ICQ
     5 domains, 100 web forms
  Domain-independence Experiment – Tel8
     8 domains, 477 web forms
DIADEM ›❯ OPAL


OPAL Evaluation
  Domain-awareness Experiment
     100 real-estate forms, 100 used car forms   >98%
  Domain-independence Experiment - ICQ
     5 domains, 100 web forms                    >95%
  Domain-independence Experiment – Tel8
     8 domains, 477 web forms                    >95%
9
DIADEM ›❯ OPAL


Form Filling with OPAL
DIADEM ›❯ OPAL


Form Filling with OPAL




                  Webpage
DIADEM ›❯ OPAL


Form Filling with OPAL




                   Webpage



                  Master Form
DIADEM ›❯ OPAL


Form Filling with OPAL
 labeling




                   Webpage



                  Master Form
DIADEM ›❯ OPAL


Form Filling with OPAL
 labeling
                                interpretation




                   Webpage



                  Master Form
DIADEM ›❯ OPAL
                                                                                                             OPAL: A Passe-partout for Web Forms
DIADEM                                              domain-centric intelligent automated
                                                    data extraction methodology
                                                                                                                                                  Authors

                                                                                                                                                      Xiaonan Guo, Jochen Kranzdorf,
                                                                                                                                                                                                                    Digital Home

                                                                                                                                                                                                                    diadem.cs.ox.ac.uk/opal/
                                                                                                                                                                                                                                                                                                                            Sponsors



                                                                                                                                                         Tim Furche, Giovanni Grasso,
                                                                                                                                                      Giorgio Orsi, Christian Schallhart
                                                                                                                                                                                                                    diadem-opal@cs.ox.ac.uk
                                                                                                                                                                                                                                                                                                                                                          b-node

                                                                            Segment Scope
                                                        Segmentation: Find “logical” structure of the form                                                              OPAL: Automated Form Understanding for the Deep Web                                                                                  OPAL: A Passe-partout for the Web
                                                        — segmentation tree with only fields as leaves and                                                               OPAL combines multi-scope domain-independent analysis with domain knowledge                                                         Automatic form filling based on domain-specific master form
                                                          • form segments s as inner nodes such that s has                                                    T4         — multi-scope domain-independent analysis: field, segment, and layout scope                                                         — automatic (approximate) matching of master form values to values of concrete fields
                                                          • at least degree two and all fields in s are style-equivalent
                                                                                                                      T2                                                 — integration of three scopes yields more robust, simpler heuristics than single scope                                              — visualization of form and segment concepts

                                                                    4               — segmentation labeling: distribute labels F2                                        — strict preference for disambiguation for quality and performance reasons                                                          — automatically detects forms of the given domain and fills them
                                                                                      • to fields, if there is a regular structure                                       Domain knowledge on top of the domain-independent analysis for                                                                      — works on nearly any page of the domain
                                                                                                                                   T1                   NW    N         NE
                                    2                   4           5                 • to segments, if single, prefix label                                             — classifying form fields and segments according to the domain ontology
                                                                                                                           T3 nF3 n’                     W   F1         E— verifying and repairing the form model to be consistent with domain constraints                                                   OPAL outperforms previousapproaches even without domain knowledge
                                                                                          — style-equivalence: two nodes and
                                        2           5       6           7       8   8                                                                                   SE Datalog-based template language for easy definition of domain knowledge
                                                                                                                                                                         —                                                                                                                                   — with domain knowledge we achieve nearly perfect accuracy in multiple domains
                        1     1             3                               7                • if same class or same type and CSS style                 SW     S
                   1                            3               6                                                        Figure 5: Layout Scope
                                                                                             — complexity: linear in document size and depth                 Labeling
                   Figure 4: Example for Segment Scope Labeling
                                                                            DOM tree                      w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or
                                                                                                                                               Segment tree                                                   Layout tree                                                                          Schema tree
       ns as representative for s (f (s) = ns ). For each segment with reg-                               (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an-
       ular interleaving of text nodes and field or segment nodes, we use                                  other field with ne-e(t    0 , f 0 ) and w-nw-n(t 0 , f ).
       those text nodes as labels for these nodes, preserving any already




                                                            
 
Thank You
       assigned labels and fields (from field scope). In detail, we iterate                                 Fields & Labels and T are overshadowedthe example in Fig-, & Labels
                                                                                                            To illustrate this overshadowing, consider
                                                                                                          ure 5. For field F , T
                                                                                                                                                             Segments
                                                                                                                                                       by F and T by F
                                                                                                                                                                                                                                             Visual Labels                                                                                              Form Model
       over all descendants c of each segment in document order, skip-                                                            1   2       4                          2        3      3
       ping any nodes that are descendants of another segment or field                                     only T1 is not overshadowed, as there is no other text node that is
       itself contained in n (line 13). In the iteration, we collect all field or                          south-east or south from T3 not overshadowed by another field.
       segment nodes in Nodes, and all sets of text nodes between field or                                    The layout scope labeling is then produced as follows: For each
                                                                 Field Scope
       segment nodes in Labels, except those text nodes already assigned                                                                 Segment Scope
                                                                                                          field f , we collect all text nodes t with w-nw-n(t, f ) and add them                                   Layout Scope                                                                         Domain Scope
       as labels in field scope (line 14), as we assume that these are outliers                            as labels to f if they are not overshadowed by another field and not
       in the regular structure of the segment. We assign the i-th text node                              contained in a segment that is no ancestor of f . The latter prevents
       group to the i-th field, if the two lists have the same size (possibly                              assignment of labels from unrelated form segments.
       using the first text node as labels of the segment, line 17–19).
          Figure 4 illustrates the segment scope labeling with triangles                                  4. FORM INTERPRETATION
       denoting Visual heuristics: Find labels inblack circles segments, and
                  text nodes, diamonds fields, visual proximity of a field
                                                                                                             There is no straightforward relationship between form fields for
                                                                                                                                 T4
       white circles DOM “overshadowed” segment tree. The numbers in-
                 — but: not nodes not in the by another field
                                                                                                        T2
                                                                                                          domain concepts, such as location or price, and their structure within
       dicate which text nodes are assigned as labels tofield in reading order
                 — but: preference for labels preceding a which segments or                                    F2
                                                                                                          a form. Even seemingly domain-independent concepts, such as
       fields. E.g., for the left hand segment, we observe a regular struc-
                                                                                                          price,Toften exhibit domain NE specific peculiarities, such as “guide
       ture of (text node+, field)+ and thuspwe assign the i-th group of
                        — t visible from a point if north-west of p                                               1      NW      N
                                                                                                          price”, “current offers in excess”, or payment periods in real es-
       text nodes to the i-th field. For the for t if hand segment (4), we find T3
                        — f overshadows f’ right t visible from                                            F3              W F1         E
                                                                                                          tate. OPAL’s domain schemata allow us to cover these specifics.
       a subsegment (5) • bottom-right corner of f and f, f’ unaligned node
                           and field 8 that is already labeled with text
                                                                                                          We recall from Section 2 that a form model (F 0 , t) for a schema S
                                                                                                                         SW      S      SE
       8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned
                                 bottom-left corner of and one text node re-
                                                                                                          is derived from a form labeling F by extending F with types and
                                                                                                          restructuring its inner nodes to fit the structuralFilling: Aof S.
                                                                                                                                                   Form constraints Passe-partout for the Web                                                   Experiments                                                                                                Applications of OPAL
       mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f
                                  •    bottom-right corner label. aligned find
       one more text node group than fields and visualconsider the first text
                                       has no other thus label
                                                                                                                                                                                                                                                                                                                                                            form filling
                                                                                                             OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout)
                                                                                                                                                    Master form serves in two                                                                                                                                                                                                     Search
       node group as a segment label. The remaining nodes have a regular                                                                                                                                                                                 1
                                                                                                          steps: (1) the classification of nodes in — according free text values used directly for free text inputs
                                                                                                                                                     F currently: to the domain                                                                                                                                                                             —‘key’ to the web                    Data Extraction
       structure (field, text node+)+ and get assigned accordingly.                                                                                                                                                                                0.985

                                                                                Layout Scope              types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes
                                                                                                                                                    — step relies on the anno-                                                                                                                                                                              — all types of web

       3.3 Layout Scope                                                                                   tation schema L and its typing of labels in F; (2) the model repair                                                                         0.97                                                                                                     automation need it
                                                                                                          where the segmentation structure derived in the segmentation scope
       At layout scope, we further refine the form labeling for each                                                                                ➊ URL
                                                                                                          (Section 3.2) is aligned with the structure constraints of S.
                                                                                                                                                                                         ➊                                                        0.955


    form field not yet labelled in field or segment scope, by exploring                                                                                                                                                                           (a)0.94
    the visible text nodes in the west, north-west, or north quadrant,
                                                                                                                                ➋ Visualization                                                                       ➋                                      UK Real Estate (100) UK Used Car (100)                  ICQ (98)             Tel-8 (436)
                                                                                                                                                                                                                                                                                                                                                             Assistive
Domain Scope: OPAL-TL by any other field. To this end, OPAL                                                4.1 Schema Design: OPAL-TL
                                                                                                                                  Controller                                                                                                    (b)                                                                                                          Devices
    if they are not overshadowed                                                                                              1                                     TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
                                                                                                              OPAL provides a template language, OPAL - TL, for easily speci-                                                                           1
OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes:
    constructs layout tree from the CSS box labels of queries
                                                                                                                  domain schemata reusing common Web page and their con- segment<C,A> { ➌
                                                                                                                                                                                                                                                (c)
                                                                                                                                                                      2

                                                                                                          fying 2          A           4        A        ➌ concepts      TEMPLATE concept_by_                    concept<C>(N)( N@A{e,p} }            0.98
— templates = common9. The layout forms, e.g.,given DOM P is a tuple
            D EFINITION          pattern in web tree of a range specification
                                                                                                          straints as well as concept templates. To implement aconcept_minmax<C,C ,A> {
                                                                                                                                                A
                                                                                                                                                                      4
                                                                                                                                                                         TEMPLATE
                                                                                                                                                                                      new do-                                                   (d)
                                                                                                                                                                                                                                                  0.96                                       (e)
— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM
        (NP constraints , ne, e, se, s, sw, aligned) where segment set of                                B        3        A                                                                            M
                                                                     P                                    main, we only need to provide (1) a set of annotators implementing
                                                                                                                                                                      6 concept<CM >(N1 )  ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),             0.94
                                                                                                                                                                                                                                                (f)
        nodes from P, , w, nw, n, . . . the “belongs to” (containment), west,
   TEMPLATE segment<C>{                                                                                  CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do-
                                                                                                                           B                             - TL specification   N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
                                                                                                                                                                                                                 ➍                                    0.92
                                                                                                                                                                                                  Ⓐ
                                                                                                                                                                      8 concept<CM >(N (
 2
        north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d})
     segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G),
                                                                            Repairing form                main types and their classification
                                                                                                           Figure 6: Example Form Labeling
                                                                                                                                                                                                                                   ,N1 ),
       ¬(concept<C>(N2 ) _ segment<C>(N2 )) }                                                                                                                Ⓐ Field concept<C>(N              2 @range_connector{e,d},¬(A1     A,                     0.9    (a) web page                                                      (b) page scope
        holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun-
                                                                            form interpretation F, that labels annotated with certain type templates and predefined predi-
                                                                                                              OPAL -   not necessarily a model with                  10 concept<CM >(N1 )
                                                                                                                                                                                                  Ⓑ
                                                                                                                                                                                           ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),                        Airfare              Auto              Book                Job             US R.E.

                                                                                     and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
                                                                                                                                                             Ⓑ
 4
                                                                            der S,direct?either provided by human domain experts or We
                                                                                         are look for violations ofconvenient querying of annotations and DOM nodes. An                                                                                                                                                                                                     Multi-modal Input
   TEMPLATE segment_range<C,CM > {                                            •                           cates for structural constraints.                                  N1
     segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p})
            We call w, nw, . . . the neighbour relations. The adapt tree is of           types sources such as DBPedia and Freebase. The
                                                                                                                                                                                                                                                        Figure 4: Holbrook Moran form and page-scope labeling
                                                                                                                                                                                                                                                                                                                                                                            (Touch-Screens)
 6                                                                                                                                                         filled     12         _    1             2                                                                              field             segment         layout        domain
                                                                              • proper? look for labels (‘price’) or also values (‘£10’)
        most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM
                                                                             be computed majority describedprogramconstruct a form
                                                                                          the rewriting rules large set of such is executed against a form
                                                                                                          OPAL - TL
       N1 6= N2 ,child(N1 ,G),child(N2 ,G) }
                                                                                                                                                           according to master
 8
                                                                            modelexclusive? as price,Relationstheof the type P strati-
                                                                                     compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates
                                                                                                          P. performsorfrom   rewriting in
   TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the
        in segment_with_unique<C,U> {                                                    types such                                                        form                       Figure 7:                                                   1           Real-estate                                    1                                                                                  Web Automation
                                                                            fied manner to guarantee termination only use child (descendant, resp.) for the child (descen-
                                                                                                                     and introduces at most n new
                                                                                            D EFINITION TL. We ain the form.
     segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G),         — template: limited second-order, same data complexity                                                                                                                                                                                                                                                                 & Testing
10
        union of the relations w, nw, and n.                                segments where n is the number  11. of fields form labeling F on a DOM P and an
                                                                                                                 Given
       ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . }                                                                                   ➎ Master formsibling order                                                             0.99

            In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con-
                                                                              • template variablesIfquantify over predicates or We is an expres-
                                                                               (1) Under Segmentation: L, resp.) relation intype such
                                                                                         annotation       dant, an a segment n with F. types                               As                                                                                                                 Precision                                        domain
12
                                                                                                                                                           provides passe-                                                                      0.98
                                                                            that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N
                                                                                                                                                                        straints that associate                                                                 Used-car
   TEMPLATE outlier<C>{                                                       • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if
                                                                                                          from P segments
                                                                                         sion of the form: X@A{d, p, e} of type t1 . ,tk 62                                                                                                                                                   Recall                                           layout
        preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions
                                                  0 ,P),¬(segment<C>(G0 from a field. How-
                                                                                                                                                                                                                                                                                                        0.8
                                                                                         A 2 A, and d, P and e are annotation modifiers. An annotation for filling
                                                                                                                                                           partout      is labeled by an exclusive direct and proper annotation of type A.                                                    F-score                                          segment
                                                                                                           p,
        ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-
14   outlier<C>(G)( root(G)_child(G,P),child(G                             INSTANTIATE basic_concept<D,A> using node in        {<RADIUS, occurs
                                                                                                                                            radius>}                                                                                            0.97
                                                                                                                                                                                                                                                                            0.6              0.7             0.8            0.9              1 field
                                                                            P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms
                                                                                         Pn such              µ ✓ and Pe} (X,Y ), |=
        segment labels. Thusstructural constraints
                 Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as
                                                                            For each Pi we add
                                                                                                          der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa.
                                                                                                                                           Rnext-sibling                 TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} }             0.96
                                                                                                                                                                                                       Figure 3: OPAL Interface
                                                                                               @Aµ nodes 2 P : Allowµ (n)  Matchµ labell 0} (X)) µ (A)
        Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f  Blockas l(X).
                                                                                                          Finally, to from n to that segment. /                                                                                                 0.95                                                     0.6
                                                                            with ti , and move
                                                                                In practice, few cases of multiple under segmentations occur at the                     A template tpl is instantiated to produce a family of rules where                Real estate
                                                                                                                                                                                                                                                         Real-estate       Used car
                                                                                                                                                                                                                                                                           Used-car                                Real-estate Used-car
 first two templates). It is0 the only template 0with two concept tem-                            0
DIADEM ›❯ OPAL
                                                                                                             OPAL: A Passe-partout for Web Forms
DIADEM                                              domain-centric intelligent automated
                                                    data extraction methodology
                                                                                                                                                  Authors

                                                                                                                                                      Xiaonan Guo, Jochen Kranzdorf,
                                                                                                                                                                                                                    Digital Home

                                                                                                                                                                                                                    diadem.cs.ox.ac.uk/opal/
                                                                                                                                                                                                                                                                                                                            Sponsors



                                                                                                                                                         Tim Furche, Giovanni Grasso,
                                                                                                                                                      Giorgio Orsi, Christian Schallhart
                                                                                                                                                                                                                    diadem-opal@cs.ox.ac.uk
                                                                                                                                                                                                                                                                                                                                                          b-node

                                                                            Segment Scope
                                                        Segmentation: Find “logical” structure of the form                                                              OPAL: Automated Form Understanding for the Deep Web                                                                                  OPAL: A Passe-partout for the Web
                                                        — segmentation tree with only fields as leaves and                                                               OPAL combines multi-scope domain-independent analysis with domain knowledge                                                         Automatic form filling based on domain-specific master form
                                                          • form segments s as inner nodes such that s has                                                    T4         — multi-scope domain-independent analysis: field, segment, and layout scope                                                         — automatic (approximate) matching of master form values to values of concrete fields
                                                          • at least degree two and all fields in s are style-equivalent
                                                                                                                      T2                                                 — integration of three scopes yields more robust, simpler heuristics than single scope                                              — visualization of form and segment concepts

                                                                    4               — segmentation labeling: distribute labels F2                                        — strict preference for disambiguation for quality and performance reasons                                                          — automatically detects forms of the given domain and fills them
                                                                                      • to fields, if there is a regular structure                                       Domain knowledge on top of the domain-independent analysis for                                                                      — works on nearly any page of the domain
                                                                                                                                   T1                   NW    N         NE
                                    2                   4           5                 • to segments, if single, prefix label                                             — classifying form fields and segments according to the domain ontology
                                                                                                                           T3 nF3 n’                     W   F1         E— verifying and repairing the form model to be consistent with domain constraints                                                   OPAL outperforms previousapproaches even without domain knowledge
                                                                                          — style-equivalence: two nodes and
                                        2           5       6           7       8   8                                                                                   SE Datalog-based template language for easy definition of domain knowledge
                                                                                                                                                                         —                                                                                                                                   — with domain knowledge we achieve nearly perfect accuracy in multiple domains
                        1     1             3                               7                • if same class or same type and CSS style                 SW     S
                   1                            3               6                                                        Figure 5: Layout Scope
                                                                                             — complexity: linear in document size and depth                 Labeling
                   Figure 4: Example for Segment Scope Labeling
                                                                            DOM tree                      w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or
                                                                                                                                               Segment tree                                                   Layout tree                                                                          Schema tree
       ns as representative for s (f (s) = ns ). For each segment with reg-                               (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an-
       ular interleaving of text nodes and field or segment nodes, we use                                  other field with ne-e(t    0 , f 0 ) and w-nw-n(t 0 , f ).
       those text nodes as labels for these nodes, preserving any already




                                                            
 
Thank You
       assigned labels and fields (from field scope). In detail, we iterate                                 Fields & Labels and T are overshadowedthe example in Fig-, & Labels
                                                                                                            To illustrate this overshadowing, consider
                                                                                                          ure 5. For field F , T
                                                                                                                                                             Segments
                                                                                                                                                       by F and T by F
                                                                                                                                                                                                                                             Visual Labels                                                                                              Form Model
       over all descendants c of each segment in document order, skip-                                                            1   2       4                          2        3      3
       ping any nodes that are descendants of another segment or field                                     only T1 is not overshadowed, as there is no other text node that is
       itself contained in n (line 13). In the iteration, we collect all field or                          south-east or south from T3 not overshadowed by another field.
       segment nodes in Nodes, and all sets of text nodes between field or                                    The layout scope labeling is then produced as follows: For each
                                                                 Field Scope
       segment nodes in Labels, except those text nodes already assigned                                                                 Segment Scope
                                                                                                          field f , we collect all text nodes t with w-nw-n(t, f ) and add them                                   Layout Scope                                                                         Domain Scope
       as labels in field scope (line 14), as we assume that these are outliers                            as labels to f if they are not overshadowed by another field and not
       in the regular structure of the segment. We assign the i-th text node                              contained in a segment that is no ancestor of f . The latter prevents
       group to the i-th field, if the two lists have the same size (possibly                              assignment of labels from unrelated form segments.




                                                                  Booth No. 2
       using the first text node as labels of the segment, line 17–19).
          Figure 4 illustrates the segment scope labeling with triangles                                  4. FORM INTERPRETATION
       denoting Visual heuristics: Find labels inblack circles segments, and
                  text nodes, diamonds fields, visual proximity of a field
                                                                                                             There is no straightforward relationship between form fields for
                                                                                                                                 T4
       white circles DOM “overshadowed” segment tree. The numbers in-
                 — but: not nodes not in the by another field
                                                                                                        T2
                                                                                                          domain concepts, such as location or price, and their structure within
       dicate which text nodes are assigned as labels tofield in reading order
                 — but: preference for labels preceding a which segments or                                    F2
                                                                                                          a form. Even seemingly domain-independent concepts, such as
       fields. E.g., for the left hand segment, we observe a regular struc-
                                                                                                          price,Toften exhibit domain NE specific peculiarities, such as “guide
       ture of (text node+, field)+ and thuspwe assign the i-th group of
                        — t visible from a point if north-west of p                                               1      NW      N
                                                                                                          price”, “current offers in excess”, or payment periods in real es-
       text nodes to the i-th field. For the for t if hand segment (4), we find T3
                        — f overshadows f’ right t visible from                                            F3              W F1         E
                                                                                                          tate. OPAL’s domain schemata allow us to cover these specifics.
       a subsegment (5) • bottom-right corner of f and f, f’ unaligned node
                           and field 8 that is already labeled with text
                                                                                                          We recall from Section 2 that a form model (F 0 , t) for a schema S
                                                                                                                         SW      S      SE
       8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned
                                 bottom-left corner of and one text node re-
                                                                                                          is derived from a form labeling F by extending F with types and
                                                                                                          restructuring its inner nodes to fit the structuralFilling: Aof S.
                                                                                                                                                   Form constraints Passe-partout for the Web                                                   Experiments                                                                                                Applications of OPAL
       mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f
                                  •    bottom-right corner label. aligned find
       one more text node group than fields and visualconsider the first text
                                       has no other thus label
                                                                                                                                                                                                                                                                                                                                                            form filling
                                                                                                             OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout)
                                                                                                                                                    Master form serves in two                                                                                                                                                                                                     Search
       node group as a segment label. The remaining nodes have a regular                                                                                                                                                                                 1
                                                                                                          steps: (1) the classification of nodes in — according free text values used directly for free text inputs
                                                                                                                                                     F currently: to the domain                                                                                                                                                                             —‘key’ to the web                    Data Extraction
       structure (field, text node+)+ and get assigned accordingly.                                                                                                                                                                                0.985

                                                                                Layout Scope              types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes
                                                                                                                                                    — step relies on the anno-                                                                                                                                                                              — all types of web

       3.3 Layout Scope                                                                                   tation schema L and its typing of labels in F; (2) the model repair                                                                         0.97                                                                                                     automation need it
                                                                                                          where the segmentation structure derived in the segmentation scope
       At layout scope, we further refine the form labeling for each                                                                                ➊ URL
                                                                                                          (Section 3.2) is aligned with the structure constraints of S.
                                                                                                                                                                                         ➊                                                        0.955


    form field not yet labelled in field or segment scope, by exploring                                                                                                                                                                           (a)0.94
    the visible text nodes in the west, north-west, or north quadrant,
                                                                                                                                ➋ Visualization                                                                       ➋                                      UK Real Estate (100) UK Used Car (100)                  ICQ (98)             Tel-8 (436)
                                                                                                                                                                                                                                                                                                                                                             Assistive
Domain Scope: OPAL-TL by any other field. To this end, OPAL                                                4.1 Schema Design: OPAL-TL
                                                                                                                                  Controller                                                                                                    (b)                                                                                                          Devices
    if they are not overshadowed                                                                                              1                                     TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
                                                                                                              OPAL provides a template language, OPAL - TL, for easily speci-                                                                           1
OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes:
    constructs layout tree from the CSS box labels of queries
                                                                                                                  domain schemata reusing common Web page and their con- segment<C,A> { ➌
                                                                                                                                                                                                                                                (c)
                                                                                                                                                                      2

                                                                                                          fying 2          A           4        A        ➌ concepts      TEMPLATE concept_by_                    concept<C>(N)( N@A{e,p} }            0.98
— templates = common9. The layout forms, e.g.,given DOM P is a tuple
            D EFINITION          pattern in web tree of a range specification
                                                                                                          straints as well as concept templates. To implement aconcept_minmax<C,C ,A> {
                                                                                                                                                A
                                                                                                                                                                      4
                                                                                                                                                                         TEMPLATE
                                                                                                                                                                                      new do-                                                   (d)
                                                                                                                                                                                                                                                  0.96                                       (e)
— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM
        (NP constraints , ne, e, se, s, sw, aligned) where segment set of                                B        3        A                                                                            M
                                                                     P                                    main, we only need to provide (1) a set of annotators implementing
                                                                                                                                                                      6 concept<CM >(N1 )  ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),             0.94
                                                                                                                                                                                                                                                (f)
        nodes from P, , w, nw, n, . . . the “belongs to” (containment), west,
   TEMPLATE segment<C>{                                                                                  CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do-
                                                                                                                           B                             - TL specification   N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
                                                                                                                                                                                                                 ➍                                    0.92
                                                                                                                                                                                                  Ⓐ
                                                                                                                                                                      8 concept<CM >(N (
 2
        north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d})
     segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G),
                                                                            Repairing form                main types and their classification
                                                                                                           Figure 6: Example Form Labeling
                                                                                                                                                                                                                                   ,N1 ),
       ¬(concept<C>(N2 ) _ segment<C>(N2 )) }                                                                                                                Ⓐ Field concept<C>(N              2 @range_connector{e,d},¬(A1     A,                     0.9    (a) web page                                                      (b) page scope
        holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun-
                                                                            form interpretation F, that labels annotated with certain type templates and predefined predi-
                                                                                                              OPAL -   not necessarily a model with                  10 concept<CM >(N1 )
                                                                                                                                                                                                  Ⓑ
                                                                                                                                                                                           ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),                        Airfare              Auto              Book                Job             US R.E.

                                                                                     and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
                                                                                                                                                             Ⓑ
 4
                                                                            der S,direct?either provided by human domain experts or We
                                                                                         are look for violations ofconvenient querying of annotations and DOM nodes. An                                                                                                                                                                                                     Multi-modal Input
   TEMPLATE segment_range<C,CM > {                                            •                           cates for structural constraints.                                  N1
     segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p})
            We call w, nw, . . . the neighbour relations. The adapt tree is of           types sources such as DBPedia and Freebase. The
                                                                                                                                                                                                                                                        Figure 4: Holbrook Moran form and page-scope labeling
                                                                                                                                                                                                                                                                                                                                                                            (Touch-Screens)
 6                                                                                                                                                         filled     12         _    1             2                                                                              field             segment         layout        domain
                                                                              • proper? look for labels (‘price’) or also values (‘£10’)
        most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM
                                                                             be computed majority describedprogramconstruct a form
                                                                                          the rewriting rules large set of such is executed against a form
                                                                                                          OPAL - TL
       N1 6= N2 ,child(N1 ,G),child(N2 ,G) }
                                                                                                                                                           according to master
 8
                                                                            modelexclusive? as price,Relationstheof the type P strati-
                                                                                     compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates
                                                                                                          P. performsorfrom   rewriting in
   TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the
        in segment_with_unique<C,U> {                                                    types such                                                        form                       Figure 7:                                                   1           Real-estate                                    1                                                                                  Web Automation
                                                                            fied manner to guarantee termination only use child (descendant, resp.) for the child (descen-
                                                                                                                     and introduces at most n new
                                                                                            D EFINITION TL. We ain the form.
     segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G),         — template: limited second-order, same data complexity                                                                                                                                                                                                                                                                 & Testing
10
        union of the relations w, nw, and n.                                segments where n is the number  11. of fields form labeling F on a DOM P and an
                                                                                                                 Given
       ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . }                                                                                   ➎ Master formsibling order                                                             0.99

            In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con-
                                                                              • template variablesIfquantify over predicates or We is an expres-
                                                                               (1) Under Segmentation: L, resp.) relation intype such
                                                                                         annotation       dant, an a segment n with F. types                               As                                                                                                                 Precision                                        domain
12
                                                                                                                                                           provides passe-                                                                      0.98
                                                                            that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N
                                                                                                                                                                        straints that associate                                                                 Used-car
   TEMPLATE outlier<C>{                                                       • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if
                                                                                                          from P segments
                                                                                         sion of the form: X@A{d, p, e} of type t1 . ,tk 62                                                                                                                                                   Recall                                           layout
        preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions
                                                  0 ,P),¬(segment<C>(G0 from a field. How-
                                                                                                                                                                                                                                                                                                        0.8
                                                                                         A 2 A, and d, P and e are annotation modifiers. An annotation for filling
                                                                                                                                                           partout      is labeled by an exclusive direct and proper annotation of type A.                                                    F-score                                          segment
                                                                                                           p,
        ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-
14   outlier<C>(G)( root(G)_child(G,P),child(G                             INSTANTIATE basic_concept<D,A> using node in        {<RADIUS, occurs
                                                                                                                                            radius>}                                                                                            0.97
                                                                                                                                                                                                                                                                            0.6              0.7             0.8            0.9              1 field
                                                                            P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms
                                                                                         Pn such              µ ✓ and Pe} (X,Y ), |=
        segment labels. Thusstructural constraints
                 Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as
                                                                            For each Pi we add
                                                                                                          der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa.
                                                                                                                                           Rnext-sibling                 TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} }             0.96
                                                                                                                                                                                                       Figure 3: OPAL Interface
                                                                                               @Aµ nodes 2 P : Allowµ (n)  Matchµ labell 0} (X)) µ (A)
        Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f  Blockas l(X).
                                                                                                          Finally, to from n to that segment. /                                                                                                 0.95                                                     0.6
                                                                            with ti , and move
                                                                                In practice, few cases of multiple under segmentations occur at the                     A template tpl is instantiated to produce a family of rules where                Real estate
                                                                                                                                                                                                                                                         Real-estate       Used car
                                                                                                                                                                                                                                                                           Used-car                                Real-estate Used-car
 first two templates). It is0 the only template 0with two concept tem-                            0

More Related Content

More from Giorgio Orsi

Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Giorgio Orsi
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_finalGiorgio Orsi
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 PosterGiorgio Orsi
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)Giorgio Orsi
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem OntologyGiorgio Orsi
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentationGiorgio Orsi
 

More from Giorgio Orsi (20)

Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological Databases
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
 
Perv a ds-rr13
Perv a ds-rr13Perv a ds-rr13
Perv a ds-rr13
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...
 
Table Recognition
Table RecognitionTable Recognition
Table Recognition
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Oxpath vldb
Oxpath vldbOxpath vldb
Oxpath vldb
 
Gottlob ICDE 2011
Gottlob ICDE 2011Gottlob ICDE 2011
Gottlob ICDE 2011
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
 
Orsi PersDB11
Orsi PersDB11Orsi PersDB11
Orsi PersDB11
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)

  • 1. DIADEM domain-centric intelligent automated data extraction methodology OPAL: A Passe-partout for Web Forms Xiaonan Guo DIADEM Group, Department of Computer Science, Oxford University joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio Orsi, Christian Schallhart
  • 2. DIADEM ›❯ OPAL A scenario... Looking for a house ? Too many websites to check ? Tired of filling every search form ? www, Lyon, Apr 20, 2012
  • 3. DIADEM ›❯ OPAL A Scenario...
  • 4. DIADEM ›❯ OPAL A Scenario...
  • 5. DIADEM ›❯ OPAL A Scenario...
  • 6. DIADEM ›❯ OPAL A Scenario...
  • 7. DIADEM ›❯ OPAL A Scenario...
  • 8. DIADEM ›❯ OPAL A Scenario...
  • 9. DIADEM ›❯ OPAL A Scenario...
  • 10. DIADEM ›❯ OPAL A Unique Interpretation
  • 11. DIADEM ›❯ OPAL A Unique Interpretation (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max}
  • 12. DIADEM ›❯ OPAL A Unique Interpretation OPAL (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max}
  • 13. DIADEM ›❯ OPAL A Unique Interpretation OPAL (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field Master Form XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max} ...
  • 15. DIADEM ›❯ OPAL OPAL Overview Ontology based Pattern Analysis with Logic
  • 16. DIADEM ›❯ OPAL OPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features
  • 17. DIADEM ›❯ OPAL OPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features domain-aware form interpretation parameterizes domain knowledge with OPAL-TL classifies and repair form model
  • 18. DIADEM ›❯ OPAL OPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features domain-aware form interpretation parameterizes domain knowledge with OPAL-TL classifies form fields and repair domain form model Template language OPAL-TL allows domain parameterization allows natural access to domain knowledge
  • 20. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
  • 21. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
  • 22. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
  • 23. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
  • 24. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
  • 25. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ...
  • 26. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ...
  • 27. DIADEM ›❯ OPAL OPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element Real- AreaBranch Element Estate ... Form Price Element(min) Price Price Element(max) Segment ...
  • 28. DIADEM ›❯ OPAL OPAL Evaluation Domain-awareness Experiment 100 real-estate forms, 100 used car forms Domain-independence Experiment - ICQ 5 domains, 100 web forms Domain-independence Experiment – Tel8 8 domains, 477 web forms
  • 29. DIADEM ›❯ OPAL OPAL Evaluation Domain-awareness Experiment 100 real-estate forms, 100 used car forms >98% Domain-independence Experiment - ICQ 5 domains, 100 web forms >95% Domain-independence Experiment – Tel8 8 domains, 477 web forms >95%
  • 30. 9
  • 31. DIADEM ›❯ OPAL Form Filling with OPAL
  • 32. DIADEM ›❯ OPAL Form Filling with OPAL Webpage
  • 33. DIADEM ›❯ OPAL Form Filling with OPAL Webpage Master Form
  • 34. DIADEM ›❯ OPAL Form Filling with OPAL labeling Webpage Master Form
  • 35. DIADEM ›❯ OPAL Form Filling with OPAL labeling interpretation Webpage Master Form
  • 36. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web Forms DIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) Assistive Domain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1 OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98 — templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e) — rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing 10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain 12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or- 14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0
  • 37. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web Forms DIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. Booth No. 2 using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) Assistive Domain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1 OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98 — templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e) — rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing 10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain 12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or- 14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n