0
DIADEM           domain-centric intelligent automated                 data extraction methodology                         ...
DIADEM ›❯ OPALA scenario...     Looking for a house ?           Too many websites to check ?           Tired of filling eve...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Scenario...
DIADEM ›❯ OPALA Unique Interpretation
DIADEM ›❯ OPALA Unique Interpretation                                      (A)                             purpose     Rea...
DIADEM ›❯ OPALA Unique Interpretation                                                                OPAL                 ...
DIADEM ›❯ OPALA Unique Interpretation                                                                    OPAL             ...
DIADEM ›❯ OPALOPAL Overview
DIADEM ›❯ OPALOPAL Overview  Ontology based Pattern Analysis with Logic
DIADEM ›❯ OPALOPAL Overview  Ontology based Pattern Analysis with Logic  multi-scope domain independent analysis     combi...
DIADEM ›❯ OPALOPAL Overview  Ontology based Pattern Analysis with Logic  multi-scope domain independent analysis     combi...
DIADEM ›❯ OPALOPAL Overview  Ontology based Pattern Analysis with Logic  multi-scope domain independent analysis     combi...
DIADEM ›❯ OPALOPAL Overview
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Overview                          ...                  AreaBranch Element                  AreaBranch E...
DIADEM ›❯ OPALOPAL Evaluation  Domain-awareness Experiment     100 real-estate forms, 100 used car forms  Domain-independe...
DIADEM ›❯ OPALOPAL Evaluation  Domain-awareness Experiment     100 real-estate forms, 100 used car forms   >98%  Domain-in...
9
DIADEM ›❯ OPALForm Filling with OPAL
DIADEM ›❯ OPALForm Filling with OPAL                  Webpage
DIADEM ›❯ OPALForm Filling with OPAL                   Webpage                  Master Form
DIADEM ›❯ OPALForm Filling with OPAL labeling                   Webpage                  Master Form
DIADEM ›❯ OPALForm Filling with OPAL labeling                                interpretation                   Webpage     ...
DIADEM ›❯ OPAL                                                                                                            ...
DIADEM ›❯ OPAL                                                                                                            ...
Upcoming SlideShare
Loading in...5
×

OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)

2,096

Published on

Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,096
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)"

    1. 1. DIADEM domain-centric intelligent automated data extraction methodology OPAL: A Passe-partout for Web Forms Xiaonan Guo DIADEM Group, Department of Computer Science, Oxford University joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio Orsi, Christian Schallhart
    2. 2. DIADEM ›❯ OPALA scenario... Looking for a house ? Too many websites to check ? Tired of filling every search form ?www, Lyon, Apr 20, 2012
    3. 3. DIADEM ›❯ OPALA Scenario...
    4. 4. DIADEM ›❯ OPALA Scenario...
    5. 5. DIADEM ›❯ OPALA Scenario...
    6. 6. DIADEM ›❯ OPALA Scenario...
    7. 7. DIADEM ›❯ OPALA Scenario...
    8. 8. DIADEM ›❯ OPALA Scenario...
    9. 9. DIADEM ›❯ OPALA Scenario...
    10. 10. DIADEM ›❯ OPALA Unique Interpretation
    11. 11. DIADEM ›❯ OPALA Unique Interpretation (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max}
    12. 12. DIADEM ›❯ OPALA Unique Interpretation OPAL (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max}
    13. 13. DIADEM ›❯ OPALA Unique Interpretation OPAL (A) purpose Real%Estate)Web)Form Combined)Form purpose.{combined} purpose AND 1..1 0..1 0..1 1..1 Price Property%Contract Search%Option Form%Buttons purpose OR Segment Segment Segment Segment 1..1 1..* Geographic Property%Feature purpose Segment Segment (B) Currency 0..1 Price&Segment purpose Element 0..1 1..1 Currency Currency Label Input<Field Master Form XOR purpose 1..1 1..1 purpose purpose OR Price&Element Price&Element priceType.{apx} priceType.{range} 1..1 1..1 purpose purpose Price&Element Price&Element priceType.{min} priceType.{max} ...
    14. 14. DIADEM ›❯ OPALOPAL Overview
    15. 15. DIADEM ›❯ OPALOPAL Overview Ontology based Pattern Analysis with Logic
    16. 16. DIADEM ›❯ OPALOPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features
    17. 17. DIADEM ›❯ OPALOPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features domain-aware form interpretation parameterizes domain knowledge with OPAL-TL classifies and repair form model
    18. 18. DIADEM ›❯ OPALOPAL Overview Ontology based Pattern Analysis with Logic multi-scope domain independent analysis combines visual, textual, and structural features domain-aware form interpretation parameterizes domain knowledge with OPAL-TL classifies form fields and repair domain form model Template language OPAL-TL allows domain parameterization allows natural access to domain knowledge
    19. 19. DIADEM ›❯ OPALOPAL Overview
    20. 20. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
    21. 21. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
    22. 22. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
    23. 23. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
    24. 24. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ...
    25. 25. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ...
    26. 26. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ...
    27. 27. DIADEM ›❯ OPALOPAL Overview ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element Real- AreaBranch Element Estate ... Form Price Element(min) Price Price Element(max) Segment ...
    28. 28. DIADEM ›❯ OPALOPAL Evaluation Domain-awareness Experiment 100 real-estate forms, 100 used car forms Domain-independence Experiment - ICQ 5 domains, 100 web forms Domain-independence Experiment – Tel8 8 domains, 477 web forms
    29. 29. DIADEM ›❯ OPALOPAL Evaluation Domain-awareness Experiment 100 real-estate forms, 100 used car forms >98% Domain-independence Experiment - ICQ 5 domains, 100 web forms >95% Domain-independence Experiment – Tel8 8 domains, 477 web forms >95%
    30. 30. 9
    31. 31. DIADEM ›❯ OPALForm Filling with OPAL
    32. 32. DIADEM ›❯ OPALForm Filling with OPAL Webpage
    33. 33. DIADEM ›❯ OPALForm Filling with OPAL Webpage Master Form
    34. 34. DIADEM ›❯ OPALForm Filling with OPAL labeling Webpage Master Form
    35. 35. DIADEM ›❯ OPALForm Filling with OPAL labeling interpretation Webpage Master Form
    36. 36. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web FormsDIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, fIn 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) AssistiveDomain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98— templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e)— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0
    37. 37. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web FormsDIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. Booth No. 2 using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, fIn 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) AssistiveDomain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98— templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e)— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×