OPAL: automated form understanding for the deep web - WWW 2012

4,842 views

Published on

Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,842
On SlideShare
0
From Embeds
0
Number of Embeds
3,451
Actions
Shares
0
Downloads
55
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • = ontology-based web form pattern analysis with logic \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • OPAL: automated form understanding for the deep web - WWW 2012

    1. 1. DIADEM domain-centric intelligent automated data extraction methodology OPAL: Automated Form Understanding for the Deep Web Xiaonan Guo DIADEM Group, Department of Computer Science, Oxford University joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio Orsi, Christian Schallhart
    2. 2. OPAL ›❯ Scenario1 A scenario... Looking for a house ? Too many websites to check ? Tired of searching through dozen of websites? 2
    3. 3. OPAL ›❯ Scenario1 The Diversity of Forms 3
    4. 4. OPAL ›❯ Scenario1 The Diversity of Forms 3
    5. 5. OPAL ›❯ Scenario1 The Diversity of Forms 3
    6. 6. OPAL ›❯ Scenario1 The Diversity of Forms 3
    7. 7. OPAL ›❯ Scenario1 The Diversity of Forms 3
    8. 8. OPAL ›❯ Scenario1 The Diversity of Forms 3
    9. 9. OPAL ›❯ Scenario1 The Diversity of Forms 3
    10. 10. OPAL ›❯ Scenario1 The Diversity of Forms 3
    11. 11. OPAL ›❯ Scenario1 The Diversity of Forms 3
    12. 12. OPAL ›❯ Scenario1 The Diversity of Forms 3
    13. 13. OPAL ›❯ Scenario1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation 4
    14. 14. OPAL ›❯ Scenario1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Field Scope 4
    15. 15. OPAL ›❯ Scenario1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Segment Scope Field Scope 4
    16. 16. OPAL ›❯ Scenario1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Layout Scope Segment Scope Field Scope 4
    17. 17. OPAL ›❯ Scenario1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Layout Scope Domain Scope Segment Scope Field Scope 4
    18. 18. OPAL ›❯ Scenario1 Previous Approaches Form Labeling Layout Scope Segment Scope Field Scope Only subset of scopes for form labeling Only form labeling, no or little form interpretation Where domain-specific, expensive to switch domain 5
    19. 19. OPAL ›❯ Scenario1 Previous Approaches Form Labeling Dragut et al., VLDB’09 Layout Scope Segment Scope Khare et al, CIKM’09 Field Scope Nguyen et al, VLDB’08 Only subset of scopes for form labeling Only form labeling, no or little form interpretation Where domain-specific, expensive to switch domain 5
    20. 20. OPAL ›❯ Scenario1 OPAL OPAL = multi-scope form labeling + flexible form interpretation 1. form labeling with all three scopes more accurate, more robust, less complex 2. form interpretation = matching and repair for elements to concepts 3. template language and pattern library for defining form ontologies more succinct, more reuse, less work 6
    21. 21. OPAL ›❯ Scenario1 OPAL OPAL = multi-scope form labeling + flexible form interpretation 1. form labeling with all three scopes more accurate, more robust, less complex 2. form interpretation = matching and repair for elements to concepts 3. template language and pattern library for defining form 1 ontologies 0.985 more succinct, more reuse, less work >95% 0.97 0.955 0.94 6 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)
    22. 22. OPAL ›❯ Form Labeling2 Form LabelingForm Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 7
    23. 23. OPAL ›❯ Form Labeling2 Form LabelingForm Labeling Layout Scope Segment Scope Field Scope 7
    24. 24. OPAL ›❯ Form Labeling2 Form LabelingForm Labeling Field Scope 7
    25. 25. OPAL ›❯ Form Labeling2Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
    26. 26. OPAL ›❯ Form Labeling2Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
    27. 27. OPAL ›❯ Form Labeling2Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
    28. 28. OPAL ›❯ Form Labeling2Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm field segment layout domain Real-estate Used-car 8 0.6 0.7 0.8 0.9 1
    29. 29. OPAL ›❯ Form Labeling2 Segment ScopeForm Labeling Field Scope 9
    30. 30. OPAL ›❯ Form Labeling2 Segment ScopeForm Labeling Segment Scope Field Scope 9
    31. 31. OPAL ›❯ Form Labeling2 Segment Scope builds a segment tree groups semantically likely related form elements, e.g. fields, labels assigns labels to fields and groups form fields are grouped if they occur in sequence they have similar style their least common ancestor contains no other elements 10
    32. 32. OPAL ›❯ Form Labeling2 Segment Scope builds a segment tree groups semantically likely related form elements, e.g. fields, labels assigns labels to fields and groups form fields are grouped if they occur in sequence they have similar style their least common ancestor contains no other elements 10
    33. 33. OPAL ›❯ Form Labeling2 Segment Tree 1 2 3 4 5 DOM Tree Segment Tree Figure 3: Example DOM and Segment Tree remove layout or template artifacts artificially breaking sequences of style-equivalent fields Algorithm 3: SegmentScopeLabeling(DOM P, Form Labeling F) 1 S SegmentTree(P) ; 11
    34. 34. 5 2 5 6 7 7 1 1 2 3 6 1 3 51 1 1 2 2 3 3 5 6 6 7 7 12
    35. 35. OPAL ›❯ Form Labeling2 Field & Segment Scope: Example 13
    36. 36. OPAL ›❯ Form Labeling2 Layout ScopeForm Labeling Segment Scope Field Scope 14
    37. 37. OPAL ›❯ Form Labeling2 Layout ScopeForm Labeling Layout Scope Segment Scope Field Scope 14
    38. 38. OPAL ›❯ Form Labeling2 Layout Scope Aligns visually related texts and fields prefers texts in the w-nw-n direction considers overshadowing T4 T2 F2 T1 NW N NE T3 F3 W F1 E SW S SE 15
    39. 39. OPAL ›❯ Form Labeling2 Layout Scope Example 16
    40. 40. OPAL ›❯ Form Labeling2 Layout Scope Example 16
    41. 41. OPAL ›❯ Form Labeling2 Layout Scope Example 16
    42. 42. OPAL ›❯ Form Labeling3 Form LabelingForm Labeling Layout Scope Segment Scope Field Scope 17
    43. 43. OPAL ›❯ Form Labeling2 Evaluation: ICQ & Tel8 Benchmark Domain-independence Experiment—ICQ Web query interfaces in 5 domains 100 query interfaces (20 in each domain) Domain-independence Experiment—Tel8 Web query interfaces in 8 domains 477 in total 18
    44. 44. OPAL ›❯ Form Labeling2 ICQ by Domain 10.980.960.940.92 0.9 Airfare Auto Book Job US R.E. 19
    45. 45. OPAL ›❯ Form Labeling2 ICQ by Domain 10.98 >98%0.960.940.92 0.9 Airfare Auto Book Job US R.E. 19
    46. 46. OPAL ›❯ Form Labeling2 ICQ by Domain 10.98 >98%0.960.94 Dragut et al., VLDB’090.92 0.9 Airfare Auto Book Job US R.E. 19
    47. 47. OPAL ›❯ Form Labeling2 Tel-8 by Domain0.990.960.93 0.9 Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records 20
    48. 48. OPAL ›❯ Form Labeling2 Tel-8 by Domain >95%0.990.960.93 0.9 Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records 20
    49. 49. OPAL ›❯ Form Interpretation3 Form InterpretationForm Labeling Layout Scope Segment Scope Field Scope 21
    50. 50. OPAL ›❯ Form Interpretation3 Form InterpretationForm Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 21
    51. 51. OPAL ›❯ Form Interpretation3Form Interpretation 22
    52. 52. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
    53. 53. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
    54. 54. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
    55. 55. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
    56. 56. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
    57. 57. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ... 22
    58. 58. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ... 22
    59. 59. OPAL ›❯ Form Interpretation3Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element Real- AreaBranch Element Estate ... Form Price Element(min) Price Price Element(max) Segment ... 22
    60. 60. OPAL ›❯ Form Interpretation3 Form Interpretation Form interpretation in OPAL = annotation + classification + repair Annotation: straightforward entity recognition Classification: maps fields to concepts based on annotations Repair: enforces structural constraints Together: high precision disambiguation for form classification 23
    61. 61. OPAL ›❯ Form Interpretation3 OPAL Ontology What does on OPAL ontology consist of? Annotation Annotation types “number”, “price period” (p.c.m.) Classification Concepts “price field”, “location field” Repair Constraints “R.E. form must have a location field” constraints: classification: between annotations and concepts structural: between concepts 24
    62. 62. OPAL ›❯ Form Interpretation3 OPAL Ontology Reducing the effort in designing an OPAL ontology by exploiting ubiquitous patterns of web forms e.g., “range” over a numeric type, dependent types Halevy et al., VLDB’08 identify 7 such patterns but: these patterns must be instantiated for a specific domain e.g., “price range”, “radius” linked with “location”, “postcode” with “city”, … OPAL’s solution: templates for these patterns 25
    63. 63. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    64. 64. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    65. 65. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    66. 66. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    67. 67. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    68. 68. OPAL ›❯ Form Interpretation3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
    69. 69. OPAL ›❯ Form Interpretation3OPAL-TL Patterns are specified using OPAL-TL OPAL-TL = Datalog + Annotation Queries + Templates 27
    70. 70. OPAL ›❯ Form Interpretation3 Annotation Queries Annotation types come in two varieties “label” types and “value” types for “price” vs. “€120” Disambiguation through fixed precedence on annotation types e.g., field with value “lowest price first” both “price” and “order-by” annotation, but “order-by” < “price” 28
    71. 71. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling N@A{ d, p, e }, “all nodes N having labels annotated with A” are either provided by human domain experts or derived from e d : direct (labels associated with the node itself) The current OPA ternal sources such as DBPedia and Freebase. version contains a large set of from field values) common doma p : proper (from labels, but not such artefacts for types such as price, location, or date. e : exclusive (labels considering precedence) D EFINITION 11. Given a form labeling F on a DOM P and a 29
    72. 72. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
    73. 73. OPAL ›❯ Form Interpretation3 Annotation Queries 1 segment 2 A 4 A value label A B 3 A field C B proper label Figure 6: Example Form Labeling are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
    74. 74. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
    75. 75. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
    76. 76. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA N@A{ d, e } contains a large set of such artefacts for common doma version = 2, 4 types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
    77. 77. OPAL ›❯ Form Interpretation3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA N@A{ d, e } contains a large set of such artefacts for common doma version = 2, 4 N@A{ p, esuch as price, location, or date. types } = 2, 3 D EFINITION 11. Given a form labeling F on a DOM P and a 30
    78. 78. OPAL ›❯ Form Interpretation3 OPAL-TL Example Price range two successive fields in the same group at least one “price” type TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } 2 range connector in between TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} } 4 TEMPLATE concept_minmax<C,CM ,A> { 6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) 8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ), concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d})10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})12 _ (N1 @max{e,p},N2 @min{e,p}) 31
    79. 79. OPAL ›❯ Form Interpretation3 OPAL-TL Example Price range two successive fields in the same group at least one “price” type TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } 2 range connector in between TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} } 4 TEMPLATE concept_minmax<C,CM ,A> { 6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) 8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ), concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d})10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})12 _ (N1 @max{e,p},N2 @min{e,p}) 31
    80. 80. OPAL ›❯ Form Interpretation3 OPAL Repair enforces structural constraints by repairing 4 types of violations under segmentation: missing segments over segmentation: superfluous segments under classification: missing types over classification: superfluous types 32
    81. 81. OPAL ›❯ Evaluation3 OPAL: EvaluationForm Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 33
    82. 82. OPAL ›❯ Evaluation4 Datasets Domain-awareness Experiment UK real estate and used car domain 100 web forms sampled for each domain Domain-independence Experiment—ICQ Web query interfaces in 5 domains 100 query interfaces (20 in each domain) Domain-independence Experiment—Tel8 Web query interfaces in 8 domains 477 in total 34
    83. 83. OPAL ›❯ Evaluation4 Evaluation Summary Precision Recall F-score 10.985 0.970.955 0.94 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) 35
    84. 84. OPAL ›❯ Evaluation4 Scope Contribution field segment layout domainReal-estate Used-car 0.6 0.7 0.8 0.9 1 36
    85. 85. results (c) OPAL ›❯ Evaluation4AL comparison OPAL: Performance (incl. Browser) 1 200 Total Computation time [s] 150 100 0.8 50 0 0 2000 4000 6000 8000 0.6 Number of nodes 37
    86. 86. results (c) OPAL ›❯ Evaluation4AL comparison OPAL: Performance (incl. Browser) 1 200 >70% below 50s Total Computation time [s] 150 100 0.8 50 0 0 2000 4000 6000 8000 0.6 Number of nodes 37
    87. 87. OPAL ›❯ Conclusion4 OPAL conclusion multi-scope domain independent form labeling combines visual, textual, and structural features expands form labeling with field, segment, page scope domain-aware form interpretation classifies form fields and repairs domain form model disambiguates form classification (up to 10% gain in precision) template language OPAL-TL provides a library of ubiquitous form patterns significantly speeds up domain instantiation 38
    88. 88. OPAL ›❯ Conclusion4Future work Integrate form understanding & record extraction (e.g., AMBER) “Site scope” Semi-automatic ontology creation for new domains tailor ontology learning approaches to forms Integrate form constraint extraction from probing and Javascript analysis (ProFound demo) 39
    89. 89. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web FormsDIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, fIn 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) AssistiveDomain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98— templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e)— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0

    ×