SlideShare a Scribd company logo
1 of 1
Download to read offline
OPAL: A Passe-partout for Web Forms 
DIADEM data extraction methodology 
domain-centric intelligent automated 
b-node 
6 7 7 8 
DOM tree 
Field Scope 
TT 4 2 
Fields & Labels 
NW NE 
Segment tree 
N 
Segment Scope 
Segments & Labels 
Layout tree 
Layout Scope 
Visual Labels 
Domain Scope 
Form Model 
Schema tree 
Master form serves as template for filling individual forms (passe-partout) 
— currently: free text values used directly for free text inputs 
— but using approximate matching for filling select boxes 
" 
(N1@min{e,p},N2@max{e,p}) 
# 
Figure 3: OPAL Interface 
1 
0.985 
0.97 
0.955 
(a) 
(b) 
(c) 
(d) 
(f) 
(e) 
(a) web page (b) page scope 
1 
0.98 
0.96 
0.94 
0.92 
Figure 4: Holbrook Moran form and page-scope labeling 
1 
0.99 
field segment layout domain 
Precision 
0.98 
Recall 
0.97 F-score 
0.96 
0.95 
Real-estate Used-car 
1 
0.8 
domain 
layout 
segment 
Real-estate Used-car 
0.6 
field 
1 1 
1 
2 
2 3 
3 
4 
4 
5 
5 8 
6 
Figure 4: Example for Segment Scope Labeling 
ns as representative for s (f(s) = ns). For each segment with reg-ular 
interleaving of text nodes and field or segment nodes, we use 
those text nodes as labels for these nodes, preserving any already 
assigned labels and fields (from field scope). In detail, we iterate 
over all descendants c of each segment in document order, skip-ping 
any nodes that are descendants of another segment or field 
itself contained in n (line 13). In the iteration, we collect all field or 
segment nodes in Nodes, and all sets of text nodes between field or 
segment nodes in Labels, except those text nodes already assigned 
as labels in field scope (line 14), as we assume that these are outliers 
in the regular structure of the segment. We assign the i-th text node 
group to the i-th field, if the two lists have the same size (possibly 
using the first text node as labels of the segment, line 17–19). 
Figure 4 illustrates the segment scope labeling with triangles 
denoting text nodes, diamonds fields, black circles segments, and 
white circles DOM nodes not in the segment tree. The numbers in-dicate 
which text nodes are assigned as labels to which segments or 
fields. E.g., for the left hand segment, we observe a regular struc-ture 
of (text node+, field)+ and thus we assign the i-th group of 
text nodes to the i-th field. For the right hand segment (4), we find 
a subsegment (5) and field 8 that is already labeled with text node 
8 in the field scope. Thus 8 is ignored and only one text node re-mains 
directly in 4, which becomes the segment label. In 5, we find 
one more text node group than fields and thus consider the first text 
node group as a segment label. The remaining nodes have a regular 
structure (field, text node+)+ and get assigned accordingly. 
3.3 Layout Scope 
At layout scope, we further refine the form labeling for each 
form field not yet labelled in field or segment scope, by exploring 
the visible text nodes in the west, north-west, or north quadrant, 
if they are not overshadowed by any other field. To this end, OPAL 
constructs a layout tree from the CSS box labels of the DOMnodes: 
DEFINITION 9. The layout tree of a given DOM P is a tuple 
(NP,,w,nw,n,ne,e, se, s,sw, aligned) where NP is the set of DOM 
nodes from P, ,w,nw,n, . . . the “belongs to” (containment), west, 
north-west, north, . . . relations  
from RCR [12], and aligned(x, y) 
holds if x and y have the same } 
height and are horizontally aligned. 
We call w,nw, . . . the neighbour relations. The layout tree is at 
most quadratic in size of a given DOM P and can be computed 
in O(|P|2). For convenience, we write, e.g., w-nw-n to denote the 
union of the relations w, nw, and n. 
In cultures with left-to-right reading direction, we observe a strong 
preference for placing labels in the w-nw-n region from a field. How-ever, 
! 
child(N2,G),N16= N2,¬(conceptC(N2)_segmentC(N2)) 
Ⓐ Field 
Ⓑ Segment 
are either provided by human domain experts or derived from ex-ternal 
sources such as DBPedia and Freebase. The current OPAL 
version contains a large set of such artefacts for common domain 
types such as price, location, or date. 
DEFINITION 11. Given a form labeling F on a DOM P and an 
annotation schema L, an OPAL-TL annotation query is an expres-sion 
forms often have many fields interspersed with field labels and 
segment labels. Thus we have to carefully consider overshadowing. 
Intuitively, for a field f , a visible text node t is overshadowed by 
F3 
F2 
T3 
T1 
W E 
S 
SE 
SW 
F1 
Figure 5: Layout Scope Labeling 
w-nw-n-ne-e(t, f 0) or (2) f and f 0 are aligned and (i) w(t, f 0) or 
(ii) nw-n(t, f 0) and there is a text node t0 not overshadowed by an-other 
field with ne-e(t0, f 0) and w-nw-n(t0, f ). 
To illustrate this overshadowing, consider the example in Fig-ure 
5. For field F1, T2 and T4 are overshadowed by F2 and T3 by F3, 
only T1 is not overshadowed, as there is no other text node that is 
south-east or south from T3 not overshadowed by another field. 
The layout scope labeling is then produced as follows: For each 
field f , we collect all text nodes t with w-nw-n(t, f ) and add them 
as labels to f if they are not overshadowed by another field and not 
contained in a segment that is no ancestor of f . The latter prevents 
assignment of labels from unrelated form segments. 
4. FORM INTERPRETATION 
There is no straightforward relationship between form fields for 
domain concepts, such as location or price, and their structure within 
a form. Even seemingly domain-independent concepts, such as 
price, often exhibit domain specific peculiarities, such as “guide 
price”, “current offers in excess”, or payment periods in real es-tate. 
TT 4 2 
F2 
NW N 
NE 
OPAL’s domain schemata allow us to cover these specifics. 
We recall from Section 2 that a form model (F0, t) for a schema S 
is derived from a form labeling F by extending F with types and 
restructuring its inner nodes to fit the structural constraints of S. 
OPAL performs form interpretation of a form labeling F in two 
steps: (1) the classification of nodes in F according to the domain 
types T to obtain a partial typing tP. This step relies on the anno-tation 
schema L and its typing of labels in F; (2) the model repair 
where the segmentation structure derived in the segmentation scope 
(Section 3.2) is aligned with the structure constraints of S. 
4.1 Schema Design: OPAL-TL 
1 
OPAL provides a template language, OPAL-TL, for easily speci-fying 
domain schemata reusing common concepts and their con-straints 
as well as concept templates. To implement a new do-main, 
we only need to provide (1) a set of annotators implementing 
2 4 
A A 
A 
A 
B 
3 
isLabela and B 
isValuea and (2) an OPAL-TL specification of the do-main 
C 
Figure types 6: Example and their Form classification Labeling 
and structural constraints. 
OPAL-TL extends Datalog with templates and predefined predi-cates 
for convenient querying of annotations and DOM nodes. An 
OPAL-TL program is executed against a form labeling F and aDOM 
P. Relations from F and P are mapped in the obvious way to OPAL-TL. 
We only use child (descendant, resp.) for the child (descen-dant, 
resp.) relation in F. We extend document and sibling order 
from P to F: follows(X,Y) for X,Y 2 F, if Rfollowing(f(X),f(Y)) 2 P and no other node in F occurs between X and Y in document or-der; 
adjacent(X,Y), if Rnext-sibling(f(X),f(Y)) 2 P or vice versa. 
Finally, we abbreviate labell(f(X)) as l(X). 
Segmentation: Find “logical” structure of the form 
— segmentation tree with only fields as leaves and 
• form segments s as inner nodes such that s has 
• at least degree two and all fields in s are style-equivalent 
— segmentation labeling: distribute labels 
• to fields, if there is a regular structure 
• to segments, if single, prefix label 
— style-equivalence: two nodes n and n’ 
• if same class or same type and CSS style 
— complexity: linear in document size and depth 
Visual heuristics: Find labels in visual proximity of a field 
— but: not “overshadowed” by another field 
— but: preference for labels preceding a field in reading order 
— t visible from a point p if north-west of p 
— f overshadows f’ for t if t visible from 
• bottom-right corner of f and f, f’ unaligned 
• bottom-left corner of f and f, f’ aligned 
• bottom-right corner of f, f, f' aligned and f 
has no other visual label 
F3 F1 
T3 
T1 
W E 
S 
SE 
SW 
TEMPLATE segmentC{ 
2 segmentC(G)(outlierC(G),child(N1,G),¬ 
! 
child(N2,G), 
¬(conceptC(N2) _ segmentC(N2)) 
4 
TEMPLATE segment_rangeC,CM { 
6 segmentC(G)(outlierC(G),conceptCM(N1),conceptCM(N2), 
N16= N2,child(N1,G),child(N2,G) } 
8 
TEMPLATE segment_with_uniqueC,U { 
10 segmentC(G)(outlierC(G),child(N1,G), conceptU(N1,G), 
¬ 
 
. } 
12 
TEMPLATE outlierC{ 
14 outlierC(G)(root(G)_child(G,P),child(G0,P),¬(segmentC(G0)) } 
Figure 8: OPAL-TL structural constraints 
Repairing form interpretations. The classification yields a 
form interpretation F, that is, however, not necessarily a model un-der 
S, and may contain violations of structural constraints. We 
adapt the types of fields and segments and the segment hierarchy 
of F with the rewriting rules described below to construct a form 
model compliant with S. OPAL performs the rewriting in a strati-fied 
manner to guarantee termination and introduces at most n new 
segments where n is the number of fields in the form. 
(1) Under Segmentation: If there is a segment n with type t such 
that CT (t) requires additional child segments of type t1, . . . ,tk62 child-T (n), we try to partition the children of n into k+1 partitions 
P1, . . . ,Pk,Pn such that Pi |= CT (ti) and Pn [ {t1, . . . ,tk} |= CT (t). 
For each Pi we add a new segment node as child of n, classify it 
with ti, and move all nodes assigned to Pi from n to that segment. 
of the form: X@A{d, p, e} where X is a first-order variable, 
A 2 A, and d, p, and e are annotation modifiers. An annotation 
query X@Aμ with μ ✓ {d, p, e} holds for all X 2 JAμ K with 
J@Aμ K = {n 2 P : Allowμ (n)Matchμ (A)6= /0}Blockμ (A) 
TEMPLATE basic_conceptC,A { conceptC(N)(N@A{d,e,p} } 
2 
TEMPLATE concept_by_segmentC,A { conceptC(N)(N@A{e,p} } 
4 
TEMPLATE concept_minmaxC,CM,A { 
6 conceptCM(N1)(child(N1,G),child(N2,G),adjacent(N1,N2), 
N1@A{e,d},(conceptC(N2) _ N2@A{e,d}) 
8 conceptCM(N2)(child(N1,G),child(N2,G),follows(N2,N1), 
conceptC(N1),N2@range_connector{e,d},¬(A1 * A, N2@A1{d}) 
10 conceptCM(N1)(child(N1,G),child(N2,G),adjacent(N1,N2), 
N1@A{e,p},N2@A{e,p}, 
12 _ (N1@max{e,p},N2@min{e,p}) 
Figure 7: OPAL-TL classification templates 
As an example, the following template defines a family of con-straints 
that associate the domain type D to a node N whenever N 
is labeled by an exclusive direct and proper annotation of type A. 
TEMPLATE basic_conceptD,A { conceptD(N) ( N@A{e,d,l} } 
A template tpl is instantiated to produce a family of rules where 
Real-estate 
Used-car 
0.6 0.7 0.8 0.9 1 
0.9 
Airfare Auto Book Job US R.E. 
0.94 
UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) 
Domain Scope: OPAL-TL 
OPAL-TL extends Datalog¬,≠ with templates and annotation queries 
— templates = common pattern in web forms, e.g., range specification 
— rules = constraints for domain-specific element and segment types 
— annotation queries: labels annotated with certain type 
• direct? look for direct only or also group labels? 
• proper? look for labels (‘price’) or also values (‘£10’) 
• exclusive? majority of labels are of the type 
— template: limited second-order, same data complexity 
• template variables quantify over predicates or types 
• instantiation reduces to Datalog 
INSTANTIATE basic_conceptD,A using {RADIUS, radius} 
Form Filling: A Passe-partout for the Web Applications of OPAL 
➊ URL ➊ 
➋ Visualization 
Controller 
➋ 
➌ Web page ➌ 
➍ Visualization 
➎ Master form 
Ⓐ 
➍ 
Ⓑ 
filled automatically 
according to master 
form 
provides passe-partout 
for filling 
individual forms 
Experiments 
Search 
Data Extraction 
Assistive 
Devices 
Multi-modal Input 
(Touch-Screens) 
Web Automation 
 Testing 
form filling 
—‘key’ to the web 
— all types of web 
automation need it 
OPAL: Automated Form Understanding for the Deep Web 
OPAL combines multi-scope domain-independent analysis with domain knowledge 
— multi-scope domain-independent analysis: field, segment, and layout scope 
— integration of three scopes yields more robust, simpler heuristics than single scope 
— strict preference for disambiguation for quality and performance reasons 
OPAL: A Passe-partout for the Web 
Domain knowledge on top of the domain-independent analysis for 
— classifying form fields and segments according to the domain ontology 
— verifying and repairing the form model to be consistent with domain constraints 
— Datalog-based template language for easy definition of domain knowledge 
Automatic form filling based on domain-specific master form 
— automatic (approximate) matching of master form values to values of concrete fields 
— visualization of form and segment concepts 
— automatically detects forms of the given domain and fills them 
— works on nearly any page of the domain 
OPAL outperforms previousapproaches even without domain knowledge 
— with domain knowledge we achieve nearly perfect accuracy in multiple domains 
Digital Home 
diadem.cs.ox.ac.uk/opal/ 
Authors 
Xiaonan Guo, Jochen Kranzdorf, 
Tim Furche, Giovanni Grasso, 
Giorgio Orsi, Christian Schallhart 
Sponsors 
diadem-opal@cs.ox.ac.uk 
Segment Scope 
Layout Scope

More Related Content

Viewers also liked

RDFox Poster
RDFox PosterRDFox Poster
RDFox PosterDBOnto
 
PDQ Poster
PDQ PosterPDQ Poster
PDQ PosterDBOnto
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paperDBOnto
 
Aggregating Semantic Annotators Paper
Aggregating Semantic Annotators PaperAggregating Semantic Annotators Paper
Aggregating Semantic Annotators PaperDBOnto
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...DBOnto
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA PresentationDBOnto
 
ROSeAnn Presentation
ROSeAnn PresentationROSeAnn Presentation
ROSeAnn PresentationDBOnto
 
PDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationPDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationDBOnto
 
Diadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDiadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDBOnto
 
SemFacet paper
SemFacet paperSemFacet paper
SemFacet paperDBOnto
 
Welcome by Ian Horrocks
Welcome by Ian HorrocksWelcome by Ian Horrocks
Welcome by Ian HorrocksDBOnto
 
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DBOnto
 
Parallel Datalog Reasoning in RDFox Presentation
Parallel Datalog Reasoning in RDFox PresentationParallel Datalog Reasoning in RDFox Presentation
Parallel Datalog Reasoning in RDFox PresentationDBOnto
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
 

Viewers also liked (14)

RDFox Poster
RDFox PosterRDFox Poster
RDFox Poster
 
PDQ Poster
PDQ PosterPDQ Poster
PDQ Poster
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paper
 
Aggregating Semantic Annotators Paper
Aggregating Semantic Annotators PaperAggregating Semantic Annotators Paper
Aggregating Semantic Annotators Paper
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
 
PAGOdA Presentation
PAGOdA PresentationPAGOdA Presentation
PAGOdA Presentation
 
ROSeAnn Presentation
ROSeAnn PresentationROSeAnn Presentation
ROSeAnn Presentation
 
PDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationPDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentation
 
Diadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meetingDiadem DBOnto Kick Off meeting
Diadem DBOnto Kick Off meeting
 
SemFacet paper
SemFacet paperSemFacet paper
SemFacet paper
 
Welcome by Ian Horrocks
Welcome by Ian HorrocksWelcome by Ian Horrocks
Welcome by Ian Horrocks
 
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...DIADEM: domain-centric intelligent automated data extraction methodology Pres...
DIADEM: domain-centric intelligent automated data extraction methodology Pres...
 
Parallel Datalog Reasoning in RDFox Presentation
Parallel Datalog Reasoning in RDFox PresentationParallel Datalog Reasoning in RDFox Presentation
Parallel Datalog Reasoning in RDFox Presentation
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
 

Recently uploaded

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

OPAL: A Passe-partout for Web Forms Poster

  • 1. OPAL: A Passe-partout for Web Forms DIADEM data extraction methodology domain-centric intelligent automated b-node 6 7 7 8 DOM tree Field Scope TT 4 2 Fields & Labels NW NE Segment tree N Segment Scope Segments & Labels Layout tree Layout Scope Visual Labels Domain Scope Form Model Schema tree Master form serves as template for filling individual forms (passe-partout) — currently: free text values used directly for free text inputs — but using approximate matching for filling select boxes " (N1@min{e,p},N2@max{e,p}) # Figure 3: OPAL Interface 1 0.985 0.97 0.955 (a) (b) (c) (d) (f) (e) (a) web page (b) page scope 1 0.98 0.96 0.94 0.92 Figure 4: Holbrook Moran form and page-scope labeling 1 0.99 field segment layout domain Precision 0.98 Recall 0.97 F-score 0.96 0.95 Real-estate Used-car 1 0.8 domain layout segment Real-estate Used-car 0.6 field 1 1 1 2 2 3 3 4 4 5 5 8 6 Figure 4: Example for Segment Scope Labeling ns as representative for s (f(s) = ns). For each segment with reg-ular interleaving of text nodes and field or segment nodes, we use those text nodes as labels for these nodes, preserving any already assigned labels and fields (from field scope). In detail, we iterate over all descendants c of each segment in document order, skip-ping any nodes that are descendants of another segment or field itself contained in n (line 13). In the iteration, we collect all field or segment nodes in Nodes, and all sets of text nodes between field or segment nodes in Labels, except those text nodes already assigned as labels in field scope (line 14), as we assume that these are outliers in the regular structure of the segment. We assign the i-th text node group to the i-th field, if the two lists have the same size (possibly using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles denoting text nodes, diamonds fields, black circles segments, and white circles DOM nodes not in the segment tree. The numbers in-dicate which text nodes are assigned as labels to which segments or fields. E.g., for the left hand segment, we observe a regular struc-ture of (text node+, field)+ and thus we assign the i-th group of text nodes to the i-th field. For the right hand segment (4), we find a subsegment (5) and field 8 that is already labeled with text node 8 in the field scope. Thus 8 is ignored and only one text node re-mains directly in 4, which becomes the segment label. In 5, we find one more text node group than fields and thus consider the first text node group as a segment label. The remaining nodes have a regular structure (field, text node+)+ and get assigned accordingly. 3.3 Layout Scope At layout scope, we further refine the form labeling for each form field not yet labelled in field or segment scope, by exploring the visible text nodes in the west, north-west, or north quadrant, if they are not overshadowed by any other field. To this end, OPAL constructs a layout tree from the CSS box labels of the DOMnodes: DEFINITION 9. The layout tree of a given DOM P is a tuple (NP,,w,nw,n,ne,e, se, s,sw, aligned) where NP is the set of DOM nodes from P, ,w,nw,n, . . . the “belongs to” (containment), west, north-west, north, . . . relations from RCR [12], and aligned(x, y) holds if x and y have the same } height and are horizontally aligned. We call w,nw, . . . the neighbour relations. The layout tree is at most quadratic in size of a given DOM P and can be computed in O(|P|2). For convenience, we write, e.g., w-nw-n to denote the union of the relations w, nw, and n. In cultures with left-to-right reading direction, we observe a strong preference for placing labels in the w-nw-n region from a field. How-ever, ! child(N2,G),N16= N2,¬(conceptC(N2)_segmentC(N2)) Ⓐ Field Ⓑ Segment are either provided by human domain experts or derived from ex-ternal sources such as DBPedia and Freebase. The current OPAL version contains a large set of such artefacts for common domain types such as price, location, or date. DEFINITION 11. Given a form labeling F on a DOM P and an annotation schema L, an OPAL-TL annotation query is an expres-sion forms often have many fields interspersed with field labels and segment labels. Thus we have to carefully consider overshadowing. Intuitively, for a field f , a visible text node t is overshadowed by F3 F2 T3 T1 W E S SE SW F1 Figure 5: Layout Scope Labeling w-nw-n-ne-e(t, f 0) or (2) f and f 0 are aligned and (i) w(t, f 0) or (ii) nw-n(t, f 0) and there is a text node t0 not overshadowed by an-other field with ne-e(t0, f 0) and w-nw-n(t0, f ). To illustrate this overshadowing, consider the example in Fig-ure 5. For field F1, T2 and T4 are overshadowed by F2 and T3 by F3, only T1 is not overshadowed, as there is no other text node that is south-east or south from T3 not overshadowed by another field. The layout scope labeling is then produced as follows: For each field f , we collect all text nodes t with w-nw-n(t, f ) and add them as labels to f if they are not overshadowed by another field and not contained in a segment that is no ancestor of f . The latter prevents assignment of labels from unrelated form segments. 4. FORM INTERPRETATION There is no straightforward relationship between form fields for domain concepts, such as location or price, and their structure within a form. Even seemingly domain-independent concepts, such as price, often exhibit domain specific peculiarities, such as “guide price”, “current offers in excess”, or payment periods in real es-tate. TT 4 2 F2 NW N NE OPAL’s domain schemata allow us to cover these specifics. We recall from Section 2 that a form model (F0, t) for a schema S is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structural constraints of S. OPAL performs form interpretation of a form labeling F in two steps: (1) the classification of nodes in F according to the domain types T to obtain a partial typing tP. This step relies on the anno-tation schema L and its typing of labels in F; (2) the model repair where the segmentation structure derived in the segmentation scope (Section 3.2) is aligned with the structure constraints of S. 4.1 Schema Design: OPAL-TL 1 OPAL provides a template language, OPAL-TL, for easily speci-fying domain schemata reusing common concepts and their con-straints as well as concept templates. To implement a new do-main, we only need to provide (1) a set of annotators implementing 2 4 A A A A B 3 isLabela and B isValuea and (2) an OPAL-TL specification of the do-main C Figure types 6: Example and their Form classification Labeling and structural constraints. OPAL-TL extends Datalog with templates and predefined predi-cates for convenient querying of annotations and DOM nodes. An OPAL-TL program is executed against a form labeling F and aDOM P. Relations from F and P are mapped in the obvious way to OPAL-TL. We only use child (descendant, resp.) for the child (descen-dant, resp.) relation in F. We extend document and sibling order from P to F: follows(X,Y) for X,Y 2 F, if Rfollowing(f(X),f(Y)) 2 P and no other node in F occurs between X and Y in document or-der; adjacent(X,Y), if Rnext-sibling(f(X),f(Y)) 2 P or vice versa. Finally, we abbreviate labell(f(X)) as l(X). Segmentation: Find “logical” structure of the form — segmentation tree with only fields as leaves and • form segments s as inner nodes such that s has • at least degree two and all fields in s are style-equivalent — segmentation labeling: distribute labels • to fields, if there is a regular structure • to segments, if single, prefix label — style-equivalence: two nodes n and n’ • if same class or same type and CSS style — complexity: linear in document size and depth Visual heuristics: Find labels in visual proximity of a field — but: not “overshadowed” by another field — but: preference for labels preceding a field in reading order — t visible from a point p if north-west of p — f overshadows f’ for t if t visible from • bottom-right corner of f and f, f’ unaligned • bottom-left corner of f and f, f’ aligned • bottom-right corner of f, f, f' aligned and f has no other visual label F3 F1 T3 T1 W E S SE SW TEMPLATE segmentC{ 2 segmentC(G)(outlierC(G),child(N1,G),¬ ! child(N2,G), ¬(conceptC(N2) _ segmentC(N2)) 4 TEMPLATE segment_rangeC,CM { 6 segmentC(G)(outlierC(G),conceptCM(N1),conceptCM(N2), N16= N2,child(N1,G),child(N2,G) } 8 TEMPLATE segment_with_uniqueC,U { 10 segmentC(G)(outlierC(G),child(N1,G), conceptU(N1,G), ¬ . } 12 TEMPLATE outlierC{ 14 outlierC(G)(root(G)_child(G,P),child(G0,P),¬(segmentC(G0)) } Figure 8: OPAL-TL structural constraints Repairing form interpretations. The classification yields a form interpretation F, that is, however, not necessarily a model un-der S, and may contain violations of structural constraints. We adapt the types of fields and segments and the segment hierarchy of F with the rewriting rules described below to construct a form model compliant with S. OPAL performs the rewriting in a strati-fied manner to guarantee termination and introduces at most n new segments where n is the number of fields in the form. (1) Under Segmentation: If there is a segment n with type t such that CT (t) requires additional child segments of type t1, . . . ,tk62 child-T (n), we try to partition the children of n into k+1 partitions P1, . . . ,Pk,Pn such that Pi |= CT (ti) and Pn [ {t1, . . . ,tk} |= CT (t). For each Pi we add a new segment node as child of n, classify it with ti, and move all nodes assigned to Pi from n to that segment. of the form: X@A{d, p, e} where X is a first-order variable, A 2 A, and d, p, and e are annotation modifiers. An annotation query X@Aμ with μ ✓ {d, p, e} holds for all X 2 JAμ K with J@Aμ K = {n 2 P : Allowμ (n)Matchμ (A)6= /0}Blockμ (A) TEMPLATE basic_conceptC,A { conceptC(N)(N@A{d,e,p} } 2 TEMPLATE concept_by_segmentC,A { conceptC(N)(N@A{e,p} } 4 TEMPLATE concept_minmaxC,CM,A { 6 conceptCM(N1)(child(N1,G),child(N2,G),adjacent(N1,N2), N1@A{e,d},(conceptC(N2) _ N2@A{e,d}) 8 conceptCM(N2)(child(N1,G),child(N2,G),follows(N2,N1), conceptC(N1),N2@range_connector{e,d},¬(A1 * A, N2@A1{d}) 10 conceptCM(N1)(child(N1,G),child(N2,G),adjacent(N1,N2), N1@A{e,p},N2@A{e,p}, 12 _ (N1@max{e,p},N2@min{e,p}) Figure 7: OPAL-TL classification templates As an example, the following template defines a family of con-straints that associate the domain type D to a node N whenever N is labeled by an exclusive direct and proper annotation of type A. TEMPLATE basic_conceptD,A { conceptD(N) ( N@A{e,d,l} } A template tpl is instantiated to produce a family of rules where Real-estate Used-car 0.6 0.7 0.8 0.9 1 0.9 Airfare Auto Book Job US R.E. 0.94 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) Domain Scope: OPAL-TL OPAL-TL extends Datalog¬,≠ with templates and annotation queries — templates = common pattern in web forms, e.g., range specification — rules = constraints for domain-specific element and segment types — annotation queries: labels annotated with certain type • direct? look for direct only or also group labels? • proper? look for labels (‘price’) or also values (‘£10’) • exclusive? majority of labels are of the type — template: limited second-order, same data complexity • template variables quantify over predicates or types • instantiation reduces to Datalog INSTANTIATE basic_conceptD,A using {RADIUS, radius} Form Filling: A Passe-partout for the Web Applications of OPAL ➊ URL ➊ ➋ Visualization Controller ➋ ➌ Web page ➌ ➍ Visualization ➎ Master form Ⓐ ➍ Ⓑ filled automatically according to master form provides passe-partout for filling individual forms Experiments Search Data Extraction Assistive Devices Multi-modal Input (Touch-Screens) Web Automation Testing form filling —‘key’ to the web — all types of web automation need it OPAL: Automated Form Understanding for the Deep Web OPAL combines multi-scope domain-independent analysis with domain knowledge — multi-scope domain-independent analysis: field, segment, and layout scope — integration of three scopes yields more robust, simpler heuristics than single scope — strict preference for disambiguation for quality and performance reasons OPAL: A Passe-partout for the Web Domain knowledge on top of the domain-independent analysis for — classifying form fields and segments according to the domain ontology — verifying and repairing the form model to be consistent with domain constraints — Datalog-based template language for easy definition of domain knowledge Automatic form filling based on domain-specific master form — automatic (approximate) matching of master form values to values of concrete fields — visualization of form and segment concepts — automatically detects forms of the given domain and fills them — works on nearly any page of the domain OPAL outperforms previousapproaches even without domain knowledge — with domain knowledge we achieve nearly perfect accuracy in multiple domains Digital Home diadem.cs.ox.ac.uk/opal/ Authors Xiaonan Guo, Jochen Kranzdorf, Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart Sponsors diadem-opal@cs.ox.ac.uk Segment Scope Layout Scope