AMBER presentation

Little Knowledge Rules The Web:
Domain-Centric Result Page Extraction

Tim Furche, Georg Gottlob, Giovanni Grasso,
Giorgio Orsi, Christian Scallhart, Cheng Wang

Department of Computer Science
University of Oxford
Cheng.wang@trinity.ox.ac.uk

Outline
Adaptable Model-Based Extraction of
Result Pages (AMBER)

• System Overview
• Experiments
• Current Work

AMBER: System Overview
Needs only Very high
one clue precision & recall

Adaptable Model-Based Extraction of Result Pages

Implemented Domain-Parameterized tool,
in rules currently aimed at UK real-estate

Part of DIADEM | Domain-centric Intelligent
Automated Data Extraction Methodology

Fact Generation & Annotation
• Live browser (Mozilla XUL-Runner)
• Extract DOM tree
• CSS box information
• Textual annotation with GATE (domain dep.)
– Gazetteers
– Regular expression like rules
• All represented as facts in the Page Model

Phenomenological Mapping
Fact  Attribute

• Attribute Model:
– Types & constraints
• Dom node and attribute
• Attribute Creation Constraints:
– Required Annotations
– Disallowed Annotations

Segmentation Mapping: Identification
Attribute  Data area

• From bottom phenomena to data area
• Little knowledge rules the web 
Only one domain concept
(mandatory attribute)
– Price
– Location
– Title

Segmentation Mapping: Identification
• Multi data area identification

Segmentation Mapping: Understanding
• Data area  Record
• Domain independent
• Identify leading nodes
• Two problems
– Superfluous nodes
– Correct shift

Segmentation Mapping: Understanding

Experiments
100.0%

99.0%

98.0%
Precision
Recall
97.0%
F-measure

96.0%

95.0%
Data Area Record Attribute Price Location

Summary
• AMBER - Adaptable Model-based Extraction of
Result Pages
– Domain knowledge  simple heuristic
– Using DLV  compact & easy implementation
– Understanding phase: only one domain clue 
quickly adaptable to new domains
– Very High precision (99.4%) recall (99.0%)

Current Work
• Testing AMBER on another domain
• Integrate visual information in understanding
phase
• Use probabilistic logic programming to improve
the whole system

AMBER presentation

More Related Content

Similar to AMBER presentation

More from Giorgio Orsi

Recently uploaded

AMBER presentation