Little Knowledge Rules The Web:
Domain-Centric Result Page Extraction

    Tim Furche, Georg Gottlob, Giovanni Grasso,
    Giorgio Orsi, Christian Scallhart, Cheng Wang

         Department of Computer Science
               University of Oxford
           Cheng.wang@trinity.ox.ac.uk
Result Page Understanding
Outline
Adaptable Model-Based Extraction of
Result Pages (AMBER)

• System Overview
• Experiments
• Current Work
AMBER: System Overview
   Needs only                Very high
    one clue             precision & recall


Adaptable Model-Based Extraction of Result Pages

  Implemented       Domain-Parameterized tool,
    in rules      currently aimed at UK real-estate

  Part of DIADEM | Domain-centric Intelligent
    Automated Data Extraction Methodology
AMBER: System Overview
Fact Generation & Annotation
•   Live browser (Mozilla XUL-Runner)
•   Extract DOM tree
•   CSS box information
•   Textual annotation with GATE (domain dep.)
    – Gazetteers
    – Regular expression like rules
• All represented as facts in the Page Model
Phenomenological Mapping
Fact  Attribute

• Attribute Model:
  – Types & constraints
• Dom node and attribute
• Attribute Creation Constraints:
  – Required Annotations
  – Disallowed Annotations
Segmentation Mapping: Identification
Attribute  Data area

• From bottom phenomena to data area
• Little knowledge rules the web 
  Only one domain concept
  (mandatory attribute)
  – Price
  – Location
  – Title
Segmentation Mapping: Identification
• Multi data area identification
Segmentation Mapping: Understanding
•   Data area  Record
•   Domain independent
•   Identify leading nodes
•   Two problems
    – Superfluous nodes
    – Correct shift
Segmentation Mapping: Understanding
Segmentation Mapping: Understanding
Experiments
100.0%


 99.0%


 98.0%
                                                             Precision
                                                             Recall
 97.0%
                                                             F-measure

 96.0%


 95.0%
         Data Area   Record   Attribute   Price   Location
Summary
• AMBER - Adaptable Model-based Extraction of
  Result Pages
  – Domain knowledge  simple heuristic
  – Using DLV  compact & easy implementation
  – Understanding phase: only one domain clue 
    quickly adaptable to new domains
  – Very High precision (99.4%) recall (99.0%)
Current Work
• Testing AMBER on another domain
• Integrate visual information in understanding
  phase
• Use probabilistic logic programming to improve
  the whole system
Thanks!

AMBER presentation

  • 1.
    Little Knowledge RulesThe Web: Domain-Centric Result Page Extraction Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang Department of Computer Science University of Oxford Cheng.wang@trinity.ox.ac.uk
  • 2.
  • 3.
    Outline Adaptable Model-Based Extractionof Result Pages (AMBER) • System Overview • Experiments • Current Work
  • 4.
    AMBER: System Overview Needs only Very high one clue precision & recall Adaptable Model-Based Extraction of Result Pages Implemented Domain-Parameterized tool, in rules currently aimed at UK real-estate Part of DIADEM | Domain-centric Intelligent Automated Data Extraction Methodology
  • 5.
  • 6.
    Fact Generation &Annotation • Live browser (Mozilla XUL-Runner) • Extract DOM tree • CSS box information • Textual annotation with GATE (domain dep.) – Gazetteers – Regular expression like rules • All represented as facts in the Page Model
  • 7.
    Phenomenological Mapping Fact Attribute • Attribute Model: – Types & constraints • Dom node and attribute • Attribute Creation Constraints: – Required Annotations – Disallowed Annotations
  • 8.
    Segmentation Mapping: Identification Attribute Data area • From bottom phenomena to data area • Little knowledge rules the web  Only one domain concept (mandatory attribute) – Price – Location – Title
  • 9.
    Segmentation Mapping: Identification •Multi data area identification
  • 10.
    Segmentation Mapping: Understanding • Data area  Record • Domain independent • Identify leading nodes • Two problems – Superfluous nodes – Correct shift
  • 11.
  • 12.
  • 13.
    Experiments 100.0% 99.0% 98.0% Precision Recall 97.0% F-measure 96.0% 95.0% Data Area Record Attribute Price Location
  • 14.
    Summary • AMBER -Adaptable Model-based Extraction of Result Pages – Domain knowledge  simple heuristic – Using DLV  compact & easy implementation – Understanding phase: only one domain clue  quickly adaptable to new domains – Very High precision (99.4%) recall (99.0%)
  • 15.
    Current Work • TestingAMBER on another domain • Integrate visual information in understanding phase • Use probabilistic logic programming to improve the whole system
  • 16.