Little Knowledge Rules The Web:Domain-Centric Result Page Extraction    Tim Furche, Georg Gottlob, Giovanni Grasso,    Gio...
Result Page Understanding
OutlineAdaptable Model-Based Extraction ofResult Pages (AMBER)• System Overview• Experiments• Current Work
AMBER: System Overview   Needs only                Very high    one clue             precision & recallAdaptable Model-Bas...
AMBER: System Overview
Fact Generation & Annotation•   Live browser (Mozilla XUL-Runner)•   Extract DOM tree•   CSS box information•   Textual an...
Phenomenological MappingFact  Attribute• Attribute Model:  – Types & constraints• Dom node and attribute• Attribute Creat...
Segmentation Mapping: IdentificationAttribute  Data area• From bottom phenomena to data area• Little knowledge rules the ...
Segmentation Mapping: Identification• Multi data area identification
Segmentation Mapping: Understanding•   Data area  Record•   Domain independent•   Identify leading nodes•   Two problems ...
Segmentation Mapping: Understanding
Segmentation Mapping: Understanding
Experiments100.0% 99.0% 98.0%                                                             Precision                       ...
Summary• AMBER - Adaptable Model-based Extraction of  Result Pages  – Domain knowledge  simple heuristic  – Using DLV  c...
Current Work• Testing AMBER on another domain• Integrate visual information in understanding  phase• Use probabilistic log...
Thanks!
Upcoming SlideShare
Loading in …5
×

AMBER presentation

1,553 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,553
On SlideShare
0
From Embeds
0
Number of Embeds
460
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

AMBER presentation

  1. 1. Little Knowledge Rules The Web:Domain-Centric Result Page Extraction Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Scallhart, Cheng Wang Department of Computer Science University of Oxford Cheng.wang@trinity.ox.ac.uk
  2. 2. Result Page Understanding
  3. 3. OutlineAdaptable Model-Based Extraction ofResult Pages (AMBER)• System Overview• Experiments• Current Work
  4. 4. AMBER: System Overview Needs only Very high one clue precision & recallAdaptable Model-Based Extraction of Result Pages Implemented Domain-Parameterized tool, in rules currently aimed at UK real-estate Part of DIADEM | Domain-centric Intelligent Automated Data Extraction Methodology
  5. 5. AMBER: System Overview
  6. 6. Fact Generation & Annotation• Live browser (Mozilla XUL-Runner)• Extract DOM tree• CSS box information• Textual annotation with GATE (domain dep.) – Gazetteers – Regular expression like rules• All represented as facts in the Page Model
  7. 7. Phenomenological MappingFact  Attribute• Attribute Model: – Types & constraints• Dom node and attribute• Attribute Creation Constraints: – Required Annotations – Disallowed Annotations
  8. 8. Segmentation Mapping: IdentificationAttribute  Data area• From bottom phenomena to data area• Little knowledge rules the web  Only one domain concept (mandatory attribute) – Price – Location – Title
  9. 9. Segmentation Mapping: Identification• Multi data area identification
  10. 10. Segmentation Mapping: Understanding• Data area  Record• Domain independent• Identify leading nodes• Two problems – Superfluous nodes – Correct shift
  11. 11. Segmentation Mapping: Understanding
  12. 12. Segmentation Mapping: Understanding
  13. 13. Experiments100.0% 99.0% 98.0% Precision Recall 97.0% F-measure 96.0% 95.0% Data Area Record Attribute Price Location
  14. 14. Summary• AMBER - Adaptable Model-based Extraction of Result Pages – Domain knowledge  simple heuristic – Using DLV  compact & easy implementation – Understanding phase: only one domain clue  quickly adaptable to new domains – Very High precision (99.4%) recall (99.0%)
  15. 15. Current Work• Testing AMBER on another domain• Integrate visual information in understanding phase• Use probabilistic logic programming to improve the whole system
  16. 16. Thanks!

×