AMBER WWW 2012 (Demonstration)

4,924 views

Published on

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,924
On SlideShare
0
From Embeds
0
Number of Embeds
2,767
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

AMBER WWW 2012 (Demonstration)

  1. 1. DIADEM domain-centric intelligent automated data extraction methodology Automatically Learning Gazetteers from the Deep Web Christian Schallhart April 19th, 2012 @ WWW in Lyon joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng WangFriday, May 11, 2012
  2. 2. AMBER: Extraction from Result Pages 2Friday, May 11, 2012
  3. 3. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2Friday, May 11, 2012
  4. 4. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2Friday, May 11, 2012
  5. 5. AMBER: Extraction from Result Pages 100.0% precision recall 99.5% <offer> <price> ! <currency>GBP</currency> 99.0% ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, 98.5% Oxford, Oxfordshire</location> </offer> <offer> >98.5% 98.0% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> 97.5% <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, data areas records attributes Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2Friday, May 11, 2012
  6. 6. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Domain Knowledge (no per-site training) 2Friday, May 11, 2012
  7. 7. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Little Ontology Domain (mandatory) Knowledge attribute types (no per-site training) 2Friday, May 11, 2012
  8. 8. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2Friday, May 11, 2012
  9. 9. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Quite easy. Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2Friday, May 11, 2012
  10. 10. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Quite easy. A lot of work! Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2Friday, May 11, 2012
  11. 11. AMBER: From Extraction to Learning Leverage the repeated structure in result pages to learn new terms. A lot of work! Gazetteers term lists 3Friday, May 11, 2012
  12. 12. AMBER: Automatically Learning Gazetteers <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  13. 13. AMBER: Automatically Learning Gazetteers <offer> •Page Segmentation ! <price> <currency>GBP</currency> ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  14. 14. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  15. 15. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  16. 16. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> •Gazetteer Learning <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> turns phrases into </offer> terms A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  17. 17. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> •Gazetteer Learning <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> turns phrases into </offer> terms A lot of work! Gazetteers term lists 4Friday, May 11, 2012
  18. 18. AMBER: Page Segmentation Page Retrieval Mozilla via XUL Runner GATE Annotations 5Friday, May 11, 2012
  19. 19. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering 6Friday, May 11, 2012
  20. 20. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering A data area is a maximal DOM subtree, which • contains ≥2 pivot nodes, which are • depth consistent (depth(n)=k±ε) • distance consistent (pathlen(n,n)=k±δ) • continuous, such that • their least common ancestor is ds root. 6Friday, May 11, 2012
  21. 21. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation head/tail cut-off segmentation boundary shifting 7Friday, May 11, 2012
  22. 22. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation head/tail cut-off segmentation boundary shifting 8Friday, May 11, 2012
  23. 23. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation A result record is a sequence of children of the head/tail cut-off data area root. segmentation boundary A result record segmentation divides a data area shifting • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node. 8Friday, May 11, 2012
  24. 24. AMBER: Attribute Alignment L L L L L L P P P X P A P A P A P A 9Friday, May 11, 2012
  25. 25. AMBER: Attribute Alignment L L L L L L P P P X P A P A P A P A The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from rs root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9Friday, May 11, 2012
  26. 26. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Cleanup The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from rs root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9Friday, May 11, 2012
  27. 27. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Attribute Disambiguation discard ambiguous Disam- Cleanup biguation attributes with lower support The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from rs root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9Friday, May 11, 2012
  28. 28. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Attribute Disambiguation discard ambiguous Disam- Cleanup Generation biguation attributes with lower support Attribute Generalisation The tag path of a node n in a record r is the • tag sequence occurring on the add new un-annotated • child/next-sibling path from rs root to n. attributes with sufficient The support of a type/tag path pair (t,p) is the support • fraction of records having an • annotation for t at path p. 9Friday, May 11, 2012
  29. 29. AMBER: Gazetteer Learning Oxford, Walton Street, top-floor apartment 10Friday, May 11, 2012
  30. 30. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment Oxford top-floor apartment Walton Street 10Friday, May 11, 2012
  31. 31. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment discard terms on black-lists and from non-overlapping attributes Oxford top-floor apartment Walton Street 10Friday, May 11, 2012
  32. 32. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment discard terms on black-lists and from non-overlapping attributes Oxford top-floor apartment Term Validation Walton Street track term relevance discard irrelevant ones 10Friday, May 11, 2012
  33. 33. AMBER: Evaluation 11Friday, May 11, 2012
  34. 34. AMBER: Evaluation Learning Location from 250 pages from 150 sites (UK real estate market) Starting with a 25% sample of our full gazetteer (containing 33.243 terms) 11Friday, May 11, 2012
  35. 35. AMBER: Evaluation Learning Location from 250 pages from 150 sites (UK real estate market) Starting with a 25% sample of our full gazetteer (containing 33.243 terms) initially failed to annotate 328 locations after 3 learning rounds learned 265 of those (recall: 80.6% precision: 95.1%) 11Friday, May 11, 2012
  36. 36. AMBER: Evaluation !"##$% -,9%:(8 -,9%:(; -,9%:(5 8223 773 613 453 123 !" "* ! " *# /, "* /, *# )$ -" )$ -" 0# &+& 0# .. #$ .. #$ &+&, . ,% . %& %& % % % (Friday, May 11, 2012
  37. 37. AMBER: Evaluation !"##$% ,-./$0&,"1.02"$3 4"//-10&,"1.02"$3 )** (+* (** +* ** +* * !"#$%& !"#$%&( !"#$%&)Friday, May 11, 2012
  38. 38. ! $ " DE M # O %Friday, May 11, 2012

×