• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
AMBER WWW 2012 Poster
 

AMBER WWW 2012 Poster

on

  • 791 views

 

Statistics

Views

Total Views
791
Views on SlideShare
788
Embed Views
3

Actions

Likes
0
Downloads
3
Comments
0

2 Embeds 3

http://www.orsigiorgio.net 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AMBER WWW 2012 Poster AMBER WWW 2012 Poster Presentation Transcript

    • diadem.cs.ox.ac.uk Automatically Learning Sponsors Gazetteers from the Deep WebDIADEM domain-centric intelligent automated data extraction methodology Authors Digital Home Tim Furche, Giovanni Grasso, Giorgio Orsi, diadem.cs.ox.ac.uk/amber Christian Schallhart, Cheng Wang diadem-amber@cs.ox.ac.ukAMBER GUI AMBER Learning Cycle ! 2 R A data area is a maximal DOM subtree, which D D Page Segmentation • contains ≥2 pivot nodes, which are $ • depth consistent (depth(n)=k±ε) L L L L 1 L L Page Mozilla, • distance consistent (pathlen(n,n)=k±δ) Retrieval GATE annotations • continuous, such that P P X P P P P A P A A P A " 2 • their least common ancestor is ds root. Data Area Pivot node (mandatory Identification fields) clustering 3 3 R Record Head/tail cut off, A result record is a sequence of children of the data area root. D D Segmentation Segment boundary shifting A result record segmentation divides a data area L L L L L L • into non-overlapping records, % • containing the same number of siblings, P P X P P P A P A P A P A # • each based on a single selected pivot node. Attribute Alignment 1 Attribute Discard attributes The tag path of a node n in a record r is the Cleanup of low support • tag sequence occurring on the L L L L L L • child/next-sibling path from rs root to n. 2 Gazet- Attribute Discard redundant 2 1 3 teers The support of a type/tag path pair (t,p) is the P P P X P A P A P A P A Disambiguation attributes of lower support • fraction of records having an 3 • annotation for t at path p. Attribute Add new attributes of P is only allowed to A has a support of Generalization sufficient support appear once, thus the X only occurs once Webpage with identified Learned terms with 3/4 at this node and second P with less support 1 2 Domain schema concepts 3 4 URLs for analysis 5 Seed Gazetteer and has too low hence we add the records and attributes confidence values is dropped. support to be kept. annotation. We inferred that this node is of type A --AMBER Applications Gazetteer Learning Remove terms which occur hence we learn its terms. L Example Generation 1 • in black lists, 1 • in other gazetteers Data Extraction for Term Spilt new attributes into P A Result Page Wrapper Induction Formulation terms Compute confidence based on Oxford, Walton Street, top-floor apartment Analysis 2 • support of its type/tag path pair, Term Track term relevance, • relative size of the term within the entire attribute Part of DIADEM (Domain-centric Intelligent Automated Data Oxford Validation Discard irrelevant terms Walton Street top-floor apartment Extraction Methodology): Analyzing the pages reached via OPAL to generate OXPath expressions for efficient Gazetteer Learning extraction. Ontology Gazetteer ... but useable independently of DIADEM as well...AMBER Evaluation AMBER Learning Evaluation AMBER Architecture !"##$% !"##$% Real Estate -,9%:(8 -,9%:(; -,9%:(5 ,-./$0&,"1.02"$3 4"//-10&,"1.02"$3 precision recall 100.0% 8223 )** 100.0% precision recall 250 pages, manual 2215 pages, automatic Web Access 100.0% Attribute Alignment Annotation Reasoning 99.5% 80.0% (+* 773 Browser Common API GATE 99.0% 98.0% 60.0% (** 613 Record Segmentaton 98.5% 40.0% +* unannotated instances (328) total instances (1484) Mozilla precision! WebKit 96.0% recall! Domain Gazetteers 453 100.0%! 98.0% 20.0% ** rnd. aligned corr. prec. rec. prec. rec. DataArea Identification 97.5% 94.0% 0.0% +* 1 226 196 86.7% 59.2% 84.7% 81.6% data areas records attributes 123 98.0%! rece ptio n price athroom al status led page bedroom location ostcode erty type b i p price n e s e e locatioetailed pag bedroomlegal statu postcod roperty typ bathroom receptio n 2 261 248 95.0% 74.9% 93.2% 91.0% leg deta prop d p 3 271 265 97.8% 80.6% 95.1% 93.8% !" $"* !" "*# /, $"* /, "*# * Reasoning in Datalog (DLV) rules ) - ) - 0# &+& 0# .. #$ .. #$ &+&, overall attributes large scale 4 271 !"#$%& 265 97.8% !"#$%&( 80.6% 95.1% 93.8% !"#$%&) . ,% . 96.0%! %& %& % % % • stratified negation ( Learning Accuracy Table 1:Termslearned instances Total learnt 94.0%! • finite domains ! s! ! ! al! e! ! e! ! th! ! rea ce on om n RL Used Cars ord non-recursive aggregation od typ tio • leg ba pri ati sU dro a ep stc rec loc rty ta rec be l unannotated instances (328) total instances (1484) rnd. unannot. recog. corr. prec. rec. terms po tai da pe precision! recall! de precision recall pro precision recall100.0% 100.0% pages records attributes rnd. aligned corr. prec. rec. prec. rec. 100.0%! 1 331 225 196 86.7% 59.2% 262 easy integration with domain knowledge 99.5% real estate 281 2785 14,614 1 226 196 86.7% 59.2% 84.7% 81.6% 2 118 34 32 94.1% 27.1% 29 97.5% 2 261 248 95.0% 74.9% 93.2% 91.0% 3 98.0%! 79 16 16 100.0% 20.3% 4 Figure 4: Evaluation on Real-Estate Domain AMBER 99.0% (large scale) 2,215 20,723 114,714 4 63 0 0 100.0% 0% 0 Number of Rules 95.0% 3 271 265 97.8% 80.6% 95.1% 93.8% used car 151 1,608 12,732 • Data Area Idenifitication: 11 98.5% 4 271 265 97.8% 80.6% 95.1% 93.8% 96.0%! 92.5% 98.0% Table 2: Incrementally recognized instances and learned terms fillings to obtain one, or if possible, two result pages with32 least • Record Segmentation: at 90.0% extracts attributes with >99% precision and >98% recall • Learning Locations from 250instances Table 1: Total learned pages from 150 sites • Fails to annotate 328 or 1,484 locations 94.0%! two result records•and Attribute Alignment: with a manually compare AMBER’s results 34 ! s! ! ! al! e! ! e! ! th! 97.5% ! num age makfe el type color price ileagecartype trans modgilne sizeocation e annotated gold standard. Using a full gazetteer, AMBER extracts rea ce on om n RL door detail p (UK real estate) • Saturated after 3 rounds o rd l od typ tio u m leg en ba data areas records attributes pri ati sU d ro a ep stc data area, records, price, detailed page link, location, legal status, rec miles around Oxford. AMBER uses the content of the seed gazetteer loc • rnd. unannot. 25% of the full gazetteerrec. terms Starting with recog. corr. rty ta overall attributes rec be l prec. po tai da pe to identify the position of the known terms such as locations. In par- postcode and bedroom number with more than 98% precision and de (containing 33.243 terms) p ro 1 331 225 196 86.7% 59.2% 262 ticular, AMBER identifies three potential new locations, videlicet. recall. For less regular attributes such as property type, reception 2 118 34 32 94.1% 27.1% 29 “Oxford”, “Witney” andEvaluation on Real-Estate Domain 0.70, Figure 4: AMBER “Wallingford” with confidence of number and bathroom number, precision remains at 98%, but re- 3 79 16 16 100.0% 20.3% 4 4 63 0 0 100.0% 0% 0 0.62, and 0.61 respectively. Since the acceptance threshold for new call drops to 94%. The result of our evaluation proves that AMBER is able to generate human-quality examples for any web site in a