Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute
  2. 2. Unstructured, Ungrammatical Text
  3. 3. Unstructured, Ungrammatical Text Car Model Car Year
  4. 4. Semantic Annotation 02 M3 Convertible .. Absolute beauty!!! <Make>BMW</Make> <Model>M3</Model> <Trim>2 Dr STD Convertible</Trim> <Year>2002</Year> “ Understand” & query the posts (can query on BMW, even though not in post!) Note: This is not extraction! (Not pulling them out of post…) implied!
  5. 5. Reference Sets <ul><li>Annotation/Extraction is hard </li></ul><ul><ul><li>Can’t rely on structure (wrappers) </li></ul></ul><ul><ul><li>Can’t rely on grammar (NLP) </li></ul></ul><ul><li>Reference sets are the key (IJCAI 2005) </li></ul><ul><ul><li>Match posts to reference set tuples </li></ul></ul><ul><ul><ul><li>Clue to attributes in posts </li></ul></ul></ul><ul><ul><ul><li>Provides normalized attribute values when matched </li></ul></ul></ul>
  6. 6. Reference Sets <ul><li>Collections of entities and their attributes </li></ul><ul><ul><li>Relational Data! </li></ul></ul>Scrape make, model, trim, year for all cars from 1990-2005…
  7. 7. Contributions Unsupervised matching between reference set & posts User trains record linkage between reference set & posts System selects reference sets from repository User supplies reference set Now Previously
  8. 8. New unsupervised approach: Two Steps Reference Set Repository: Grows over time, increasing coverage ------------ ----------- ---------- Posts 1) Unsupervised Reference Set Chooser 2) Unsupervised Record Linkage Unsupervised Semantic Annotation
  9. 9. Choosing a Reference Set Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 Hotels Cars Restaurants SIM:0.7 SIM:0.4 SIM:0.3 Cars 0.7 PD(C,H) = 0.75 > T Hotels 0.4 PD(H,R) = 0.33 < T Restaurants 0.3 Avg. 0.47 Cars
  10. 10. Choosing Reference Sets <ul><li>Similarity: Jensen-Shannon distance & TF-IDF used in Experiments in paper </li></ul><ul><li>Percent Difference as splitting criterion </li></ul><ul><ul><li>Relative measure </li></ul></ul><ul><ul><li>“ Reasonable” threshold – we use 0.6 throughout </li></ul></ul><ul><ul><li>Score > average as well </li></ul></ul><ul><ul><ul><li>Small scores with small changes can result in increased percent difference but they are not better, just relatively so… </li></ul></ul></ul><ul><li>If two or more reference sets selected, annotation runs iteratively </li></ul><ul><ul><li>If two reference sets have same schema, use one with higher rank </li></ul></ul><ul><ul><ul><li>Eliminate redundant matching </li></ul></ul></ul>
  11. 11. Vector Space Matching for Semantic Annotation <ul><li>Choosing reference sets: set of posts vs. whole reference set </li></ul><ul><li>Vector space matching: each post vs. each reference set record </li></ul><ul><li>Modified Dice similarity </li></ul><ul><ul><li>Modification: if Jaro-Winler > 0.95 put in (p ∩ r) </li></ul></ul><ul><ul><ul><li>captures spelling errors and abbreviations </li></ul></ul></ul>
  12. 12. Why Dice? <ul><li>TF/IDF w/ Cosine Sim: </li></ul><ul><ul><li>“ City” given more weight than “Ford” in reference set </li></ul></ul><ul><ul><li>Post: Near New Ford Expedition XLT 4WD with Brand New 22 Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 </li></ul></ul><ul><ul><li>TFIDF Match (score 0.20): {VOLKSWAGEN, JETTA, 4 Dr City Sedan, 1995} </li></ul></ul><ul><li>Jaccard Sim [(p ∩ r)/(p U r)]: </li></ul><ul><ul><li>Discounts shorter strings (many posts are short!) </li></ul></ul><ul><ul><li>Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr XLT 4WD SUV, 2005 } </li></ul></ul><ul><ul><ul><li>Dice: 0.32 </li></ul></ul></ul><ul><ul><ul><li>Jacc: 0.19 </li></ul></ul></ul><ul><ul><ul><li>Dice boosts numerator </li></ul></ul></ul><ul><ul><ul><li>If intersection is small, denominator of Dice almost same as Jaccard, so numerator matters more </li></ul></ul></ul>
  13. 13. Vector Space Matching for Semantic Annotation <ul><li>Average score splits matches from non-matches, eliminating false positives </li></ul><ul><ul><li>Threshold for matches from data </li></ul></ul><ul><ul><li>Using average assumes good matches and bad ones (see this in the data…) </li></ul></ul>new 2007 altima 02 M3 Convertible .. Absolute beauty!!! Awesome car for sale! It’s an accord, I think… {BMW, M3, 2 Dr STD Convertible, 2002}  0.5 {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007}  0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}  0.36 {HONDA, ACCORD, 4 Dr LX, 2001}  0.13 Avg. Dice = 0.33 < 0.33
  14. 14. Vector Space Matching for Semantic Annotation <ul><li>Attributes in agreement </li></ul><ul><ul><li>Set of matches: ambiguity in differing attributes </li></ul></ul><ul><ul><li>Which is better? All have maximum score as matches! </li></ul></ul><ul><ul><ul><li>We say none, throw away differences… </li></ul></ul></ul><ul><ul><ul><li>Union them? In real world, not all posts have all attributes </li></ul></ul></ul><ul><ul><ul><ul><li>E.g.: new 2007 altima </li></ul></ul></ul></ul>{NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007}  0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007}  0.36
  15. 15. Experimental Data Sets Reference Sets: Posts: 2,777 make, model, trim, year Kelly Blue Book Car Prices KBBCars 27,006 make, model, trim, year Edmunds & Super Lamb Auto Cars 132 star rating, name, local area Bidding For Travel Hotels 918 title, issue, publisher Comics Price Guide Comics 330 name, address, city, cuisine Zagat Restaurant Guide Zagat 534 name, address, city, cuisine Fodors Travel Guide Fodors Records Attributes Source Name 1,099 None Craigs List Boats Boats 2,568 Cars, KBBCars (in order) Craigs List Cars Craigs List 776 Comics EBay Comics EBay 1,125 Hotels Bidding For Travel BFT Records Reference Set Match Source Name
  16. 16. Results: Choose Ref. Sets (Jensen-Shannon) T = 0.6 0.234 Average 0.117 Comics 0.161 0.136 Zagat 0.101 0.15 KBBCars 0.248 0.187 Cars 0.05 0.196 Fodors 2.172 0.622 Hotels % Diff. Score Ref. Set BFT Posts 0.266 Average 0.113 Comics 0.153 0.131 Hotels 0.365 0.178 Zagat 0.144 0.204 Fodors 1.193 0.447 KBBCars 0.161 0.52 Cars % Diff. Score Ref. Set Craig’s List 0.201 Average 0.086 KBBCars 0.170 0.101 Hotels 0.186 0.12 Zagat 0.252 0.15 Cars 0.152 0.173 Fodors 2.351 0.579 Comics % Diff Score Ref. Set Ebay Posts 0.152 Average 0.084 Hotels 0.544 0.13 Zagat 0.025 0.133 Comics 0.089 0.145 KBBCars 0.144 0.166 Fodors 0.513 0.251 Cars % Diff. Score Ref. Set Boat Posts
  17. 17. Results: Semantic Annotation Supervised Machine Learning: notion of matches/ nonmatches in its training data In agreement issues N/A 84.50 91.01 78.86 Year N/A 60.22 51.95 71.62 Trim N/A 81.98 81.35 82.62 Model N/A 89.99 86.35 93.96 Make Craig’s List Posts 88.64 88.76 91.60 86.08 Publisher 88.64 78.62 89.40 70.16 Issue 88.64 88.76 91.60 86.08 Title EBay Posts 92.68 92.17 90.52 93.77 Local Area 92.68 90.61 89.25 92.02 Star Rating 92.68 88.79 89.36 88.23 Hotel Name Phoebus F-Mes. F-Measure Prec. Recall Attribute BFT Posts
  18. 18. Related Work <ul><li>Semantic Annotation </li></ul><ul><ul><li>Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions. </li></ul></ul><ul><ul><li>SemTag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate </li></ul></ul><ul><ul><ul><li>They disambiguate 1 token at time. We disambiguate using all posts during reference set selection, so we don’t have their ambiguity issue such as “is jaguar a car or animal?” [Reference set would tell us!] </li></ul></ul></ul><ul><ul><ul><li>We don’t require carefully formed taxonomy so we can easily exploit widely available reference sets </li></ul></ul></ul><ul><li>Info. Extraction using Reference Sets </li></ul><ul><ul><li>CRAM – unsupervised extraction but given reference set & labels all tokens (no junk allowed!) </li></ul></ul><ul><ul><li>Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised </li></ul></ul><ul><li>Resource Selection in Distr. IR (“Hidden Web”) [Survey: Craswell et. al. 2000] </li></ul><ul><ul><li>Probe queries required to estimate coverage since they don’t have full access to data. Since we have full access to reference sets we don’t use probe queries </li></ul></ul>
  19. 19. Conclusions <ul><li>Unsupervised semantic annotation </li></ul><ul><ul><li>System can accurately query noisy, unstructured sources w/o human intervention </li></ul></ul><ul><ul><ul><li>E.g. Aggregate queries (avg. Honda price?) w/o reading all posts </li></ul></ul></ul><ul><ul><li>Unsupervised selection of reference sets </li></ul></ul><ul><ul><ul><li>repository grows over time, increasing coverage over time </li></ul></ul></ul><ul><ul><li>Unsupervised annotation </li></ul></ul><ul><ul><ul><li>competitive with machine learning approach but without burden of labeling matches. </li></ul></ul></ul><ul><ul><ul><li>Necessary to exploit newly collected reference sets automatically </li></ul></ul></ul><ul><ul><ul><li>Allow for large scale annotation over time, w/o user intervention </li></ul></ul></ul><ul><li>Future Work </li></ul><ul><ul><li>Unsupervised extraction </li></ul></ul><ul><ul><li>Collect reference sets and manage with an information mediator </li></ul></ul>