Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (Research Track)

206 views

Published on

Presentation of "Facet Annotation Using Reference Knowledge Bases" at the WWW2018 Research Track, i.e., The Web Conference 2018, April 26th, Lyon, France.

ABSTRACT: Faceted interfaces are omnipresent on the web to support data exploration and filtering. A facet is a triple: a domain (e.g., Book), a property (e.g., author,lanдuaдe), and a set of property values (e.g., { Austen, Beauvoir, Coelho, Dostoevsky, Eco, Kerouac, Suskind, ... },{ French, English, German, Italian, Portuguese, Russian, ... } ). Given a property (e.g., language), selecting one or more of its values (Enдlish and Italian) returns the domain entities (of type Book) that match the given values (the books that are written in English or Italian). To implement faceted interfaces in a way that is scalable to very large datasets, it is necessary to automate facet extraction. Prior work associates a facet domain with a set of homogeneous values, but does not annotate the facet property. In this paper, we annotate the facet property with a predicate from a reference Knowledge Base (KB) so as to maximize the semantic similarity between the property and the predicate. We define semantic similarity in terms of three new metrics: specificity, coverage, and frequency. Our experimental evaluation uses the DBpedia and YAGO K Bs and shows that for the facet annotation problem, we obtain better results than a state-of-the-art approach for the annotation of web tables as modified to annotate a set of values.

For more info about our work you can check out the websites of our labs:
INSID&S Lab (UNIMIB): http://inside.disco.unimib.it/
ADVIS Lab (UIC): https://www.cs.uic.edu/bin/view/Advis/WebHome

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (Research Track)

  1. 1. Facet Annota*on Using Reference Knowledge Bases INSID&S Lab Interaction and Semantics for Innovation with Data & Services Isabel F. Cruz Department of Computer Science University of Illinois at Chicago Ma>eo Palmonari Department of Informa9on, Systems and Communica9on University of Milano-Bicocca Riccardo Porrini Mia-pla?orm (7Pixel Srl) (University of Milano-Bicocca)
  2. 2. Using and Building Faceted Interfaces Faceted interfaces •  Widespread in several domains •  Enhance user experience –  Compe99ve advantage for web applica9ons, e.g., in eCommerce (IT) (SI) (Global) 2 Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  3. 3. Using and Building Faceted Interfaces Faceted interfaces •  Widespread in several domains •  Enhance user experience –  Compe99ve advantage for web applica9ons, e.g., in eCommerce A facet •  Domain, property, values –  <D,p,{v1,…, vn}> –  <Books,book language, {English,Spanish}> (IT) (SI) (Global) 3 Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  4. 4. Using and Building Faceted Interfaces Faceted interfaces •  Widespread in several domains •  Enhance user experience –  Compe99ve advantage for web applica9ons, e.g., in eCommerce Hard to define well-craed facets manually •  Many different types of en99es require many domain-specific facets –  E.g., 500+ product types A facet •  Domain, property, values –  <D,p,{v1,…, vn}> –  <Books,book language, {English,Spanish}> cellphones books pet supplies 4 Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  5. 5. Facet Extrac9on & Annota9on Facet extrac9on •  Clusters of values for different characteris9cs –  Offline vs. Online (on search results) Facet annota9on •  Assign a property to the automa1cally extracted set of values •  Use a predicate of a KB to capture the seman9cs of the property –  Human readable seman9cs –  Machine-readable seman9cs à RDFa markup of faceted interfaces –  Reduc9on of terminological entropy across domains 5 Book Agatha Christie Arthur C. Clarke Paulo Coelho Charles Dickens ... ?p English Italian Portuguese ... ?q Book Agatha Christie Arthur C. Clarke Paulo Coelho Charles Dickens ... author English Italian Portuguese ... language Main goals: Build faceted interfaces on-the-fly for effec9ve searching Assist domain experts in crea9ng and maintaining facets Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  6. 6. Facet Extrac9on & Crea9on in Comparison Shopping Pla?orms (CSPs) Experts refine the facet with a GUI •  Add/delete values •  Lexical specifica9on of the facet property Real World Deployment Automated clustering of facet values (Porrini&al.CAISE14) •  extracted from 3k+ data sources mapped to a CSP central schema with 500+ leaf types •  clustered by similarity (used in produc9on in an Italian CSP) 6 Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on 3K+ sources 500+ types
  7. 7. Facet Annota9on vs. Table Annota9on Book Agatha Christie Arthur C. Clarke Paulo Coelho Charles Dickens J. K. Rowling J. R. R. Tolkien Anna Sewell ... ?p Book Work Thing Agent Person author 2001_A_Space_Odyssey Arthur_C._Clarke The_Hobbit J.R.R._Tolkien author creator Facet KB Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on author book_id *tle genre author 1020[…] The Hobbit Juvenile fantasy J. R. R. Tolkien 2013[…] 2001 A Space Odyssey Science Fic9on Arthur C. Clarke 2016[…] Gone Girl Crime Gillian Flynn author J. R. R. Tolkien Arthur C. Clarke Gillian Flynn author “Books” Facet annota1on (unbalanced) Table annota1on (balanced) Book Person 7
  8. 8. Facet Annota9on: Challenges •  Challenges: –  Many predicates, e.g., 53K+ DBpedia proper9es –  Weak specifica9on, e.g., no domain/range restric9ons –  Ambiguous values, i.e., same values in different predicates –  Specificity of the predicate associated with a domain-specific facet Domain Range Universe Universe Intensional semantics Extensional semantics PersonBBC_Online Work BBC author 2001_A_Space_Odyssey Arthur_C._Clarke ... J._R._R._TolkienThe_Hobbit creator Domain Range Universe Universe Person ...HAL_9000 Sunjammer Arthur_C._Clarke J._R._R._Tolkien The_Hobbit Middle-heart Intensional semantics Extensional semantics 8 Book Agatha Christie Arthur C. Clarke J. K. Rowling J. R. R. Tolkien Anna Sewell ... ?p Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on In red: en11es that are not books
  9. 9. Approach: Overview •  Index the extensions of the KB predicates using signatures •  Filter predicates whose signatures match the facet domain and values •  Rank predicates by similarity (three criteria: frequency, coverage, specificity) •  Output: ranked list of predicates creator Ranking Book Work Thing Agent Person FantasyNovels FantasyBooks Types (DBpedia classes) FictionalComputers 2001_A_Space_Odyssey creator HAL_9000 Arthur_C._Clarke author The_Hobbit J.R.R._Tolkien author Knowledge Base Relational assertions <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... FictionalCharacter Book Arthur C. Clarke Paulo Coelho J. R. R. Tolkien ... ?p author <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}> ... Filtering <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... creator ... author creator Facet ... Types (wikipedia categories) author FictionalObjects ... Predicates’ signatures <{FictionalCharacter, Person, ...}, {Person, Agent, …}, {fiction, charact, person, agent, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}>creator creator Sunjammer Middle-heart creator CORRECT-COMPLETE NEW EXAMPLE 9 Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  10. 10. Approach: Signatures creator Ranking Book Work Thing Agent Person FantasyNovels FantasyBooks Types (DBpedia classes) FictionalComputers 2001_A_Space_Odyssey creator HAL_9000 Arthur_C._Clarke author The_Hobbit J.R.R._Tolkien author Knowledge Base Relational assertions <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... FictionalCharacter Book Arthur C. Clarke Paulo Coelho J. R. R. Tolkien ... ?p author <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}> ... Filtering <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... creator ... author creator Facet ... Types (wikipedia categories) author FictionalObjects ... Predicates’ signatures <{FictionalCharacter, Person, ...}, {Person, Agent, …}, {fiction, charact, person, agent, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}>creator creator Sunjammer Middle-heart creator CORRECT-COMPLETE NEW EXAMPLE Domain Range The_Hobbit “J.R.R._Tolkien” Middle-earth “J.R.R._Tolkien” HAL_9000 “Arthur_C._Clark” Sunjammer “Arthur_C._Clark” Subject Types’ Names Subject Types’ Lexicaliza*ons Object Types’ Names Objects’ Lexicaliza*ons < {Book, Work, ... } {book, work, fantas, novel, ...} {Person, Agent, …} {j, r, r, tolkien} > < {Thing, Fic9onalCon9nents,…} {thing, fic1on, con1nent,…,} {Person, Agent, …} {j, r, r, tolkien} > < {Fic9onalCharacter, Person, …} {fic1on, charact, person, agent, ...} {Person, Agent, …} {arthur, c, clarke} > < {Book, Work, ... } {book, work, short, story, ...} {Person, Agent, …} {arthur, c, clarke} > Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  11. 11. Approach: Filtering creator Ranking Book Work Thing Agent Person FantasyNovels FantasyBooks Types (DBpedia classes) FictionalComputers 2001_A_Space_Odyssey creator HAL_9000 Arthur_C._Clarke author The_Hobbit J.R.R._Tolkien author Knowledge Base Relational assertions <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... FictionalCharacter Book Arthur C. Clarke Paulo Coelho J. R. R. Tolkien ... ?p author <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}> ... Filtering <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... creator ... author creator Facet ... Types (wikipedia categories) author FictionalObjects ... Predicates’ signatures <{FictionalCharacter, Person, ...}, {Person, Agent, …}, {fiction, charact, person, agent, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}>creator creator Sunjammer Middle-heart creator CORRECT-COMPLETE NEW EXAMPLE Subject Types’ Names Subject Types’ Lexicaliza*ons Object Types’ Names Objects’ Lexicaliza*ons < {Book, Work, ... } {book, work, fantas, novel, ...} {Person, Agent, …} {j, r, r, tolkien} > < {Thing, Fic9onalCon9nents,…} {thing, fic1on, con1nent,…,} {Person, Agent, …} {j, r, r, tolkien} > < {Fic9onalCharacter, Person, …} {fic1on, charact, person, agent, ...} {Person, Agent, …} {arthur, c, clarke} > < {Book, Work, ... } {book, work, short, story, ...} {Person, Agent, …} {arthur, c, clarke} > D-match •  subject types vs. facet domain (lexicaliza9ons) R-match •  object vs. at least 1 facet value (lexicaliza9ons) Signature matches if •  d-match & r-match ✓ ✗ ✗ ✓ Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  12. 12. Approach: Ranking creator Ranking Book Work Thing Agent Person FantasyNovels FantasyBooks Types (DBpedia classes) FictionalComputers 2001_A_Space_Odyssey creator HAL_9000 Arthur_C._Clarke author The_Hobbit J.R.R._Tolkien author Knowledge Base Relational assertions <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... FictionalCharacter Book Arthur C. Clarke Paulo Coelho J. R. R. Tolkien ... ?p author <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}> ... Filtering <{Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien}> ... creator ... author creator Facet ... Types (wikipedia categories) author FictionalObjects ... Predicates’ signatures <{FictionalCharacter, Person, ...}, {Person, Agent, …}, {fiction, charact, person, agent, ...}, {arthur, c, clarke}> <{Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}>creator creator Sunjammer Middle-heart creator CORRECT-COMPLETE NEW EXAMPLE Subject Types’ Names Subject Types’ Lexicaliza*ons Object Types’ Names Objects’ Lexicaliza*ons < {Book, Work, ... } {book, work, fantas, novel, ...} {Person, Agent, …} {j, r, r, tolkien} > < {Thing, Fic9onalCon9nents,…} {thing, fic1on, con1nent,…,} {Person, Agent, …} {j, r, r, tolkien} > < {Fic9onalCharacter, Person, …} {fic1on, charact, person, agent, ...} {Person, Agent, …} {arthur, c, clarke} > < {Book, Work, ... } {book, work, short, story, ...} {Person, Agent, …} {arthur, c, clarke} > Three criteria: •  Coverage •  Weighted Frequency •  Specificity Combina9on: •  Smoothing for comparability •  Mul9plica9on ✓ ✗ ✗ ✓ Conclusions & Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on & Problem Defini9on
  13. 13. Ranking: Coverage and Weighted Frequency •  Coverage of a predicate q –  #facet_values_matched_by_q / #facet_values •  Weighted Frequency of a predicate q 13 co (f ,q) = |{ 2 V f | 9 (s,o) 2 f q }| |V f | (9) Intuitively, a higher coverage is estimated for those predicates whose objects cover larger subsets of the facet value set V f . Weighted frequency. Specicity and coverage do not take into account the cardinality of the extension f q, where a large exten- sion provides evidence for q and f being similar. To capture this principle, we introduce the weighted frequency, freq(f ,q) freq(f ,q) = 1 |V f | X (s,o) 2 f q |LT (s) L(Df )| |LT (s) [ L(Df )| (10) Intuitively, we count signatures in f q, weighting them considering the similarity between the lexicalization of the facet domain L(Df ) and the lexicalization of all the subject types LT (s). Our intuition f T Sum of signatures δ(s,o) matching the facet… … weighted by the Jaccard similarity between the lexicaliza9ons of the subject type LT(s) and of the facet domain L(D f ) Scaled down by the number of facet values V f … Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  14. 14. Ranking: Specificity 14 •  Matching signatures vs. non matching signatures –  “more similar” in author than in creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on {Book, Work, ... } {…} {Person, Agent, …} {…} author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗
  15. 15. Ranking: Specificity 15 •  Matching signatures vs. non matching signatures –  “more similar” in author than in creator •  Compare types in signatures rather than signatures themselves –  Coun9ng matching vs. non matching signatures is unreliable (noisy, matching signatures non matching signatures) {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on {Book, Work, ... } {…} {Person, Agent, …} {…} author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗ * * * * * *
  16. 16. Ranking: Specificity 16 •  Matching signatures vs. non matching signatures –  “more similar” in author than in creator •  Compare types in signatures rather than signatures themselves –  Coun9ng matching vs. non matching signatures is unreliable (noisy, matching signatures non matching signatures) –  Consider most specific types {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on {Book, Work, ... } {…} {Person, Agent, …} {…} author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗ * * * * * *
  17. 17. Two similarity measures: –  d-spec (on the domain) –  r-spec (on the range) cting the types of all the subjects in the signatures of . More y, the subject type set D[ ] of an extension is dened as D[ ] = {t | 9 (s,o) 2 , t 2 T (s) ^ @t0 2 T (s),t0 t} (4) oes not contain every type of every subject, but only the f every subject that are minimal. A type t is minimal for an if there are no other types in T (e) that are subtypes of t. onsidering the minimal types only, D[ ] captures the speci- f the domains of , as D[ ] contains the most specic types used to categorize the subjects of the signatures from . We pture the domain-specicity of a predicate q with respect to ec(f ,q), by comparing the subject type sets extracted from extensions q and f q. We consider q specic with respect heir extensions contain the same set of specic subject types mpare D[ q] with D[ f q] using the well-known weighted set similarity d-spec(f ,q) = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) (5) card (for weighted cardinality) function in Equation 5 com- he weighted sum of the depths of the types t 2 T in the e graph, each depth being weighted by the inverse of the um depth of the descendants of t X Intuitively, we count signatures the similarity between the lexic and the lexicalization of all the is that the more similar L(Df ) the evidence provided by the s of each signature f q is comput the two lexicalizations. The f down the resulting weighted fr values (i.e., the cardinality of V The idea of evaluating spec matches found for facet value ranges of K B predicates is som Information Retrieval SoftTFID compare two documents of m a computes string similarity betw most similar word pairs. SoftT short text fragments like entity the comparison of longer text encode facet values and, in par contain several thousand valu frequency measures go beyond minimal types. Final Rank Computation Most specific subject types in non matching signatures Most specific subject types in matching signatures 17 Ranking: Specificity Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗
  18. 18. Two similarity measures: –  d-spec (on the domain) –  r-spec (on the range) cting the types of all the subjects in the signatures of . More y, the subject type set D[ ] of an extension is dened as D[ ] = {t | 9 (s,o) 2 , t 2 T (s) ^ @t0 2 T (s),t0 t} (4) oes not contain every type of every subject, but only the f every subject that are minimal. A type t is minimal for an if there are no other types in T (e) that are subtypes of t. onsidering the minimal types only, D[ ] captures the speci- f the domains of , as D[ ] contains the most specic types used to categorize the subjects of the signatures from . We pture the domain-specicity of a predicate q with respect to ec(f ,q), by comparing the subject type sets extracted from extensions q and f q. We consider q specic with respect heir extensions contain the same set of specic subject types mpare D[ q] with D[ f q] using the well-known weighted set similarity d-spec(f ,q) = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) (5) card (for weighted cardinality) function in Equation 5 com- he weighted sum of the depths of the types t 2 T in the e graph, each depth being weighted by the inverse of the um depth of the descendants of t X Intuitively, we count signatures the similarity between the lexic and the lexicalization of all the is that the more similar L(Df ) the evidence provided by the s of each signature f q is comput the two lexicalizations. The f down the resulting weighted fr values (i.e., the cardinality of V The idea of evaluating spec matches found for facet value ranges of K B predicates is som Information Retrieval SoftTFID compare two documents of m a computes string similarity betw most similar word pairs. SoftT short text fragments like entity the comparison of longer text encode facet values and, in par contain several thousand valu frequency measures go beyond minimal types. Final Rank Computation that are used to categorize the subjects of the signatures fro then capture the domain-specicity of a predicate q with r f , d-spec(f ,q), by comparing the subject type sets extrac the two extensions q and f q. We consider q specic wit to f if their extensions contain the same set of specic sub and compare D[ q] with D[ f q] using the well-known w Jaccard set similarity d-spec(f ,q) = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) The w-card (for weighted cardinality) function in Equatio putes the weighted sum of the depths of the types t 2 subtype graph, each depth being weighted by the inver maximum depth of the descendants of t w-card(T ) = X t 2T depth(t) max t0 2DES[t] depth(t0) where DES[t] = {t0 | t0 t}. The rationale of using th depth is: (1) to ensure values in the [0, 1] interval, and (2) t bias the function d-spec towards the most specic subjec We compute also the range-specicity of q with resp Most specific subject types in non matching signatures Most specific subject types in matching signatures Each type is weighted by its depth 18 Ranking: Specificity Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗
  19. 19. Two similarity measures: –  d-spec (on the domain) –  r-spec (on the range) cting the types of all the subjects in the signatures of . More y, the subject type set D[ ] of an extension is dened as D[ ] = {t | 9 (s,o) 2 , t 2 T (s) ^ @t0 2 T (s),t0 t} (4) oes not contain every type of every subject, but only the f every subject that are minimal. A type t is minimal for an if there are no other types in T (e) that are subtypes of t. onsidering the minimal types only, D[ ] captures the speci- f the domains of , as D[ ] contains the most specic types used to categorize the subjects of the signatures from . We pture the domain-specicity of a predicate q with respect to ec(f ,q), by comparing the subject type sets extracted from extensions q and f q. We consider q specic with respect heir extensions contain the same set of specic subject types mpare D[ q] with D[ f q] using the well-known weighted set similarity d-spec(f ,q) = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) (5) card (for weighted cardinality) function in Equation 5 com- he weighted sum of the depths of the types t 2 T in the e graph, each depth being weighted by the inverse of the um depth of the descendants of t X Intuitively, we count signatures the similarity between the lexic and the lexicalization of all the is that the more similar L(Df ) the evidence provided by the s of each signature f q is comput the two lexicalizations. The f down the resulting weighted fr values (i.e., the cardinality of V The idea of evaluating spec matches found for facet value ranges of K B predicates is som Information Retrieval SoftTFID compare two documents of m a computes string similarity betw most similar word pairs. SoftT short text fragments like entity the comparison of longer text encode facet values and, in par contain several thousand valu frequency measures go beyond minimal types. Final Rank Computation that are used to categorize the subjects of the signatures fro then capture the domain-specicity of a predicate q with r f , d-spec(f ,q), by comparing the subject type sets extrac the two extensions q and f q. We consider q specic wit to f if their extensions contain the same set of specic sub and compare D[ q] with D[ f q] using the well-known w Jaccard set similarity d-spec(f ,q) = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) The w-card (for weighted cardinality) function in Equatio putes the weighted sum of the depths of the types t 2 subtype graph, each depth being weighted by the inver maximum depth of the descendants of t w-card(T ) = X t 2T depth(t) max t0 2DES[t] depth(t0) where DES[t] = {t0 | t0 t}. The rationale of using th depth is: (1) to ensure values in the [0, 1] interval, and (2) t bias the function d-spec towards the most specic subjec We compute also the range-specicity of q with resp d-spec(f ,q) = q q w-card(D[ q] [ D[ f q]) (5) card (for weighted cardinality) function in Equation 5 com- he weighted sum of the depths of the types t 2 T in the e graph, each depth being weighted by the inverse of the um depth of the descendants of t w-card(T ) = X t 2T depth(t) max t0 2DES[t] depth(t0) (6) DES[t] = {t0 | t0 t}. The rationale of using this scaled s: (1) to ensure values in the [0, 1] interval, and (2) to further e function d-spec towards the most specic subject types. compute also the range-specicity of q with respect to f , f ,q), by adapting the d-spec function to consider the object nstead of the subject types. Similarly to Equation 4, we dene R[ ] of object types extracted from a generic extension as R[ ] = {t | 9 (s,o) 2 , t 2 T (o) ^ @t0 2 T (o),t0 t} (7) s adapt Equation 5 accordingly r-spec(f ,q) = w-card(R[ q] R[ f q]) w-card(R[ q] [ R[ f q]) (8) erage. We capture the coverage of a predicate q over a facet short text fragments like entity the comparison of longer text f encode facet values and, in part contain several thousand value frequency measures go beyond minimal types. Final Rank Computation. ture the specicity, coverage, a similarity score, expressed by function for a predicate q and a values in the [0, 1] interval, with score. The freq score may in pr overall score because it provide unbounded way. As a result, it Thus, to make all the scores m using a logarithmic function. Th and then we multiply all the sm in a similarity score that is a n with no upper bound and a low drc(f ,q) = ✓ 1 + ln ✓ 1 + ln ✓ 1 + ln Most specific subject types in non matching signatures Most specific subject types in matching signatures Each type is weighted by its depth 19 Ranking: Specificity Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on author creator {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Work, … } {…} {Agent, … } {…} Object Types’ Names Subject Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Thing, Fic9onalCon9nents,…} {…} {Person, Agent, …} {…} {Fic9onalCharacter, Person, …} {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} Subject Types’ Names Object Types’ Names ✓ ✓ {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Person, Agent, …} {…} {Book, Work, ... } {…} {Company, Agent, …} {…} ✗ ✗ ✗ ✗ ✗ ✗
  20. 20. •  Ground truth: graded judgments (incorrect correct accurate) •  Ground truth: one correct predicate per facet Experiments: Set Up We implemented our approach along with the baseline algo- rithms using the Java programming language and the Lucene li- brary.3 The code repository, the gold standards with which we compared our approach, and the experimental results provided in this section are publicly available.4 Table 1: Gold Standards’ statistics. Facets Predicates # Domains Accurate Correct dbpedia-numbers 8 7 ⇠4 ⇠19 dbpedia-entities 31 13 ⇠7 ⇠7 dbpedia 39 13 ⇠3 ⇠9 yago-explicit 83 17 1 - yago-ambiguous 83 10 1 - 5.1 Gold Standards Knowledge Bases. DBpedia (version 3.9) contains 753958 types, 53195 predicates and more than 96M relational assertions. In our experiments, we include two type graphs: one extracted from the DBpedia ontology, and the other extracted from DBpedia categories. DBpedia categories are extracted from Wikipedia categories and are known to be noisy and low quality [14]. The second K B we consider is YAGO (version: 2008-w40-2),5 which includes 184512 types, 89 predicates and more than 5M relational assertions. The datasets used for the evaluation consist of two disjoint sets of facets manually annotated with predicates from DBpedia and YAGO, respectively. In the case of DBpedia, we found that there Gold standards (122 facets) •  DBpedia [new!] •  + proper9es (≈ 53K+) •  + numbers •  medium hierarchy •  YAGO [Limayeal.PVLDB2010] •  - proper9es (= 89) •  deep hierarchy Baselines •  Majority vo9ng (MAG) •  Maximum likelihood (MAX) •  Adapta9on of [Vene9sAl.VLDB2011] Measures •  Mean Average Precision (MAP) – DBpedia •  Mean Reciprocal Rank (MRR) - YAGO •  Normalized Discounted Cumula9ve Gain (nDCG) – Dbpedia/YAGO 20 Our approach DRC Domain and Range Comparison Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  21. 21. Experiments: DBpedia MAJ MAX DRC MAJ MAX DRC 0 0.2 0.4 0.6 0.8 1 dbpedia dbpedia-numbers dbpedia-entities MAP (a) Whole DBpedia dataset. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (b) dbpedia. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (c) dbpedia-numbers. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (d) dbpedia-entities. Figure 4: Evaluation of DRC over the DBpedia dataset. Figure 5: The eect of the specicity score on the dbpedia- numbers dataset. 5.3 Evaluation Measures In our experiments we compare the eectiveness of DRC, MAJ, and MAX using three dierent measures: Normalized Discounted Cu- Boolean fashion. The experimental results are depicted in Figure 4a. DRC consistently achieves better eectiveness in terms of MAP over all the DBPedia datasets. Figure 4b gives an overview of the nDCG obtained at dierent ranks. DRC is more eective than MAJ and MAX in capturing the similarity at all ranks from 1 to 20, achieving a maximum nDCG of 0.81 (at rank 1), compared to 0.77 of MAJ and 0.76 of MAX. The interesting observation is that the more naive MAJ baseline performs better than the MAX baseline, at all ranks. These results are in agreement with those provided by the authors of MAX [36]. In their work, they composed their approach with the majority based approach (MAJ) to achieve better results. However, both MAJ and MAX are less eective than DRC in capturing the semantic similarity between facets and DBpedia predicates. Our interpretation for this behavior is that the baselines are less robust to ambiguity introduced by the imbalanced speci- 21 Highlights •  DRC always best results •  DRC par9cularly be~er with DBpedia numbers (values are more ambiguous) •  MAJ MAX (adapted state-of-the-art approach) Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  22. 22. Experiments: DBpedia MAJ MAX DRC MAJ MAX DRC 0 0.2 0.4 0.6 0.8 1 dbpedia dbpedia-numbers dbpedia-entities MAP (a) Whole DBpedia dataset. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (b) dbpedia. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (c) dbpedia-numbers. 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank (d) dbpedia-entities. Figure 4: Evaluation of DRC over the DBpedia dataset. Figure 5: The eect of the specicity score on the dbpedia- numbers dataset. 5.3 Evaluation Measures In our experiments we compare the eectiveness of DRC, MAJ, and MAX using three dierent measures: Normalized Discounted Cu- Boolean fashion. The experimental results are depicted in Figure 4a. DRC consistently achieves better eectiveness in terms of MAP over all the DBPedia datasets. Figure 4b gives an overview of the nDCG obtained at dierent ranks. DRC is more eective than MAJ and MAX in capturing the similarity at all ranks from 1 to 20, achieving a maximum nDCG of 0.81 (at rank 1), compared to 0.77 of MAJ and 0.76 of MAX. The interesting observation is that the more naive MAJ baseline performs better than the MAX baseline, at all ranks. These results are in agreement with those provided by the authors of MAX [36]. In their work, they composed their approach with the majority based approach (MAJ) to achieve better results. However, both MAJ and MAX are less eective than DRC in capturing the semantic similarity between facets and DBpedia predicates. Our interpretation for this behavior is that the baselines are less robust to ambiguity introduced by the imbalanced speci- (a) Whole DBpedia dataset. (b) dbpedia Figure 4: Evalua 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18 20 nDCG Rank freq freq+cov DRC Figure 5: The eect of the specicity score on the numbers dataset. 5.3 Evaluation Measures 22 Highlights •  DRC always best results •  DRC par9cularly be~er with DBpedia numbers (values are more ambiguous) •  MAJ MAX (adapted state-of-the-art approach) •  Specificity has impact on the quality of ranking Effect of specificity on DRC Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  23. 23. Experiments: YAGO MAJ MAX DRC MAJ MAX DRC 0.6 0.8 1 yago-explicit yago-ambiguous MRR (a) Whole YAGO dataset. 0.6 0.8 1 1 2 3 4 5 nDCG Rank (b) yago-explicit. 0.6 0.8 1 1 2 3 4 5 nDCG Rank (c) yago-ambiguous. Figure 6: Evaluation of DRC over the YAGO dataset. The eect of the specicity score on the performance of DRC on the dbpedia-numbers dataset is shown in Figure 5. We compare three dierent versions of DRC in terms of nDCG: the rst one uses only the frequency score, freq, the second one uses both the frequency and the coverage score, freq+cov, while the third one is the complete DRC ranking function. Experimental results high- light the importance of capturing the specicity of predicates. The specicity score is crucial to providing high quality annotations for numeric facets. We also study the eect of the specicity score over the dbpedia-entities dataset and found that it provides an observable, though slight, improvement of the eectiveness. This was expected, as strings are less ambiguous than numerical values. We report a higher nDCG at all ranks for DRC compared with freq and freq+cov scores, with an average improvement of around 4.5% graphs. As a result, the subject and object type sets extracted from the extension of predicates are more sparse than the ones that are extracted from extensions induced by facets, thus making the specicity score less discriminative. 6 CONCLUSIONS Facet data is omnipresent on the web. In this paper we address the problem of annotating facets with a predicate from a Knowledge Base. Our approach is capable of handling the intrinsic ambigu- ity of facets by annotating them with the most specic predicates with respect to the facet domain. Experiments conducted by an- notating facets with two large Knowledge Bases (DBpedia and YAGO) conrm the eectiveness of our approach. Using our facet annotation approach, we can assist domain experts in charge of 23 Highlights •  DRC always best results •  DRC and Specificity are less effec9ve than with DBpedia (in par9cular with yago- ambigous) •  comparison of generic domains vs. very specific YAGO types •  possible solu9on: take types of medium depth instead of most specific types Facet domains: type names in YAGO (e.g., wikicategory-American-science-fic9on-novels) Facet domains: more general and intui9ve than type names in YAGO (e.g., novels) Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  24. 24. Conclusions Future Work •  Facet annota9on with reference KBs –  Specificity-driven predicate ranking –  End-to-end support domain experts in: •  Crea9ng new facets [Porrinial.CAISE2014] •  Labeling and liing facets so as to transform lexical data into seman9c-rich data •  Future work –  Applica9on to new domains (geospa9al healthcare applica9ons) –  Improvement for KBs with very deep type hierarchy –  Plug the approach into a table annota9on framework 24 Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on
  25. 25. Contact: palmonari@disco.unimib.it This research was par9ally supported by NSF awards: CNS-1646395, III-1618126, CCF-1331800 Supporting Event and Weather- based Data Analytics and Marketing along the Shopper Journey www.ew-shopp.eu Bill Melinda Gates Founda9on Grand Challenges Explora9ons 25
  26. 26. Addi9onal Details
  27. 27. More Details: Combina9on of Ranking Criteria creator Ranking Book Work Thing Agent Person FantasyNovels FantasyBooks Types (DBpedia classes) FictionalComputers 2001_A_Space_Odyssey creator HAL_9000 Arthur_C._Clarke author The_Hobbit J.R.R._Tolkien author Knowledge Base Relational assertions {Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien} ... FictionalCharacter Book Arthur C. Clarke Paulo Coelho J. R. R. Tolkien ... ?p author {Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {arthur, c, clarke} {Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke} ... Filtering {Book, Work, ...}, {Person, Agent, ...}, {book, work, fantasy, novel, ...}, {j, r, r, tolkien} ... creator ... author creator Facet ... Types (wikipedia categories) author FictionalObjects ... Predicates’ signatures {FictionalCharacter, Person, ...}, {Person, Agent, …}, {fiction, charact, person, agent, ...}, {arthur, c, clarke} {Book, Work, ... }, {Person, Agent, …}, {book, work, short, story, ...}, {arthur, c, clarke}creator creator Sunjammer Middle-heart creator CORRECT-COMPLETE NEW EXAMPLE Three criteria: •  Coverage, Weighted Frequency, Specificity Combina9on: •  Smoothing for comparability •  Mul9plica9on 27 Conclusions Future Work Experimental Evalua9on Facet Annota9on with Reference KBs Mo9va9on Problem Defini9on h D[ q] using the well-known weighted = w-card(D[ q] D[ f q]) w-card(D[ q] [ D[ f q]) (5) cardinality) function in Equation 5 com- of the depths of the types t 2 T in the th being weighted by the inverse of the escendants of t = X t 2T depth(t) max t0 2DES[t] depth(t0) (6) t}. The rationale of using this scaled ues in the [0, 1] interval, and (2) to further owards the most specic subject types. range-specicity of q with respect to f , the d-spec function to consider the object ct types. Similarly to Equation 4, we dene es extracted from a generic extension as 2 , t 2 T (o) ^ @t0 2 T (o),t0 t} (7) 5 accordingly = w-card(R[ q] R[ f q]) w-card(R[ q] [ R[ f q]) (8) the coverage of a predicate q over a facet computing the fraction of facet values for t one matched signature (s,o) 2 f q computes string similarity between m · n pairs of words to nd the most similar word pairs. SoftTFIDF is thus eective in comparing short text fragments like entity names [20], but is not suitable for the comparison of longer text fragments such as those that would encode facet values and, in particular, predicate ranges (which may contain several thousand values). In addition, our specicity and frequency measures go beyond value comparison, by considering minimal types. Final Rank Computation. We aggregate the scores that cap- ture the specicity, coverage, and frequency into a single, global similarity score, expressed by the domain and range comparison function for a predicate q and a facet f , drc(f ,q). The scores have values in the [0, 1] interval, with the exception of the weighted freq score. The freq score may in principle have a huge impact on the overall score because it provides positive evidence captured in an unbounded way. As a result, it can possibly skew the overall score. Thus, to make all the scores more comparable we smooth them using a logarithmic function. The lower bound of each score is one and then we multiply all the smoothed scores together, resulting in a similarity score that is a non-decreasing monotonic function with no upper bound and a lower bound equal to 1 drc(f ,q) = ✓ 1 + ln(d-spec(f ,q) + 1) ◆ · ✓ 1 + ln(r-spec(f ,q) + 1) ◆ · ✓ 1 + ln(co (f ,q) + 1) ◆ · ✓ 1 + ln(f req(f ,q) + 1) ◆ (11)

×