Successfully reported this slideshow.

TRank ISWC2013

0

Share

1 of 27
1 of 27

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

TRank ISWC2013

  1. 1. TRank: Ranking Entity Types Using the Web of Data Alberto Tonon1, Michele Catasta2, Gianluca Demartini1, Philippe Cudré-Mauroux1, and Karl Aberer2 1eXascale Infolab, University of Fribourg, Switzerland {alberto, demartini, phil}@exascale.info ISWC– 25 October 2013 2Distributed Information Systems Laboratory EPFL, Switzerland {firstname.lastname}@epfl.ch
  2. 2. Why Entities? • The Web is getting entity-centric! • Entity-centric services 2 Google
  3. 3. …and Why Types? • “Summarization” of texts • Contextual entities summaries in Web-pages • Disambiguation of other entities • Diversification of search results 3 Article Title Entities Types Bin Laden Relative Pleads Not Guilty in Terrorism Case Osama Bin Laden Abu Ghaith Lewis Kaplan Manhattan Al-QaedaPropagandists Kuwaiti Al-Qaeda members Judge Borough (New York City) Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Al-Quaeda Propagandist Kuwaiti Al-Qaeda members Jihadist Organizations
  4. 4. Entities May Have Many Types 4 Thing American Billionaires People from King County People from Seattle Windows People Agent Person Living People American People of Scottish Descent Harvard University People American Computer Programmers American Philanthropists People from Seattle
  5. 5. G: DBPedia 3.8 e: Bill Gates c: «Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.» Our Task: Ranking Types Given a Context • Input: a knowledge base G, an Entity e, a context c in which e appears. • Output: e’s types ranked by relevance wrt the context c. • Evaluation: crowdsourcing + MAP, NDCG 5 Bill Gates 1. American Chief executive 2. American Computer Programmer 3. American Billionaires 4. …
  6. 6. TRank Pipeline 6 Type ranking Type ranking Type ranking Text extraction (BoilerPipe) Named Entity Recognition (Stanford NER) List of entity labels Entity linking (inverted index: DBpedia labels ⟹ resource URIs) foreach List of entity URIs Type retrieval (inverted index: resource URIs ⟹ type URIs) List of type URIs Type ranking Ranked list of types
  7. 7. Type Hierarchy 7 <owl:equivalentClass> <owl:Thing> MappingsYAGO/DBpedia (PARIS) type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from <owl:equivalentClass> manually added PARISontology mapping
  8. 8. Ranking Algorithms • Entity centric • Hierarchy-based • Context-aware (featuring type-hierarchy) • Learning to Rank 8
  9. 9. Entity-Centric Ranking Approaches (An Example) 9 • SAMEAS Score(e, t) = number of URIs representing e with type t.
  10. 10. Hierarchy-Based Approaches (An Example) • ANCESTORS Score(e, t) = number of t’s ancestors in the type hierarchy contained in Te. 10 Te often doesn’t contain all super types of a specific type
  11. 11. Context-Aware Ranking Approaches (An Example) • SAMETYPE Score(e, t, cT) = number of times t appears among the types of every other entity in cT. 11 e' Person Actor Actor AmericanActor Context e'' Organization Thing e
  12. 12. Learning to Rank Entity Types Determine an optimal combination of all our approaches: • Decision trees • Linear regression models • 10-fold cross validation 12
  13. 13. Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computations are done by using inverted indices: – Entity linking – Path index – Depth index • The inverted indices are publicly available at exascale.info/TRank 13
  14. 14. EXPERIMENTAL EVALUATION 14
  15. 15. Datasets • 128 recent NYTimes articles split to create: – Entity Collection – Sentence Collection – Paragraph Collection – 3-Paragraphs Collection • Ground-truth obtained by using crowdsourcing – 3 workers per entity/context – 4 levels of relevance for each type – Overall cost: 190$ 15
  16. 16. Effectiveness Evaluation 16 Check our paper or contact us for a complete description of all the approaches we evaluated
  17. 17. Efficiency Evaluation • Tested efficiency on a CommonCrawl sample of 1TB – 1,310,459 HTML pages – 23GB compressed • Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks • On average, 25 min. processing time (> 100 docs/node x sec) 17 Text Extraction NER Entity Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2%
  18. 18. Conclusions • New task: ranking entity types. – Useful for: “summarization” of Web-documents, entity summaries, disambiguation. • Various approaches: entity-centric, context- aware, hierarchy-based, learning to rank. – Hierarchy-based and learning to rank are the most effective. • Hadoop, Map/Reduce, and inverted indices to achieve scalability. 18
  19. 19. Grazie! • Datasets (with relevance judgments!), inverted indices, evaluation tools and more material are available at exascale.info/Trank. 19 Thank you for your attention! Check out B-hist at the SW Challenge! Thanks to for the Travel Award! TRank is open- source!https://github.c om/MEM0R1ES/TRank
  20. 20. 20
  21. 21. Entity-Centric Ranking Approaches • FREQ Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the knowledge base. • WIKILINK Rank(e, t, ck) = number of e’s “neighbor entities” with type t. • SAMEAS Rank(e, t, ck) = number of URIs representing e with type t. • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label (thank you, Lucene  ) 21
  22. 22. Create Inverted Index "Tom Cruise" label ... "Tom Hanks" label ... "Bill Gates" label ... "Osama Bin Laden" label ... Knowledge Base e1 e2 e3 eN ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Entity-Centric Ranking Approaches • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label. Exploits an inverted index. 22 ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Label(e) Query TF-IDF Ranking e2 e3 . . . TOP-10
  23. 23. Hierarchy-Based Ranking Approaches • DEPTH Rank(e, t, cH) = depth of t in the type hierarchy. • ANCESTORS Rank(e, t, cH) = number of t’s ancestors in the type hierarchy contained in Te. • ANC_DEPTH Rank(e, t, cH) = 23 Te often doesn’t contain all super types of a specific type
  24. 24. Context-Aware Ranking Approaches • The context can help getting a better ranking of types. 24 Italy’s rebellious voters, who opted for a flamboyant billionaire and a clown, reminded us last week how deeply in crisis the Continent is. Meanwhile, France is going it virtually alone in Mali, and Britain talks openly of jumping the European ship altogether. Landlocked Countries Least Developed Countries States And Territories Established In 1960 French-speaking Countries World Trade Organization Member Economies Country African Union Member States African Countries Member States Of La Francophonie African Union Member Economies Populated Place Place • Which is the right type for Mali?
  25. 25. Context-Aware Ranking Approaches PATH • Suppose we have to compute Rank(t, e, cT). • Consider each type t’ of each other entity e’ in c. • P(t) = path from the root of the type hierarchy to t. 25 ???
  26. 26. Context-Aware Ranking Approaches Ranking Tom Hank’s types when co-occurring with Tom Cruise in some text. 26 1 2 3 4 4 1 1 1
  27. 27. Relevance Judgments • Crowdsourced relevance judgments. • Anonymous Web-users are TRank users. • 3 workers per entity/context. • Overall cost: 190$ • Pilot study on task design… mega-bubbles! • Numbers of votes as relevance score for a type. 27

Editor's Notes

  • An entity is something that exists by itself, although it need not be of material existance
  • LEGGI TIPI
  • STATE OF THE ART NER AND LINKING FOCUS IS RANKING TYPES
  • PARIS: VLDB2012 ontology alignment
    Yago super specific types
  • Entity centric
    Use only the information connected to the entity
    Context-aware
    Exploit the types of entities that co-occur in the context (e.g. Bill Gates + Micr soft vs Bill Gates + Scotland)
    Hierarchy-based
    Exploit the type hierarchy
    Learning to Rank
    Combine evidences coming from all previous approaches in an optimal way
  • we start from the node representing an entity, follow same-as links (we get other nodes representing the same entity) and we count how many “new” nodes feature the type we’re giving a score to
  • The set of types associated to an entity in a knowledge base often doesn’t contain all super types
  • C_T is the context given by the text
  • 10 FOLD CROSS VALID
    DECISION TREE
    REGRESSION …
    Preliminary experiments showed that is the best performing model bla la
  • Use Inverted indices to AVOID SPARQL QUERIES!!
  • - Increasing granularities of context: from no-context (here is the entity, here are its types, rank them), one sentence/paragraph (rank the types of all entities in this sentence/paragraph)
    - 3 workers were asked to select the best type of each entity appearing in a given context
  • ANCESTORS Is the real winner since it uses inverted indices which are faster, no machine learning yadda yadda 
  • Only pages with schema.org
  • HADOOP -> scalable, not efficient
  • SEMANTIC WEB SCIENCE ASSOCIATION!
  • Ck is the context given by the knowledge base
  • ×