Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TRank ISWC2013


Published on

Presentation of "TRank: Ranking Entity Types Using the Web of Data" at ISWC2013

  • Be the first to comment

  • Be the first to like this

TRank ISWC2013

  1. 1. TRank: Ranking Entity Types Using the Web of Data Alberto Tonon1, Michele Catasta2, Gianluca Demartini1, Philippe Cudré-Mauroux1, and Karl Aberer2 ISWC– 25 October 2013 1eXascale Infolab, 2Distributed Information Systems Laboratory University of Fribourg, Switzerland EPFL, Switzerland {alberto, demartini, phil} {firstname.lastname}
  2. 2. Why Entities? • The Web is getting entity-centric! • Entity-centric services Google 2
  3. 3. …and Why Types? • “Summarization” of texts Article Title Entities Bin Laden Relative Pleads Not Guilty in Terrorism Case Types Osama Bin Laden Abu Ghaith Lewis Kaplan Manhattan Al-QaedaPropagandists Kuwaiti Al-Qaeda members Judge Borough (New York City) • Contextual entities summaries in Web-pages Kuwaiti Al-Qaeda members Al-Quaeda Propagandist Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Jihadist Organizations • Disambiguation of other entities • Diversification of search results 3
  4. 4. Entities May Have Many Types American Billionaires People from King County Thing American Philanthropists People from Seattle Windows People American Computer Programmers Agent Harvard University People Person Living People American People of Scottish Descent 4
  5. 5. Our Task: Ranking Types Given a Context • Input: a knowledge base G, an Entity e, a context c in which e appears. • Output: e’s types ranked by relevance wrt the context c. G: DBPedia 3.8 e: Bill Gates c: «Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.» Bill Gates 1. American Chief executive 2. American Computer Programmer 3. American Billionaires 4. … • Evaluation: crowdsourcing + MAP, NDCG 5
  6. 6. TRank Pipeline Text extraction (BoilerPipe) Ranked list of types Named Entity Recognition (Stanford NER) Type ranking Type ranking Type ranking Type ranking List of entity labels List of type URIs foreach Entity linking (inverted index: DBpedia labels ⟹ resource URIs) Type retrieval (inverted index: resource URIs ⟹ type URIs) List of entity URIs 6
  7. 7. Type Hierarchy <owl:Thing> <owl:equivalentClass> Mappings YAGO/DBpedia (PARIS) type: DBpedia Yago subClassOf relationship: explicit inferred from <owl:equivalentClass> PARIS ontology mapping manually added 7
  8. 8. Ranking Algorithms • • • • Entity centric Hierarchy-based Context-aware (featuring type-hierarchy) Learning to Rank 8
  9. 9. Entity-Centric Ranking Approaches (An Example) • SAMEAS Score(e, t) = number of URIs representing e with type t. 9
  10. 10. Hierarchy-Based Approaches (An Example) • ANCESTORS Score(e, t) = number of t’s ancestors in the type hierarchy contained in Te. Te often doesn’t contain all super types of a specific type 10
  11. 11. Context-Aware Ranking Approaches (An Example) • SAMETYPE Score(e, t, cT) = number of times t appears among the types of every other entity in cT. Context Actor AmericanActor e Organization e'' Thing Person Actor e' 11
  12. 12. Learning to Rank Entity Types Determine an optimal combination of all our approaches: • Decision trees • Linear regression models • 10-fold cross validation 12
  13. 13. Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computations are done by using inverted indices: – Entity linking – Path index – Depth index • The inverted indices are publicly available at 13
  15. 15. Datasets • 128 recent NYTimes articles split to create: – – – – Entity Collection Sentence Collection Paragraph Collection 3-Paragraphs Collection • Ground-truth obtained by using crowdsourcing – 3 workers per entity/context – 4 levels of relevance for each type – Overall cost: 190$ 15
  16. 16. Effectiveness Evaluation Check our paper or contact us for a complete description of all the approaches we evaluated 16
  17. 17. Efficiency Evaluation • Tested efficiency on a CommonCrawl sample of 1TB – 1,310,459 HTML pages – 23GB compressed • Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks • On average, 25 min. processing time (> 100 docs/node x sec) Text Extraction NER Entity Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2% 17
  18. 18. Conclusions • New task: ranking entity types. – Useful for: “summarization” of Web-documents, entity summaries, disambiguation. • Various approaches: entity-centric, contextaware, hierarchy-based, learning to rank. – Hierarchy-based and learning to rank are the most effective. • Hadoop, Map/Reduce, and inverted indices to achieve scalability. 18
  19. 19. Grazie! • Datasets (with relevance judgments!), inverted indices, evaluation tools and more material are available at Thank you for your attention! Thanks to for the Travel Award! TRank is opensource!https://github.c om/MEM0R1ES/TRank Check out B-hist at the SW Challenge! 19