TRank ISWC2013


Published on

Presentation of "TRank: Ranking Entity Types Using the Web of Data" at ISWC2013

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • An entity is something that exists by itself, although it need not be of material existance
  • PARIS: VLDB2012 ontology alignment Yago super specific types
  • Entity centricUse only the information connected to the entityContext-awareExploit the types of entities that co-occur in the context (e.g. Bill Gates + Micr soft vs Bill Gates + Scotland)Hierarchy-basedExploit the type hierarchyLearning to RankCombine evidences coming from all previous approaches in an optimal way
  • we start from the node representing an entity, follow same-as links (we get other nodes representing the same entity) and we count how many “new” nodes feature the type we’re giving a score to
  • The set of types associated to an entity in a knowledge base often doesn’t contain all super types
  • C_T is the context given by the text
  • 10 FOLD CROSS VALIDDECISION TREEREGRESSION …Preliminary experiments showed that is the best performing model bla la
  • Use Inverted indices to AVOID SPARQL QUERIES!!
  • - Increasing granularities of context: from no-context (here is the entity, here are its types, rank them), one sentence/paragraph (rank the types of all entities in this sentence/paragraph)- 3 workers were asked to select the best type of each entity appearing in a given context
  • ANCESTORS Is the real winner since it uses inverted indices which are faster, no machine learning yaddayadda
  • Only pages with schema.org
  • HADOOP -> scalable, not efficient
  • Transcript of "TRank ISWC2013"

    1. 1. TRank: Ranking Entity Types Using the Web of Data Alberto Tonon1, Michele Catasta2, Gianluca Demartini1, Philippe Cudré-Mauroux1, and Karl Aberer2 ISWC– 25 October 2013 1eXascale Infolab, 2Distributed Information Systems Laboratory University of Fribourg, Switzerland EPFL, Switzerland {alberto, demartini, phil}@exascale.info {firstname.lastname}@epfl.ch
    2. 2. Why Entities? • The Web is getting entity-centric! • Entity-centric services Google 2
    3. 3. …and Why Types? • “Summarization” of texts Article Title Entities Bin Laden Relative Pleads Not Guilty in Terrorism Case Types Osama Bin Laden Abu Ghaith Lewis Kaplan Manhattan Al-QaedaPropagandists Kuwaiti Al-Qaeda members Judge Borough (New York City) • Contextual entities summaries in Web-pages Kuwaiti Al-Qaeda members Al-Quaeda Propagandist Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Jihadist Organizations • Disambiguation of other entities • Diversification of search results 3
    4. 4. Entities May Have Many Types American Billionaires People from King County Thing American Philanthropists People from Seattle Windows People American Computer Programmers Agent Harvard University People Person Living People American People of Scottish Descent 4
    5. 5. Our Task: Ranking Types Given a Context • Input: a knowledge base G, an Entity e, a context c in which e appears. • Output: e’s types ranked by relevance wrt the context c. G: DBPedia 3.8 e: Bill Gates c: «Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.» Bill Gates 1. American Chief executive 2. American Computer Programmer 3. American Billionaires 4. … • Evaluation: crowdsourcing + MAP, NDCG 5
    6. 6. TRank Pipeline Text extraction (BoilerPipe) Ranked list of types Named Entity Recognition (Stanford NER) Type ranking Type ranking Type ranking Type ranking List of entity labels List of type URIs foreach Entity linking (inverted index: DBpedia labels ⟹ resource URIs) Type retrieval (inverted index: resource URIs ⟹ type URIs) List of entity URIs 6
    7. 7. Type Hierarchy <owl:Thing> <owl:equivalentClass> Mappings YAGO/DBpedia (PARIS) type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from <owl:equivalentClass> PARIS ontology mapping manually added 7
    8. 8. Ranking Algorithms • • • • Entity centric Hierarchy-based Context-aware (featuring type-hierarchy) Learning to Rank 8
    9. 9. Entity-Centric Ranking Approaches (An Example) • SAMEAS Score(e, t) = number of URIs representing e with type t. 9
    10. 10. Hierarchy-Based Approaches (An Example) • ANCESTORS Score(e, t) = number of t’s ancestors in the type hierarchy contained in Te. Te often doesn’t contain all super types of a specific type 10
    11. 11. Context-Aware Ranking Approaches (An Example) • SAMETYPE Score(e, t, cT) = number of times t appears among the types of every other entity in cT. Context Actor AmericanActor e Organization e'' Thing Person Actor e' 11
    12. 12. Learning to Rank Entity Types Determine an optimal combination of all our approaches: • Decision trees • Linear regression models • 10-fold cross validation 12
    13. 13. Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computations are done by using inverted indices: – Entity linking – Path index – Depth index • The inverted indices are publicly available at exascale.info/TRank 13
    15. 15. Datasets • 128 recent NYTimes articles split to create: – – – – Entity Collection Sentence Collection Paragraph Collection 3-Paragraphs Collection • Ground-truth obtained by using crowdsourcing – 3 workers per entity/context – 4 levels of relevance for each type – Overall cost: 190$ 15
    16. 16. Effectiveness Evaluation Check our paper or contact us for a complete description of all the approaches we evaluated 16
    17. 17. Efficiency Evaluation • Tested efficiency on a CommonCrawl sample of 1TB – 1,310,459 HTML pages – 23GB compressed • Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks • On average, 25 min. processing time (> 100 docs/node x sec) Text Extraction NER Entity Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2% 17
    18. 18. Conclusions • New task: ranking entity types. – Useful for: “summarization” of Web-documents, entity summaries, disambiguation. • Various approaches: entity-centric, contextaware, hierarchy-based, learning to rank. – Hierarchy-based and learning to rank are the most effective. • Hadoop, Map/Reduce, and inverted indices to achieve scalability. 18
    19. 19. Grazie! • Datasets (with relevance judgments!), inverted indices, evaluation tools and more material are available at exascale.info/Trank. Thank you for your attention! Thanks to for the Travel Award! TRank is opensource!https://github.c om/MEM0R1ES/TRank Check out B-hist at the SW Challenge! 19
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.