Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linked Data Entity Summarization (PhD defense)

22 views

Published on

On the Web, the amount of structured and Linked Data about entities is constantly growing. Descriptions of single entities often include thousands of statements and it becomes difficult to comprehend the data, unless a selection of the most relevant facts is provided. This doctoral thesis addresses the problem of Linked Data entity summarization. The contributions involve two entity summarization approaches, a common API for entity summarization, and an approach for entity data fusion.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Linked Data Entity Summarization (PhD defense)

  1. 1. KIT – The Research University in the Helmholtz Association INSTITUTE OF APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS (AIFB) www.kit.edu Linked Data Entity Summarization Dipl.-Inf. Univ. Andreas Thalhammer 08.12.2016
  2. 2. Institute of Applied Informatics and Formal Description Methods (AIFB) 2 Outline 1. Motivation 2. Research Questions 3. Contributions a) LinkSUM (Contribution 1) b) SUMMA API (Contribution 3) 4. Related Work 5. Summary and Outlook Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
  3. 3. Institute of Applied Informatics and Formal Description Methods (AIFB) 3 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 1. MOTIVATION
  4. 4. Institute of Applied Informatics and Formal Description Methods (AIFB) 4 Information need versus availability Information need (in the US*) More than 40% of all search queries are focused on one specific entity. 579 million searches per day come from home and work devices in the US every day. ~ 232 million searches for entities (every day; in the US; desktop) Information availability (Wikidata**) Wikidata covers 24.5 million entities (growth of 55% in last year). 3.2 million entities have > 10 statements (growth of 78% in last year). Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 * https://www.comscore.com/Insights/Rankings/comScore-Releases-February-2016-US-Desktop-Search-Engine-Rankings ** https://www.wikidata.org/wiki/Wikidata:Statistics
  5. 5. Institute of Applied Informatics and Formal Description Methods (AIFB) 5 Wikidata entry for Pulp Fiction ~ 614 facts Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Growing amount of structured data on the Web
  6. 6. Institute of Applied Informatics and Formal Description Methods (AIFB) 6 Naïve solution: Entity presentation based on class summaries Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 (Source: yahoo.com)
  7. 7. Institute of Applied Informatics and Formal Description Methods (AIFB) 7 Problems of class summaries 1. The patterns are very static and do not reflect the individual particularities of entities. 2. A pattern needs to be created for each type and class hierarchies need to be considered. 3. Some entities are of multiple (distinct) types with unclear main type. 4. Some of the properties can have many values for which no ranking or cut-off is defined. Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Person Athlete Body builder Arnold Schwarzenegger Angkor Wat
  8. 8. Institute of Applied Informatics and Formal Description Methods (AIFB) 8 Entity Summarization Propositions: Every entity is individual. For different entities, different properties are of importance. Entities of the same type do not always have the same attributes. For each entity, a single property-value pair can be of different relevance. Solution: Focus on individual particularities of each entity: Entity Summarization Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
  9. 9. Institute of Applied Informatics and Formal Description Methods (AIFB) 9 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 2. RESEARCH QUESTIONS
  10. 10. Institute of Applied Informatics and Formal Description Methods (AIFB) 10 Challenge #1 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 RQ1: How can we effectively summarize entities with limited background information? RQ1.1: How can we use link analysis effectively in order to derive summaries of entities? RQ1.2: How can we use usage data analysis effectively in order to derive summaries of entities? RDF data typically does not reflect importance levels in its relations. Proprietary entity summarization systems have access to a lot of data (e.g., search queries) and infrastructure (e.g., a full Web index). Other knowledge panel providers (such as publishers) are lacking that information and infrastructure. (Source: google.com)
  11. 11. Institute of Applied Informatics and Formal Description Methods (AIFB) 11 Challenge #2 RQ2: Is there a minimum set of re-occurring/common features of entity summarization systems that allow us to provide a generic API? Andreas Thalhammer – Linked Data Entity Summarization03.10.201803.10.2018 Providers of knowledge panels are hiding the original graph structure in strongly abstracted interfaces. Standardized programmatic access is desirable (but not available). (Source: google.com) (Source: developers.google.com/knowledge-graph)
  12. 12. Institute of Applied Informatics and Formal Description Methods (AIFB) 12 Challenge #3 RQ3: How can we align duplicate/similar facts about Linked Data entities on the Web? Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Different Web sources provide structured information about a single entity. The different sources often cover similar information but do not provide according links or vocabulary mappings. Alignments are particularly difficult as the sources typically provide data at different levels of modeling granularity. (Source: imdb.com) (Source: wikidata.org)
  13. 13. Institute of Applied Informatics and Formal Description Methods (AIFB) 13 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 3. CONTRIBUTIONS
  14. 14. Institute of Applied Informatics and Formal Description Methods (AIFB) 14 Knowledge Base(s) Input Output (Usage Data) (Link Structure) LinkSUM UBES UI SUMMA API 1 2 3 Entity Data Fusion 4 Overview: Research Questions and Contributions RQ1: How can we effectively summarize entities with limited background information? RQ1.1: How can we use link analysis effectively in order to derive summaries of entities? (Contribution 1) RQ1.2: How can we use usage data analysis effectively in order to derive summaries of entities? (Contribution 2) RQ2: Is there a minimum set of re-occurring/common features of entity summarization systems that allow us to provide a generic API (Contribution 3) RQ3: How can we align duplicate/similar facts about Linked Data entities on the Web? (Contribution 4) Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
  15. 15. Institute of Applied Informatics and Formal Description Methods (AIFB) 15 Linked Data Entity Summarization Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Knowledge Base(s) Input Output (Usage Data) (Link Structure) LinkSUM UBES UI SUMMA API 1 2 3 Entity Data Fusion 4 Contribution 1
  16. 16. Institute of Applied Informatics and Formal Description Methods (AIFB) 16 LinkSUM Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Step 1: Select top-k important related resources. Step 2: Select the most relevant connecting predicate. Idea: Use link analysis for selecting facts. (Link Structure) LinkSUM
  17. 17. Institute of Applied Informatics and Formal Description Methods (AIFB) 17 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Approach: Resource Selection Quentin Tarantino Pulp Fiction director Compute PageRank [5] scores of entities with (un-typed) links that occur in textual descriptions of entities (pr). Use “Backlinks” [7] (also called “mutual links”) for finding strong connections (bl): Combine scores: (Link Structure) LinkSUM dbpedia:Category:English-language_films 220.961 dbpedia:Quentin_Tarantino 13.7403 dbpedia:John_Travolta 10.5771 dbpedia:Miramax_Films 9.9398 ... ...
  18. 18. Institute of Applied Informatics and Formal Description Methods (AIFB) 18 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Approach: Relation Selection Problem: multiple relations Approaches: Frequency (FRQ) #times the predicate is used Exclusivity (EXC) 1 / (N + M) Description (DSC): #domain + #range + #label Quentin Tarantino Pulp Fiction director writer of and combinations of those, e.g. (FREQ * EXCL) (Link Structure) LinkSUM
  19. 19. Institute of Applied Informatics and Formal Description Methods (AIFB) 19 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Used reference dataset: Introduced in Gunaratna et al. [3]. Contains human-created summaries of 50 entities (DBpedia 3.9, outgoing relations). Includes seven top-5 and seven top-10 summaries for each entity. The dataset was created by 15 experts from the Semantic Web field. Used similarity measure: Reference system: FACES (introduced in [3]). Quantitative Evaluation: Dataset and Measures (Link Structure) LinkSUM
  20. 20. Institute of Applied Informatics and Formal Description Methods (AIFB) 20 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Quantitative Evaluation: Results (Link Structure) LinkSUM SO: Subject-Object pairs (predicates not considered). SPO: Full triple. config-1: config-2: Significance with respect to both LinkSUM configurations (p < 0.05). Significance with respect to the best LinkSUM configuration (p < 0.05). Standard deviation.SD 9.0 8.0
  21. 21. Institute of Applied Informatics and Formal Description Methods (AIFB) 21 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Qualitative Evaluation: Setup (Link Structure) LinkSUM Scenario: Search Engine Result Page (SERP). 20 users, 10 entities (from the FACES dataset).
  22. 22. Institute of Applied Informatics and Formal Description Methods (AIFB) 22 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Qualitative Evaluation: Results (Link Structure) LinkSUM In some cases the task is subjective. Reasons for: Selection - the presented related resources are relevant for the entity. Rejection - redundancy. - related resources do not characterize the entity.
  23. 23. Institute of Applied Informatics and Formal Description Methods (AIFB) 23 Focus: PageRank (1) PageRank is not perfect, for example: Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 PREFIX v:http://purl.org/voc/vrank# SELECT ?e ?r FROM <http://dbpedia.org> FROM <http://people.aifb.kit.edu/ath/ #DBpedia_PageRank> WHERE { ?e rdf:type dbo:Scientist; v:hasRank/v:rankValue ?r. } ORDER BY DESC(?r) LIMIT 5 dbpedia:Carl_Linnaeus 551.791 dbpedia:Charles_Darwin 215.028 dbpedia:Albert_Einstein 186.549 dbpedia:Isaac_Newton 167.811 dbpedia:Sigmund_Freud 140.245 (Link Structure) LinkSUM
  24. 24. Institute of Applied Informatics and Formal Description Methods (AIFB) 24 Focus: PageRank (2) Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 (Link Structure) LinkSUM Important parameters (for resources r): l(r) – returns all pages that link to r. c(r) – the number of outgoing links of r. d – the damping factor Traditional PageRank [5]: Variant: Weighted Links Rank (WLRank) [6]: Link weights (lw): relative position of a link in the article [8]
  25. 25. Institute of Applied Informatics and Formal Description Methods (AIFB) 25 Focus: PageRank (3) Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 (Link Structure) LinkSUM Newly constructed rankings: ALL – all links from the article text and from the templates. ATL – article text links. TEL – template links. ATL-RP – article text links with WLRank and relative position. Size of input dataset: Reference rankings (page-view-based): TOWR-PV – “The Open Wikipedia Ranking” SUB – SubjectiveEye3D by Paul Houle ALL ATL TEL ATL-RP # links 159.398.815 142.305.605 26.460.273 143.056.545
  26. 26. Institute of Applied Informatics and Formal Description Methods (AIFB) 26 Focus: PageRank (4) Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 (Link Structure) LinkSUM Measure: Spearman rank correlation (range: [-1, 1]) Results: Conclusions: Bad correlation of TEL with TOWR-PV/SUB is the result of a small input data set. Weighting by relative position improves correlation to SUB. These findings are supported by [4].
  27. 27. Institute of Applied Informatics and Formal Description Methods (AIFB) 27 Conclusions and Impact Conclusions: LinkSUM significantly outperforms the state of the art. Entity summarization: Focus should be on selecting relevant resources. Redundancies at the object level should be avoided. LinkSUM is lightweight and can be applied in other scenarios, e.g. Web sites with semantic annotations. Semantic MediaWikis. Impact: Published and presented as full research paper at ICWE 2016. The PageRank scores are published online and found many adopters (e.g., the official DBpedia SPARQL endpoint includes the scores) In use at the WDAqua project (http://wdaqua.eu/). Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 (Link Structure) LinkSUM
  28. 28. Institute of Applied Informatics and Formal Description Methods (AIFB) 28 Linked Data Entity Summarization Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Knowledge Base(s) Input Output (Usage Data) (Link Structure) LinkSUM UBES UI SUMMA API 1 2 3 Entity Data Fusion 4 Contribution 3
  29. 29. Institute of Applied Informatics and Formal Description Methods (AIFB) 29 SUMMA API Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Quantitative evaluation. Qualitative evaluation. A/B testing. Combination of summary services. Idea: A common API for entity summaries Output UI SUMMA API
  30. 30. Institute of Applied Informatics and Formal Description Methods (AIFB) 30 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Approach: SUMMA API Parameters: URI (of the entity e) – the entity needs to be identified k (number) – an upper limit of facts related to e Multi-language support Statement groups (e.g., biographical data) Restriction to specific properties Multi-hop search space SUMMA Vocabulary: Output UI SUMMA API summa:Summary xsd:positiveInteger summa:topK summa:entity rdfs:Resource xsd:String summa:language summa:fixedProperty rdf:Property summa:statement rdf:Statement xsd:positiveInteger summa:maxHops summa:SummaryGroup summa:group summa:path PF JT VV actor role _: starring
  31. 31. Institute of Applied Informatics and Formal Description Methods (AIFB) 31 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Approach: SUMMA API SUMMA RESTful Interaction: Client Server POST [ a :Summary; :entity dbpedia:Barack_Obama; :topK 10 ] . 201 CREATED Location: http://example.com/ summary?entity=dbpedia:Barack_Obama&topK=10 @ prefix summa: <http://purl.org/voc/summa/> . ... GET http://example.com/ summary?entity=dbpedia:Barack_Obama&topK=10 200 OK @ prefix summa: <http://purl.org/voc/summa/> . ... Output UI SUMMA API
  32. 32. Institute of Applied Informatics and Formal Description Methods (AIFB) 32 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Analysis: Setup Search Engines: Google Knowledge Graph Microsoft Bing Satori/Snapshots Yahoo Knowledge News Portals (Alexa Top 25 News sites): Forbes BBC News Can the user interfaces be generated with data from the SUMMA API without changing their layout? Output UI SUMMA API
  33. 33. Institute of Applied Informatics and Formal Description Methods (AIFB) 33 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Analysis: Criteria Features: 1. Property Restriction 2. Statement Groups 3. Multi-hop Search Space 4. Languages Five entities: Spain (country) Dirk Nowitzki (person/athlete) Ramones (band) SAP (company/organization) Inglourious Basterds (movie) (Source: http://google.com) Output UI SUMMA API
  34. 34. Institute of Applied Informatics and Formal Description Methods (AIFB) 34 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Analysis: Results Which features were required by the respective system? Output UI SUMMA API
  35. 35. Institute of Applied Informatics and Formal Description Methods (AIFB) 35 Conclusions and Impact Conclusions: Decouple user interface from actual entity summarization system by defining a common API. Light-weight and extensible vocabulary and interaction mechanism. Reference implementations and their source code are publicly available. Empirical analysis demonstrate applicability in real-world scenarios. Impact: Published and presented as full research paper at ICWE 2015. Best Paper Candidate at ICWE 2015. Best Demo Award at ICWE 2016. Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Output UI SUMMA API
  36. 36. Institute of Applied Informatics and Formal Description Methods (AIFB) 36 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 4. RELATED WORK
  37. 37. Institute of Applied Informatics and Formal Description Methods (AIFB) 37 Related Work Who else is working on this? Google [1], Microsoft, Yahoo, etc. Other researchers in the field of the Semantic Web e.g. Cheng et al. [2] Gunaratna et al. [3] What distinguishes the presented work from theirs? LinkSUM is a lightweight and effective approach. UBES is the first approach that uses usage data for entity summarization. SUMMA API: first and currently only API definition that enables the exchange of entity summaries. Entity Data Fusion: First approach that focuses on general alignment of structured entity data on the Web. Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 RDF + lots of background data (Only) RDF data
  38. 38. Institute of Applied Informatics and Formal Description Methods (AIFB) 38 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 5. SUMMARY AND OUTLOOK
  39. 39. Institute of Applied Informatics and Formal Description Methods (AIFB) 39 We provided contributions for Linked Data Entity Summarization. Impact was created on the levels of research and dataset/system adoption. Combination with entity linking is possible. The addressed problem is highly relevant for search and question answering engines. Summary Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
  40. 40. Institute of Applied Informatics and Formal Description Methods (AIFB) 40 Outlook Full integration of the entity data fusion approach. Addressing literal values. Personalized/contextualized summaries of entities. Abstract entity summarization. Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
  41. 41. Institute of Applied Informatics and Formal Description Methods (AIFB) 41 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Questions?
  42. 42. Institute of Applied Informatics and Formal Description Methods (AIFB) 42 Publications Contribution 1 Andreas Thalhammer, Nelia Lasierra, Achim Rettinger: LinkSUM: Using Link Analysis to Summarize Entity Data, In Web Engineering: 16th International Conference, ICWE 2016. Proceedings, vol. 9671 of Lecture Notes in Computer Science, pages 244–261. Springer, 2016 Andreas Thalhammer and Achim Rettinger: Browsing DBpedia Entities with Summaries. The Semantic Web: ESWC 2014 Satellite Events, Lecture Notes in Computer Science 2014, pages 511-515, Springer 2014 Andreas Thalhammer and Achim Rettinger: PageRank on Wikipedia: Towards General Importance Scores for Entities. In The Semantic Web: ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 – June 2, 2016, Revised Selected Papers, pages 227–240. Springer, 2016. Contribution 2 Andreas Thalhammer, Ioan Toma, Antonio J. Roa-Valverde, Dieter Fensel: Leveraging Usage Data for Linked Data Movie Entity Summarization. In Proceedings of the 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD’12), 2012. Andreas Thalhammer, Magnus Knuth, Harald Sack: Evaluating Entity Summarization Using a Game-Based Ground Truth. In International Semantic Web Conference (2), vol. 7650, pages 350–361. Springer, 2012. Contribution 3 Antonio Roa-Valverde, Andreas Thalhammer, Ioan Toma, and Miguel-Angel Sicilia: Towards a formal model for sharing and reusing ranking computations. In Proceedings of the 6th International Workshop on Ranking in Databases In conjunction with VLDB 2012. Andreas Thalhammer and Steffen Stadtmüller. SUMMA: A Common API for Linked Data Entity Summaries. In P. Cimiano, F. Frasincar, G.-J. Houben, and D. Schwabe, editors, Engineering the Web in the Big Data Era, vol. 9114, pages 430-446. Springer, 2015. Andreas Thalhammer, Achim Rettinger: ELES: Combining Entity Linking and Entity Summarization. In Web Engineering: 16th International Conference, ICWE 2016. Proceedings, vol. 9671 of Lecture Notes in Computer Science, pages 547–550. Springer, 2016 Contribution 4 Andreas Thalhammer, Steffen Thoma, Andreas Harth: Entity-Centric Claim Reconciliation in Web Data, Submitted to WWW 2017. Andreas Thalhammer – Linked Data Entity Summarization03.10.2018 Conference Workshop Demo Knowledge Base(s) Input Output (Usage Data) (Link Structure) LinkSUM UBES UI SUMMA API 1 2 3 Entity Data Fusion 4
  43. 43. Institute of Applied Informatics and Formal Description Methods (AIFB) 43 References [1] A. Singhal. Introducing the knowledge graph: things, not strings. http://goo.gl/kH1NKq, 2012. [2] G. Cheng, T. Tran, and Y. Qu. RELIN: relatedness and informativeness-based centrality for entity summarization. In Proc. of the 10th int. conf. on The Semantic Web - Vol. Part I, ISWC’11. Springer, 2011. [3] K. Gunaratna, K. Thirunarayan, and A. P. Sheth. FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering. In Proc. of the 29th AAAI Conf. Artificial Intelligence, 2015, Austin, Texas, USA., 2015. [4] D. Dimitrov, P. Singer, F. Lemmerich, M. Strohmaier. What Makes a Link Successful on Wikipedia? https://arxiv.org/abs/1611.02508 [5] S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 1998. [6] R. Baeza-Yates and E. Davis. Web Page Ranking Using Link Attributes. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers &Amp; Posters, WWW Alt. ’04, pages 328–329, New York, NY, USA, 2004. ACM. [7] J. Waitelonis and H. Sack. Towards exploratory video search using linked data. Multimedia Tools and Applications, 59:645–672, 2012. 10.1007/s11042-011-0733-1. [8] An art draw drawn by Felipe Micaroni Lalli (micaroni@gmail.com). Andreas Thalhammer – Linked Data Entity Summarization03.10.2018

×