Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Summarizing Semantic Data

1,039 views

Published on

Tutorial presented at CCKS'16, Beijing, 20/09/2016.

Published in: Science
  • Be the first to comment

Summarizing Semantic Data

  1. 1. Summarizing Semantic Data Gong Cheng National Key Laboratory for Novel Software Technology Nanjing University, China Websoft
  2. 2. What is semantic data? • Entity • Class • Property • Attribute • Relation
  3. 3. What is semantic data? • Entity • Class • Property • Attribute • Relation Datasets
  4. 4. Semantic datasets on the Web
  5. 5. What is semantic data summarization? Why? 1. Summarizing entity descriptions (a.k.a. entity summarization)
  6. 6. What is semantic data summarization? Why? 2. Summarizing entity associations Alice Bob article-A paper-A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  7. 7. What is semantic data summarization? Why? 3. Summarizing semantic datasets
  8. 8. Two types of summaries • Extractive methods • summary = a subset of data • summarization = ranking and selection • Abstractive methods (a.k.a. non-extractive methods) • summary = a high-level abstraction of data • summarization = a more complex process
  9. 9. Outline of this talk • Summarizing entity descriptions • Summarizing entity associations • Summarizing semantic datasets • Summarizing ontologies (if time permits)
  10. 10. Outline of this talk • Summarizing entity descriptions • Summarizing entity associations • Summarizing semantic datasets
  11. 11. Summarizing entity descriptions • Extractive methods (summary = a subset of property-value pairs) • Metrics for ranking property-value pairs • Intrinsic metrics • Extrinsic metrics • Structures for combining metrics • Abstractive methods • Not known yet Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman …
  12. 12. Intrinsic metrics 1. Frequency 2. Centrality 3. Informativeness 4. Diversity
  13. 13. Intrinsic metrics (1): frequency • Frequency of property • Frequency in the dataset • Frequency among entities of the same type • Frequency in this entity description • Frequency in the ontology (i.e., richness of definition) Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … Property Value … influenced … … Property Value … type Artist creates … …
  14. 14. Intrinsic metrics (1): frequency • Frequency of property value • Frequency in the dataset (note: entities in text) • Frequency in this entity description (note: indirect relations) Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … Property Value … … Mona Lisa … … Lady with an Ermine … Property Value … … …Mona Lisa… … Indirect relations may also be counted.
  15. 15. Intrinsic metrics (1): frequency • Frequency of property-value pair • Frequency among similar entities • Frequency in the dataset (why not?) Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … Property Value … type Artist … influenced Richard Feynman … (a similar entity)
  16. 16. Intrinsic metrics (2): centrality • Centrality of property value • Within the dataset: (weighted) PageRank • On the Web: authority of datasets referencing it Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman …
  17. 17. Intrinsic metrics (2): centrality • Centrality of property-value pair • PageRank, weighted by inverse Google distance[Cheng et al., ISWC’11] Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … name: Leonardo da Vinci type: Person creates: Mona Lisa …
  18. 18. Intrinsic metrics (3): informativeness • Informativeness of property-value pair • Self-information of property-value pair[Cheng et al., ISWC’11] • Depth of class Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … Property Value … type Person type Scientist … Person Artist Scientist
  19. 19. Intrinsic metrics (4): diversity • Diversity of properties • To avoid common properties • To avoid properties having similar values Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman …
  20. 20. Intrinsic metrics (4): diversity • Diversity of property-value pairs[Cheng et al., JoWS’15, WWW’15] • Similarity between text: string-based, word-based • Similarity between numbers • Semantic similarity: reasoning-based Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … Person Artist Scientist type:Artist ⇒ type:Person
  21. 21. Extrinsic metrics 1. Using external knowledge 2. Context-based
  22. 22. Extrinsic metrics (1): using external knowledge • Using domain knowledge • Certain properties are known to be important. • Using indicators on the Web • Search engine hits • Bidirectional links in Wikipedia • Using user feedback • User clicks Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman …
  23. 23. Extrinsic metrics (2): context-based • Entity search results • context = query • solution: query relevance [Cheng et al., IJSWIS’09]
  24. 24. Extrinsic metrics (2): context-based • Entities in a document • context = contents of the document • solution: Class Vector Model[Cheng et al., WWW’15] Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … vector = {Painting} … The Starry Night, from MoMA’s collection, reminds us of some work painted by Leonardo da Vinci. ... Property Value … type Painting … vector(context) = {Painting} vector = {Artist}
  25. 25. Extrinsic metrics (2): context-based • Co-summarization • context = other entities • solution: • difference from other entities[Cheng et al., WWW’15] (for entity linking) • similarity with other entities[Cheng et al., JoWS’15] (for entity coreference resolution)
  26. 26. Structures for combining metrics 1. Result combination 5 1 3 2 4 5 2 4 1 3 5 1 2 4 3 Ranked by Metric A Ranked by Metric B Ranked by Metric C Summary
  27. 27. Structures for combining metrics 1. Result combination (cont.) Ranked by Metric A Ties broken by Metric B
  28. 28. Structures for combining metrics 2. Arithmetic combination ɑ*MetricA + β*MetricB
  29. 29. Structures for combining metrics • e.g., combinatorial optimization • Quadratic Knapsack Problem[Cheng et al., JoWS’15] • Quadratic Multidimensional Knapsack Problem[Cheng et al., WWW’15] Length constraint Similarity with and difference from other entities Inverse similarity Diagonal: informativeness One entity The other entity Inverse similarity
  30. 30. Structures for combining metrics • e.g., weighted PageRank[Cheng et al., ISWC’11] Property Value name Leonardo da Vinci type Person type Artist dateOfBirth 1452-04-15 creates Mona Lisa creates Lady with an Ermine knownFor Mona Lisa influenced Richard Feynman … name: Leonardo da Vinci type: Person creates: Mona Lisa … Probability of jumpingProbability of following edges Inverse Google distance Informativeness
  31. 31. Structures for combining metrics 3. Machine Learning • Decision trees • Linear regression
  32. 32. Structures for combining metrics 4. Complex combinations • Result combination + arithmetic combination • Machine learning + arithmetic combination
  33. 33. Outline of this talk • Summarizing entity descriptions • Summarizing entity associations • Summarizing semantic datasets
  34. 34. Summarizing entity associations • Extractive methods • Finding and ranking associations between two entities (summary = a subset of paths) • Path finding and filtering • Intrinsic and extrinsic metrics for ranking paths • Structures for combining metrics • Finding and ranking associations between multiple entities (summary = a subset of subgraphs) • Abstractive methods • Ranking association patterns • Hierarchically organizing association patterns Alice Bob article-A paper-A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  35. 35. Finding associations between two entities • Path finding • Dijkstra or A* • Bidirectional breadth-first search (bi-BFS) • Schema-based performance optimization Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor Paper Person Conference inProcOf cites, extends O(Δd)  O(Δd/2)
  36. 36. Finding associations between two entities • Path filtering • By length • By entities, classes, relations • By keywords Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  37. 37. Ranking associations between two entities • Intrinsic metrics • Frequency • Centrality • Informativeness • Diversity • Length • Conformity • Extrinsic metrics • Using external knowledge • Context-based • Structures for combining metrics
  38. 38. Intrinsic metrics: frequency, centrality, diversity, length • Property frequency • Degree centrality • Diverse relations • Length Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  39. 39. Intrinsic metrics: informativeness • Informativeness • Data-based informativeness: inverse relation frequency • Schema-based informativeness: depth of class/relation Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  40. 40. Intrinsic metrics: conformity • Conformity to schema Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor Paper Person Conference inProcOf cites, extends
  41. 41. Extrinsic metrics • Using external knowledge • Explicit: user-defined weights • Implicit: user’s Web browsing history • Context-based • Query relevance Alice Bob article-A paper-A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  42. 42. Finding and ranking associations between multiple entities • association = a size-constrained connected subgraph (size = number of other entities) 3 associations via 2 other entities
  43. 43. Finding and ranking associations between multiple entities • association = a size-constrained connected subgraph (size = diameter)[Cheng et al., ISWC’16] 3 associations having a diameter of 3
  44. 44. Finding and ranking associations between multiple entities • Subgraph finding • n-directional breadth-first search • Distance-based performance optimization[Cheng et al., ISWC’16]
  45. 45. Finding and ranking associations between multiple entities • Subgraph ranking (based on entity ranking) • PageRank • Query relevance • Number of short paths • Random walk with restart
  46. 46. Finding and ranking associations between multiple entities • association = a Steiner tree (size-unconstrained, weight-minimized)
  47. 47. Abstractive methods • Association pattern[Cheng et al., ISWC’14] paper-A conf-A inProcOfsecondAuthor reviewer paper-B conf-B inProcOffirstAuthor chair Paper Conference inProcOfauthor role Patterns Associations
  48. 48. Abstractive methods • Association pattern[Cheng et al., ISWC’16] Patterns Associations
  49. 49. Ranking association patterns • Metrics • Frequency • Informativeness • Diversity • Structures for combining metrics Paper Conference inProcOfauthor role
  50. 50. Metrics: frequency • frequency = occurrences of canonical code[Cheng et al., ISWC’16] = isomorphic? eq 1r1C1r2C2r3eq 2$r4eq 3$$$$ (when T=e)
  51. 51. Metrics: frequency • frequency = occurrences of canonical code[Cheng et al., ISWC’16] ? Solution: using query entities as proxies for classes to be ordered
  52. 52. Hierarchically organizing association patterns • subClassOf/subPropertyOf  subPatternOf[Zhang et al., JIST’13] Paper Conference inProcOfauthor role Demo Conference inProcOfauthor reviewer Poster Conference inProcOfauthor chair
  53. 53. Outline of this talk • Summarizing entity descriptions • Summarizing entity associations • Summarizing semantic datasets
  54. 54. Summarizing semantic datasets • Extractive methods (summary = a subset of triples) • Centrality • Abstractive methods 1. Inferred schema 2. Flat partitioning 3. Hierarchical grouping
  55. 55. Extractive methods • Triple ranking (based on entity ranking) • Centrality: degree, PageRank Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor
  56. 56. Abstractive methods (1): inferred schema • summary = a graph-structured (sub-)schema inferred from data (grouping entities by classes) Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor Paper Person Conference inProcOf cites, extends
  57. 57. Abstractive methods (1): inferred schema • Metrics for ranking classes and properties • Frequency • Centrality Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor Paper Person Conference inProcOf cites, extends
  58. 58. Abstractive methods (2): flat partitioning • summary = entity partitions connected by relations • partitioning by shared classes (= inferred schema) • partitioning by shared attributes • partitioning by shared paths (a.k.a. bisimulation) Alice Bob article-A paper- A AAAI IJCAI paper-B paper-C paper-D inProcOf secondAuthor reviewer chair firstAuthor firstAuthor inProcOf citessecondAuthor cites extends firstAuthor Paper Person Conference inProcOf cites, extends
  59. 59. Abstractive methods (3): hierarchical grouping[Cheng et al., IJCAI’16] • summary = a hierarchical grouping of entities • identified by property-value pairs • connected by relations A hierarchical grouping of entities Relations connecting sibling groups
  60. 60. • Metrics for choosing groups (i.e., property-value pairs) • Coverage of data  large subgroups • Height of hierarchy  moderate-sized subgroups • Cohesion within groups  informative property-value pairs • Overlap between groups  controllable overlap • Homogeneity of groups  different values of the same property A hierarchical grouping of entities Relations connecting sibling groups Abstractive methods (3): hierarchical grouping[Cheng et al., IJCAI’16]
  61. 61. • Combining metrics by combinatorial optimization (formulated as a multidimensional knapsack problem) maximizing moderateness of each subgroup maximizing cohesion within each subgroup disallowing large overlap between subgroups selecting ≤k subgroups (optionally) disallowing different properties Abstractive methods (3): hierarchical grouping[Cheng et al., IJCAI’16]
  62. 62. Concluding remarks • Research • More application scenarios are to be identified. • New applications may promote new metrics. • More benchmarks are needed for evaluation. • Practice • Handy tools for semantic data summarization are missing. The 2016 ENtity Summarization Evaluation Campaign (ENSEC 2016) http://km.aifb.kit.edu/ws/sumpre2016/challenge.html
  63. 63. Papers on summarizing entity descriptions • Gong Cheng, Danyun Xu, Yuzhong Qu. Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking. (WWW'15) • Gong Cheng, Danyun Xu, Yuzhong Qu. C3D+P: A Summarization Method for Interactive Entity Resolution. (JoWS’15) • Gong Cheng, Thanh Tran, Yuzhong Qu. RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization. (ISWC’11) • Gong Cheng, Yuzhong Qu. Searching Linked Objects with Falcons: Approach, Implementation and Evaluation. (IJSWIS’09)
  64. 64. Papers on summarizing entity associations • Gong Cheng, Daxin Liu, Yuzhong Qu. Efficient Algorithms for Association Finding and Frequent Association Pattern Mining. (ISWC'16) • Gong Cheng, Yanan Zhang, Yuzhong Qu. Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets. (ISWC’14) • Yanan Zhang, Gong Cheng, Yuzhong Qu. Towards Exploratory Relationship Search: A Clustering-based Approach (JIST’13)
  65. 65. Papers on summarizing semantic datasets • Gong Cheng, Cheng Jin, Yuzhong Qu. HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization. (IJCAI’16)
  66. 66. Ontology • Terms • Publication • Paper • Conference • title • inProc • Term descriptions • SubClassOf(Paper, Publication) • SubClassOf(Paper, DataExactCardinality(1, title)) • ObjectPropertyDomain(inProc, Paper) • ObjectPropertyRange(inProc, Conference)
  67. 67. Summarizing ontologies: an application
  68. 68. Summarizing ontologies • Extractive methods 1. Ranking terms (summary = a subset of terms) 2. Ranking term descriptions (summary = a subset of term descriptions) 3. Ranking subgraphs (summary = a subgraph) • Abstractive methods • Not known yet
  69. 69. Extractive methods (1): ranking terms • Intrinsic metrics 1. Frequency 2. Centrality 3. Diversity 4. Simplicity • Extrinsic metrics 1. Using external knowledge 2. Context-based
  70. 70. Intrinsic metrics (1): frequency • Schema-based frequency • Data-based frequency
  71. 71. Intrinsic metrics (2): centrality • Middleness in the hierarchy • Degree • Betweenness • PageRank Paper Publication title inProc Conference Publication Paper Book Article Poster
  72. 72. Intrinsic metrics (3): diversity • Coverage of hierarchy Publication Paper Book Article Poster
  73. 73. Intrinsic metrics (4): simplicity • Number of words in the name of a term Paper vs. PaperPublishedAtCCKS2016
  74. 74. Extrinsic metrics • Using external knowledge • Search engine hits • Personalization (e.g., spreading activation) • Context-based • Query relevance Paper Publication title inProc Conference
  75. 75. Extractive methods (2): ranking term descriptions • Graph representation of term descriptions 1. Description graph 2. Term-description graph • Ranking term descriptions • Intrinsic metrics • Extrinsic metrics
  76. 76. Graph representation (1): description graph [Zhang et al., WWW’07] SubClassOf(Paper, Publication) SubClassOf(Paper, DataExactCardinality(1, title)) ObjectPropertyDomain(inProc, Paper) ObjectPropertyRange(inProc, Conference) SubClassOf(Paper, Publication) SubClassOf(Paper, DataExactCardinality(1, title)) ObjectPropertyDomain(inProc, Paper) ObjectPropertyRange(inProc, Conference)
  77. 77. Graph representation (2): term-description graph [Zhang et al., JCST’09; Cheng et al., JIST’11] SubClassOf(Paper, Publication) SubClassOf(Paper, DataExactCardinality(1, title)) ObjectPropertyDomain(inProc, Paper) ObjectPropertyRange(inProc, Conference) SubClassOf(Paper, Publication) SubClassOf(Paper, DataExactCardinality(1, title)) ObjectPropertyDomain(inProc, Paper) ObjectPropertyRange(inProc, Conference) Paper Publication title inProc Conference
  78. 78. Ranking term descriptions • Intrinsic metrics • Frequency • Centrality • Diversity • Cohesion/coherence • Extrinsic metrics • Query relevance SubClassOf(Paper, Publication) SubClassOf(Paper, DataExactCardinality(1, title)) ObjectPropertyDomain(inProc, Paper) ObjectPropertyRange(inProc, Conference)
  79. 79. Papers on summarizing ontologies • Weiyi Ge, Gong Cheng, Huiying Li, Yuzhong Qu. Incorporating Compactness to Generate Term-association View Snippets for Ontology Search. (IP&M’13) • Gong Cheng, Feng Ji, Shengmei Luo, Weiyi Ge, Yuzhong Qu. BipRank: Ranking and Summarizing RDF Vocabulary Descriptions. (JIST’11) • Xiang Zhang, Gong Cheng, Weiyi Ge, Yuzhong Qu. Summarizing Vocabularies in the Global Semantic Web. (JCST’09) • Xiang Zhang, Gong Cheng, Yuzhong Qu. Ontology Summarization Based on RDF Sentence Graph. (WWW’07)

×