Kalpa Gunaratna's Ph.D. dissertation defense: April 19 2017
The processing of structured and semi-structured content on the Web has been gaining attention with the rapid progress in the Linking Open Data project and the development of commercial knowledge graphs. Knowledge graphs capture domain-specific or encyclopedic knowledge in the form of a data layer and add rich and explicit semantics on top of the data layer to infer additional knowledge. The data layer of a knowledge graph represents entities and their descriptions. The semantic layer on top of the data layer is called the schema (ontology), where relationships of the entity descriptions, their classes, and the hierarchy of the relationships and classes are defined. Today, there exist large knowledge graphs in the research community (e.g., encyclopedic datasets like DBpedia and Yago) and corporate world (e.g., Google knowledge graph) that encapsulate a large amount of knowledge for human and machine consumption. Typically, they consist of millions of entities and billions of facts describing these entities. While it is good to have this much knowledge available on the Web for consumption, it leads to information overload, and hence proper summarization (and presentation) techniques need to be explored.
In this dissertation, we focus on creating both comprehensive and concise entity summaries at: (i) the single entity level and (ii) the multiple entity level. To summarize a single entity, we propose a novel approach called FACeted Entity Summarization (FACES) that considers importance, which is computed by combining popularity and uniqueness, and diversity of facts getting selected for the summary. We first conceptually group facts using semantic expansion and hierarchical incremental clustering techniques and form facets (i.e., groupings) that go beyond syntactic similarity. Then we rank both the facts and facets using Information Retrieval (IR) ranking techniques to pick the highest ranked facts from these facets for the summary. The important and unique contribution of this approach is that because of its generation of facets, it adds diversity into entity summaries, making them comprehensive. For creating multiple entity summaries, we simultaneously process facts belonging to the given entities using combinatorial optimization techniques. In this process, we maximize diversity and importance of facts within each entity summary and relatedness of facts between the entity summaries. The proposed approach uniquely combines semantic expansion, graph-based relatedness, and combinatorial optimization techniques to generate relatedness-based multi-entity summaries.
Complementing the entity summarization approaches, we introduce a novel approach using light Natural Language Processing (NLP) techniques to enrich knowledge graphs by adding type semantics to literals.