Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Topic Modeling with Knowledge Graph Embeddings

446 views

Published on

Topic modeling techniques have been applied in many scenarios
in recent years, spanning textual content, as well as many
different data sources. The existing researches in this field
continuously try to improve the accuracy and coherence of
the results. Some recent works propose new methods that capture
the semantic relations between words into the topic modeling
process, by employing vector embeddings over knowledge
bases.

In our recent paper presented at the AAAI Spring Symposium 2019, held at Stanford University, we studied how knowledge graph embeddings affect topic modeling performance
on textual content. In particular, the objective of the
work is to determine which aspects of knowledge graph embedding
have a significant and positive impact on the accuracy
of the extracted topics.
We improve the state of the art by integrating some advanced graph embedding approaches (specifically designed for knowledge graphs) within the topic extraction process.
We also studied how the knowledge base could be expanded by using dataset-specific relations between the words.
We implemented the method and we validated it with
a set of experiments with 2 variations of the knowledge
base, 7 embedding methods, and 2 methods for incorporation
of the embeddings into the topic modeling framework, also
considering different parametrizations of topic number and embedding
dimensionality.
Besides the specific technical results, the work has also aims at showing the potentials of integrating statistical methods with knowledge-centric methods. The full extent of the impact of these techniques shall be explored further in the future.
The details of the work are reported in the paper, which is available online here, and in the slides, also available online (on SlideShare).






Published in: Data & Analytics
  • Be the first to comment

Improving Topic Modeling with Knowledge Graph Embeddings

  1. 1. Improving Topic Modeling with Knowledge Graph Embeddings Marco Brambilla, Birant Altinel marco.brambilla@polimi.it marcobrambi
  2. 2. Formalizing new knowledge is hard Only high frequency emerges The long tail challenge
  3. 3. Key: Feature selection To extract novel knowledge it’s crucial to find the appropriate way to describe the source content. Features can be: Syntactic user profiles tags, hashtags BOW Semantic entity extraction semantic features on images
  4. 4. • Topic Model: A statistical model that is used to discover the abstract «latent» topics of a given content • Example usage areas include information retrieval, classification, collaborative filtering, … • Most well known topic model is LDA • Plate notation Topics as new features: Why not?
  5. 5. • Topic Modeling relatively successful using pure statistical approaches • Unsupervised method of representing a corpus as a set of topics (a distribution over a set of topics) Topic Modeling
  6. 6. Edwin Chen "Introduction to Latent Dirichlet Allocation" (2011) Given the sentences 1.I like to eat broccoli and bananas. 2.I ate a banana and spinach smoothie for breakfast. 3.Chinchillas and kittens are cute. 4.My sister adopted a kitten yesterday. 5.Look at this cute hamster munching on a piece of broccoli. LDA might produce something like • Sentences 1 and 2: 100% Topic A • Sentences 3 and 4: 100% Topic B • Sentence 5: 60% Topic A, 40% Topic B • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
  7. 7. • Improve state-of-the-art of topic modeling by integrating embedding methods over knowlegde graphs • Explore possible extensions on the Knowledge Graph to create a better structure for the knowledge embedding process • Further explore the parametrization to clarify the effects of the most relevant parameters on topic modeling Objective
  8. 8. Background (1): Representation Learning • Process of encoding knowledge into low-dimensional vectors • Used for Machine Learning/Deep Learning tasks over graphs • Supervised / Unsupervised • Text Embedding is a RL that encodes textual content into vectors composed of real numbers • Graph Embeddings do the same on network models
  9. 9. A B Background (2): Embedding Nodes Find embedding of nodes to d-dimensions so that “similar” nodes in the graph have embeddings that are close together. OutputInput
  10. 10. Background (3): Knowledge Graphs • Ontological representation of collected, structured and organized information as a collective knowledge source. Explains real word entities and relations between them. • Examples: DBPedia, Freebase, WordNet, Google Knowledge Graph. • WordNet is an online lexical database for English language where words are linked with semantic relations.
  11. 11. Embedding Methods on Knowledge Graphs • TransE(2013) –Uses addition as the translation operator • TransH(2014) –Extends TransE by modeling relations as hyperplanes • DistMult(2014) –Uses multiplication as the translation operator • PTransE(2015) –Extends TransE with paths of multiple relations • TransR(2015) –Extends TransE by creating separate semantic spaces for entities and relations • HolE(2016) –Uses correlation as the translation operator • Analogy(2017) –Optimizes the representations with the analogical properties of entities and relations
  12. 12. Embedding Methods Comparison d: Embedding dimension, ne: Number of entities, nr: Number of relations, h: head entity, r: relation, t: tail entity, wr: vector representation of r, p: path
  13. 13. Related Work • KGE-LDA • A knowledge-based topic model • Combines LDA with entity embeddings obtained from knowledge graphs using TransE • Proposes 2 models on how to incorporate the embeddings into the topic model Model A Model B
  14. 14. Our Experiment • Text corpus: 20-Newsgroups (20NG) public dataset • 18,800 documents • 21K distinct words • Wordnet18 as graph • 115K triples • 40K entities • 18 types of relations
  15. 15. Parameter Exploration and Evaluation • Topic Number • Embedding Dimension • Topic Coherence A quantitative measure to evaluate the topic models by their coherence • Document Classification Scores The accuracy of the document classification through topic model’s output features
  16. 16. Embedding Methods Comparison – (some) Results
  17. 17. Embedding Methods Comparison – (some) Results
  18. 18. Extending the Graph • dependency relations in sentences constitute meaningful semantics by itself • KG merged with the Dependency Graph
  19. 19. Knowledge Graph Extension • Semantic relations in KG are merged with the syntactic dependency relations obtained from sentences.
  20. 20. Our Experiment • Wordnet18 as graph • 115K triples • 40K entities (only 9K in common with the dataset vocabulary) • 18 types of relations • Extended graph size • 815K triples • 55 types of relations
  21. 21. Knowledge Graph Extension – Results
  22. 22. Further parameter exploration (embedding size = 100)
  23. 23. Execution Time: Embedding Dimension and Topics • The runtime duration with respect to the topic number, embedding dimension, and incorporation model • Topic Number has a higher impact on runtime than the embedding dimension
  24. 24. Conclusions • First attempt to systematically integrate KBs in Topic analysis • A content-based approach to extend the Knowledge Graph transforming it in a domain specific network in order to improve the embeddings. • Parametrization extended (topic number and embedding dimension) Future: • Grid-search for parameter optimization • Improvement of knowledge graph extension process
  25. 25. THANKS! QUESTIONS? Marco Brambilla @marcobrambi marco.brambilla@polimi.it http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

×