Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Mining using LDA with Context

1,144 views

Published on

We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.

These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/

Published in: Science

Text Mining using LDA with Context

  1. 1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Text Mining Using LDA with Context Christoph Kling, Steffen Staab Web and Internet Science Group · ECS · University of Southampton, UK &
  2. 2. Text Mining Using LDA with Context 2/68Steffen Staab Text Mining Documents Documents are  PDFs, emails, tweets, Flickr photo tags, CVs, ... Documents consist of  bag of words  metadata - author(s) - timestamp - geolocation - publisher - booktitle - device ... Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ... Objective: Cluster, categorize, & explain
  3. 3. Text Mining Using LDA with Context 3/68Steffen Staab Latent Dirichlet Allocation (LDA)
  4. 4. Text Mining Using LDA with Context 4/68Steffen Staab Latent Dirichlet Allocation (LDA) Document-topic distributions Topic-word distributions K topics M documents Each doc m M has length Nm
  5. 5. Text Mining Using LDA with Context 5/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
  6. 6. Text Mining Using LDA with Context 6/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours  Usage  Autocompletion → From words to words  Prediction of search queries → From metadata to words → From words to metadata Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
  7. 7. Text Mining Using LDA with Context 7/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked Structures of Metadata Spaces Nejdl Staa b Kling
  8. 8. Text Mining Using LDA with Context 8/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model Creating a special text mining model for every dataset with its kind of metadata spaces is impractical → we need flexible models!
  9. 9. Text Mining Using LDA with Context 9/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model Rich metadata → complex models → complex inference, slow convergence of samplers → analysis of big datasets impossible
  10. 10. Text Mining Using LDA with Context 10/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model  Explaining the Result Importance of Metadata → learn how to weight metadata → exclude irrelevant metadata (improves efficiency!) Complex dependencies & complex probability functions → Learned parameters incomprehensible → Reduced usefulness for data analysis / visualisation → No sanity checks on parameters
  11. 11. Text Mining Using LDA with Context 11/68Steffen Staab Topic Models for Arbitrary Metadata
  12. 12. Text Mining Using LDA with Context 12/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model (Agovic & Banerjee, 2012) → Dirichlet-Multinomial Regression Topic Model (Mimno & McCallum, 2012) → Structural Topic Model (logistic normal regression) (Roberts et al., 2013)
  13. 13. Text Mining Using LDA with Context 13/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model → Dirichlet-Multinomial Regression Topic Model → Structural Topic Model (logistic normal regression) Regression input: Metadata Regression output: Topic distribution
  14. 14. Text Mining Using LDA with Context 14/68Steffen Staab Topic Models for Arbitrary Metadata Dirichlet-multinomial regression Metadata Document-topic distributions
  15. 15. Text Mining Using LDA with Context 15/68Steffen Staab Topic Models for Arbitrary Metadata Gaussian process regression Metadata Document-topic distributions
  16. 16. Text Mining Using LDA with Context 16/68Steffen Staab Topic Models for Arbitrary Metadata Logistic normal regression Metadata Document-topic distributions
  17. 17. Text Mining Using LDA with Context 17/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
  18. 18. Text Mining Using LDA with Context 18/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
  19. 19. Text Mining Using LDA with Context 19/68Steffen Staab Topic Models for Arbitrary Metadata  Applicable to a wide range of metadata!  Estimation of regression parameters relatively expensive  Learned parameters have no natural interpretation  Alternating process of paramter estimation is expensive
  20. 20. Text Mining Using LDA with Context 20/68Steffen Staab Topic Models for Arbitrary Metadata  Dirichlet-multinomial and logistic-normal regression do not support complex input data (i.e. geographical data, temporal cycles, …)  Gaussian process regression topic models are very powerful with the right kernel function ...but require expert knowledge for kernel selection and efficient inference!
  21. 21. Text Mining Using LDA with Context 21/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Models The Idea
  22. 22. Text Mining Using LDA with Context 22/68Steffen Staab Topic Prediction TopicProbability Metadata (e.g. time) Documents, e.g. emails
  23. 23. Text Mining Using LDA with Context 23/68Steffen Staab Dirichlet-Multinomial Regression TopicProbability Metadata (e.g. time)
  24. 24. Text Mining Using LDA with Context 24/68Steffen Staab Gaussian Process Regression TopicProbability Metadata (e.g. time) TopicProbability
  25. 25. Text Mining Using LDA with Context 25/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
  26. 26. Text Mining Using LDA with Context 26/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
  27. 27. Text Mining Using LDA with Context 27/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
  28. 28. Text Mining Using LDA with Context 28/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
  29. 29. Text Mining Using LDA with Context 29/68Steffen Staab Idea  Two-step model: 1)Cluster similar documents 2)Learn topics for clusters and documents simultaneously ▪ Learn topic distributions of document clusters ▪ Use cluster-topic distributions for topic prediction
  30. 30. Text Mining Using LDA with Context 30/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
  31. 31. Text Mining Using LDA with Context 31/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
  32. 32. Text Mining Using LDA with Context 32/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata + nominal, ordinal, cyclic, spherical data + any data which can be clustered!
  33. 33. Text Mining Using LDA with Context 33/68Steffen Staab Performance, Complex Metadata  Metadata clusters are associated with topics German Beer Party
  34. 34. Text Mining Using LDA with Context 34/68Steffen Staab Mixture of Metadata Predictions  Metadata clusters are associated with topics German Beer Party  The topic prediction for a single document is a mixture of the prediction of its metadata clusters
  35. 35. Text Mining Using LDA with Context 35/68Steffen Staab Smoothing of HMDP
  36. 36. Text Mining Using LDA with Context 36/68Steffen Staab Cluster-Based Prediction vs Outliers and noisy data TopicProbability Metadata (e.g. time)
  37. 37. Text Mining Using LDA with Context 37/68Steffen Staab Adjacency Smoothing  Naive approach: Smoothed value of a cluster is the mean of the cluster and its adjacent clusters  Repeat n times
  38. 38. Text Mining Using LDA with Context 38/68Steffen Staab Smoothing topics associated with metadata clusters  Documents receive topics from their own and neighboring metadata clusters
  39. 39. Text Mining Using LDA with Context 39/68Steffen Staab Performance, Complex Metadata  Smooth topics associated with metadata clusters
  40. 40. Text Mining Using LDA with Context 40/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked
  41. 41. Text Mining Using LDA with Context 41/68Steffen Staab Smoothing  Smoothing-strength is learned during inference Similar clusters → stronger smoothing Dissimilar clusters → softer smoothing  Smoothing-strength alternatively can be predefined by user
  42. 42. Text Mining Using LDA with Context 42/68Steffen Staab Metadata Weighting in HMDP's
  43. 43. Text Mining Using LDA with Context 43/68Steffen Staab Feature Weighting  One variable governs the influence of metadata cluster on documents  If η < threshold, ignore variable. η
  44. 44. Text Mining Using LDA with Context 44/68Steffen Staab Metadata Weighting  Importance of metadata is learned during inference, answering the question: How many percent of the topics are explained by a given metadata? (e.g. time, geographical coordinates, ...) → Interpretable parameter!  Metadata with a low weight can be removed during inference
  45. 45. Text Mining Using LDA with Context 45/68Steffen Staab Example Application
  46. 46. Text Mining Using LDA with Context 46/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID
  47. 47. Text Mining Using LDA with Context 47/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID  Timeline  Yearly cycle  Weekly cycle  Daily cycle  Mailing list
  48. 48. Text Mining Using LDA with Context 48/68Steffen Staab Topics
  49. 49. Text Mining Using LDA with Context 49/68Steffen Staab Topics
  50. 50. Text Mining Using LDA with Context 50/68Steffen Staab Topics  Professional topics:  Hobbyist topics:
  51. 51. Text Mining Using LDA with Context 51/68Steffen Staab Topics  Metadata weighting:
  52. 52. Text Mining Using LDA with Context 52/68Steffen Staab Topics  Metadata weighting: can be removed during inference
  53. 53. Text Mining Using LDA with Context 53/68Steffen Staab Efficient Inference in HMDP
  54. 54. Text Mining Using LDA with Context 54/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Cluster-topic distributions Document-topic distributions Metadata
  55. 55. Text Mining Using LDA with Context 55/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Inference: Nearly completely collapsed inference!
  56. 56. Text Mining Using LDA with Context 56/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words
  57. 57. Text Mining Using LDA with Context 57/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words  Dirichlet parameters
  58. 58. Text Mining Using LDA with Context 58/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Approximations:  Variational  Practical  Stochastic → low memory consumption → online inference
  59. 59. Text Mining Using LDA with Context 59/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?
  60. 60. Text Mining Using LDA with Context 60/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?
  61. 61. Text Mining Using LDA with Context 61/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?  Dirichlet process scaling parameters How many pseudo-counts do we add to the topic distributions?
  62. 62. Text Mining Using LDA with Context 62/68Steffen Staab Properties of HMDP  Interpretable parameters  Simultaneous inference of topics and metadata-topic dependencies  Efficient online inference
  63. 63. Text Mining Using LDA with Context 63/68Steffen Staab Comparison of Topic Models for Arbitrary Metadata
  64. 64. Text Mining Using LDA with Context 64/68Steffen Staab Comparison  Gaussian Process Topic Model The “perfect” model:  Can cope with arbitrary metadata  Models dependencies between metadata  Parameter learning is very expensive  Kernel selection and inference require expert knowledge  Parameters of Gaussian processes hard to interpret
  65. 65. Text Mining Using LDA with Context 65/68Steffen Staab Comparison  Multinomial Regression Topic Model The “straight-forward” model:  Can cope with many metadata  Parameter learning is cheaper than for Gaussian processes but still expensive (due to alternating inference and repeated distance calculations)  Can not cope with complex metadata (e.g. geographical, cyclic, ...)  Does not model dependencies between metadata  Regression weights of Dirichlet-multinomial regression hard to interpret
  66. 66. Text Mining Using LDA with Context 66/68Steffen Staab Comparison  Hierarchical Multi-Dirichlet Process Topic Model The “fast” model:  Can cope with arbitrary metadata  Fast inference (simultaneously for topics and topic predictions)  All parameters have natural interpretations as probabilities or pseudo-counts  Requires a (simple) pre-clustering of documents  Does not model dependencies between metadata
  67. 67. Text Mining Using LDA with Context 67/68Steffen Staab THANK YOU FOR YOUR ATTENTION!

×