Text Mining using LDA with Context

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Text Mining Using LDA with Context
Christoph Kling, Steffen Staab
Web and Internet Science Group · ECS · University of Southampton, UK &

Text Mining Using LDA with Context 2/68Steffen Staab
Text Mining Documents
Documents are
 PDFs, emails, tweets,
Flickr photo tags, CVs, ...
Documents consist of
 bag of words
 metadata
- author(s)
- timestamp
- geolocation
- publisher
- booktitle
- device
...
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
Objective:
Cluster, categorize,
& explain

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topics
M documents
Each doc m M has length Nm

Use Metadata to Help Topic Prediction
 Improve topic detection
→ Morning times may help to improve the breakfast topic
 Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...

Use Metadata to Help Topic Prediction
 Improve topic detection
→ Morning times may help to improve the breakfast topic
 Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours
 Usage
 Autocompletion
→ From words to words
 Prediction of search queries
→ From metadata to words
→ From words to metadata
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...

 Nominal
 Ordinal
 Cyclic
 Spherical
 Networked
Structures of Metadata Spaces Nejdl
Staa
b
Kling

Challenges for Using Metadata for Text Mining
 Generalizing the Text Mining Model
Creating a special text mining model for every dataset with its
kind of metadata spaces is impractical
→ we need flexible models!

 Efficiency of the Text Mining Model
Rich metadata
→ complex models
→ complex inference, slow convergence of samplers
→ analysis of big datasets impossible

 Efficiency of the Text Mining Model
 Explaining the Result
Importance of Metadata
→ learn how to weight metadata
→ exclude irrelevant metadata (improves efficiency!)
Complex dependencies & complex probability functions
→ Learned parameters incomprehensible
→ Reduced usefulness for data analysis / visualisation
→ No sanity checks on parameters

Topic Models for Arbitrary Metadata

 Predict document-topic distributions using metadata
→ Gaussian Process Regression Topic Model
(Agovic & Banerjee, 2012)
→ Dirichlet-Multinomial Regression Topic Model
(Mimno & McCallum, 2012)
→ Structural Topic Model (logistic normal regression)
(Roberts et al., 2013)

 Predict document-topic distributions using metadata
→ Gaussian Process Regression Topic Model
→ Dirichlet-Multinomial Regression Topic Model
→ Structural Topic Model (logistic normal regression)
Regression input: Metadata
Regression output: Topic distribution

Dirichlet-multinomial regression
Metadata

Gaussian process regression
Metadata

Logistic normal regression
Metadata

 Alternating inference:
 Estimate topics
 Estimate regression model
 Use prediction for re-estimating topics
 Re-estimate regression model with new topics
 ...

 Applicable to a wide range of metadata!
 Estimation of regression parameters relatively expensive
 Learned parameters have no natural interpretation
 Alternating process of paramter estimation is expensive

 Dirichlet-multinomial and logistic-normal regression do not
support complex input data
(i.e. geographical data, temporal cycles, …)
 Gaussian process regression topic models are very
powerful with the right kernel function
...but require expert knowledge for kernel selection and
efficient inference!

Hierarchical
Multi-Dirichlet Process
Topic Models
The Idea

Topic Prediction
TopicProbability
Metadata (e.g. time)
Documents, e.g. emails

Dirichlet-Multinomial Regression
TopicProbability

Gaussian Process Regression
TopicProbability
TopicProbability

Cluster-Based Prediction
TopicProbability

TopicProbability

TopicProbability
TopicProbabilityTopicProbabilityTopicProbability

Idea
 Two-step model:
1)Cluster similar documents
2)Learn topics for clusters and documents simultaneously
▪ Learn topic distributions of document clusters
▪ Use cluster-topic distributions for topic prediction

Performance, Complex Metadata
 Cluster documents for each metadata

+ nominal, ordinal, cyclic, spherical data
+ any data which can be clustered!

 Metadata clusters are associated with topics
German Beer
Party

Mixture of Metadata Predictions
 Metadata clusters are associated with topics
German Beer
Party
 The topic prediction for a single document is a mixture of
the prediction of its metadata clusters

Smoothing of HMDP

Cluster-Based Prediction vs Outliers and noisy data
TopicProbability

Adjacency Smoothing
 Naive approach: Smoothed value of a cluster is the mean
of the cluster and its adjacent clusters
 Repeat n times

Smoothing topics associated with metadata clusters
 Documents receive topics from their own and neighboring
metadata clusters

 Smooth topics associated with metadata clusters

 Nominal
 Ordinal
 Cyclic
 Spherical
 Networked

Smoothing
 Smoothing-strength is learned during inference
Similar clusters → stronger smoothing
Dissimilar clusters → softer smoothing
 Smoothing-strength alternatively can be predefined by user

Metadata Weighting in HMDP's

Feature Weighting
 One variable governs the influence of metadata cluster on
documents
 If η < threshold, ignore variable.
η

Metadata Weighting
 Importance of metadata is learned during inference,
answering the question:
How many percent of the topics are explained by a given
metadata? (e.g. time, geographical coordinates, ...)
→ Interpretable parameter!
 Metadata with a low weight can be removed during
inference

Example Application

Dataset
 Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID

Dataset
 Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID
 Timeline
 Yearly cycle
 Weekly cycle
 Daily cycle
 Mailing list

Topics

Topics
 Professional topics:
 Hobbyist topics:

Topics
 Metadata weighting:

Topics
 Metadata weighting:
can be removed during inference

Efficient Inference in HMDP

Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
Cluster-topic distributions
Metadata

Inference:
Nearly completely collapsed
inference!

We only need to learn
 Global topic distribution
 Topic assignments to words

We only need to learn
 Global topic distribution
 Topic assignments to words
 Dirichlet parameters

Approximations:
 Variational
 Practical
 Stochastic
→ low memory consumption
→ online inference

Parameters of HMDP
 Cluster-topic distributions:
How many documents of a cluster contain topic x?

Parameters of HMDP
 Metadata-weights
How many of the topics of documents are explained
by metadata x?

Parameters of HMDP
 Metadata-weights
How many of the topics of documents are explained
by metadata x?
 Dirichlet process scaling parameters
How many pseudo-counts do we add to the topic
distributions?

Properties of HMDP
 Interpretable parameters
 Simultaneous inference of topics and metadata-topic
dependencies
 Efficient online inference

Comparison of

Comparison
 Gaussian Process Topic Model
The “perfect” model:
 Can cope with arbitrary metadata
 Models dependencies between metadata
 Parameter learning is very expensive
 Kernel selection and inference require expert knowledge
 Parameters of Gaussian processes hard to interpret

Comparison
 Multinomial Regression Topic Model
The “straight-forward” model:
 Can cope with many metadata
 Parameter learning is cheaper than for Gaussian
processes but still expensive (due to alternating inference
and repeated distance calculations)
 Can not cope with complex metadata
(e.g. geographical, cyclic, ...)
 Does not model dependencies between metadata
 Regression weights of Dirichlet-multinomial regression
hard to interpret

Comparison
 Hierarchical Multi-Dirichlet Process Topic Model
The “fast” model:
 Can cope with arbitrary metadata
 Fast inference (simultaneously for topics and topic
predictions)
 All parameters have natural interpretations as probabilities
or pseudo-counts
 Requires a (simple) pre-clustering of documents
 Does not model dependencies between metadata

THANK YOU FOR YOUR
ATTENTION!

Text Mining using LDA with Context

More Related Content

What's hot

Viewers also liked

Similar to Text Mining using LDA with Context

More from Steffen Staab

Recently uploaded

Text Mining using LDA with Context