Efficient Clustering from Distributions over Topics

Badenes-Olmedo, Carlos
Redondo Garcia, Jose Luís
Corcho, Oscar
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
cbadenes@fi.upm.es
@carbadol
github.com/librairy
oeg-upm.net
Efﬁcient Clustering
from Distributions
over Topics
K-CAP 2017
Knowledge Capture

December 4th-6th, 2017

Austin, Texas, United States

2
source: emages.eventshigh.com

3
Connected Documents in a Collection

4
From Sets of Textual documents to Graphs

5
Similarity Matrix
Non-Symmetric: 31*31-31=930
Symmetric: 930/2=465

6
Solution: Partition of the Search Space

7
Use-Case: Digital Publisher
7,648 Books
68,653 Chapters
4x1.6 Ghz
16GB RAM
4x1.6 Ghz
16GB RAM
2,910,883,150
similarities
aprox 8 hours
76,301 Documents
All pairwise similarities
approx 3.5 days

8
Hypothesis
Topic 1
Topic 2
Topic 3
Topic 4
LDA
LDA for efficient space partition
Probabilis)c Topic Models (PTM) and in par)cular, on Latent Dirichlet Alloca-on (LDA)
can efficiently divide the search space and speed up the process of finding rela)ons
among documents inside big collec)ons.

9
• Each topic is a distribution over words
• Each document is a mixture of corpus-wide topics
• Each word is drawn from one of those topics
Probabilistic Topic Models
source: David Blei, Probabilistic Topic Models

10
Topic Model
Similarity Valuetopic0: 0.520,
topic1: 0.327
topic2: 0.081
…
topic122: 0.182
topic1: 0.573,
topic2: 0.172
topic3: 0.136
…
topic122: 0.099
0.595
..
Topic0 Topic1 Topic2 Topic122
DOCUMENT
‘A’
DOCUMENT
‘B’
Topic-based Similarity
Jensen-Shannon Divergence

11
Scenario
[ 0.243, 0.145, 0.600, 0.022]
corpus Prob. Topic Model
Topic 1
Topic 2
Topic 3
Dirichlet Distribution
• Exponential family distribution over the simplex,  
i.e. positive vectors that sum to one

Approach 1: TDC
12
•  Instead of directly relying on the topic distribu5on’s scores, it considers
their varia5ons across consecu5ve topics inside a document’s topic
distribu5on

0.23 0.18 0.33 0.13 0.13 P =
> < = >
2 1 2 0 T =
Trends on Dirichlet-based Clustering (TDC)

13
Approach 2: RDC
Ranking on Dirichlet-based Clustering (RDC)
•  Only considering the top n topics from the ranked list of probability distribu7ons [29]
•  Based on the assump7on that the highest weighted topics have a high inﬂuence in the
rest of topics when calcula7ng distances

0.23 0.18 0.33 0.13 0.13 n=2, P =
2 0 R =
1 0 2 3 4

14
Approach 3: CRDC
Cumulative Ranking on Dirichlet-based Clustering (CRDC)

16
Experiments: Baselines
• 
1: h%p://commons.apache.org/proper/commons-math/

Similarity Metrics:
•Jensen-Shannon Divergence
•Hellinger Distance
Clustering Algorithms:

18
Experiments: Results
Precision

19
Recall

20
Number of Clusters

21
Effectiveness
JSD

22
Effectiveness
Hellinger

23
Cost
JSD

24
Cost
Hellinger

25
Efﬁciency
JSD

26
Efﬁciency
Hellinger

27
Conclusions
•  Unsupervised clustering algorithms, TDC, RDC and CRDC.
•  CRDC is a promising approach, which improves the eﬃciency
obtained by other centroid-based and density-based approaches such
as K-Means
•  Hierarchical approach for RDC algorithm was also considered but it
did not produce good results

28
Future Work
•  Hybrid methods combining some of these novel approaches with
exis8ng techniques will be performed in future work on the same line
•  Nearest neighbors as baseline

29
Use-Case: Digital Publisher
7,648 Books
68,653 Chapters
4x1.6 Ghz
16GB RAM
4x1.6 Ghz
16GB RAM
2,910,883,150
similarities
aprox 8 hours
76,301 Documents
by using CRDC
aprox 2 hours
64,635,080
similarities
threshold = 0.9
approx 3.5 days

K-CAP 2017
Knowledge Capture

December 4th-6th, 2017

Austin, Texas, United States
Badenes-Olmedo, Carlos
Redondo Garcia, Jose Luís
Corcho, Oscar
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
Efﬁcient Clustering
from Distributions
over Topics
cbadenes@fi.upm.es
@carbadol
github.com/librairy
oeg-upm.net

Efficient Clustering from Distributions over Topics

More Related Content

What's hot

Similar to Efficient Clustering from Distributions over Topics

More from Carlos Badenes-Olmedo

Recently uploaded

Efficient Clustering from Distributions over Topics