Thinking in clustering yueshen xu

Thinking in (Text) Clustering
（No math, be not afraid）
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML

Software Engineering2017/4/13
Outline
 Background
 What can be clustered?
 Problems in K-XXX (Means/Medoid/Center…)
 Similarity Measure
 Convex and Concave
 Problems in Gaussian Mixture Model
 Problems in Matrix Factorization
 Multinomial and Sparsity
2
Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF,
Multinomial Distribution
Basics, not
state-of-the-art

Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc

Background
Dimensional Reduction (DR)
 Clustering
 Text Clustering, Webpage Clustering, Image Clustering…
 Summarization
Document Summarization, Image Summarization…
 Factorization
 Rating Matrix Factorization, Image Non-negative Factorization
4
Automatic Applicable Explainable
 Basic Requirement
Clustering (Text)

 Related Research Areas
 Dimensional Reduction (DR)
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Artificial Intelligence
 (Text) Clustering
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
DR
Data Mining
ArtificialIntelligence
Machine
Learning
Machine
Translation
(Text)
Clustering
 We all know what (text) clustering is, right?
 Widely-accepted topic, since everyone knows it

What can be clustered?
6
Data Sample 1：(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),
(5.234, 3.56, 4.454, 6.78)
Data Sample 2：(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)
Data Sample 3：(China, modern, people, gov.), (policy,
paper, conference, chair), (report, solution, UN, UK)
Data Sample 4：(aaabbbccc), (dddfffggg), (hhhiiiijjj)
Data Sample 5：(▲▼♦), (♣♠█),(■□●)

Is there anything that
cannot be clustered?
7
Yes, but not related to us
What can be clustered?
Anything which a similarity
measure can be defined over
Matrix topology
All kinds of data can be
clustered

K-Means Trap
8
Defects of K-Means, K-
Medoid,K-XXX
 How many K?
 Where are the initial centers?
 Do the data really form a
sphere?
 Do the data really follow
Minkowski /Euclidean distance?

How about these?
What kind of data that K-XXX better fits?
What kind of data that the methods relying
on distance-similarity computation better fit?
CONVEX

Alternative
 Gaussian Mixture Model

Alternative
 Gaussian Mixture Model
11
Why Gaussian  central limit theorem
Is central limit theorem always applicable in
real-world cases?
1. Parameter Tuning
2. High applicability of Gaussian distribution
How to estimate parameters?
Expectation-Maximization
No closed-form solution

Alternative
 Matrix Factorization
12
No closed solution
‘Cause we are not in
department of math
SVD, PMF, NMF, Tensor
Factorization…

Triangle
1313
Is there no perfect method here？
What we probably want
 No constraint in the form
of data
 No assumption in data
distribution
 Closed-solution
Triangle borrowed from
distributed computing

Triangle (Cont.)
I do not know whether such a
method exists or not
Form
Distribution Closed-solution
Hierarchical
Clustering?
GMM/Gaussian
Process
K-Means/Medoid
impossible
Matrix Factorization
impossible impossible

Multinomial Distribution
Discrete Data (Text)
15
One document：
(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0
meeting,0,0,0,0,report,0,….)
Multinomial distribution
Clustering 
Sampling
Markov Chain
Monte Carlo
Friendly to
sparsity

Sparsity
Sparsity brings a lot of problems
16
 Also in clustering  What can we do?
➢ Ensemble Learning (Ensemble clustering)
➢ Missing values pre-filling
➢ Tuning ☺
➢ …
10000 words 
1 term

Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Random Thoughts in Clustering’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc.
 Website
 You can download all slides of mine
➢ http://web.xidian.edu.cn/ysxu/teach.html
➢ http://liu.cs.uic.edu/yueshenxu/
➢ http://www.slideshare.net/obamaxys2011
➢ https://www.researchgate.net/profile/Yueshen_Xu
17

Software Engineering2017/4/13 18
Q&A

Thinking in clustering yueshen xu

More Related Content

What's hot

Similar to Thinking in clustering yueshen xu

More from Yueshen Xu

Recently uploaded

Thinking in clustering yueshen xu