Thinking in (Text) Clustering
(No math, be not afraid)
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2017/4/13
Outline
 Background
 What can be clustered?
 Problems in K-XXX (Means/Medoid/Center…)
 Similarity Measure
 Convex and Concave
 Problems in Gaussian Mixture Model
 Problems in Matrix Factorization
 Multinomial and Sparsity
2
Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF,
Multinomial Distribution
Basics, not
state-of-the-art
Software Engineering2017/4/13
Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc
Software Engineering2017/4/13
Background
Dimensional Reduction (DR)
 Clustering
 Text Clustering, Webpage Clustering, Image Clustering…
 Summarization
Document Summarization, Image Summarization…
 Factorization
 Rating Matrix Factorization, Image Non-negative Factorization
4
Automatic Applicable Explainable
 Basic Requirement
Clustering (Text)
Software Engineering2017/4/13
 Related Research Areas
 Dimensional Reduction (DR)
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Artificial Intelligence
 (Text) Clustering
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
DR
Data Mining
ArtificialIntelligence
Machine
Learning
Machine
Translation
(Text)
Clustering
 We all know what (text) clustering is, right?
 Widely-accepted topic, since everyone knows it
Software Engineering2017/4/13
What can be clustered?
6
Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),
(5.234, 3.56, 4.454, 6.78)
Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)
Data Sample 3:(China, modern, people, gov.), (policy,
paper, conference, chair), (report, solution, UN, UK)
Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj)
Data Sample 5:(▲▼♦), (♣♠█),(■□●)
Software Engineering2017/4/13
Is there anything that
cannot be clustered?
7
Yes, but not related to us
What can be clustered?
Anything which a similarity
measure can be defined over
Matrix topology
All kinds of data can be
clustered
Software Engineering2017/4/13
K-Means Trap
8
Defects of K-Means, K-
Medoid,K-XXX
 How many K?
 Where are the initial centers?
 Do the data really form a
sphere?
 Do the data really follow
Minkowski /Euclidean distance?
Software Engineering2017/4/13
How about these?
What kind of data that K-XXX better fits?
What kind of data that the methods relying
on distance-similarity computation better fit?
CONVEX
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
11
Why Gaussian  central limit theorem
Is central limit theorem always applicable in
real-world cases?
1. Parameter Tuning
2. High applicability of Gaussian distribution
How to estimate parameters?
Expectation-Maximization
No closed-form solution
Software Engineering2017/4/13
Alternative
 Matrix Factorization
12
No closed solution
‘Cause we are not in
department of math
SVD, PMF, NMF, Tensor
Factorization…
Software Engineering2017/4/13
Triangle
1313
Is there no perfect method here?
What we probably want
 No constraint in the form
of data
 No assumption in data
distribution
 Closed-solution
Triangle borrowed from
distributed computing
Software Engineering2017/4/13
Triangle (Cont.)
I do not know whether such a
method exists or not
Form
Distribution Closed-solution
Hierarchical
Clustering?
GMM/Gaussian
Process
K-Means/Medoid
impossible
Matrix Factorization
impossible impossible
Software Engineering2017/4/13
Multinomial Distribution
Discrete Data (Text)
15
One document:
(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0
meeting,0,0,0,0,report,0,….)
Multinomial distribution
Clustering 
Sampling
Markov Chain
Monte Carlo
Friendly to
sparsity
Software Engineering2017/4/13
Sparsity
Sparsity brings a lot of problems
16
 Also in clustering  What can we do?
➢ Ensemble Learning (Ensemble clustering)
➢ Missing values pre-filling
➢ Tuning ☺
➢ …
10000 words 
1 term
Software Engineering2017/4/13
Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Random Thoughts in Clustering’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc.
 Website
 You can download all slides of mine
➢ http://web.xidian.edu.cn/ysxu/teach.html
➢ http://liu.cs.uic.edu/yueshenxu/
➢ http://www.slideshare.net/obamaxys2011
➢ https://www.researchgate.net/profile/Yueshen_Xu
17
Software Engineering2017/4/13 18
Q&A

Thinking in clustering yueshen xu

  • 1.
    Thinking in (Text)Clustering (No math, be not afraid) Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
  • 2.
    Software Engineering2017/4/13 Outline  Background What can be clustered?  Problems in K-XXX (Means/Medoid/Center…)  Similarity Measure  Convex and Concave  Problems in Gaussian Mixture Model  Problems in Matrix Factorization  Multinomial and Sparsity 2 Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF, Multinomial Distribution Basics, not state-of-the-art
  • 3.
    Software Engineering2017/4/13 Background  InformationOverloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
  • 4.
    Software Engineering2017/4/13 Background Dimensional Reduction(DR)  Clustering  Text Clustering, Webpage Clustering, Image Clustering…  Summarization Document Summarization, Image Summarization…  Factorization  Rating Matrix Factorization, Image Non-negative Factorization 4 Automatic Applicable Explainable  Basic Requirement Clustering (Text)
  • 5.
    Software Engineering2017/4/13  RelatedResearch Areas  Dimensional Reduction (DR)  Text Mining  Natural Language Processing  Computational Linguistics  Information Retrieval  Artificial Intelligence  (Text) Clustering Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining DR Data Mining ArtificialIntelligence Machine Learning Machine Translation (Text) Clustering  We all know what (text) clustering is, right?  Widely-accepted topic, since everyone knows it
  • 6.
    Software Engineering2017/4/13 What canbe clustered? 6 Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41), (5.234, 3.56, 4.454, 6.78) Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0) Data Sample 3:(China, modern, people, gov.), (policy, paper, conference, chair), (report, solution, UN, UK) Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj) Data Sample 5:(▲▼♦), (♣♠█),(■□●)
  • 7.
    Software Engineering2017/4/13 Is thereanything that cannot be clustered? 7 Yes, but not related to us What can be clustered? Anything which a similarity measure can be defined over Matrix topology All kinds of data can be clustered
  • 8.
    Software Engineering2017/4/13 K-Means Trap 8 Defectsof K-Means, K- Medoid,K-XXX  How many K?  Where are the initial centers?  Do the data really form a sphere?  Do the data really follow Minkowski /Euclidean distance?
  • 9.
    Software Engineering2017/4/13 How aboutthese? What kind of data that K-XXX better fits? What kind of data that the methods relying on distance-similarity computation better fit? CONVEX
  • 10.
  • 11.
    Software Engineering2017/4/13 Alternative  GaussianMixture Model 11 Why Gaussian  central limit theorem Is central limit theorem always applicable in real-world cases? 1. Parameter Tuning 2. High applicability of Gaussian distribution How to estimate parameters? Expectation-Maximization No closed-form solution
  • 12.
    Software Engineering2017/4/13 Alternative  MatrixFactorization 12 No closed solution ‘Cause we are not in department of math SVD, PMF, NMF, Tensor Factorization…
  • 13.
    Software Engineering2017/4/13 Triangle 1313 Is thereno perfect method here? What we probably want  No constraint in the form of data  No assumption in data distribution  Closed-solution Triangle borrowed from distributed computing
  • 14.
    Software Engineering2017/4/13 Triangle (Cont.) Ido not know whether such a method exists or not Form Distribution Closed-solution Hierarchical Clustering? GMM/Gaussian Process K-Means/Medoid impossible Matrix Factorization impossible impossible
  • 15.
    Software Engineering2017/4/13 Multinomial Distribution DiscreteData (Text) 15 One document: (0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0 meeting,0,0,0,0,report,0,….) Multinomial distribution Clustering  Sampling Markov Chain Monte Carlo Friendly to sparsity
  • 16.
    Software Engineering2017/4/13 Sparsity Sparsity bringsa lot of problems 16  Also in clustering  What can we do? ➢ Ensemble Learning (Ensemble clustering) ➢ Missing values pre-filling ➢ Tuning ☺ ➢ … 10000 words  1 term
  • 17.
    Software Engineering2017/4/13 Reference  Myprevious tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Random Thoughts in Clustering’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc.  Website  You can download all slides of mine ➢ http://web.xidian.edu.cn/ysxu/teach.html ➢ http://liu.cs.uic.edu/yueshenxu/ ➢ http://www.slideshare.net/obamaxys2011 ➢ https://www.researchgate.net/profile/Yueshen_Xu 17
  • 18.