SlideShare a Scribd company logo
Thinking in (Text) Clustering
(No math, be not afraid)
Yueshen Xu (lecturer)
ysxu@xidian.edu.cn / xuyueshen@163.com
Data and Knowledge Engineering Research Center
Xidian University
Text Mining & NLP & ML
Software Engineering2017/4/13
Outline
 Background
 What can be clustered?
 Problems in K-XXX (Means/Medoid/Center…)
 Similarity Measure
 Convex and Concave
 Problems in Gaussian Mixture Model
 Problems in Matrix Factorization
 Multinomial and Sparsity
2
Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF,
Multinomial Distribution
Basics, not
state-of-the-art
Software Engineering2017/4/13
Background
 Information Overloading
3
we need
summarization
Visualization
Dimensional
Reduction
Big Data
Cloud Computing
Artificial Intelligence
Deep Learning
,…, etc
Software Engineering2017/4/13
Background
Dimensional Reduction (DR)
 Clustering
 Text Clustering, Webpage Clustering, Image Clustering…
 Summarization
Document Summarization, Image Summarization…
 Factorization
 Rating Matrix Factorization, Image Non-negative Factorization
4
Automatic Applicable Explainable
 Basic Requirement
Clustering (Text)
Software Engineering2017/4/13
 Related Research Areas
 Dimensional Reduction (DR)
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Artificial Intelligence
 (Text) Clustering
Some Concepts
5
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
DR
Data Mining
ArtificialIntelligence
Machine
Learning
Machine
Translation
(Text)
Clustering
 We all know what (text) clustering is, right?
 Widely-accepted topic, since everyone knows it
Software Engineering2017/4/13
What can be clustered?
6
Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41),
(5.234, 3.56, 4.454, 6.78)
Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0)
Data Sample 3:(China, modern, people, gov.), (policy,
paper, conference, chair), (report, solution, UN, UK)
Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj)
Data Sample 5:(▲▼♦), (♣♠█),(■□●)
Software Engineering2017/4/13
Is there anything that
cannot be clustered?
7
Yes, but not related to us
What can be clustered?
Anything which a similarity
measure can be defined over
Matrix topology
All kinds of data can be
clustered
Software Engineering2017/4/13
K-Means Trap
8
Defects of K-Means, K-
Medoid,K-XXX
 How many K?
 Where are the initial centers?
 Do the data really form a
sphere?
 Do the data really follow
Minkowski /Euclidean distance?
Software Engineering2017/4/13
How about these?
What kind of data that K-XXX better fits?
What kind of data that the methods relying
on distance-similarity computation better fit?
CONVEX
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
Software Engineering2017/4/13
Alternative
 Gaussian Mixture Model
11
Why Gaussian  central limit theorem
Is central limit theorem always applicable in
real-world cases?
1. Parameter Tuning
2. High applicability of Gaussian distribution
How to estimate parameters?
Expectation-Maximization
No closed-form solution
Software Engineering2017/4/13
Alternative
 Matrix Factorization
12
No closed solution
‘Cause we are not in
department of math
SVD, PMF, NMF, Tensor
Factorization…
Software Engineering2017/4/13
Triangle
1313
Is there no perfect method here?
What we probably want
 No constraint in the form
of data
 No assumption in data
distribution
 Closed-solution
Triangle borrowed from
distributed computing
Software Engineering2017/4/13
Triangle (Cont.)
I do not know whether such a
method exists or not
Form
Distribution Closed-solution
Hierarchical
Clustering?
GMM/Gaussian
Process
K-Means/Medoid
impossible
Matrix Factorization
impossible impossible
Software Engineering2017/4/13
Multinomial Distribution
Discrete Data (Text)
15
One document:
(0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0
meeting,0,0,0,0,report,0,….)
Multinomial distribution
Clustering 
Sampling
Markov Chain
Monte Carlo
Friendly to
sparsity
Software Engineering2017/4/13
Sparsity
Sparsity brings a lot of problems
16
 Also in clustering  What can we do?
➢ Ensemble Learning (Ensemble clustering)
➢ Missing values pre-filling
➢ Tuning ☺
➢ …
10000 words 
1 term
Software Engineering2017/4/13
Reference
 My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)
 ‘Random Thoughts in Clustering’
 ‘Non-parametric Bayesian learning in discrete data’
 ‘The research of topic modeling in text mining’
 ‘Matrix factorization with user generated content’
 …, etc.
 Website
 You can download all slides of mine
➢ http://web.xidian.edu.cn/ysxu/teach.html
➢ http://liu.cs.uic.edu/yueshenxu/
➢ http://www.slideshare.net/obamaxys2011
➢ https://www.researchgate.net/profile/Yueshen_Xu
17
Software Engineering2017/4/13 18
Q&A

More Related Content

What's hot

(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
Yueshen Xu
 
Interactive Learning of Bayesian Networks
Interactive Learning of Bayesian NetworksInteractive Learning of Bayesian Networks
Interactive Learning of Bayesian Networks
NTNU
 
Utilizing Graph Theory to Model Forensic Examination
Utilizing Graph Theory to Model Forensic ExaminationUtilizing Graph Theory to Model Forensic Examination
Utilizing Graph Theory to Model Forensic Examination
AM Publications,India
 
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
ijcoa
 
Data visualization
Data visualizationData visualization
Data visualization
Baijayanti Chakraborty
 
Seminar_Koga_Yuki_v2.pdf
Seminar_Koga_Yuki_v2.pdfSeminar_Koga_Yuki_v2.pdf
Seminar_Koga_Yuki_v2.pdf
IkedaYuki
 
Argumentation Trails and Topic Maps
Argumentation Trails and Topic MapsArgumentation Trails and Topic Maps
Argumentation Trails and Topic Maps
Lutz Maicher
 
Collnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainCollnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainHan Woo PARK
 
Collnet _Conference_Turkey
Collnet _Conference_TurkeyCollnet _Conference_Turkey
Collnet _Conference_TurkeyGohar Feroz Khan
 
Maths concept map
Maths concept mapMaths concept map
Maths concept map
tamara hope
 
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
cscpconf
 
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
CSCJournals
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
Pramit Choudhary
 

What's hot (15)

(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
 
Resume
ResumeResume
Resume
 
Interactive Learning of Bayesian Networks
Interactive Learning of Bayesian NetworksInteractive Learning of Bayesian Networks
Interactive Learning of Bayesian Networks
 
Utilizing Graph Theory to Model Forensic Examination
Utilizing Graph Theory to Model Forensic ExaminationUtilizing Graph Theory to Model Forensic Examination
Utilizing Graph Theory to Model Forensic Examination
 
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
A Study on Transition of Logic Connectives to Induced Linked Fuzzy Relational...
 
Data visualization
Data visualizationData visualization
Data visualization
 
Seminar_Koga_Yuki_v2.pdf
Seminar_Koga_Yuki_v2.pdfSeminar_Koga_Yuki_v2.pdf
Seminar_Koga_Yuki_v2.pdf
 
Argumentation Trails and Topic Maps
Argumentation Trails and Topic MapsArgumentation Trails and Topic Maps
Argumentation Trails and Topic Maps
 
Collnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainCollnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domain
 
Collnet _Conference_Turkey
Collnet _Conference_TurkeyCollnet _Conference_Turkey
Collnet _Conference_Turkey
 
Maths concept map
Maths concept mapMaths concept map
Maths concept map
 
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
FIRST-ORDER MATHEMATICAL FUZZY LOGIC WITH HEDGES
 
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
Automated Education Propositional Logic Tool (AEPLT): Used For Computation in...
 
algorithms
algorithmsalgorithms
algorithms
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 

Similar to Thinking in clustering yueshen xu

Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mba
Babasab Patil
 
Geometric Deep Learning
Geometric Deep Learning Geometric Deep Learning
Geometric Deep Learning
PetteriTeikariPhD
 
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Christoph Lange
 
Machine Learning basics
Machine Learning basicsMachine Learning basics
Machine Learning basics
NeeleEilers
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
Dr. Abdul Ahad Abro
 
Automatically Answering And Generating Machine Learning Final Exams
Automatically Answering And Generating Machine Learning Final ExamsAutomatically Answering And Generating Machine Learning Final Exams
Automatically Answering And Generating Machine Learning Final Exams
Richard Hogue
 
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
Leveraging Flat Files from the Canvas LMS Data Portal at K-StateLeveraging Flat Files from the Canvas LMS Data Portal at K-State
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
Shalin Hai-Jew
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
cscpconf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Dr Arash Najmaei ( Phd., MBA, BSc)
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
tuxette
 
Application of discrete mathematics in IT
Application of discrete mathematics in ITApplication of discrete mathematics in IT
Application of discrete mathematics in IT
ShahidAbbas52
 
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
T OWARDS A  S YSTEM  D YNAMICS  M ODELING  M E- THOD B ASED ON  DEMATELT OWARDS A  S YSTEM  D YNAMICS  M ODELING  M E- THOD B ASED ON  DEMATEL
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
ijcsit
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
Shanmugasundaram M
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
Jie He
 
Dms introduction Sharmila Chidaravalli
Dms introduction Sharmila ChidaravalliDms introduction Sharmila Chidaravalli
Dms introduction Sharmila Chidaravalli
SharmilaChidaravalli
 
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Beat Signer
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
Rich Heimann
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
Irmac presentation for website
Irmac presentation for websiteIrmac presentation for website
Irmac presentation for website
Frank Barnes
 

Similar to Thinking in clustering yueshen xu (20)

Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mba
 
Geometric Deep Learning
Geometric Deep Learning Geometric Deep Learning
Geometric Deep Learning
 
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...
 
Machine Learning basics
Machine Learning basicsMachine Learning basics
Machine Learning basics
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
MODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdf
MODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdfMODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdf
MODEL_FOR_SEMANTICALLY_RICH_POINT_CLOUD.pdf
 
Automatically Answering And Generating Machine Learning Final Exams
Automatically Answering And Generating Machine Learning Final ExamsAutomatically Answering And Generating Machine Learning Final Exams
Automatically Answering And Generating Machine Learning Final Exams
 
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
Leveraging Flat Files from the Canvas LMS Data Portal at K-StateLeveraging Flat Files from the Canvas LMS Data Portal at K-State
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
 
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Application of discrete mathematics in IT
Application of discrete mathematics in ITApplication of discrete mathematics in IT
Application of discrete mathematics in IT
 
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
T OWARDS A  S YSTEM  D YNAMICS  M ODELING  M E- THOD B ASED ON  DEMATELT OWARDS A  S YSTEM  D YNAMICS  M ODELING  M E- THOD B ASED ON  DEMATEL
T OWARDS A S YSTEM D YNAMICS M ODELING M E- THOD B ASED ON DEMATEL
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
Dms introduction Sharmila Chidaravalli
Dms introduction Sharmila ChidaravalliDms introduction Sharmila Chidaravalli
Dms introduction Sharmila Chidaravalli
 
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
Course Review - Lecture 13 - Introduction to Databases (1007156ANR)
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
Irmac presentation for website
Irmac presentation for websiteIrmac presentation for website
Irmac presentation for website
 

More from Yueshen Xu

Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
Yueshen Xu
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
Yueshen Xu
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
Yueshen Xu
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
Yueshen Xu
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
Yueshen Xu
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
Yueshen Xu
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
Yueshen Xu
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
Yueshen Xu
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
Yueshen Xu
 
Yueshen xu cv
Yueshen xu cvYueshen xu cv
Yueshen xu cv
Yueshen Xu
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
Yueshen Xu
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
Yueshen Xu
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Yueshen Xu
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
Yueshen Xu
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
Yueshen Xu
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star programYueshen Xu
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
Yueshen Xu
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
Yueshen Xu
 
Analysis on tcp ip protocol stack
Analysis on tcp ip protocol stackAnalysis on tcp ip protocol stack
Analysis on tcp ip protocol stack
Yueshen Xu
 

More from Yueshen Xu (20)

Context aware service recommendation
Context aware service recommendationContext aware service recommendation
Context aware service recommendation
 
Course review for ir class 本科课件
Course review for ir class 本科课件Course review for ir class 本科课件
Course review for ir class 本科课件
 
Semantic web 本科课件
Semantic web 本科课件Semantic web 本科课件
Semantic web 本科课件
 
Recommender system slides for undergraduate
Recommender system slides for undergraduateRecommender system slides for undergraduate
Recommender system slides for undergraduate
 
推荐系统 本科课件
 推荐系统 本科课件 推荐系统 本科课件
推荐系统 本科课件
 
Text classification 本科课件
Text classification 本科课件Text classification 本科课件
Text classification 本科课件
 
Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)Text clustering (information retrieval, in chinese)
Text clustering (information retrieval, in chinese)
 
Non parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete dataNon parametric bayesian learning in discrete data
Non parametric bayesian learning in discrete data
 
聚类 (Clustering)
聚类 (Clustering)聚类 (Clustering)
聚类 (Clustering)
 
Yueshen xu cv
Yueshen xu cvYueshen xu cv
Yueshen xu cv
 
徐悦甡简历
徐悦甡简历徐悦甡简历
徐悦甡简历
 
Learning to recommend with user generated content
Learning to recommend with user generated contentLearning to recommend with user generated content
Learning to recommend with user generated content
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
 
Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013Summary on the Conference of WISE 2013
Summary on the Conference of WISE 2013
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Acoustic modeling using deep belief networks
Acoustic modeling using deep belief networksAcoustic modeling using deep belief networks
Acoustic modeling using deep belief networks
 
Summarization for dragon star program
Summarization for dragon  star programSummarization for dragon  star program
Summarization for dragon star program
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Analysis on tcp ip protocol stack
Analysis on tcp ip protocol stackAnalysis on tcp ip protocol stack
Analysis on tcp ip protocol stack
 

Recently uploaded

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Thinking in clustering yueshen xu

  • 1. Thinking in (Text) Clustering (No math, be not afraid) Yueshen Xu (lecturer) ysxu@xidian.edu.cn / xuyueshen@163.com Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML
  • 2. Software Engineering2017/4/13 Outline  Background  What can be clustered?  Problems in K-XXX (Means/Medoid/Center…)  Similarity Measure  Convex and Concave  Problems in Gaussian Mixture Model  Problems in Matrix Factorization  Multinomial and Sparsity 2 Keywords: Clustering, K-Means/Medoid, Similarity Computation, GMM, MF, Multinomial Distribution Basics, not state-of-the-art
  • 3. Software Engineering2017/4/13 Background  Information Overloading 3 we need summarization Visualization Dimensional Reduction Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
  • 4. Software Engineering2017/4/13 Background Dimensional Reduction (DR)  Clustering  Text Clustering, Webpage Clustering, Image Clustering…  Summarization Document Summarization, Image Summarization…  Factorization  Rating Matrix Factorization, Image Non-negative Factorization 4 Automatic Applicable Explainable  Basic Requirement Clustering (Text)
  • 5. Software Engineering2017/4/13  Related Research Areas  Dimensional Reduction (DR)  Text Mining  Natural Language Processing  Computational Linguistics  Information Retrieval  Artificial Intelligence  (Text) Clustering Some Concepts 5 Information Retrieval Computational Linguistics Natural Language Processing LSA/Topic Model Text Mining DR Data Mining ArtificialIntelligence Machine Learning Machine Translation (Text) Clustering  We all know what (text) clustering is, right?  Widely-accepted topic, since everyone knows it
  • 6. Software Engineering2017/4/13 What can be clustered? 6 Data Sample 1:(1.2, 1.4, 2.234, 3.231), (8.2, 6.4, 4.243, 5.41), (5.234, 3.56, 4.454, 6.78) Data Sample 2:(1), (0),(1),(0),(1),(1),(1),(0),(1),(0) Data Sample 3:(China, modern, people, gov.), (policy, paper, conference, chair), (report, solution, UN, UK) Data Sample 4:(aaabbbccc), (dddfffggg), (hhhiiiijjj) Data Sample 5:(▲▼♦), (♣♠█),(■□●)
  • 7. Software Engineering2017/4/13 Is there anything that cannot be clustered? 7 Yes, but not related to us What can be clustered? Anything which a similarity measure can be defined over Matrix topology All kinds of data can be clustered
  • 8. Software Engineering2017/4/13 K-Means Trap 8 Defects of K-Means, K- Medoid,K-XXX  How many K?  Where are the initial centers?  Do the data really form a sphere?  Do the data really follow Minkowski /Euclidean distance?
  • 9. Software Engineering2017/4/13 How about these? What kind of data that K-XXX better fits? What kind of data that the methods relying on distance-similarity computation better fit? CONVEX
  • 11. Software Engineering2017/4/13 Alternative  Gaussian Mixture Model 11 Why Gaussian  central limit theorem Is central limit theorem always applicable in real-world cases? 1. Parameter Tuning 2. High applicability of Gaussian distribution How to estimate parameters? Expectation-Maximization No closed-form solution
  • 12. Software Engineering2017/4/13 Alternative  Matrix Factorization 12 No closed solution ‘Cause we are not in department of math SVD, PMF, NMF, Tensor Factorization…
  • 13. Software Engineering2017/4/13 Triangle 1313 Is there no perfect method here? What we probably want  No constraint in the form of data  No assumption in data distribution  Closed-solution Triangle borrowed from distributed computing
  • 14. Software Engineering2017/4/13 Triangle (Cont.) I do not know whether such a method exists or not Form Distribution Closed-solution Hierarchical Clustering? GMM/Gaussian Process K-Means/Medoid impossible Matrix Factorization impossible impossible
  • 15. Software Engineering2017/4/13 Multinomial Distribution Discrete Data (Text) 15 One document: (0,0,0,China,0,0,0,0,0,0,0,report,0,0,0,0,0,0,0,0,0,policy,0,0,0,0,0,0,0,meeting,0,0,0 meeting,0,0,0,0,report,0,….) Multinomial distribution Clustering  Sampling Markov Chain Monte Carlo Friendly to sparsity
  • 16. Software Engineering2017/4/13 Sparsity Sparsity brings a lot of problems 16  Also in clustering  What can we do? ➢ Ensemble Learning (Ensemble clustering) ➢ Missing values pre-filling ➢ Tuning ☺ ➢ … 10000 words  1 term
  • 17. Software Engineering2017/4/13 Reference  My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Random Thoughts in Clustering’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc.  Website  You can download all slides of mine ➢ http://web.xidian.edu.cn/ysxu/teach.html ➢ http://liu.cs.uic.edu/yueshenxu/ ➢ http://www.slideshare.net/obamaxys2011 ➢ https://www.researchgate.net/profile/Yueshen_Xu 17