SlideShare a Scribd company logo
1 of 31
Download to read offline
Data Clustering with R 
Yanchang Zhao 
http://www.RDataMining.com 
30 September 2014 
1 / 30
Outline 
Introduction 
The k-Means Clustering 
The k-Medoids Clustering 
Hierarchical Clustering 
Density-based Clustering 
Online Resources 
2 / 30
Data Clustering with R 1 
I k-means clustering with kmeans() 
I k-medoids clustering with pam() and pamk() 
I hierarchical clustering 
I density-based clustering with DBSCAN 
1Chapter 6: Clustering, in book R and Data Mining: Examples and Case 
Studies. http://www.rdatamining.com/docs/RDataMining.pdf 
3 / 30
Outline 
Introduction 
The k-Means Clustering 
The k-Medoids Clustering 
Hierarchical Clustering 
Density-based Clustering 
Online Resources 
4 / 30
k-means clustering 
set.seed(8953) 
iris2 <- iris 
iris2$Species <- NULL 
(kmeans.result <- kmeans(iris2, 3)) 
## K-means clustering with 3 clusters of sizes 38, 50, 62 
## 
## Cluster means: 
## Sepal.Length Sepal.Width Petal.Length Petal.Width 
## 1 6.850 3.074 5.742 2.071 
## 2 5.006 3.428 1.462 0.246 
## 3 5.902 2.748 4.394 1.434 
## 
## Clustering vector: 
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2... 
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3... 
## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3... 
## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1... 
## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3... 
## 
## Within cluster sum of squares by cluster: 
## [1] 23.88 15.15 39.82 
5 / 30
Results of k-Means Clustering 
Check clustering result against class labels (Species) 
table(iris$Species, kmeans.result$cluster) 
## 
## 1 2 3 
## setosa 0 50 0 
## versicolor 2 0 48 
## virginica 36 0 14 
I Class setosa" can be easily separated from the other clusters 
I Classes versicolor" and virginica" are to a small degree 
overlapped with each other. 
6 / 30
plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster) 
points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")], 
col = 1:3, pch = 8, cex = 2) # plot cluster centers 
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 
2.0 2.5 3.0 3.5 4.0 
Sepal.Length 
Sepal.Width 
7 / 30
Outline 
Introduction 
The k-Means Clustering 
The k-Medoids Clustering 
Hierarchical Clustering 
Density-based Clustering 
Online Resources 
8 / 30
The k-Medoids Clustering 
I Dierence from k-means: a cluster is represented with its 
center in the k-means algorithm, but with the object closest 
to the center of the cluster in the k-medoids clustering. 
I more robust than k-means in presence of outliers 
I PAM (Partitioning Around Medoids) is a classic algorithm for 
k-medoids clustering. 
I The CLARA algorithm is an enhanced technique of PAM by 
drawing multiple samples of data, applying PAM on each 
sample and then returning the best clustering. It performs 
better than PAM on larger data. 
I Functions pam() and clara() in package cluster 
I Function pamk() in package fpc does not require a user to 
choose k. 
9 / 30
Clustering with pamk() 
library(fpc) 
pamk.result - pamk(iris2) 
# number of clusters 
pamk.result$nc 
## [1] 2 
# check clustering against actual species 
table(pamk.result$pamobject$clustering, iris$Species) 
## 
## setosa versicolor virginica 
## 1 50 1 0 
## 2 0 49 50 
Two clusters: 
I setosa 
I a mixture of versicolor and virginica 
10 / 30
layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page 
plot(pamk.result$pamobject) 
clusplot(pam(x = sdata, k = k, diss = diss)) 
−3 −1 0 1 2 3 4 
−2 −1 0 1 2 3 
Component 1 Component 2 
Silhouette plot of pam(x = sdata, k = n = 150 2  clusters  Cj 
Average silhouette width : 0.69 
0.0 0.2 0.4 0.6 0.8 1.0 
Silhouette width si 
These two components explain 95.81 % of the point variability. 
j : nj | aveiÎCj  si 
1 : 51 | 0.81 
2 : 99 | 0.62 
layout(matrix(1)) # change back to one graph per page 11 / 30
I The left chart is a 2-dimensional clusplot (clustering plot) 
of the two clusters and the lines show the distance between 
clusters. 
I The right chart shows their silhouettes. A large si (almost 1) 
suggests that the corresponding observations are very well 
clustered, a small si (around 0) means that the observation 
lies between two clusters, and observations with a negative si 
are probably placed in the wrong cluster. 
I Since the average Si are respectively 0.81 and 0.62 in the 
above silhouette, the identi
ed two clusters are well clustered. 
12 / 30
Clustering with pam() 
# group into 3 clusters 
pam.result - pam(iris2, 3) 
table(pam.result$clustering, iris$Species) 
## 
## setosa versicolor virginica 
## 1 50 0 0 
## 2 0 48 14 
## 3 0 2 36 
Three clusters: 
I Cluster 1 is species setosa and is well separated from the 
other two. 
I Cluster 2 is mainly composed of versicolor, plus some cases 
from virginica. 
I The majority of cluster 3 are virginica, with two cases from 
versicolor. 
13 / 30
layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page 
plot(pam.result) 
clusplot(pam(x = iris2, k = 3)) 
−3 −2 −1 0 1 2 3 
−3 −2 −1 0 1 2 
Component 1 Component 2 
Silhouette plot of pam(x = iris2, k = 3) 
n = 150 3  clusters  Cj 
j : nj | aveiÎCj  si 
1 : 50 | 0.80 
2 : 62 | 0.42 
3 : 38 | 0.45 
0.0 0.2 0.4 0.6 0.8 1.0 
Silhouette width si 
These two components explain 95.81 % of the point variability. 
Average silhouette width : 0.55 
layout(matrix(1)) # change back to one graph per page 14 / 30
Results of Clustering 
I In this example, the result of pam() seems better, because it 
identi
es three clusters, corresponding to three species. 
I Note that we cheated by setting k = 3 when using pam(), 
which is already known to us as the number of species. 
15 / 30
Outline 
Introduction 
The k-Means Clustering 
The k-Medoids Clustering 
Hierarchical Clustering 
Density-based Clustering 
Online Resources 
16 / 30
Hierarchical Clustering of the iris Data 
set.seed(2835) 
# draw a sample of 40 records from the iris data, so that the 
# clustering plot will not be over crowded 
idx - sample(1:dim(iris)[1], 40) 
irisSample - iris[idx, ] 
# remove class label 
irisSample$Species - NULL 
# hierarchical clustering 
hc - hclust(dist(irisSample), method = ave) 
# plot clusters 
plot(hc, hang = -1, labels = iris$Species[idx]) 
# cut tree into 3 clusters 
rect.hclust(hc, k = 3) 
# get cluster IDs 
groups - cutree(hc, k = 3) 
17 / 30
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
versicolor 
versicolor 
versicolor 
virginica 
virginica 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
0 1 2 3 4 
Cluster Dendrogram 
hclust (*, average) 
dist(irisSample) 
Height 
18 / 30
Outline 
Introduction 
The k-Means Clustering 
The k-Medoids Clustering 
Hierarchical Clustering 
Density-based Clustering 
Online Resources 
19 / 30
Density-based Clustering 
I Group objects into one cluster if they are connected to one 
another by densely populated area 
I The DBSCAN algorithm from package fpc provides a 
density-based clustering for numeric data. 
I Two key parameters in DBSCAN: 
I eps: reachability distance, which de
nes the size of 
neighborhood; and 
I MinPts: minimum number of points. 
I If the number of points in the neighborhood of point  is no 
less than MinPts, then  is a dense point. All the points in its 
neighborhood are density-reachable from  and are put into 
the same cluster as . 
I Can discover clusters with various shapes and sizes 
I Insensitive to noise 
I The k-means algorithm tends to
nd clusters with sphere 
shape and with similar sizes. 
20 / 30
Density-based Clustering of the iris data 
library(fpc) 
iris2 - iris[-5] # remove class tags 
ds - dbscan(iris2, eps = 0.42, MinPts = 5) 
# compare clusters with original class labels 
table(ds$cluster, iris$Species) 
## 
## setosa versicolor virginica 
## 0 2 10 17 
## 1 48 0 0 
## 2 0 37 0 
## 3 0 3 33 
I 1 to 3: identi
ed clusters 
I 0: noises or outliers, i.e., objects that are not assigned to any 
clusters 
21 / 30
plot(ds, iris2) 
Sepal.Length 
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 
4.5 5.5 6.5 7.5 
2.0 2.5 3.0 3.5 4.0 
Sepal.Width 
Petal.Length 
1 2 3 4 5 6 7 
4.5 5.5 6.5 7.5 
0.5 1.0 1.5 2.0 2.5 
1 2 3 4 5 6 7 
Petal.Width 
22 / 30
plot(ds, iris2[c(1, 4)]) 
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 
0.5 1.0 1.5 2.0 2.5 
Sepal.Length 
Petal.Width 
23 / 30
plotcluster(iris2, ds$cluster) 
0 1 
1 
1 
1 
1 1 
1 1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
11 
1 
1 
1 
1 
1 
1 
0 
1 
1 
1 
1 
1 
1 
1 
1 
2 
2 
2 2 
2 
2 
2 
2 2 
2 
2 
0 
2 
2 
2 
0 
2 
0 
2 
0 
2 
2 
2 
3 3 
0 
2 
3 
2 
0 
3 
2 
2 
2 
2 
2 
0 
2 
2 
3 
2 
0 
2 
0 
2 
2 
2 
2 
0 
0 
2 
0 
3 
3 
3 
3 
3 
3 
0 
0 
0 
0 
0 
3 
3 
3 
0 
3 
3 
0 
0 
33 
0 
3 
0 
3 
3 
3 
0 
0 
0 
3 
3 
0 
0 
3 
3 
3 
3 
3 
3 
3 
3 
3 
−8 −6 −4 −2 0 2 
−2 −1 0 1 2 3 
dc 1 
dc 2 
24 / 30
Prediction with Clustering Model 
I Label new data, based on their similarity with the clusters 
I Draw a sample of 10 objects from iris and add small noises 
to them to make a new dataset for labeling 
I Random noises are generated with a uniform distribution 
using function runif(). 
# create a new dataset for labeling 
set.seed(435) 
idx - sample(1:nrow(iris), 10) 
# remove class labels 
new.data - iris[idx,-5] 
# add random noise 
new.data - new.data + matrix(runif(10*4, min=0, max=0.2), 
nrow=10, ncol=4) 
# label new data 
pred - predict(ds, iris2, new.data) 
25 / 30
Results of Prediction 
table(pred, iris$Species[idx]) # check cluster labels 
## 
## pred setosa versicolor virginica 
## 0 0 0 1 
## 1 3 0 0 
## 2 0 3 0 
## 3 0 1 2 
26 / 30

More Related Content

What's hot

Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 

What's hot (20)

Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trend
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
K means
K meansK means
K means
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Clustering
ClusteringClustering
Clustering
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 

Viewers also liked

Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in RDuyen Do
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)オラクルエンジニア通信
 
DIstinguish between Parametric vs nonparametric test
 DIstinguish between Parametric vs nonparametric test DIstinguish between Parametric vs nonparametric test
DIstinguish between Parametric vs nonparametric testsai prakash
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Viewers also liked (20)

Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in R
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)
R de Hadoop (Oracle R Advanced Analytics for Hadoopご説明資料)
 
time series analysis
time series analysistime series analysis
time series analysis
 
DIstinguish between Parametric vs nonparametric test
 DIstinguish between Parametric vs nonparametric test DIstinguish between Parametric vs nonparametric test
DIstinguish between Parametric vs nonparametric test
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 

Similar to Data Clustering with R

RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ... Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...Gota Morota
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RDr. Volkan OBAN
 
Pattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierPattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierNayem Nayem
 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphsAndres Mendez-Vazquez
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in RSujaAldrin
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programmingNixon Mendez
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RFayan TAO
 
Data Visualization using base graphics
Data Visualization using base graphicsData Visualization using base graphics
Data Visualization using base graphicsRupak Roy
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul
 
Lecture8 clustering
Lecture8 clusteringLecture8 clustering
Lecture8 clusteringsidsingh680
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS TECSI FEA USP
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.pptSueMiu
 
Suppression of grating lobes
Suppression of grating lobesSuppression of grating lobes
Suppression of grating lobeskavindrakrishna
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 

Similar to Data Clustering with R (20)

RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ... Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: ...
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
Pattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifierPattern Recognition - Designing a minimum distance class mean classifier
Pattern Recognition - Designing a minimum distance class mean classifier
 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphs
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
Data Visualization using base graphics
Data Visualization using base graphicsData Visualization using base graphics
Data Visualization using base graphics
 
ClusterAnalysis
ClusterAnalysisClusterAnalysis
ClusterAnalysis
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Lecture8 clustering
Lecture8 clusteringLecture8 clustering
Lecture8 clustering
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Suppression of grating lobes
Suppression of grating lobesSuppression of grating lobes
Suppression of grating lobes
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 

More from Yanchang Zhao

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 

More from Yanchang Zhao (8)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisation
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 

Recently uploaded

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Data Clustering with R

  • 1. Data Clustering with R Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 30
  • 2. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 2 / 30
  • 3. Data Clustering with R 1 I k-means clustering with kmeans() I k-medoids clustering with pam() and pamk() I hierarchical clustering I density-based clustering with DBSCAN 1Chapter 6: Clustering, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 3 / 30
  • 4. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 4 / 30
  • 5. k-means clustering set.seed(8953) iris2 <- iris iris2$Species <- NULL (kmeans.result <- kmeans(iris2, 3)) ## K-means clustering with 3 clusters of sizes 38, 50, 62 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 6.850 3.074 5.742 2.071 ## 2 5.006 3.428 1.462 0.246 ## 3 5.902 2.748 4.394 1.434 ## ## Clustering vector: ## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2... ## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3... ## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3... ## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1... ## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3... ## ## Within cluster sum of squares by cluster: ## [1] 23.88 15.15 39.82 5 / 30
  • 6. Results of k-Means Clustering Check clustering result against class labels (Species) table(iris$Species, kmeans.result$cluster) ## ## 1 2 3 ## setosa 0 50 0 ## versicolor 2 0 48 ## virginica 36 0 14 I Class setosa" can be easily separated from the other clusters I Classes versicolor" and virginica" are to a small degree overlapped with each other. 6 / 30
  • 7. plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster) points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex = 2) # plot cluster centers 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 2.5 3.0 3.5 4.0 Sepal.Length Sepal.Width 7 / 30
  • 8. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 8 / 30
  • 9. The k-Medoids Clustering I Dierence from k-means: a cluster is represented with its center in the k-means algorithm, but with the object closest to the center of the cluster in the k-medoids clustering. I more robust than k-means in presence of outliers I PAM (Partitioning Around Medoids) is a classic algorithm for k-medoids clustering. I The CLARA algorithm is an enhanced technique of PAM by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger data. I Functions pam() and clara() in package cluster I Function pamk() in package fpc does not require a user to choose k. 9 / 30
  • 10. Clustering with pamk() library(fpc) pamk.result - pamk(iris2) # number of clusters pamk.result$nc ## [1] 2 # check clustering against actual species table(pamk.result$pamobject$clustering, iris$Species) ## ## setosa versicolor virginica ## 1 50 1 0 ## 2 0 49 50 Two clusters: I setosa I a mixture of versicolor and virginica 10 / 30
  • 11. layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page plot(pamk.result$pamobject) clusplot(pam(x = sdata, k = k, diss = diss)) −3 −1 0 1 2 3 4 −2 −1 0 1 2 3 Component 1 Component 2 Silhouette plot of pam(x = sdata, k = n = 150 2 clusters Cj Average silhouette width : 0.69 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width si These two components explain 95.81 % of the point variability. j : nj | aveiÎCj si 1 : 51 | 0.81 2 : 99 | 0.62 layout(matrix(1)) # change back to one graph per page 11 / 30
  • 12. I The left chart is a 2-dimensional clusplot (clustering plot) of the two clusters and the lines show the distance between clusters. I The right chart shows their silhouettes. A large si (almost 1) suggests that the corresponding observations are very well clustered, a small si (around 0) means that the observation lies between two clusters, and observations with a negative si are probably placed in the wrong cluster. I Since the average Si are respectively 0.81 and 0.62 in the above silhouette, the identi
  • 13. ed two clusters are well clustered. 12 / 30
  • 14. Clustering with pam() # group into 3 clusters pam.result - pam(iris2, 3) table(pam.result$clustering, iris$Species) ## ## setosa versicolor virginica ## 1 50 0 0 ## 2 0 48 14 ## 3 0 2 36 Three clusters: I Cluster 1 is species setosa and is well separated from the other two. I Cluster 2 is mainly composed of versicolor, plus some cases from virginica. I The majority of cluster 3 are virginica, with two cases from versicolor. 13 / 30
  • 15. layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page plot(pam.result) clusplot(pam(x = iris2, k = 3)) −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Component 1 Component 2 Silhouette plot of pam(x = iris2, k = 3) n = 150 3 clusters Cj j : nj | aveiÎCj si 1 : 50 | 0.80 2 : 62 | 0.42 3 : 38 | 0.45 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width si These two components explain 95.81 % of the point variability. Average silhouette width : 0.55 layout(matrix(1)) # change back to one graph per page 14 / 30
  • 16. Results of Clustering I In this example, the result of pam() seems better, because it identi
  • 17. es three clusters, corresponding to three species. I Note that we cheated by setting k = 3 when using pam(), which is already known to us as the number of species. 15 / 30
  • 18. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 16 / 30
  • 19. Hierarchical Clustering of the iris Data set.seed(2835) # draw a sample of 40 records from the iris data, so that the # clustering plot will not be over crowded idx - sample(1:dim(iris)[1], 40) irisSample - iris[idx, ] # remove class label irisSample$Species - NULL # hierarchical clustering hc - hclust(dist(irisSample), method = ave) # plot clusters plot(hc, hang = -1, labels = iris$Species[idx]) # cut tree into 3 clusters rect.hclust(hc, k = 3) # get cluster IDs groups - cutree(hc, k = 3) 17 / 30
  • 20. setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor virginica virginica versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica 0 1 2 3 4 Cluster Dendrogram hclust (*, average) dist(irisSample) Height 18 / 30
  • 21. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 19 / 30
  • 22. Density-based Clustering I Group objects into one cluster if they are connected to one another by densely populated area I The DBSCAN algorithm from package fpc provides a density-based clustering for numeric data. I Two key parameters in DBSCAN: I eps: reachability distance, which de
  • 23. nes the size of neighborhood; and I MinPts: minimum number of points. I If the number of points in the neighborhood of point is no less than MinPts, then is a dense point. All the points in its neighborhood are density-reachable from and are put into the same cluster as . I Can discover clusters with various shapes and sizes I Insensitive to noise I The k-means algorithm tends to
  • 24. nd clusters with sphere shape and with similar sizes. 20 / 30
  • 25. Density-based Clustering of the iris data library(fpc) iris2 - iris[-5] # remove class tags ds - dbscan(iris2, eps = 0.42, MinPts = 5) # compare clusters with original class labels table(ds$cluster, iris$Species) ## ## setosa versicolor virginica ## 0 2 10 17 ## 1 48 0 0 ## 2 0 37 0 ## 3 0 3 33 I 1 to 3: identi
  • 26. ed clusters I 0: noises or outliers, i.e., objects that are not assigned to any clusters 21 / 30
  • 27. plot(ds, iris2) Sepal.Length 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 Sepal.Width Petal.Length 1 2 3 4 5 6 7 4.5 5.5 6.5 7.5 0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 7 Petal.Width 22 / 30
  • 28. plot(ds, iris2[c(1, 4)]) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.5 1.0 1.5 2.0 2.5 Sepal.Length Petal.Width 23 / 30
  • 29. plotcluster(iris2, ds$cluster) 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 0 2 0 2 0 2 2 2 3 3 0 2 3 2 0 3 2 2 2 2 2 0 2 2 3 2 0 2 0 2 2 2 2 0 0 2 0 3 3 3 3 3 3 0 0 0 0 0 3 3 3 0 3 3 0 0 33 0 3 0 3 3 3 0 0 0 3 3 0 0 3 3 3 3 3 3 3 3 3 −8 −6 −4 −2 0 2 −2 −1 0 1 2 3 dc 1 dc 2 24 / 30
  • 30. Prediction with Clustering Model I Label new data, based on their similarity with the clusters I Draw a sample of 10 objects from iris and add small noises to them to make a new dataset for labeling I Random noises are generated with a uniform distribution using function runif(). # create a new dataset for labeling set.seed(435) idx - sample(1:nrow(iris), 10) # remove class labels new.data - iris[idx,-5] # add random noise new.data - new.data + matrix(runif(10*4, min=0, max=0.2), nrow=10, ncol=4) # label new data pred - predict(ds, iris2, new.data) 25 / 30
  • 31. Results of Prediction table(pred, iris$Species[idx]) # check cluster labels ## ## pred setosa versicolor virginica ## 0 0 0 1 ## 1 3 0 0 ## 2 0 3 0 ## 3 0 1 2 26 / 30
  • 32. Results of Prediction table(pred, iris$Species[idx]) # check cluster labels ## ## pred setosa versicolor virginica ## 0 0 0 1 ## 1 3 0 0 ## 2 0 3 0 ## 3 0 1 2 Eight(=3+3+2) out of 10 objects are assigned with correct class labels. 26 / 30
  • 33. plot(iris2[c(1, 4)], col = 1 + ds$cluster) points(new.data[c(1, 4)], pch = +, col = 1 + pred, cex = 3) + 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.5 1.0 1.5 2.0 2.5 Sepal.Length Petal.Width + + + + + + + + + 27 / 30
  • 34. Outline Introduction The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-based Clustering Online Resources 28 / 30
  • 35. Online Resources I Chapter 6: Clustering, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining.pdf I R Reference Card for Data Mining http://www.rdatamining.com/docs/R-refcard-data-mining.pdf I Free online courses and documents http://www.rdatamining.com/resources/ I RDataMining Group on LinkedIn (7,000+ members) http://group.rdatamining.com I RDataMining on Twitter (1,700+ followers) @RDataMining 29 / 30
  • 36. The End Thanks! Email: yanchang(at)rdatamining.com 30 / 30