SlideShare a Scribd company logo
1 of 64
Download to read offline
Data Clustering with R
Yanchang Zhao
http://www.RDataMining.com
R and Data Mining Course
Beijing University of Posts and Telecommunications,
Beijing, China
July 2019
1 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
2 / 62
What is Data Clustering?
Data clustering is to partition data into groups, where the
data in the same group are similar to one another and the data
from different groups are dissimilar [Han and Kamber, 2000].
To segment data into clusters so that the intra-cluster
similarity is maximized and that the inter-cluster similarity is
minimized.
The groups obtained are a partition of data, which can be
used for customer segmentation, document categorization,
etc.
3 / 62
Data Clustering with R †
Partitioning Methods
k-means clustering: stats::kmeans() ∗
and
fpc::kmeansruns()
k-medoids clustering: cluster::pam() and fpc::pamk()
Hierarchical Methods
Divisive hierarchical clustering: DIANA, cluster::diana(),
Agglomerative hierarchical clustering: cluster::agnes(),
stats::hclust()
Density based Methods
DBSCAN: fpc::dbscan()
Cluster Validation
Packages clValid, cclust, NbClust
∗
package name::function name()
†
Chapter 6 - Clustering, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
4 / 62
The Iris Dataset - I
The iris dataset [Frank and Asuncion, 2010] consists of 50
samples from each of three classes of iris flowers. There are five
attributes in the dataset:
sepal length in cm,
sepal width in cm,
petal length in cm,
petal width in cm, and
class: Iris Setosa, Iris Versicolour, and Iris Virginica.
Detailed desription of the dataset can be found at the UCI
Machine Learning Repository ‡.
‡
https://archive.ics.uci.edu/ml/datasets/Iris
5 / 62
The Iris Dataset - II
Below we have a look at the structure of the dataset with str().
## the IRIS dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0...
## $ Species : Factor w/ 3 levels "setosa","versicolor",....
150 observations (records, or rows) and 5 variables (or
columns)
The first four variables are numeric.
The last one, Species, is categoric (called as “factor” in R)
and has three levels of values.
6 / 62
The Iris Dataset - III
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
7 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
8 / 62
Partitioning clustering - I
Partitioning the data into k groups first and then trying to
improve the quality of clustering by moving objects from one
group to another
k-means [Alsabti et al., 1998, Macqueen, 1967]: randomly
selects k objects as cluster centers and assigns other objects
to the nearest cluster centers, and then improves the
clustering by iteratively updating the cluster centers and
reassigning the objects to the new centers.
k-medoids [Huang, 1998]: a variation of k-means for
categorical data, where the medoid (i.e., the object closest to
the center), instead of the centroid, is used to represent a
cluster.
PAM and CLARA [Kaufman and Rousseeuw, 1990]
CLARANS [Ng and Han, 1994]
9 / 62
Partitioning clustering - II
The result of partitioning clustering is dependent on the
selection of initial cluster centers and it may result in a local
optimum instead of a global one. (Improvement: run k-means
multiple times with different initial centers and then choose
the best clustering result.)
Tends to result in sphere-shaped clusters with similar sizes
Sentitive to outliers
Non-trivial to choose an appropriate value for k
10 / 62
k-Means Algorithm
k-means: a classic partitioning method for clustering
First, it selects k objects from the dataset, each of which
initially represents a cluster center.
Each object is assigned to the cluster to which it is most
similar, based on the distance between the object and the
cluster center.
The means of clusters are computed as the new cluster
centers.
The process iterates until the criterion function converges.
11 / 62
k-Means Algorithm - Criterion Function
A typical criterion function is the squared-error criterion, defined as
E =
k
i=1 p∈Ci
p − mi
2
, (1)
where E is the sum of square-error, p is a point, and mi is the
center of cluster Ci .
12 / 62
k-means clustering
## k-means clustering set a seed for random number generation to
## make the results reproducible
set.seed(8953)
## make a copy of iris data
iris2 <- iris
## remove the class label, Species
iris2$Species <- NULL
## run kmeans clustering to find 3 clusters
kmeans.result <- kmeans(iris2, 3)
## print the clusterng result
kmeans.result
13 / 62
## K-means clustering with 3 clusters of sizes 38, 50, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.006000 3.428000 1.462000 0.246000
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3...
## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3...
## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1...
## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3...
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"...
## [5] "tot.withinss" "betweenss" "size" "iter" ...
## [9] "ifault" 14 / 62
Results of k-Means Clustering
Check clustering result against class labels (Species)
table(iris$Species, kmeans.result$cluster)
##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14
Class “setosa” can be easily separated from the other clusters
Classes “versicolor” and “virginica” are to a small degree
overlapped with each other.
15 / 62
plot(iris2[, c("Sepal.Length", "Sepal.Width")],
col = kmeans.result$cluster)
points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex=2) # plot cluster centers
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.02.53.03.54.0
Sepal.Length
Sepal.Width
16 / 62
k-means clustering with estimating k and initialisations
kmeansruns() in package fpc [Hennig, 2014]
calls kmeans() to perform k-means clustering
initializes the k-means algorithm several times with random
points from the data set as means
estimates the number of clusters by Calinski Harabasz index
or average silhouette width
17 / 62
library(fpc)
kmeansruns.result <- kmeansruns(iris2)
kmeansruns.result
## K-means clustering with 3 clusters of sizes 62, 50, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.901613 2.748387 4.393548 1.433871
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1...
## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1...
## [91] 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3...
## [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1...
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
18 / 62
The k-Medoids Clustering
Difference from k-means: a cluster is represented with its
center in the k-means algorithm, but with the object closest
to the center of the cluster in the k-medoids clustering.
more robust than k-means in presence of outliers
PAM (Partitioning Around Medoids) is a classic algorithm for
k-medoids clustering.
The CLARA algorithm is an enhanced technique of PAM by
drawing multiple samples of data, applying PAM on each
sample and then returning the best clustering. It performs
better than PAM on larger data.
Functions pam() and clara() in package cluster
[Maechler et al., 2016]
Function pamk() in package fpc does not require a user to
choose k.
19 / 62
Clustering with pam()
## clustering with PAM
library(cluster)
# group into 3 clusters
pam.result <- pam(iris2, 3)
# check against actual class label
table(pam.result$clustering, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
Three clusters:
Cluster 1 is species “setosa” and is well separated from the
other two.
Cluster 2 is mainly composed of “versicolor”, plus some cases
from “virginica”.
The majority of cluster 3 are “virginica”, with two cases from
“versicolor”.
20 / 62
plot(pam.result)
−3 −2 −1 0 1 2 3
−202
clusplot(pam(x = iris2, k = 3))
Component 1
Component2
These two components explain 95.81 % of the point variability.
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = iris2, k = 3)
Average silhouette width : 0.55
n = 150 3 clusters Cj
j : nj | avei∈Cj si
1 : 50 | 0.80
2 : 62 | 0.42
3 : 38 | 0.45
21 / 62
The left chart is a 2-dimensional “clusplot” (clustering plot)
of the three clusters and the lines show the distance between
clusters.
The right chart shows their silhouettes. A large si (almost 1)
suggests that the corresponding observations are very well
clustered, a small si (around 0) means that the observation
lies between two clusters, and observations with a negative si
are probably placed in the wrong cluster.
Silhouette width of cluster 1 is 0.80, which means it is well
clustered and seperated from other clusters. The other two
are of relatively low silhouette width (0.42 and 0.45), and they
are somewhat overlapped with each other.
22 / 62
Clustering with pamk()
library(fpc)
pamk.result <- pamk(iris2)
# number of clusters
pamk.result$nc
## [1] 2
# check clustering against actual class label
table(pamk.result$pamobject$clustering, iris$Species)
##
## setosa versicolor virginica
## 1 50 1 0
## 2 0 49 50
Two clusters:
“setosa”
a mixture of “versicolor” and “virginica”
23 / 62
plot(pamk.result)
−3 −2 −1 0 1 2 3 4
−3−2−1012
clusplot(pam(x = sdata, k = k, diss = diss))
Component 1
Component2
These two components explain 95.81 % of the point variability.
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = sdata, k = k, diss = diss
Average silhouette width : 0.69
n = 150 2 clusters Cj
j : nj | avei∈Cj si
1 : 51 | 0.81
2 : 99 | 0.62
24 / 62
Results of Clustering
In this example, the result of pam() seems better, because it
identifies three clusters, corresponding to three species.
25 / 62
Results of Clustering
In this example, the result of pam() seems better, because it
identifies three clusters, corresponding to three species.
Note that we cheated by setting k = 3 when using pam(),
which is already known to us as the number of species.
25 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
26 / 62
Hierarchical Clustering - I
With hierarchical clustering approach, a hierarchical
decomposition of data is built in either bottom-up
(agglomerative) or top-down (divisive) way.
Generally a dendrogram is generated and a user may select to
cut it at a certain level to get the clusters.
27 / 62
Hierarchical Clustering - II
28 / 62
Hierarchical Clustering Algorithms
With agglomerative clustering, every single object is taken as
a cluster and then iteratively the two nearest clusters are
merged to build bigger clusters until the expected number of
clusters is obtained or when only one cluster is left.
AGENS [Kaufman and Rousseeuw, 1990]
Divisive clustering works in an opposite way, which puts all
objects in a single cluster and then divides the cluster into
smaller and smaller ones.
DIANA [Kaufman and Rousseeuw, 1990]
BIRCH [Zhang et al., 1996]
CURE [Guha et al., 1998]
ROCK [Guha et al., 1999]
Chameleon [Karypis et al., 1999]
29 / 62
Hierarchical Clustering - Distance Between Clusters
In hierarchical clustering, there are four different methods to
measure the distance between clusters:
Centroid distance is the distance between the centroids of two
clusters.
Average distance is the average of the distances between
every pair of objects from two clusters.
Single-link distance, a.k.a. minimum distance, is the distance
between the two nearest objects from two clusters.
Complete-link distance, a.k.a. maximum distance, is the
distance between the two objects which are the farthest from
each other from two clusters.
30 / 62
Hierarchical Clustering of the iris Data
## hiercrchical clustering
set.seed(2835)
# draw a sample of 40 records from the iris data, so that the
# clustering plot will not be over crowded
idx <- sample(1:dim(iris)[1], 40)
iris3 <- iris[idx, ]
# remove class label
iris3$Species <- NULL
# hierarchical clustering
hc <- hclust(dist(iris3), method = "ave")
# plot clusters
plot(hc, hang = -1, labels = iris$Species[idx])
# cut tree into 3 clusters
rect.hclust(hc, k = 3)
# get cluster IDs
groups <- cutree(hc, k = 3)
31 / 62
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
virginica
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
01234
Cluster Dendrogram
hclust (*, "average")
dist(iris3)
Height
32 / 62
Agglomeration Methods of hclust
hclust(d, method = "complete", members = NULL)
method = "ward.D" or "ward.D2": Ward’s minimum
variance method aims at finding compact, spherical
clusters [R Core Team, 2015].
method = "complete": complete-link distance; finds similar
clusters.
method = "single": single-link distance; adopts a “friends
of friends” clustering strategy.
method = "average": average distance
method = "centroid": centroid distance
method = "median":
method = "mcquitty":
33 / 62
DIANA
DIANA [Kaufman and Rousseeuw, 1990]: divisive hierarchical
clustering
Constructs a hierarchy of clusterings, starting with one large
cluster containing all observations.
Divides clusters until each cluster contains only a single
observation.
At each stage, the cluster with the largest diameter is
selected. (The diameter of a cluster is the largest dissimilarity
between any two of its observations.)
To divide the selected cluster, the algorithm first looks for its
most disparate observation (i.e., which has the largest average
dissimilarity to other observations in the selected cluster).
This observation initiates the “splinter group”. In subsequent
steps, the algorithm reassigns observations that are closer to
the “splinter group” than to the “old party”. The result is a
division of the selected cluster into two new clusters.
34 / 62
DIANA
## clustering with DIANA
library(cluster)
diana.result <- diana(iris3)
plot(diana.result, which.plots = 2, labels = iris$Species[idx])
35 / 62
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
virginica
virginica
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
versicolor
virginica
01234567
Dendrogram of diana(x = iris3)
Divisive Coefficient = 0.93
iris3
Height
36 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
37 / 62
Density-Based Clustering
The rationale of density-based clustering is that a cluster is
composed of well-connected dense region, while objects in
sparse areas are removed as noises.
DBSCAN is a typical density-based clustering algorithm,
which works by expanding clusters to their dense
neighborhood [Ester et al., 1996].
Other density-based clustering techniques:
OPTICS [Ankerst et al., 1999] and
DENCLUE [Hinneburg and Keim, 1998]
The advantage of density-based clustering is that it can filter
out noise and find clusters of arbitrary shapes (as long as they
are composed of connected dense regions).
38 / 62
DBSCAN [Ester et al., 1996]
Group objects into one cluster if they are connected to one
another by densely populated area
The DBSCAN algorithm from package fpc provides a
density-based clustering for numeric data.
Two key parameters in DBSCAN:
eps: reachability distance, which defines the size of
neighborhood; and
MinPts: minimum number of points.
If the number of points in the neighborhood of point α is no
less than MinPts, then α is a dense point. All the points in its
neighborhood are density-reachable from α and are put into
the same cluster as α.
Can discover clusters with various shapes and sizes
Insensitive to noise
39 / 62
Density-based Clustering of the iris data
## Density-based Clustering
library(fpc)
iris2 <- iris[-5] # remove class tags
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
ds
## dbscan Pts=150 MinPts=5 eps=0.42
## 0 1 2 3
## border 29 6 10 12
## seed 0 42 27 24
## total 29 48 37 36
40 / 62
Density-based Clustering of the iris data
# compare clusters with actual class labels
table(ds$cluster, iris$Species)
##
## setosa versicolor virginica
## 0 2 10 17
## 1 48 0 0
## 2 0 37 0
## 3 0 3 33
1 to 3: identified clusters
0: noises or outliers, i.e., objects that are not assigned to any
clusters
41 / 62
plot(ds, iris2)
Sepal.Length
2.02.53.03.54.0
4.5 5.5 6.5 7.5
0.51.01.52.02.5
2.0 2.5 3.0 3.5 4.0
Sepal.Width
Petal.Length
1 2 3 4 5 6 7
0.5 1.0 1.5 2.0 2.5
4.55.56.57.51234567
Petal.Width
42 / 62
plot(ds, iris2[, c(1, 4)])
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.51.01.52.02.5
Sepal.Length
Petal.Width
43 / 62
plotcluster(iris2, ds$cluster)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
1
1
1
1
11
1
1
1
1
1
1
111 1
1
1
0
1
1
1
1
1
11
1
2
2
2
22
2
2
0
2
2
0
2
0
2
0
2
2
2
0
2
3
2
3
2
2
2
2
22
0
2
2
2
3
2
0
2
0
2
2
2
2
2
0
22
2
2
0
2
0
3
3
3
3
0
0
0
0
0
3
3
3
3
0
3
3
0
0
0
33
0
3
3
0
3
3
3
0
0
0
3
3
0
0
3
3
3 3
3
3
3
3
3
3
3
3
3
3
−8 −6 −4 −2 0 2
−2−10123
dc 1
dc2
44 / 62
Prediction with Clustering Model
Label new data, based on their similarity with the clusters
Draw a sample of 10 objects from iris and add small noises
to them to make a new dataset for labeling
Random noises are generated with a uniform distribution
using function runif().
## cluster prediction
# create a new dataset for labeling
set.seed(435)
idx <- sample(1:nrow(iris), 10)
# remove class labels
new.data <- iris[idx,-5]
# add random noise
new.data <- new.data + matrix(runif(10*4, min=0, max=0.2),
nrow=10, ncol=4)
# label new data
pred <- predict(ds, iris2, new.data)
45 / 62
Results of Prediction
table(pred, iris$Species[idx]) # check cluster labels
##
## pred setosa versicolor virginica
## 0 0 0 1
## 1 3 0 0
## 2 0 3 0
## 3 0 1 2
46 / 62
Results of Prediction
table(pred, iris$Species[idx]) # check cluster labels
##
## pred setosa versicolor virginica
## 0 0 0 1
## 1 3 0 0
## 2 0 3 0
## 3 0 1 2
Eight(=3+3+2) out of 10 objects are assigned with correct class
labels.
46 / 62
plot(iris2[, c(1, 4)], col = 1 + ds$cluster)
points(new.data[, c(1, 4)], pch = "+", col = 1 + pred, cex = 3)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.51.01.52.02.5
Sepal.Length
Petal.Width
+ +
+
+
+
+
+
+
+
+
47 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
48 / 62
Cluster Validation
silhouette() compute or extract silhouette information
(cluster)
cluster.stats() compute several cluster validity statistics
from a clustering and a dissimilarity matrix (fpc)
clValid() calculate validation measures for a given set of
clustering algorithms and number of clusters (clValid)
clustIndex() calculate the values of several clustering
indexes, which can be independently used to determine the
number of clusters existing in a data set (cclust)
NbClust() provide 30 indices for cluster validation and
determining the number of clusters (NbClust)
49 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
50 / 62
Further Readings - Clustering
A breif overview of various apporaches for clustering
Yanchang Zhao, et al. ”Data Clustering.” In Ferraggine et al. (Eds.), Handbook
of Research on Innovations in Database Technologies and Applications, Feb
2009. http://yanchang.rdatamining.com/publications/
Overview-of-Data-Clustering.pdf
Cluster Analysis & Evaluation Measures
https://en.wikipedia.org/wiki/Cluster_analysis
Detailed review of algorithms for data clustering
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review.
ACM Computing Surveys, 31(3), 264-323.
Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue
Software, San Jose, CA, USA.
http://citeseer.ist.psu.edu/berkhin02survey.html.
A comprehensive textbook on data mining
Han, J., & Kamber, M. (2000). Data mining: concepts and techniques. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
51 / 62
Further Readings - Clustering with R
Data Mining Algorithms In R: Clustering
https:
//en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering
Data Mining Algorithms In R: k-Means Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/K-Means
Data Mining Algorithms In R: k-Medoids Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Partitioning_Around_Medoids_(PAM)
Data Mining Algorithms In R: Hierarchical Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Hierarchical_Clustering
Data Mining Algorithms In R: Density-Based Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Density-Based_Clustering
52 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Further Readings and Online Resources
Exercises
53 / 62
Exercise - I
Clustering cars based on road test data
mtcars: the Motor Trend Car Road Tests data, comprising
fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74
models) [R Core Team, 2015]
A data frame with 32 observations on 11 variables:
1. mpg: fuel consumption (Miles/gallon)
2. cyl: Number of cylinders
3. disp: Displacement (cu.in.)
4. hp: Gross horsepower
5. drat: Rear axle ratio
6. wt: Weight (lb/1000)
7. qsec: 1/4 mile time
8. vs: V engine or straight engine
9. am: Transmission (0 = automatic, 1 = manual)
10. gear: Number of forward gears
11. carb: Number of carburetors
54 / 62
Exercise - II
To cluster states of US
state.x77: statistics of the 50 states of
US [R Core Team, 2015]
a matrix with 50 rows and 8 columns
1. Population: population estimate as of July 1, 1975
2. Income: per capita income (1974)
3. Illiteracy: illiteracy (1970, percent of population)
4. Life Exp: life expectancy in years (1969–71)
5. Murder: murder and non-negligent manslaughter rate per
100,000 population (1976)
6. HS Grad: percent high-school graduates (1970)
7. Frost: mean number of days with minimum temperature below
freezing (1931–1960) in capital or large city
8. Area: land area in square miles
55 / 62
Exercise - Questions
Which attributes to use?
Are the attributes at the same scale?
Which clustering techniques to use?
Which clustering algorithms to use?
How many clusters to find?
Are the clustering results good or not?
56 / 62
Online Resources
Book titled R and Data Mining: Examples and Case Studies
http://www.rdatamining.com/docs/RDataMining-book.pdf
R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
Free online courses and documents
http://www.rdatamining.com/resources/
RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
Twitter (3,300+ followers)
@RDataMining
57 / 62
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
58 / 62
How to Cite This Work
Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
59 / 62
References I
Alsabti, K., Ranka, S., and Singh, V. (1998).
An efficient k-means clustering algorithm.
In Proc. the First Workshop on High Performance Data Mining, Orlando, Florida.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999).
OPTICS: ordering points to identify the clustering structure.
In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data,
pages 49–60, New York, NY, USA. ACM Press.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).
A density-based algorithm for discovering clusters in large spatial databases with noise.
In KDD, pages 226–231.
Frank, A. and Asuncion, A. (2010).
UCI machine learning repository. university of california, irvine, school of information and computer sciences.
http://archive.ics.uci.edu/ml.
Guha, S., Rastogi, R., and Shim, K. (1998).
CURE: an efficient clustering algorithm for large databases.
In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data,
pages 73–84, New York, NY, USA. ACM Press.
Guha, S., Rastogi, R., and Shim, K. (1999).
ROCK: A robust clustering algorithm for categorical attributes.
In Proceedings of the 15th International Conference on Data Engineering, 23-26 March 1999, Sydney,
Austrialia, pages 512–521. IEEE Computer Society.
Han, J. and Kamber, M. (2000).
Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
60 / 62
References II
Hennig, C. (2014).
fpc: Flexible procedures for clustering.
R package version 2.1-9.
Hinneburg, A. and Keim, D. A. (1998).
An efficient approach to clustering in large multimedia databases with noise.
In KDD, pages 58–65.
DENCLUE.
Huang, Z. (1998).
Extensions to the k-means algorithm for clustering large data sets with categorical values.
Data Min. Knowl. Discov., 2(3):283–304.
Karypis, G., Han, E.-H., and Kumar, V. (1999).
Chameleon: hierarchical clustering using dynamic modeling.
Computer, 32(8):68–75.
Kaufman, L. and Rousseeuw, P. J. (1990).
Finding groups in data. an introduction to cluster analysis.
Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York:
Wiley, 1990.
Macqueen, J. B. (1967).
Some methods of classification and analysis of multivariate observations.
In the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2016).
cluster: Cluster Analysis Basics and Extensions.
R package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source).
61 / 62
References III
Ng, R. T. and Han, J. (1994).
Efficient and effective clustering methods for spatial data mining.
In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155,
San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
R Core Team (2015).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996).
BIRCH: an efficient data clustering method for very large databases.
In SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data,
pages 103–114, New York, NY, USA. ACM Press.
62 / 62

More Related Content

What's hot

Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examplesDennis
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programmingNixon Mendez
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RDr. Volkan OBAN
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.Dr. Volkan OBAN
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientistsLambda Tree
 

What's hot (20)

Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
R programming language
R programming languageR programming language
R programming language
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
Language R
Language RLanguage R
Language R
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
R language introduction
R language introductionR language introduction
R language introduction
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 

Similar to RDataMining slides-clustering-with-r

Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in RSujaAldrin
 
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0theijes
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 

Similar to RDataMining slides-clustering-with-r (20)

ClusterAnalysis
ClusterAnalysisClusterAnalysis
ClusterAnalysis
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
K means report
K means reportK means report
K means report
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
My8clst
My8clstMy8clst
My8clst
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
08 clustering
08 clustering08 clustering
08 clustering
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 

More from Yanchang Zhao

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 

More from Yanchang Zhao (10)

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

RDataMining slides-clustering-with-r

  • 1. Data Clustering with R Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 1 / 62
  • 2. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 2 / 62
  • 3. What is Data Clustering? Data clustering is to partition data into groups, where the data in the same group are similar to one another and the data from different groups are dissimilar [Han and Kamber, 2000]. To segment data into clusters so that the intra-cluster similarity is maximized and that the inter-cluster similarity is minimized. The groups obtained are a partition of data, which can be used for customer segmentation, document categorization, etc. 3 / 62
  • 4. Data Clustering with R † Partitioning Methods k-means clustering: stats::kmeans() ∗ and fpc::kmeansruns() k-medoids clustering: cluster::pam() and fpc::pamk() Hierarchical Methods Divisive hierarchical clustering: DIANA, cluster::diana(), Agglomerative hierarchical clustering: cluster::agnes(), stats::hclust() Density based Methods DBSCAN: fpc::dbscan() Cluster Validation Packages clValid, cclust, NbClust ∗ package name::function name() † Chapter 6 - Clustering, in R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 4 / 62
  • 5. The Iris Dataset - I The iris dataset [Frank and Asuncion, 2010] consists of 50 samples from each of three classes of iris flowers. There are five attributes in the dataset: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm, and class: Iris Setosa, Iris Versicolour, and Iris Virginica. Detailed desription of the dataset can be found at the UCI Machine Learning Repository ‡. ‡ https://archive.ics.uci.edu/ml/datasets/Iris 5 / 62
  • 6. The Iris Dataset - II Below we have a look at the structure of the dataset with str(). ## the IRIS dataset str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0... ## $ Species : Factor w/ 3 levels "setosa","versicolor",.... 150 observations (records, or rows) and 5 variables (or columns) The first four variables are numeric. The last one, Species, is categoric (called as “factor” in R) and has three levels of values. 6 / 62
  • 7. The Iris Dataset - III summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Wid... ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.... ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.... ## Median :5.800 Median :3.000 Median :4.350 Median :1.... ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.... ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.... ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.... ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## 7 / 62
  • 8. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 8 / 62
  • 9. Partitioning clustering - I Partitioning the data into k groups first and then trying to improve the quality of clustering by moving objects from one group to another k-means [Alsabti et al., 1998, Macqueen, 1967]: randomly selects k objects as cluster centers and assigns other objects to the nearest cluster centers, and then improves the clustering by iteratively updating the cluster centers and reassigning the objects to the new centers. k-medoids [Huang, 1998]: a variation of k-means for categorical data, where the medoid (i.e., the object closest to the center), instead of the centroid, is used to represent a cluster. PAM and CLARA [Kaufman and Rousseeuw, 1990] CLARANS [Ng and Han, 1994] 9 / 62
  • 10. Partitioning clustering - II The result of partitioning clustering is dependent on the selection of initial cluster centers and it may result in a local optimum instead of a global one. (Improvement: run k-means multiple times with different initial centers and then choose the best clustering result.) Tends to result in sphere-shaped clusters with similar sizes Sentitive to outliers Non-trivial to choose an appropriate value for k 10 / 62
  • 11. k-Means Algorithm k-means: a classic partitioning method for clustering First, it selects k objects from the dataset, each of which initially represents a cluster center. Each object is assigned to the cluster to which it is most similar, based on the distance between the object and the cluster center. The means of clusters are computed as the new cluster centers. The process iterates until the criterion function converges. 11 / 62
  • 12. k-Means Algorithm - Criterion Function A typical criterion function is the squared-error criterion, defined as E = k i=1 p∈Ci p − mi 2 , (1) where E is the sum of square-error, p is a point, and mi is the center of cluster Ci . 12 / 62
  • 13. k-means clustering ## k-means clustering set a seed for random number generation to ## make the results reproducible set.seed(8953) ## make a copy of iris data iris2 <- iris ## remove the class label, Species iris2$Species <- NULL ## run kmeans clustering to find 3 clusters kmeans.result <- kmeans(iris2, 3) ## print the clusterng result kmeans.result 13 / 62
  • 14. ## K-means clustering with 3 clusters of sizes 38, 50, 62 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 6.850000 3.073684 5.742105 2.071053 ## 2 5.006000 3.428000 1.462000 0.246000 ## 3 5.901613 2.748387 4.393548 1.433871 ## ## Clustering vector: ## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2... ## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3... ## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3... ## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1... ## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3... ## ## Within cluster sum of squares by cluster: ## [1] 23.87947 15.15100 39.82097 ## (between_SS / total_SS = 88.4 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss"... ## [5] "tot.withinss" "betweenss" "size" "iter" ... ## [9] "ifault" 14 / 62
  • 15. Results of k-Means Clustering Check clustering result against class labels (Species) table(iris$Species, kmeans.result$cluster) ## ## 1 2 3 ## setosa 0 50 0 ## versicolor 2 0 48 ## virginica 36 0 14 Class “setosa” can be easily separated from the other clusters Classes “versicolor” and “virginica” are to a small degree overlapped with each other. 15 / 62
  • 16. plot(iris2[, c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster) points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2) # plot cluster centers 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.02.53.03.54.0 Sepal.Length Sepal.Width 16 / 62
  • 17. k-means clustering with estimating k and initialisations kmeansruns() in package fpc [Hennig, 2014] calls kmeans() to perform k-means clustering initializes the k-means algorithm several times with random points from the data set as means estimates the number of clusters by Calinski Harabasz index or average silhouette width 17 / 62
  • 18. library(fpc) kmeansruns.result <- kmeansruns(iris2) kmeansruns.result ## K-means clustering with 3 clusters of sizes 62, 50, 38 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.901613 2.748387 4.393548 1.433871 ## 2 5.006000 3.428000 1.462000 0.246000 ## 3 6.850000 3.073684 5.742105 2.071053 ## ## Clustering vector: ## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2... ## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1... ## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1... ## [91] 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3... ## [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1... ## ## Within cluster sum of squares by cluster: ## [1] 39.82097 15.15100 23.87947 ## (between_SS / total_SS = 88.4 %) ## ## Available components: 18 / 62
  • 19. The k-Medoids Clustering Difference from k-means: a cluster is represented with its center in the k-means algorithm, but with the object closest to the center of the cluster in the k-medoids clustering. more robust than k-means in presence of outliers PAM (Partitioning Around Medoids) is a classic algorithm for k-medoids clustering. The CLARA algorithm is an enhanced technique of PAM by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger data. Functions pam() and clara() in package cluster [Maechler et al., 2016] Function pamk() in package fpc does not require a user to choose k. 19 / 62
  • 20. Clustering with pam() ## clustering with PAM library(cluster) # group into 3 clusters pam.result <- pam(iris2, 3) # check against actual class label table(pam.result$clustering, iris$Species) ## ## setosa versicolor virginica ## 1 50 0 0 ## 2 0 48 14 ## 3 0 2 36 Three clusters: Cluster 1 is species “setosa” and is well separated from the other two. Cluster 2 is mainly composed of “versicolor”, plus some cases from “virginica”. The majority of cluster 3 are “virginica”, with two cases from “versicolor”. 20 / 62
  • 21. plot(pam.result) −3 −2 −1 0 1 2 3 −202 clusplot(pam(x = iris2, k = 3)) Component 1 Component2 These two components explain 95.81 % of the point variability. Silhouette width si 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette plot of pam(x = iris2, k = 3) Average silhouette width : 0.55 n = 150 3 clusters Cj j : nj | avei∈Cj si 1 : 50 | 0.80 2 : 62 | 0.42 3 : 38 | 0.45 21 / 62
  • 22. The left chart is a 2-dimensional “clusplot” (clustering plot) of the three clusters and the lines show the distance between clusters. The right chart shows their silhouettes. A large si (almost 1) suggests that the corresponding observations are very well clustered, a small si (around 0) means that the observation lies between two clusters, and observations with a negative si are probably placed in the wrong cluster. Silhouette width of cluster 1 is 0.80, which means it is well clustered and seperated from other clusters. The other two are of relatively low silhouette width (0.42 and 0.45), and they are somewhat overlapped with each other. 22 / 62
  • 23. Clustering with pamk() library(fpc) pamk.result <- pamk(iris2) # number of clusters pamk.result$nc ## [1] 2 # check clustering against actual class label table(pamk.result$pamobject$clustering, iris$Species) ## ## setosa versicolor virginica ## 1 50 1 0 ## 2 0 49 50 Two clusters: “setosa” a mixture of “versicolor” and “virginica” 23 / 62
  • 24. plot(pamk.result) −3 −2 −1 0 1 2 3 4 −3−2−1012 clusplot(pam(x = sdata, k = k, diss = diss)) Component 1 Component2 These two components explain 95.81 % of the point variability. Silhouette width si 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette plot of pam(x = sdata, k = k, diss = diss Average silhouette width : 0.69 n = 150 2 clusters Cj j : nj | avei∈Cj si 1 : 51 | 0.81 2 : 99 | 0.62 24 / 62
  • 25. Results of Clustering In this example, the result of pam() seems better, because it identifies three clusters, corresponding to three species. 25 / 62
  • 26. Results of Clustering In this example, the result of pam() seems better, because it identifies three clusters, corresponding to three species. Note that we cheated by setting k = 3 when using pam(), which is already known to us as the number of species. 25 / 62
  • 27. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 26 / 62
  • 28. Hierarchical Clustering - I With hierarchical clustering approach, a hierarchical decomposition of data is built in either bottom-up (agglomerative) or top-down (divisive) way. Generally a dendrogram is generated and a user may select to cut it at a certain level to get the clusters. 27 / 62
  • 30. Hierarchical Clustering Algorithms With agglomerative clustering, every single object is taken as a cluster and then iteratively the two nearest clusters are merged to build bigger clusters until the expected number of clusters is obtained or when only one cluster is left. AGENS [Kaufman and Rousseeuw, 1990] Divisive clustering works in an opposite way, which puts all objects in a single cluster and then divides the cluster into smaller and smaller ones. DIANA [Kaufman and Rousseeuw, 1990] BIRCH [Zhang et al., 1996] CURE [Guha et al., 1998] ROCK [Guha et al., 1999] Chameleon [Karypis et al., 1999] 29 / 62
  • 31. Hierarchical Clustering - Distance Between Clusters In hierarchical clustering, there are four different methods to measure the distance between clusters: Centroid distance is the distance between the centroids of two clusters. Average distance is the average of the distances between every pair of objects from two clusters. Single-link distance, a.k.a. minimum distance, is the distance between the two nearest objects from two clusters. Complete-link distance, a.k.a. maximum distance, is the distance between the two objects which are the farthest from each other from two clusters. 30 / 62
  • 32. Hierarchical Clustering of the iris Data ## hiercrchical clustering set.seed(2835) # draw a sample of 40 records from the iris data, so that the # clustering plot will not be over crowded idx <- sample(1:dim(iris)[1], 40) iris3 <- iris[idx, ] # remove class label iris3$Species <- NULL # hierarchical clustering hc <- hclust(dist(iris3), method = "ave") # plot clusters plot(hc, hang = -1, labels = iris$Species[idx]) # cut tree into 3 clusters rect.hclust(hc, k = 3) # get cluster IDs groups <- cutree(hc, k = 3) 31 / 62
  • 34. Agglomeration Methods of hclust hclust(d, method = "complete", members = NULL) method = "ward.D" or "ward.D2": Ward’s minimum variance method aims at finding compact, spherical clusters [R Core Team, 2015]. method = "complete": complete-link distance; finds similar clusters. method = "single": single-link distance; adopts a “friends of friends” clustering strategy. method = "average": average distance method = "centroid": centroid distance method = "median": method = "mcquitty": 33 / 62
  • 35. DIANA DIANA [Kaufman and Rousseeuw, 1990]: divisive hierarchical clustering Constructs a hierarchy of clusterings, starting with one large cluster containing all observations. Divides clusters until each cluster contains only a single observation. At each stage, the cluster with the largest diameter is selected. (The diameter of a cluster is the largest dissimilarity between any two of its observations.) To divide the selected cluster, the algorithm first looks for its most disparate observation (i.e., which has the largest average dissimilarity to other observations in the selected cluster). This observation initiates the “splinter group”. In subsequent steps, the algorithm reassigns observations that are closer to the “splinter group” than to the “old party”. The result is a division of the selected cluster into two new clusters. 34 / 62
  • 36. DIANA ## clustering with DIANA library(cluster) diana.result <- diana(iris3) plot(diana.result, which.plots = 2, labels = iris$Species[idx]) 35 / 62
  • 38. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 37 / 62
  • 39. Density-Based Clustering The rationale of density-based clustering is that a cluster is composed of well-connected dense region, while objects in sparse areas are removed as noises. DBSCAN is a typical density-based clustering algorithm, which works by expanding clusters to their dense neighborhood [Ester et al., 1996]. Other density-based clustering techniques: OPTICS [Ankerst et al., 1999] and DENCLUE [Hinneburg and Keim, 1998] The advantage of density-based clustering is that it can filter out noise and find clusters of arbitrary shapes (as long as they are composed of connected dense regions). 38 / 62
  • 40. DBSCAN [Ester et al., 1996] Group objects into one cluster if they are connected to one another by densely populated area The DBSCAN algorithm from package fpc provides a density-based clustering for numeric data. Two key parameters in DBSCAN: eps: reachability distance, which defines the size of neighborhood; and MinPts: minimum number of points. If the number of points in the neighborhood of point α is no less than MinPts, then α is a dense point. All the points in its neighborhood are density-reachable from α and are put into the same cluster as α. Can discover clusters with various shapes and sizes Insensitive to noise 39 / 62
  • 41. Density-based Clustering of the iris data ## Density-based Clustering library(fpc) iris2 <- iris[-5] # remove class tags ds <- dbscan(iris2, eps = 0.42, MinPts = 5) ds ## dbscan Pts=150 MinPts=5 eps=0.42 ## 0 1 2 3 ## border 29 6 10 12 ## seed 0 42 27 24 ## total 29 48 37 36 40 / 62
  • 42. Density-based Clustering of the iris data # compare clusters with actual class labels table(ds$cluster, iris$Species) ## ## setosa versicolor virginica ## 0 2 10 17 ## 1 48 0 0 ## 2 0 37 0 ## 3 0 3 33 1 to 3: identified clusters 0: noises or outliers, i.e., objects that are not assigned to any clusters 41 / 62
  • 43. plot(ds, iris2) Sepal.Length 2.02.53.03.54.0 4.5 5.5 6.5 7.5 0.51.01.52.02.5 2.0 2.5 3.0 3.5 4.0 Sepal.Width Petal.Length 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 4.55.56.57.51234567 Petal.Width 42 / 62
  • 44. plot(ds, iris2[, c(1, 4)]) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.51.01.52.02.5 Sepal.Length Petal.Width 43 / 62
  • 46. Prediction with Clustering Model Label new data, based on their similarity with the clusters Draw a sample of 10 objects from iris and add small noises to them to make a new dataset for labeling Random noises are generated with a uniform distribution using function runif(). ## cluster prediction # create a new dataset for labeling set.seed(435) idx <- sample(1:nrow(iris), 10) # remove class labels new.data <- iris[idx,-5] # add random noise new.data <- new.data + matrix(runif(10*4, min=0, max=0.2), nrow=10, ncol=4) # label new data pred <- predict(ds, iris2, new.data) 45 / 62
  • 47. Results of Prediction table(pred, iris$Species[idx]) # check cluster labels ## ## pred setosa versicolor virginica ## 0 0 0 1 ## 1 3 0 0 ## 2 0 3 0 ## 3 0 1 2 46 / 62
  • 48. Results of Prediction table(pred, iris$Species[idx]) # check cluster labels ## ## pred setosa versicolor virginica ## 0 0 0 1 ## 1 3 0 0 ## 2 0 3 0 ## 3 0 1 2 Eight(=3+3+2) out of 10 objects are assigned with correct class labels. 46 / 62
  • 49. plot(iris2[, c(1, 4)], col = 1 + ds$cluster) points(new.data[, c(1, 4)], pch = "+", col = 1 + pred, cex = 3) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.51.01.52.02.5 Sepal.Length Petal.Width + + + + + + + + + + 47 / 62
  • 50. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 48 / 62
  • 51. Cluster Validation silhouette() compute or extract silhouette information (cluster) cluster.stats() compute several cluster validity statistics from a clustering and a dissimilarity matrix (fpc) clValid() calculate validation measures for a given set of clustering algorithms and number of clusters (clValid) clustIndex() calculate the values of several clustering indexes, which can be independently used to determine the number of clusters existing in a data set (cclust) NbClust() provide 30 indices for cluster validation and determining the number of clusters (NbClust) 49 / 62
  • 52. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 50 / 62
  • 53. Further Readings - Clustering A breif overview of various apporaches for clustering Yanchang Zhao, et al. ”Data Clustering.” In Ferraggine et al. (Eds.), Handbook of Research on Innovations in Database Technologies and Applications, Feb 2009. http://yanchang.rdatamining.com/publications/ Overview-of-Data-Clustering.pdf Cluster Analysis & Evaluation Measures https://en.wikipedia.org/wiki/Cluster_analysis Detailed review of algorithms for data clustering Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264-323. Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue Software, San Jose, CA, USA. http://citeseer.ist.psu.edu/berkhin02survey.html. A comprehensive textbook on data mining Han, J., & Kamber, M. (2000). Data mining: concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 51 / 62
  • 54. Further Readings - Clustering with R Data Mining Algorithms In R: Clustering https: //en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering Data Mining Algorithms In R: k-Means Clustering https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ Clustering/K-Means Data Mining Algorithms In R: k-Medoids Clustering https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ Clustering/Partitioning_Around_Medoids_(PAM) Data Mining Algorithms In R: Hierarchical Clustering https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ Clustering/Hierarchical_Clustering Data Mining Algorithms In R: Density-Based Clustering https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ Clustering/Density-Based_Clustering 52 / 62
  • 55. Contents Introduction Data Clustering with R The Iris Dataset Partitioning Clustering The k-Means Clustering The k-Medoids Clustering Hierarchical Clustering Density-Based clustering Cluster Validation Further Readings and Online Resources Exercises 53 / 62
  • 56. Exercise - I Clustering cars based on road test data mtcars: the Motor Trend Car Road Tests data, comprising fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models) [R Core Team, 2015] A data frame with 32 observations on 11 variables: 1. mpg: fuel consumption (Miles/gallon) 2. cyl: Number of cylinders 3. disp: Displacement (cu.in.) 4. hp: Gross horsepower 5. drat: Rear axle ratio 6. wt: Weight (lb/1000) 7. qsec: 1/4 mile time 8. vs: V engine or straight engine 9. am: Transmission (0 = automatic, 1 = manual) 10. gear: Number of forward gears 11. carb: Number of carburetors 54 / 62
  • 57. Exercise - II To cluster states of US state.x77: statistics of the 50 states of US [R Core Team, 2015] a matrix with 50 rows and 8 columns 1. Population: population estimate as of July 1, 1975 2. Income: per capita income (1974) 3. Illiteracy: illiteracy (1970, percent of population) 4. Life Exp: life expectancy in years (1969–71) 5. Murder: murder and non-negligent manslaughter rate per 100,000 population (1976) 6. HS Grad: percent high-school graduates (1970) 7. Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city 8. Area: land area in square miles 55 / 62
  • 58. Exercise - Questions Which attributes to use? Are the attributes at the same scale? Which clustering techniques to use? Which clustering algorithms to use? How many clusters to find? Are the clustering results good or not? 56 / 62
  • 59. Online Resources Book titled R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 57 / 62
  • 61. How to Cite This Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 59 / 62
  • 62. References I Alsabti, K., Ranka, S., and Singh, V. (1998). An efficient k-means clustering algorithm. In Proc. the First Workshop on High Performance Data Mining, Orlando, Florida. Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 49–60, New York, NY, USA. ACM Press. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231. Frank, A. and Asuncion, A. (2010). UCI machine learning repository. university of california, irvine, school of information and computer sciences. http://archive.ics.uci.edu/ml. Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 73–84, New York, NY, USA. ACM Press. Guha, S., Rastogi, R., and Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th International Conference on Data Engineering, 23-26 March 1999, Sydney, Austrialia, pages 512–521. IEEE Computer Society. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 60 / 62
  • 63. References II Hennig, C. (2014). fpc: Flexible procedures for clustering. R package version 2.1-9. Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58–65. DENCLUE. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283–304. Karypis, G., Han, E.-H., and Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer, 32(8):68–75. Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York: Wiley, 1990. Macqueen, J. B. (1967). Some methods of classification and analysis of multivariate observations. In the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source). 61 / 62
  • 64. References III Ng, R. T. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. In SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 103–114, New York, NY, USA. ACM Press. 62 / 62