2. What is Clustering?
Clustering is a technique of data segmentation that partitions the data into several
groups based on their similarity.
We group the data through a statistical operation. These smaller groups that are
formed from the bigger data are known as clusters.
3. What is cluster Analysis?
A method of summarizing data, similar to factor analysis cases are grouped into
“clusters” with other similar cases
It’s a way of “grouping” data into meaningful groups or clusters
4. Types of Cluster analysis
There are two types of cluster analysis: R and Q type analysis:
R Type - to what extent do the variables covary across the cases?
Q Type - to what extent do the cases covary across the variables?
5. Where it is used?
It used in cases where the underlying input data has a colossal volume and we are
tasked with finding similar subsets that can be analysed in several ways.
For example – A marketing company can categorise their customers based on
their economic background, age and several other factors to sell their products, in
a better way.
6. K-Means clustering in R
One of the most popular partitioning algorithms in clustering is the K-means
cluster analysis in R. It is an unsupervised learning algorithm. It tries to cluster
data based on their similarity. Also, we have specified the number of clusters and
we want that the data must be grouped into the same clusters. The algorithm
assigns each observation to a cluster and also finds the centroid of each cluster.
7. Code for K means Clustering in R
mydata <- mtcars[, c('mpg', 'cyl', 'wt')]
clusters <- kmeans(mydata, 3)
kmeanPlot <- par(mar = c(5.1, 4.1, 0, 1)) plot(mydata, col = clusters$cluster)
8. Points to keep in mind
• k-means clustering is a flat clustering technique, which produces only one
partition with k clusters
• requires a user to determine the number of clusters at the beginning
• k-means clustering is much faster than hierarchical clustering
9. #Using the mtcars dataset
#clean/normalize the data
data(mtcars)
mydata = na.omit(mtcars)
#deletion of missing
mydata = scale(mydata)
#standarize variables
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15)
wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# check out the plot
# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
#visualize the clustering results
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0