SlideShare a Scribd company logo
1 of 15
Methods of Unsupervised Learning
[ISLR.2013.Ch10-10]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
Here we are going to discuss Unsupervised Learning Methods, a set of statistical tools
mostly intended to reveal interesting information about the attributes of what we
consider a feature space, X1, X2,. . . Xp, measured on n observations. Instead of the
methods we examined in previous articles, we do not have a response variable to
make a prediction. More specifically, we are going to discuss two particular types of
unsupervised learning: (1) Principal Component Analysis (PCA) and (2) Clustering.
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106214855+02’00’
∗
e-mail:tgrammat@gmail.com
1
1 Principal Component Analysis
Here we examine Principal Component Analysis (PCA) on the USArrests data set, which
is part of the R language. The rows of the data set contain the number of arrests which
have been taken place in the 50 US states cities and are associated with the crimes of
Murder, Assault and Rape respectively. Note, that the urbanization percentage of each
city has also been given.
states <- row.names(USArrests)
Attribs <- names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "Florida"
## [10] "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa"
## [16] "Kansas" "Kentucky" "Louisiana"
## [19] "Maine" "Maryland" "Massachusetts"
## [22] "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska"
## [28] "Nevada" "New Hampshire" "New Jersey"
## [31] "New Mexico" "New York" "North Carolina"
## [34] "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island"
## [40] "South Carolina" "South Dakota" "Tennessee"
## [43] "Texas" "Utah" "Vermont"
## [46] "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
Attribs
## [1] "Murder" "Assault" "UrbanPop" "Rape"
To have a rough overview of the USArrests data set we calculate its attributes mean.
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
2
Note, that there are on average three times as many rapes as murders, and more than
eight times assaults comparing to rapes. To find the variances of the data set:
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97 6945.17 209.52 87.73
Not surprisingly, the variables also have vastly different variances. Note, for example the
variance of UrbanPop variable, which measures the percentage of each state population
living in an urban area, and it is not comparable with the variance of the number of
rapes per 100, 000 individuals. If we failed to scale the variables before make a PCA, then
the resulted principal components would be strongly driven by the Assault variable,
since it has by far the largest mean and variance. Thus, it is important to standardize
the variables to have mean zero and standard deviation one, before performing PCA. In
fact, we can perform both the principal component analysis and standardize the variables
altogether by using the prcomp{stats}() function.
PCA.out <- prcomp(USArrests, scale = TRUE)
The output of prcomp() contains a number of useful quantities, i.e.
names(PCA.out)
## [1] "sdev" "rotation" "center" "scale" "x"
One can find the means and standard deviations of the variables that we are used for
scaling prior to implementing PCA by calling:
PCA.out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
PCA.out$scale
## Murder Assault UrbanPop Rape
## 4.356 83.338 14.475 9.366
3
The rotation matrix provides the principal component loadings; each column of
PCA.out$rotation contains the corresponding principal component loading vector∗
.
PCA.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5359 0.4182 -0.3412 0.64923
## Assault -0.5832 0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780 0.13388
## Rape -0.5434 -0.1673 0.8178 0.08902
Note, that there are four principal components. This is expected, since in a data set
with n observations and p variables there are in general min(n −1, p) informative principal
components. In addition, the principal component scores for every state of the USArrests
data set are returned by typing
PCA.out$x
a 50 × 4 matrix. To plot a scaled diagram of the first two principal components and the
loadings of the attributes of the feature space (Figure 1 below):
par(mfrow = c(1, 1), mar = c(2, 2, 2, 2), oma = c(0, 0, 5, 0))
biplot(PCA.out, scale = 0, col = c("blue", "red"), xlab = "1st
Principal Component",
ylab = "2nd Principal Component")
title("First Two Principal Components & Loading Vectorsnfor
Crimes in US States [USArrests]",
outer = TRUE)
To calculate the Proportion of Variance Explained (PVE) by its principal component:
PCA.var <- (PCA.out$sdev)^2
PCA.PVE <- (PCA.var/sum(PCA.var))
∗
The PCA.out$rotation component is also called rotation matrix, because when we matrix-multiply
the X matrix by PCA.out$rotation, it gives us the data coordinates in the rotated coordinate system.
These coordinates are the principal component scores.
4
Figure 1: Biplot diagram of the first two Principal Components and the Loading Vectors
for crimes in major US cities [USArrests]. The blue state names represent the scores
for the first two principal components. The orange arrows indicate the first two principal
component loading vectors (with axes on the bottom and left).
PCA.PVE
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Note that the first two principal components in the analysis we made above explain the
most of the variance in data, i.e. the 62% by the first principal component and 24.7% by
the second one.
5
We can plot the PVE by each component, as well as the cumulative PVE, as follows:
par(mfrow = c(1, 2), oma = c(0, 0, 4, 0))
plot(PCA.PVE, xlab = "Principal Component", ylab = "Proportion of
Variance Explained (PVE)",
ylim = c(0, 1), type = "b", main = "PVE vs Principal
Component",
pch = 8)
plot(cumsum(PCA.PVE), xlab = "Principal Component", ylab = "
Cumulative Proportion of Variance Explained (PVE)",
ylim = c(0, 1), type = "b", main = "Cumulative PVE vs
Principal Component",
pch = 8)
This is shown in Figure 2 below.
Figure 2: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari-
ance Explained (Cumulative PVE) towards the number of Principal Components in use
[USArrests].
6
2 A Real Example - NCI60 Micro-array data (Expression
levels on 6830 genes from 64 cancer cell lines)
Unsupervised learning methods are often used in the analysis of genomic data, where the
plethora of attributes of the feature space cannot help much without a good exploratory
data analysis. In this section we illustrate these techniques on the NCI60 cancer cell line
micro-array data, which consists of 6, 830 gene expression instruments on 64 cancer cell
lines.
library(ISLR)
nci.labs <- NCI60$labs
nci.data <- NCI60$data
Each cell line is labeled with a cancer type. What we want to examine here is if there are
any specific patterns of gene expressions in favor of some particular cancer type and not
other, i.e. if we can explain the appearance of this particular cancer type due to a specific
pattern of gene expressions.
2.1 PCA on the NCI60{ISLR} data set
We scale the attributes of the feature space to have standard deviation one and perform
principal component analysis by using the prccomp() function.
PCA.NCI60 <- prcomp(nci.data, scale = TRUE)
A summary of which cab be printed out as follows
PCA.NCI60.printout <- summary(PCA.NCI60)
PCA.NCI60.printout$importance[, 1:10]
## PC1 PC2 PC3 PC4
## Standard deviation 27.8535 21.48136 19.82046 17.03256
## Proportion of Variance 0.1136 0.06756 0.05752 0.04248
## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115
## PC5 PC6 PC7 PC8
## Standard deviation 15.97181 15.72108 14.47145 13.54427
## Proportion of Variance 0.03735 0.03619 0.03066 0.02686
7
## Cumulative Proportion 0.31850 0.35468 0.38534 0.41220
## PC9 PC10
## Standard deviation 13.14400 12.73860
## Proportion of Variance 0.02529 0.02376
## Cumulative Proportion 0.43750 0.46126
Truncating the result up to the first 10 principal components it is already evident that a
considerable proportion of variance will have been already explained. However, it is more
informative to plot the Proportion of Variance Explained (PVE) for each principal component
(scree plot), as well as the cumulative PVE of each principal component.
PCA.NCI60.PVE <- 100 * (PCA.NCI60$sdev)^2/(sum((PCA.NCI60$sdev)
^2))
par(mfrow = c(1, 2), oma = c(0, 0, 4, 0))
plot(PCA.NCI60.PVE, xlab = "Principal Component", ylab = "
Proportion of Variance Explained (PVE)",
type = "b", main = "PVE vs Principal Component", cex = 0.6,
col = "blue")
plot(cumsum(PCA.NCI60.PVE), xlab = "Principal Component", ylab =
"Cumulative Proportion of Variance Explained (PVE)",
type = "b", main = "Cumulative PVE vs Principal Component",
cex = 0.6, col = "red")
From the scree plot shown in Figure 3, we see that while the first seven principal compo-
nents explain a substantial amount of the “Proportion of Variance” (PVE) in NCI60{ISLR}
data set, i.e. 40%, there is a steep decrease in PVE by further principal components. More
specifically, we can figure out an elbow in the plot after approximately the seventh prin-
cipal component. This suggests that there may be little benefit in examining more than
seven or so principal components (though even examining seven principal component may
be difficult).
Next, we plot a scatter plot matrix between the score vectors of the first seven principal
components. The observations (cell lines) corresponding to a given cancer type will be
plotted in the same color, so that we can easily recognize to what extent the observations
within a cancer type are similar to each other. We first define a simple function that
assigns a color to each of the 64 lines, based on the cancer type to which it corresponds.
8
Figure 3: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari-
ance Explained (Cumulative PVE) towards the number of Principal Components in use
[NCI60{ISLR}].
CancerType_col <- function(vec) {
cols <- rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
Then we produce the scatter plots matrix (Figure 4):
pairs(PCA.NCI60$x[, 1:7], col = CancerType_col(nci.labs), pch =
20)
9
Figure 4: Scatter plots of the score vectors of the first seven principal components of
[NCI60{ISLR}] data set.
Or, if we want to concentrate only to the first four principal components (Figure 5):
pairs(PCA.NCI60$x[, 1:4], col = CancerType_col(nci.labs), pch =
20)
2.2 Clustering on the NCI60{ISLR} data set
We now try to investigate whether or not the observations of the NCI60{ISLR} data set
cluster into distinct types of cancer. To do so we will use the method of hierarchical
clustering.
10
Figure 5: Scatter plots of the score vectors of the first four principal components of
[NCI60{ISLR}] data set.
Since, hierarchical clustering is sensitive in the scales of the attributes in use we firstly
standardize these variables to have mean zero and standard deviation one
nci.data.sd <- scale(nci.data)
and perform hierarchical clustering using Complete, Single and Average Linkage respec-
tively. The dissimilarity measure is chosen to be the Euclidean Distance.
nci.data.sd.dist <- dist(nci.data.sd, method = "euclidean")
dev.new()
11
opar <- par(no.readonly = TRUE)
par(opar)
par(mfrow = c(3, 1), mar = c(2, 2, 1, 1), oma = c(0, 0, 3, 0))
plot(hclust(nci.data.sd.dist, method = "complete"), labels = nci.
labs,
main = "Complete Linkage", xlab = "", ylab = "", sub = "")
plot(hclust(nci.data.sd.dist, method = "average"), labels = nci.
labs,
main = "Avearage Linkage", xlab = "", ylab = "", sub = "")
plot(hclust(nci.data.sd.dist, method = "single"), labels = nci.
labs,
main = "Single Linkage", xlab = "", ylab = "", sub = "")
The results are shown in figure 6 below. It is evident that the choice of linkage greatly
affects the obtained results. Typically, single linkage will tend to yield trailing clusters:
very large clusters onto which individuals observations attach one-by-one. On the other
hand, complete and average linkage tend to yield more balanced, attractive clusters. For
these reasons, complete and average linkage are generally preferred to single linkage.
Clearly, cell lines within a single cancer type do tend to cluster together, although the
clustering is not perfect. In the analysis below we are going to use only the complete
linkage hierarchical clustering.
More specifically, we build the hierarchical clustering again and pass the produced result
to local variable. The dissimilarity measure in use is that of Euclidean distance, whereas
maximal inter-cluster dissimilarity is taken in account (complete linkage).
hclust.out <- hclust(nci.data.sd.dist, method = "complete")
and truncate the produced result at a height that will return a particular number of
clusters, lets say four.
hclust.clusters <- cutree(hclust.out, 4)
12
Figure 6: The NCI60{ISLR} cancer cell line micro-array data, clustered with average,
complete, and single linkage, and using Euclidean distance as the dissimilarity measure.
Complete and average linkage tend to yield evenly sized clusters whereas single linkage
tends to yield extended clusters to which single leaves are fused one by one.
table(hclust.clusters, nci.labs)
## nci.labs
## hclust.clusters BREAST CNS COLON K562A-repro K562B-repro
## 1 2 3 2 0 0
## 2 3 2 0 0 0
## 3 0 0 0 1 1
## 4 2 0 5 0 0
## nci.labs
## hclust.clusters LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA
## 1 0 0 0 8
## 2 0 0 0 0
## 3 6 0 0 0
## 4 0 1 1 0
13
## nci.labs
## hclust.clusters NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 8 6 2 8 1
## 2 1 0 0 1 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
There are some clear patterns. All the LEUKEMIA cell lines fall in cluster 3, while the
BREAST cell lines are spread out over three different clusters, 1, 2 and 4. We can even
plot the dendogram produced above and draw a horizontal line at the height at which
these four clusters are produced.
par(mfrow = c(1, 1))
plot(hclust.out, labels = nci.labs, hang = 0.1, main = "Complete
Linkage",
xlab = "", ylab = "", sub = "")
abline(h = 139, col = "red", lty = "dashed")
Figure 7: The NCI60{ISLR} cancer cell line micro-array data, clustered with complete
linkage, and Euclidean distance as the dissimilarity measure. A horizontal dashed red line
has been added at the height (h = 139) where four distinct clusters emerge.
14
It is important to note that K-means clustering and hierarchical clustering with the den-
drogram cut to obtain the same number of clusters can yield very different results. Indeed,
let us now perform a K-means clustering using the NCI60 micro-array data set and com-
pare its results with the one obtained before.
set.seed(343)
km.out <- kmeans(nci.data.sd, 4, nstart = 20)
km.clusters <- km.out$cluster
table(km.clusters, hclust.clusters)
## hclust.clusters
## km.clusters 1 2 3 4
## 1 0 0 8 0
## 2 9 0 0 0
## 3 11 0 0 9
## 4 20 7 0 0
Obviously, the four clusters obtained using hierarchical clustering and K-means clus-
tering differ. Cluster 2 in K-means clustering is identical to cluster 3 in hierarchical
clustering. However, the other clusters differ: for instance, cluster 4 in K-means cluster-
ing contains a portion of the observations assigned to cluster 1 by hierarchical clustering,
as well as all of the observations assigned to cluster 2 by hierarchical clustering.
15

More Related Content

Similar to Methods of Unsupervised Learning (Article 10 - Practical Exercises)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptxnikshaikh786
 
maXbox starter67 machine learning V
maXbox starter67 machine learning VmaXbox starter67 machine learning V
maXbox starter67 machine learning VMax Kleiner
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component AnalysisMason Ziemer
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchEshanAgarwal4
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelDavid Ritchie
 
PCA with princomp and prcomp
PCA with princomp and prcompPCA with princomp and prcomp
PCA with princomp and prcompRupak Roy
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningJohn Edward Slough II
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programmingNixon Mendez
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guideprateek kumar
 

Similar to Methods of Unsupervised Learning (Article 10 - Practical Exercises) (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
 
maXbox starter67 machine learning V
maXbox starter67 machine learning VmaXbox starter67 machine learning V
maXbox starter67 machine learning V
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
ARIMA Models - [Lab 3]
ARIMA Models - [Lab 3]ARIMA Models - [Lab 3]
ARIMA Models - [Lab 3]
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
 
PCA with princomp and prcomp
PCA with princomp and prcompPCA with princomp and prcomp
PCA with princomp and prcomp
 
Project_Report_RMD
Project_Report_RMDProject_Report_RMD
Project_Report_RMD
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Appendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductorAppendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductor
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
 
R programming
R programmingR programming
R programming
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guide
 

Recently uploaded

Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...siskavia95
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 

Recently uploaded (20)

Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 

Methods of Unsupervised Learning (Article 10 - Practical Exercises)

  • 1. Methods of Unsupervised Learning [ISLR.2013.Ch10-10] Theodore Grammatikopoulos∗ Tue 6th Jan, 2015 Abstract Here we are going to discuss Unsupervised Learning Methods, a set of statistical tools mostly intended to reveal interesting information about the attributes of what we consider a feature space, X1, X2,. . . Xp, measured on n observations. Instead of the methods we examined in previous articles, we do not have a response variable to make a prediction. More specifically, we are going to discuss two particular types of unsupervised learning: (1) Principal Component Analysis (PCA) and (2) Clustering. ## OTN License Agreement: Oracle Technology Network - Developer ## Oracle Distribution of R version 3.0.1 (--) Good Sport ## Copyright (C) The R Foundation for Statistical Computing ## Platform: x86_64-unknown-linux-gnu (64-bit) D:20150106214855+02’00’ ∗ e-mail:tgrammat@gmail.com 1
  • 2. 1 Principal Component Analysis Here we examine Principal Component Analysis (PCA) on the USArrests data set, which is part of the R language. The rows of the data set contain the number of arrests which have been taken place in the 50 US states cities and are associated with the crimes of Murder, Assault and Rape respectively. Note, that the urbanization percentage of each city has also been given. states <- row.names(USArrests) Attribs <- names(USArrests) states ## [1] "Alabama" "Alaska" "Arizona" ## [4] "Arkansas" "California" "Colorado" ## [7] "Connecticut" "Delaware" "Florida" ## [10] "Georgia" "Hawaii" "Idaho" ## [13] "Illinois" "Indiana" "Iowa" ## [16] "Kansas" "Kentucky" "Louisiana" ## [19] "Maine" "Maryland" "Massachusetts" ## [22] "Michigan" "Minnesota" "Mississippi" ## [25] "Missouri" "Montana" "Nebraska" ## [28] "Nevada" "New Hampshire" "New Jersey" ## [31] "New Mexico" "New York" "North Carolina" ## [34] "North Dakota" "Ohio" "Oklahoma" ## [37] "Oregon" "Pennsylvania" "Rhode Island" ## [40] "South Carolina" "South Dakota" "Tennessee" ## [43] "Texas" "Utah" "Vermont" ## [46] "Virginia" "Washington" "West Virginia" ## [49] "Wisconsin" "Wyoming" Attribs ## [1] "Murder" "Assault" "UrbanPop" "Rape" To have a rough overview of the USArrests data set we calculate its attributes mean. apply(USArrests, 2, mean) ## Murder Assault UrbanPop Rape ## 7.788 170.760 65.540 21.232 2
  • 3. Note, that there are on average three times as many rapes as murders, and more than eight times assaults comparing to rapes. To find the variances of the data set: apply(USArrests, 2, var) ## Murder Assault UrbanPop Rape ## 18.97 6945.17 209.52 87.73 Not surprisingly, the variables also have vastly different variances. Note, for example the variance of UrbanPop variable, which measures the percentage of each state population living in an urban area, and it is not comparable with the variance of the number of rapes per 100, 000 individuals. If we failed to scale the variables before make a PCA, then the resulted principal components would be strongly driven by the Assault variable, since it has by far the largest mean and variance. Thus, it is important to standardize the variables to have mean zero and standard deviation one, before performing PCA. In fact, we can perform both the principal component analysis and standardize the variables altogether by using the prcomp{stats}() function. PCA.out <- prcomp(USArrests, scale = TRUE) The output of prcomp() contains a number of useful quantities, i.e. names(PCA.out) ## [1] "sdev" "rotation" "center" "scale" "x" One can find the means and standard deviations of the variables that we are used for scaling prior to implementing PCA by calling: PCA.out$center ## Murder Assault UrbanPop Rape ## 7.788 170.760 65.540 21.232 PCA.out$scale ## Murder Assault UrbanPop Rape ## 4.356 83.338 14.475 9.366 3
  • 4. The rotation matrix provides the principal component loadings; each column of PCA.out$rotation contains the corresponding principal component loading vector∗ . PCA.out$rotation ## PC1 PC2 PC3 PC4 ## Murder -0.5359 0.4182 -0.3412 0.64923 ## Assault -0.5832 0.1880 -0.2681 -0.74341 ## UrbanPop -0.2782 -0.8728 -0.3780 0.13388 ## Rape -0.5434 -0.1673 0.8178 0.08902 Note, that there are four principal components. This is expected, since in a data set with n observations and p variables there are in general min(n −1, p) informative principal components. In addition, the principal component scores for every state of the USArrests data set are returned by typing PCA.out$x a 50 × 4 matrix. To plot a scaled diagram of the first two principal components and the loadings of the attributes of the feature space (Figure 1 below): par(mfrow = c(1, 1), mar = c(2, 2, 2, 2), oma = c(0, 0, 5, 0)) biplot(PCA.out, scale = 0, col = c("blue", "red"), xlab = "1st Principal Component", ylab = "2nd Principal Component") title("First Two Principal Components & Loading Vectorsnfor Crimes in US States [USArrests]", outer = TRUE) To calculate the Proportion of Variance Explained (PVE) by its principal component: PCA.var <- (PCA.out$sdev)^2 PCA.PVE <- (PCA.var/sum(PCA.var)) ∗ The PCA.out$rotation component is also called rotation matrix, because when we matrix-multiply the X matrix by PCA.out$rotation, it gives us the data coordinates in the rotated coordinate system. These coordinates are the principal component scores. 4
  • 5. Figure 1: Biplot diagram of the first two Principal Components and the Loading Vectors for crimes in major US cities [USArrests]. The blue state names represent the scores for the first two principal components. The orange arrows indicate the first two principal component loading vectors (with axes on the bottom and left). PCA.PVE ## [1] 0.62006039 0.24744129 0.08914080 0.04335752 Note that the first two principal components in the analysis we made above explain the most of the variance in data, i.e. the 62% by the first principal component and 24.7% by the second one. 5
  • 6. We can plot the PVE by each component, as well as the cumulative PVE, as follows: par(mfrow = c(1, 2), oma = c(0, 0, 4, 0)) plot(PCA.PVE, xlab = "Principal Component", ylab = "Proportion of Variance Explained (PVE)", ylim = c(0, 1), type = "b", main = "PVE vs Principal Component", pch = 8) plot(cumsum(PCA.PVE), xlab = "Principal Component", ylab = " Cumulative Proportion of Variance Explained (PVE)", ylim = c(0, 1), type = "b", main = "Cumulative PVE vs Principal Component", pch = 8) This is shown in Figure 2 below. Figure 2: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari- ance Explained (Cumulative PVE) towards the number of Principal Components in use [USArrests]. 6
  • 7. 2 A Real Example - NCI60 Micro-array data (Expression levels on 6830 genes from 64 cancer cell lines) Unsupervised learning methods are often used in the analysis of genomic data, where the plethora of attributes of the feature space cannot help much without a good exploratory data analysis. In this section we illustrate these techniques on the NCI60 cancer cell line micro-array data, which consists of 6, 830 gene expression instruments on 64 cancer cell lines. library(ISLR) nci.labs <- NCI60$labs nci.data <- NCI60$data Each cell line is labeled with a cancer type. What we want to examine here is if there are any specific patterns of gene expressions in favor of some particular cancer type and not other, i.e. if we can explain the appearance of this particular cancer type due to a specific pattern of gene expressions. 2.1 PCA on the NCI60{ISLR} data set We scale the attributes of the feature space to have standard deviation one and perform principal component analysis by using the prccomp() function. PCA.NCI60 <- prcomp(nci.data, scale = TRUE) A summary of which cab be printed out as follows PCA.NCI60.printout <- summary(PCA.NCI60) PCA.NCI60.printout$importance[, 1:10] ## PC1 PC2 PC3 PC4 ## Standard deviation 27.8535 21.48136 19.82046 17.03256 ## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 ## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 ## PC5 PC6 PC7 PC8 ## Standard deviation 15.97181 15.72108 14.47145 13.54427 ## Proportion of Variance 0.03735 0.03619 0.03066 0.02686 7
  • 8. ## Cumulative Proportion 0.31850 0.35468 0.38534 0.41220 ## PC9 PC10 ## Standard deviation 13.14400 12.73860 ## Proportion of Variance 0.02529 0.02376 ## Cumulative Proportion 0.43750 0.46126 Truncating the result up to the first 10 principal components it is already evident that a considerable proportion of variance will have been already explained. However, it is more informative to plot the Proportion of Variance Explained (PVE) for each principal component (scree plot), as well as the cumulative PVE of each principal component. PCA.NCI60.PVE <- 100 * (PCA.NCI60$sdev)^2/(sum((PCA.NCI60$sdev) ^2)) par(mfrow = c(1, 2), oma = c(0, 0, 4, 0)) plot(PCA.NCI60.PVE, xlab = "Principal Component", ylab = " Proportion of Variance Explained (PVE)", type = "b", main = "PVE vs Principal Component", cex = 0.6, col = "blue") plot(cumsum(PCA.NCI60.PVE), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained (PVE)", type = "b", main = "Cumulative PVE vs Principal Component", cex = 0.6, col = "red") From the scree plot shown in Figure 3, we see that while the first seven principal compo- nents explain a substantial amount of the “Proportion of Variance” (PVE) in NCI60{ISLR} data set, i.e. 40%, there is a steep decrease in PVE by further principal components. More specifically, we can figure out an elbow in the plot after approximately the seventh prin- cipal component. This suggests that there may be little benefit in examining more than seven or so principal components (though even examining seven principal component may be difficult). Next, we plot a scatter plot matrix between the score vectors of the first seven principal components. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can easily recognize to what extent the observations within a cancer type are similar to each other. We first define a simple function that assigns a color to each of the 64 lines, based on the cancer type to which it corresponds. 8
  • 9. Figure 3: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari- ance Explained (Cumulative PVE) towards the number of Principal Components in use [NCI60{ISLR}]. CancerType_col <- function(vec) { cols <- rainbow(length(unique(vec))) return(cols[as.numeric(as.factor(vec))]) } Then we produce the scatter plots matrix (Figure 4): pairs(PCA.NCI60$x[, 1:7], col = CancerType_col(nci.labs), pch = 20) 9
  • 10. Figure 4: Scatter plots of the score vectors of the first seven principal components of [NCI60{ISLR}] data set. Or, if we want to concentrate only to the first four principal components (Figure 5): pairs(PCA.NCI60$x[, 1:4], col = CancerType_col(nci.labs), pch = 20) 2.2 Clustering on the NCI60{ISLR} data set We now try to investigate whether or not the observations of the NCI60{ISLR} data set cluster into distinct types of cancer. To do so we will use the method of hierarchical clustering. 10
  • 11. Figure 5: Scatter plots of the score vectors of the first four principal components of [NCI60{ISLR}] data set. Since, hierarchical clustering is sensitive in the scales of the attributes in use we firstly standardize these variables to have mean zero and standard deviation one nci.data.sd <- scale(nci.data) and perform hierarchical clustering using Complete, Single and Average Linkage respec- tively. The dissimilarity measure is chosen to be the Euclidean Distance. nci.data.sd.dist <- dist(nci.data.sd, method = "euclidean") dev.new() 11
  • 12. opar <- par(no.readonly = TRUE) par(opar) par(mfrow = c(3, 1), mar = c(2, 2, 1, 1), oma = c(0, 0, 3, 0)) plot(hclust(nci.data.sd.dist, method = "complete"), labels = nci. labs, main = "Complete Linkage", xlab = "", ylab = "", sub = "") plot(hclust(nci.data.sd.dist, method = "average"), labels = nci. labs, main = "Avearage Linkage", xlab = "", ylab = "", sub = "") plot(hclust(nci.data.sd.dist, method = "single"), labels = nci. labs, main = "Single Linkage", xlab = "", ylab = "", sub = "") The results are shown in figure 6 below. It is evident that the choice of linkage greatly affects the obtained results. Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individuals observations attach one-by-one. On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For these reasons, complete and average linkage are generally preferred to single linkage. Clearly, cell lines within a single cancer type do tend to cluster together, although the clustering is not perfect. In the analysis below we are going to use only the complete linkage hierarchical clustering. More specifically, we build the hierarchical clustering again and pass the produced result to local variable. The dissimilarity measure in use is that of Euclidean distance, whereas maximal inter-cluster dissimilarity is taken in account (complete linkage). hclust.out <- hclust(nci.data.sd.dist, method = "complete") and truncate the produced result at a height that will return a particular number of clusters, lets say four. hclust.clusters <- cutree(hclust.out, 4) 12
  • 13. Figure 6: The NCI60{ISLR} cancer cell line micro-array data, clustered with average, complete, and single linkage, and using Euclidean distance as the dissimilarity measure. Complete and average linkage tend to yield evenly sized clusters whereas single linkage tends to yield extended clusters to which single leaves are fused one by one. table(hclust.clusters, nci.labs) ## nci.labs ## hclust.clusters BREAST CNS COLON K562A-repro K562B-repro ## 1 2 3 2 0 0 ## 2 3 2 0 0 0 ## 3 0 0 0 1 1 ## 4 2 0 5 0 0 ## nci.labs ## hclust.clusters LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA ## 1 0 0 0 8 ## 2 0 0 0 0 ## 3 6 0 0 0 ## 4 0 1 1 0 13
  • 14. ## nci.labs ## hclust.clusters NSCLC OVARIAN PROSTATE RENAL UNKNOWN ## 1 8 6 2 8 1 ## 2 1 0 0 1 0 ## 3 0 0 0 0 0 ## 4 0 0 0 0 0 There are some clear patterns. All the LEUKEMIA cell lines fall in cluster 3, while the BREAST cell lines are spread out over three different clusters, 1, 2 and 4. We can even plot the dendogram produced above and draw a horizontal line at the height at which these four clusters are produced. par(mfrow = c(1, 1)) plot(hclust.out, labels = nci.labs, hang = 0.1, main = "Complete Linkage", xlab = "", ylab = "", sub = "") abline(h = 139, col = "red", lty = "dashed") Figure 7: The NCI60{ISLR} cancer cell line micro-array data, clustered with complete linkage, and Euclidean distance as the dissimilarity measure. A horizontal dashed red line has been added at the height (h = 139) where four distinct clusters emerge. 14
  • 15. It is important to note that K-means clustering and hierarchical clustering with the den- drogram cut to obtain the same number of clusters can yield very different results. Indeed, let us now perform a K-means clustering using the NCI60 micro-array data set and com- pare its results with the one obtained before. set.seed(343) km.out <- kmeans(nci.data.sd, 4, nstart = 20) km.clusters <- km.out$cluster table(km.clusters, hclust.clusters) ## hclust.clusters ## km.clusters 1 2 3 4 ## 1 0 0 8 0 ## 2 9 0 0 0 ## 3 11 0 0 9 ## 4 20 7 0 0 Obviously, the four clusters obtained using hierarchical clustering and K-means clus- tering differ. Cluster 2 in K-means clustering is identical to cluster 3 in hierarchical clustering. However, the other clusters differ: for instance, cluster 4 in K-means cluster- ing contains a portion of the observations assigned to cluster 1 by hierarchical clustering, as well as all of the observations assigned to cluster 2 by hierarchical clustering. 15