This document discusses two methods of unsupervised learning: principal component analysis (PCA) and clustering. It applies PCA and clustering to cancer microarray gene expression data (NCI60) to explore patterns in the data without a response variable. PCA of the NCI60 data finds the first seven principal components explain 40% of the variance. Scatter plots of the first seven principal components show cancer types cluster together, though imperfectly. Hierarchical clustering with complete linkage also tends to cluster cell lines within a single cancer type together.
社内勉強会資料_Object Recognition as Next Token Prediction
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
1. Methods of Unsupervised Learning
[ISLR.2013.Ch10-10]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
Here we are going to discuss Unsupervised Learning Methods, a set of statistical tools
mostly intended to reveal interesting information about the attributes of what we
consider a feature space, X1, X2,. . . Xp, measured on n observations. Instead of the
methods we examined in previous articles, we do not have a response variable to
make a prediction. More specifically, we are going to discuss two particular types of
unsupervised learning: (1) Principal Component Analysis (PCA) and (2) Clustering.
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106214855+02’00’
∗
e-mail:tgrammat@gmail.com
1
2. 1 Principal Component Analysis
Here we examine Principal Component Analysis (PCA) on the USArrests data set, which
is part of the R language. The rows of the data set contain the number of arrests which
have been taken place in the 50 US states cities and are associated with the crimes of
Murder, Assault and Rape respectively. Note, that the urbanization percentage of each
city has also been given.
states <- row.names(USArrests)
Attribs <- names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "Florida"
## [10] "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa"
## [16] "Kansas" "Kentucky" "Louisiana"
## [19] "Maine" "Maryland" "Massachusetts"
## [22] "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska"
## [28] "Nevada" "New Hampshire" "New Jersey"
## [31] "New Mexico" "New York" "North Carolina"
## [34] "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island"
## [40] "South Carolina" "South Dakota" "Tennessee"
## [43] "Texas" "Utah" "Vermont"
## [46] "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
Attribs
## [1] "Murder" "Assault" "UrbanPop" "Rape"
To have a rough overview of the USArrests data set we calculate its attributes mean.
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
2
3. Note, that there are on average three times as many rapes as murders, and more than
eight times assaults comparing to rapes. To find the variances of the data set:
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97 6945.17 209.52 87.73
Not surprisingly, the variables also have vastly different variances. Note, for example the
variance of UrbanPop variable, which measures the percentage of each state population
living in an urban area, and it is not comparable with the variance of the number of
rapes per 100, 000 individuals. If we failed to scale the variables before make a PCA, then
the resulted principal components would be strongly driven by the Assault variable,
since it has by far the largest mean and variance. Thus, it is important to standardize
the variables to have mean zero and standard deviation one, before performing PCA. In
fact, we can perform both the principal component analysis and standardize the variables
altogether by using the prcomp{stats}() function.
PCA.out <- prcomp(USArrests, scale = TRUE)
The output of prcomp() contains a number of useful quantities, i.e.
names(PCA.out)
## [1] "sdev" "rotation" "center" "scale" "x"
One can find the means and standard deviations of the variables that we are used for
scaling prior to implementing PCA by calling:
PCA.out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
PCA.out$scale
## Murder Assault UrbanPop Rape
## 4.356 83.338 14.475 9.366
3
4. The rotation matrix provides the principal component loadings; each column of
PCA.out$rotation contains the corresponding principal component loading vector∗
.
PCA.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5359 0.4182 -0.3412 0.64923
## Assault -0.5832 0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780 0.13388
## Rape -0.5434 -0.1673 0.8178 0.08902
Note, that there are four principal components. This is expected, since in a data set
with n observations and p variables there are in general min(n −1, p) informative principal
components. In addition, the principal component scores for every state of the USArrests
data set are returned by typing
PCA.out$x
a 50 × 4 matrix. To plot a scaled diagram of the first two principal components and the
loadings of the attributes of the feature space (Figure 1 below):
par(mfrow = c(1, 1), mar = c(2, 2, 2, 2), oma = c(0, 0, 5, 0))
biplot(PCA.out, scale = 0, col = c("blue", "red"), xlab = "1st
Principal Component",
ylab = "2nd Principal Component")
title("First Two Principal Components & Loading Vectorsnfor
Crimes in US States [USArrests]",
outer = TRUE)
To calculate the Proportion of Variance Explained (PVE) by its principal component:
PCA.var <- (PCA.out$sdev)^2
PCA.PVE <- (PCA.var/sum(PCA.var))
∗
The PCA.out$rotation component is also called rotation matrix, because when we matrix-multiply
the X matrix by PCA.out$rotation, it gives us the data coordinates in the rotated coordinate system.
These coordinates are the principal component scores.
4
5. Figure 1: Biplot diagram of the first two Principal Components and the Loading Vectors
for crimes in major US cities [USArrests]. The blue state names represent the scores
for the first two principal components. The orange arrows indicate the first two principal
component loading vectors (with axes on the bottom and left).
PCA.PVE
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Note that the first two principal components in the analysis we made above explain the
most of the variance in data, i.e. the 62% by the first principal component and 24.7% by
the second one.
5
6. We can plot the PVE by each component, as well as the cumulative PVE, as follows:
par(mfrow = c(1, 2), oma = c(0, 0, 4, 0))
plot(PCA.PVE, xlab = "Principal Component", ylab = "Proportion of
Variance Explained (PVE)",
ylim = c(0, 1), type = "b", main = "PVE vs Principal
Component",
pch = 8)
plot(cumsum(PCA.PVE), xlab = "Principal Component", ylab = "
Cumulative Proportion of Variance Explained (PVE)",
ylim = c(0, 1), type = "b", main = "Cumulative PVE vs
Principal Component",
pch = 8)
This is shown in Figure 2 below.
Figure 2: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari-
ance Explained (Cumulative PVE) towards the number of Principal Components in use
[USArrests].
6
7. 2 A Real Example - NCI60 Micro-array data (Expression
levels on 6830 genes from 64 cancer cell lines)
Unsupervised learning methods are often used in the analysis of genomic data, where the
plethora of attributes of the feature space cannot help much without a good exploratory
data analysis. In this section we illustrate these techniques on the NCI60 cancer cell line
micro-array data, which consists of 6, 830 gene expression instruments on 64 cancer cell
lines.
library(ISLR)
nci.labs <- NCI60$labs
nci.data <- NCI60$data
Each cell line is labeled with a cancer type. What we want to examine here is if there are
any specific patterns of gene expressions in favor of some particular cancer type and not
other, i.e. if we can explain the appearance of this particular cancer type due to a specific
pattern of gene expressions.
2.1 PCA on the NCI60{ISLR} data set
We scale the attributes of the feature space to have standard deviation one and perform
principal component analysis by using the prccomp() function.
PCA.NCI60 <- prcomp(nci.data, scale = TRUE)
A summary of which cab be printed out as follows
PCA.NCI60.printout <- summary(PCA.NCI60)
PCA.NCI60.printout$importance[, 1:10]
## PC1 PC2 PC3 PC4
## Standard deviation 27.8535 21.48136 19.82046 17.03256
## Proportion of Variance 0.1136 0.06756 0.05752 0.04248
## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115
## PC5 PC6 PC7 PC8
## Standard deviation 15.97181 15.72108 14.47145 13.54427
## Proportion of Variance 0.03735 0.03619 0.03066 0.02686
7
8. ## Cumulative Proportion 0.31850 0.35468 0.38534 0.41220
## PC9 PC10
## Standard deviation 13.14400 12.73860
## Proportion of Variance 0.02529 0.02376
## Cumulative Proportion 0.43750 0.46126
Truncating the result up to the first 10 principal components it is already evident that a
considerable proportion of variance will have been already explained. However, it is more
informative to plot the Proportion of Variance Explained (PVE) for each principal component
(scree plot), as well as the cumulative PVE of each principal component.
PCA.NCI60.PVE <- 100 * (PCA.NCI60$sdev)^2/(sum((PCA.NCI60$sdev)
^2))
par(mfrow = c(1, 2), oma = c(0, 0, 4, 0))
plot(PCA.NCI60.PVE, xlab = "Principal Component", ylab = "
Proportion of Variance Explained (PVE)",
type = "b", main = "PVE vs Principal Component", cex = 0.6,
col = "blue")
plot(cumsum(PCA.NCI60.PVE), xlab = "Principal Component", ylab =
"Cumulative Proportion of Variance Explained (PVE)",
type = "b", main = "Cumulative PVE vs Principal Component",
cex = 0.6, col = "red")
From the scree plot shown in Figure 3, we see that while the first seven principal compo-
nents explain a substantial amount of the “Proportion of Variance” (PVE) in NCI60{ISLR}
data set, i.e. 40%, there is a steep decrease in PVE by further principal components. More
specifically, we can figure out an elbow in the plot after approximately the seventh prin-
cipal component. This suggests that there may be little benefit in examining more than
seven or so principal components (though even examining seven principal component may
be difficult).
Next, we plot a scatter plot matrix between the score vectors of the first seven principal
components. The observations (cell lines) corresponding to a given cancer type will be
plotted in the same color, so that we can easily recognize to what extent the observations
within a cancer type are similar to each other. We first define a simple function that
assigns a color to each of the 64 lines, based on the cancer type to which it corresponds.
8
9. Figure 3: Proportion of Variance Explained (PVE) and Cumulative Proportion of Vari-
ance Explained (Cumulative PVE) towards the number of Principal Components in use
[NCI60{ISLR}].
CancerType_col <- function(vec) {
cols <- rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
Then we produce the scatter plots matrix (Figure 4):
pairs(PCA.NCI60$x[, 1:7], col = CancerType_col(nci.labs), pch =
20)
9
10. Figure 4: Scatter plots of the score vectors of the first seven principal components of
[NCI60{ISLR}] data set.
Or, if we want to concentrate only to the first four principal components (Figure 5):
pairs(PCA.NCI60$x[, 1:4], col = CancerType_col(nci.labs), pch =
20)
2.2 Clustering on the NCI60{ISLR} data set
We now try to investigate whether or not the observations of the NCI60{ISLR} data set
cluster into distinct types of cancer. To do so we will use the method of hierarchical
clustering.
10
11. Figure 5: Scatter plots of the score vectors of the first four principal components of
[NCI60{ISLR}] data set.
Since, hierarchical clustering is sensitive in the scales of the attributes in use we firstly
standardize these variables to have mean zero and standard deviation one
nci.data.sd <- scale(nci.data)
and perform hierarchical clustering using Complete, Single and Average Linkage respec-
tively. The dissimilarity measure is chosen to be the Euclidean Distance.
nci.data.sd.dist <- dist(nci.data.sd, method = "euclidean")
dev.new()
11
12. opar <- par(no.readonly = TRUE)
par(opar)
par(mfrow = c(3, 1), mar = c(2, 2, 1, 1), oma = c(0, 0, 3, 0))
plot(hclust(nci.data.sd.dist, method = "complete"), labels = nci.
labs,
main = "Complete Linkage", xlab = "", ylab = "", sub = "")
plot(hclust(nci.data.sd.dist, method = "average"), labels = nci.
labs,
main = "Avearage Linkage", xlab = "", ylab = "", sub = "")
plot(hclust(nci.data.sd.dist, method = "single"), labels = nci.
labs,
main = "Single Linkage", xlab = "", ylab = "", sub = "")
The results are shown in figure 6 below. It is evident that the choice of linkage greatly
affects the obtained results. Typically, single linkage will tend to yield trailing clusters:
very large clusters onto which individuals observations attach one-by-one. On the other
hand, complete and average linkage tend to yield more balanced, attractive clusters. For
these reasons, complete and average linkage are generally preferred to single linkage.
Clearly, cell lines within a single cancer type do tend to cluster together, although the
clustering is not perfect. In the analysis below we are going to use only the complete
linkage hierarchical clustering.
More specifically, we build the hierarchical clustering again and pass the produced result
to local variable. The dissimilarity measure in use is that of Euclidean distance, whereas
maximal inter-cluster dissimilarity is taken in account (complete linkage).
hclust.out <- hclust(nci.data.sd.dist, method = "complete")
and truncate the produced result at a height that will return a particular number of
clusters, lets say four.
hclust.clusters <- cutree(hclust.out, 4)
12
13. Figure 6: The NCI60{ISLR} cancer cell line micro-array data, clustered with average,
complete, and single linkage, and using Euclidean distance as the dissimilarity measure.
Complete and average linkage tend to yield evenly sized clusters whereas single linkage
tends to yield extended clusters to which single leaves are fused one by one.
table(hclust.clusters, nci.labs)
## nci.labs
## hclust.clusters BREAST CNS COLON K562A-repro K562B-repro
## 1 2 3 2 0 0
## 2 3 2 0 0 0
## 3 0 0 0 1 1
## 4 2 0 5 0 0
## nci.labs
## hclust.clusters LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA
## 1 0 0 0 8
## 2 0 0 0 0
## 3 6 0 0 0
## 4 0 1 1 0
13
14. ## nci.labs
## hclust.clusters NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 8 6 2 8 1
## 2 1 0 0 1 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
There are some clear patterns. All the LEUKEMIA cell lines fall in cluster 3, while the
BREAST cell lines are spread out over three different clusters, 1, 2 and 4. We can even
plot the dendogram produced above and draw a horizontal line at the height at which
these four clusters are produced.
par(mfrow = c(1, 1))
plot(hclust.out, labels = nci.labs, hang = 0.1, main = "Complete
Linkage",
xlab = "", ylab = "", sub = "")
abline(h = 139, col = "red", lty = "dashed")
Figure 7: The NCI60{ISLR} cancer cell line micro-array data, clustered with complete
linkage, and Euclidean distance as the dissimilarity measure. A horizontal dashed red line
has been added at the height (h = 139) where four distinct clusters emerge.
14
15. It is important to note that K-means clustering and hierarchical clustering with the den-
drogram cut to obtain the same number of clusters can yield very different results. Indeed,
let us now perform a K-means clustering using the NCI60 micro-array data set and com-
pare its results with the one obtained before.
set.seed(343)
km.out <- kmeans(nci.data.sd, 4, nstart = 20)
km.clusters <- km.out$cluster
table(km.clusters, hclust.clusters)
## hclust.clusters
## km.clusters 1 2 3 4
## 1 0 0 8 0
## 2 9 0 0 0
## 3 11 0 0 9
## 4 20 7 0 0
Obviously, the four clusters obtained using hierarchical clustering and K-means clus-
tering differ. Cluster 2 in K-means clustering is identical to cluster 3 in hierarchical
clustering. However, the other clusters differ: for instance, cluster 4 in K-means cluster-
ing contains a portion of the observations assigned to cluster 1 by hierarchical clustering,
as well as all of the observations assigned to cluster 2 by hierarchical clustering.
15