The document discusses using cluster analysis techniques like K-Means and spectral clustering on NBA player statistics data. It begins by introducing machine learning concepts like supervised vs. unsupervised learning and definitions of clustering criteria. It then describes preprocessing the 27-dimensional player data into 2 dimensions using linear discriminant analysis (LDA) and principal component analysis (PCA) for visualization. K-Means clustering is applied to the LDA-reduced data, identifying distinct player groups. Spectral clustering will also be applied using PCA for comparison. The goal is to categorize players and determine the best athletes without prior basketball knowledge.
1. Cluster Analysis: NBA Sports Data
Kathlene Ngo and Gareth Williams
Presented on May 21, 2019
Abstract
Clustering is one of the main tasks in unsupervised machine learning, methods that use algo-
rithms inferring from a dataset without reference or prior knowledge of labeled outcomes. Unlike
its supervised counterpart, unsupervised machine learning methods cannot be directly applied to
regression or classification problems. This is due to the lack of knowledge about the output values;
hence, it would be impossible to train the algorithms. Unsupervised learning uses techniques to
learn the structure of the dataset. In this paper, we will be discussing two traditional methods of
clustering: K-Means and Spectral.
Contents
1 Introduction to Machine Learning 2
2 What is Clustering? 2
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Theory of Computation: NP-Hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Our NBA Player Dataset: What does our data look like? . . . . . . . . . . . . . . . . . . 3
3 K-Means Clustering 4
3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 What is K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 What if our data is more than 3 dimensions? . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Guessing Your Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Graphing our Data onto 2-Dimensions 5
4.1 Initial Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Method 1: Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.1 Analysis of the LDA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 6
4.3 Method 2: Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3.1 Analysis of the PCA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 8
5 Spectral Clustering 9
5.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 What is Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Steps for Building a Spectral Clustering Algorithm: . . . . . . . . . . . . . . . . . . . . . . 9
Step 1: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Step 2: Project Data onto Low-Dimensional Space . . . . . . . . . . . . . . . . . . . . . . 10
Step 3: Create clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Analysis of the PCA Graph using Spectral Clustering . . . . . . . . . . . . . . . . . . . . 11
6 Advantages/Disadvantages of K-Means vs. Spectral clustering 12
7 K-Means Algorithm 13
References 14
1
2. 1 Introduction to Machine Learning
Supervised machine learning uses prior knowledge of output values for the sample data. A popular
example of supervised learning is classification and the more common algorithms include support vector
machines, artificial neural networks, and linear regression.
Unsupervised machine learning, on the other hand, does not have labeled outcomes; hence, its
goal is to infer the natural structure present within a dataset. The most common task within unsupervised
learning is data clustering, finding a coherent structure within our data without using any explicit labels.
We will be demonstrating two algorithms of data clustering in our NBA problem: K-Means clustering
and Spectral clustering.
2 What is Clustering?
2.1 Definition
Let us be given a large amount of arbitrary data. As future data analysts, we want to obtain informa-
tion from this big dataset. How exactly do we organize and categorize such information? Clustering is a
Machine Learning technique that involves the grouping of data points that many modern data scientists
familarize themselves with. Given a set of data points from a dataset, we can use clustering algorithms to
"classify" each data point into specific groups. In theory, data points in the same group must be similar
to one another, and thus, data points in different groups must be dissimilar to one another. Similarity
can be thought of as having same properties or features depending on what context the data is in and
what we are specifically studying.
There are two different criteria for data clustering:
• Compactness, e.g., k-means, mixture models
• Connectivity, e.g., spectral clustering
Definition 1.1.1: Compactness is the property that generalizes the notion of a subset of Euclidean
space being closed and bounded, in order words—points that lie close to each other fall in the same
cluster and are "compact" around the cluster’s center.
Definition 1.1.2: Connectivity is one of the basic concepts of graph theory related to network flow:
it asks for the minimum number of elements (edges or nodes) needed to be removed or cut to separate
the remaining nodes into isolated subgraphs. Or, in order words—points that are connected by edges or
next to each other are put in the same cluster.
Definition 1.1.3: Mixture models are probabilistic models for representing the sub-populations
within an overall population that do not require that the dataset must identity to which sub-population
each individual data point belongs.
Figure 1: Different types of clustering normally specializes in mainly one of the two criteria.
The goal is to find homogeneous groups of data points based on the degree of similarity and dis-
similarity of their attributes/features. Most clustering methods that exist are specialized to a single
criterion. Hence, such methods would be unsuitable for datasets with different characteristics. And
therefore, modern data scientists are researching multiple-objective clustering algorithms.
However, we will be using two traditional clustering methods that each focus only on a single crite-
rion to study a given basketball player dataset: K-Means clustering and Spectral clustering.
2
3. 2.2 Theory of Computation: NP-Hard
Data clustering is a relaxation of this NP-Hard problem. How to partition a graph into two clusters:
We will use min-cut to partition into two sets(A and B) such that the weight of edges connecting vertices
in A to vertices in B is minimum.
cut(A, B) =
i∈A,∈B
wij (1)
Note: This is a rather easy-to-solve algorithm, but not exactly a good partition as it often isolates
vertices. And thus, unwanted cuts with weights lesser than the "ideal cut" will occur.
Figure 2: Less than Ideal Cut
Therefore, we want to normalize the cut to make A and B similar in size.
Ncut(A, B) = cut(A, B)(
1
vol(A)
+
1
vol(B)
) (2)
vol(A) =
i∈A
dj (3)
dj =
n
j=1
wij (4)
|A| = number of vertices of A;
vol(A) is the size of A by summing over the weights of all edge attached to vertices in A
This is NP-Hard, or in order words–computationally difficult to execute. We must use heuristic algo-
rithms (aka clustering) to convey quickly to a local optimum.
2.3 Our NBA Player Dataset: What does our data look like?
Figure 3: Golden State Warriors and Toronto Raptors.
We have a 145 x 27 matrix containing all the players’ names and certain attributes such as: Age,
Games played, Minutes played, Free Throws, Free Throw Percentage, etc. The Figure 3 above is a
snippet of the data we’ll be clustering! (Check References for link to the complete dataset.)
3
4. 3 K-Means Clustering
3.1 History
K-Means was first created by Hugo Steinhaus in 1956 and was not really utilized until 1957 by Stuart
Lloyd as a technique for pulse-code modulation. The original algorithm then branched off into different
variants like k-means++, k-medians, k-metroids, Guassian mixtures models, HG-means, etc.
The problem that we will be presenting and solving with the K-Means algorithm is invented by math-
ematicians who had a passion for sports gambling or is most likely involved with the plot in the movie
Moneyball.
3.2 What is K-means
What is K-Means: K-means is a simple and easy way to classify a given data set with a certain k-
number of clusters or groups. It is a method of vector quantization that is highly popular for clustering
analysis in data mining. The criterion that K-Means focuses on is Compactness (See Definition 1.1.1
in Section 2.1). Firstly, we will need to understand some of the fundamental basics.
The main function of K-Means is to define k number of centroids, with each cluster containing a
single centroid at the center of that particular cluster. Using an iterative process, the new centroids
will be located and reassigned repetitively until we have found the stabilized centers. In some cases, the
centroids will not stop changing and we must incorporate a max iteration to halt the algorithm within
a reasonable margin of error. This is also a method for finding local optimums for objective functions.
Min
k
j=1
n
i=1
||x
(j)
j − mj||2
(5)
Our goal is minimize the error using the Euclidean distance function, which is a feature of the K-Means
algorithm. x
(j)
i is our data, i is the data point, and j is the centroid we are comparing to. mj is the
centroids with j indicating the jth centroid.
Note: Although it can be proven theoretically that k-mean will always terminate once the centroids
stop moving, but reality—the program does not always stop running. This might be due to computer
limitations when dealing with very large numbers or extremely resource-intensive data (after the 16th
digit, data becomes more and more inaccurate).
Note: Initial conditions are very important. Basically, we don’t want to be naive and place centroids
on top of one other or place them in questionable locations. Two very good techniques on the placement
of initial centroids are: (1) give your initial guess near the origin or (2) slightly outside the data.
3.3 What if our data is more than 3 dimensions?
Then we would have to use a dimension-reduction method to get our data into our ideal dimensions.
PCA and LDA are examples of some good methods to use. This will discussed in detail in a later section.
3.4 Guessing Your Centroids
If we are lucky and the data is separable, we can ideally draw a hyperplane/line where all the similar
group data is on its own side of the hyperplane/line. This is very convenient visually because we can
place our centroids very close to the actual center of the data.
However, if our data is not obviously separable, then we have to give an educated guess about where
the centroids are and the number of centroids that actually exists. This is one of the most difficult parts
of K-Means clustering. Why is picking the number of centroids difficult? Suppose I gave you our NBA
data and told you “who would be the next best pick if Kevin Durant got injured or left the Golden State
Warriors Team?” If you had no knowledge about the NBA or Sports in general, like myself, then this
would be very difficult to answer.
Thus, having knowledge about your data is extremely useful: who, what, and where in from our data.
We will first try to use K-Means clustering to analyze the current Top 7 NBA teams and their players
to categorize players and determine the best athletes.
4
5. 4 Graphing our Data onto 2-Dimensions
4.1 Initial Dataset
Going back to our dataset (See Figure 3 in Section 2.3), recall that we have a 145 x 27 matrix. It
contains all players’ names and attributes such as: Age, Games played, Minutes played, Free Throws,
Free Throw Percentage, etc.
This is 27-dimensional data. In order to visualize our information, we need to turn these 27 dimen-
sions into 2 dimensions.
1. For our purposes of K-Means clustering, we will use the Linear Discriminant Analysis
(LDA) to dimension-reduce and graph. This method of clustering helps to categorize players
according to their positions/skills.
2. For the purposes of Spectral Clustering, we will use Principal Component Analysis (PCA)
to dimension-reduce and graph. This method of clustering will separate the average players from
extremely skilled players.
Using both of these methods will help give us a mathematical decision on who the best pick players are
without knowing anything prior about the sport.
4.2 Method 1: Linear Discriminant Analysis
Linear Discriminant Analysis is method using a linear combination of features attributes, quali-
ties, etc. in order to separate the data into classes, groups, and/or events. It is used in statistics, pattern
recognition, and machine learning. To avoid getting side-tracked from our problem, the most important
thing to know about the Linear Discriminant Analysis is that it is a dimensional reduction method
and that it is similar to the Principal Component Analysis method. However, the Linear Dis-
criminant Analysis attempts to find the differences between classes and the Principle Component
Analysis doesn’t consider the difference. Consider the Figure 4: LDA Plot Method below.
Figure 4: Players plotted using Linear Discriminant Analysis method.
Using the Linear Discriminant Analysis method on our 27-dimensional data (using a built-in MAT-
LAB function), our data is now in 2-dimensions. Now, what do we do from here? We then proceed to
plug our 2-dimensional data into the K-Means to obtain how many clusters there are and their locations.
Our detailed K-Means algorithm that we used is in Section 7 titled K-Means Algorithm.
5
6. Figure 5: K-mean Results (K-means applied on Figure 4)
Figure 4 on the previous page is the resulting graph using LDA to reduce the 27-dimensions to 2-
dimension data. We will next use the K-Means algorithm to cluster the data shown in Figure 5 above.
This shows where center of the clusters are located after applying the K-Means algorithm. The circled
areas is approximately where each player falls within each cluster.
How is this Useful?
Suppose your favorite team gets a new rookie player and the only information that you can find on
Google is his (Age, Games played, Minutes played, Free Throws, Free Throw Percentage, ...etc). We are
able to go through this process again and the rookie will belong in one of clusters in Figure 5.
What does this exactly tell us? This shows us the following information: (1) which players the rookie
will be similar to, (2) statistics on the similar features within the cluster he lands in, and (3) if he will
be an outstanding player. How close he is to the players within his cluster means how relatively similar
they are. And vice versa, the farther the player is to another player, the more dissimilar they will be.
These features could be skillset, position, ranking, etc.
4.2.1 Analysis of the LDA Graph using K-Means Clustering
With the data being collected, refined, graphed, and clustered, it can now satisfy as potential training
data for numerous types of predicting algorithms (both Unsupervised and Supervised 1).
Suppose some friends start a competitive fantasy draft: Everyone picks a player to form any team.
However, they can only play the positions they realistically play, and no repeated picks can occur.
Suppose all your friends are huge sports fans, so they start picking their favorite players and who they
think are the best athletes. Since we already have our training data, we could build an algorithm that
can predict who the best player will be for each position and then print out all the next season’s best
6
7. picks. Hence, this is how NBA gambling can become very competitive.
From the Figure 6, we can infer that the orange cluster involves Forward position players. Kevin
Durant is known to be an excellent Small Forward and Power Forward. His data point appears on the edge
of his cluster. Note: Outliers in each cluster represent the stronger players for that basketball position
because their 27-dimensional statistics are insane compared to other players within their respective
clusters. This observation will also be proven with spectral clustering!
Figure 6: Zoomed-In Cluster
4.3 Method 2: Principle Component Analysis
Principal Component Analysis (PCA) is a dimension-reduction method used to cluster high-
dimensional data into 2 dimensions or 3 dimensions.
The Principal Component Analysis method is important because the K-Means algorithm is most
useful when you can visualize the data clustering; therefore, anything above 3 dimensions would not be
useful. Thus, the Principal Component Analysis method can reduce large dimensional data into smaller
dimensions while still retaining all the information from the large dataset.
Principal Component Analysis method is basically a procedure to transform a number of correlated
variables into a smaller number of uncorrelated variables called Principal Components. This method
is very similar to the Singular Value Decomposition Method (SVD) in that the very first Singular Value
of the matrix carries the most weight compared to all the rest.
σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 (6)
The same is also true for the principal components, in that the first component will carry the most
information from the dataset.
Traditionally, the Principal Component Analysis method is performed on a squares symmetric matrix.
Sum of squares and cross products matrix, covariance matrix, or correlation matrix can be used instead,
but the difference between the methods is merely a scalar factor. However, for the sake of simplicity, we
will use Singular Decomposition to also obtain our Principal Components.
Given the data X, SVD(X) = UΣVT
where Σ is our singular value matrix which is a diagonal matrix
and U,V are our orthogonal unit matrices; thus,
XT
X = (UΣVT
)T
∗ UΣVT
(7)
= VΣT
UT
UΣVT
(8)
= VΣT
ΣVT
(9)
= VˆΣ2
VT
(10)
= VΛVT
(11)
where Λ is our square diagonal matrix (diagonal Lambda matrix) which holds all our eigenvalues in order
of importance.
λ1 ≥ λ2 ≥ ... ≥ λn (12)
∀n ∈ Z+
, λn ≥ 0 (13)
Since Λ is diagonal, we can create a co-variance matrix. This is one way of getting Principal Com-
ponents, and thankfully the Principal Component Analysis method is pre-built for most computer
languages.
7
8. 4.3.1 Analysis of the PCA Graph using K-Means Clustering
Running the PCA method on all 7 teams yields a very interesting graph. How many clusters can you
spot out? One? Two? Seventy? (See Figure 7 below.)
Figure 7: PCA Method for all 7 teams
Evidently, one cluster doesn’t tell us anything. So, how about two? Let the number of centriods k=2.
Clearly, applying the Principal Component Analysis method here is not really useful.
At first glance, we can already foreshadow that the data is not separable with a hyperplane. This
is one example of the limitations of using K-Means because this algorithms focuses on Compactness.
In the next section, we will use Spectral Clustering to cluster by the criterion Connectivity (See
Definition 1.1.2 in Section 2.1) to analyze the PCA graph.
Figure 8: K-means failing on PCA Graph
8
9. 5 Spectral Clustering
5.1 History
Spectral Clustering has a long history and was mainly popularized by Jianbo Shi and Jitendra Malik
in "Normalized Cuts and Image Segmentation," and in "On Spectral Clustering: Analysis and an Algo-
rithm" in Advances in Neural Information Processing Systems by Ng, Jordan, and Weiss. Since Spectral
Clustering has a greater variety of different algorithms, it is rather difficult to pinpoint the origin/creator
of this clustering method.
5.2 What is Spectral Clustering
What is Spectral Clustering: Spectral Clustering classifies a dataset using the data points as nodes
of a graph. Therefore, it is almost similar to a graph partitioning problem (See Section 2.2: NP-Hard).
The nodes are mapped into low-dimensions so that they can be easily segregated to form clusters. The
criterion that Spectral Clustering focuses on is Connectivity (See Definition 1.1.2 in Section 2.1).
5.3 Steps for Building a Spectral Clustering Algorithm:
1. Create a similarity graph.
2. Project our given data onto a low-dimensional space.
3. Create the clusters.
Step 1: Similarity Graph
The main function of Spectral Clustering is to compute a similarity graph. We first will create an
undirected graph G= (V,E) with vertex set V= v1,v2,...,vn and represent this with an adjacency matrix
with the similarity between each vertex/node as its entries. To compute the simiarity function, there are
3 common ways:
1. The -neightborhood graph: Connecting all points whose pairwise distances are smaller than
. Since the distances are of the same scalar (at most ), the edges are unweighted and thus, this
is an undirected graph.
2. K-Nearest Neighbors: Attaching an edge from each node to its k nearest neighbors in the space
(k is not choice-sensitive). After, connecting appropriate vertices/nodes, the edges are weighted by
the similarity of the adjacent points.
3. Fully connected graph: Connecting all points with each other and weighting all edges by sim-
ilarity wij. This graph models the local neighborhood relationships. Hence, similarity functions
such as the Gaussian similarity function can be used.
w(xi, xj) = exp(−
||xi − xj||2
2σ2
) (14)
σ = width of neighborhoods/free tuning parameter to adjust similarity
Note: There are other similarity functions such as the polynomial kernel, dynamic similarity kernel,
and the inverse multi-quadric kernel.
Figure 9:
Thus, we are able to create an adjacency matrix A for any of our chosen similarity graphs. Note: Entries
of adjacency matrix A will be 0 and 1 if the graph is unweighted.
9
10. Step 2: Project Data onto Low-Dimensional Space
As noted in Figure 10, some data points in the same circular cluster have a greater Euclidean distance
between them compared to points in different clusters.
Figure 10: Circles Data
Hence, we want to project any observations made into low-dimensional space. We can do this by
computing the Graph Laplacian.
L = D − A, (15)
where A is the adjacency matrix and D is the degree matrix such that:
di =
j|(i,j)∈E
wij (16)
Thus, we get the following Laplacian matrix:
L =
di if i = j
−wij if (i, j) ∈ E
0 if (i, j) /∈ E
(17)
From the Graph Laplacian, we are able to find the eigenvalues and eigenvectors to embed data points
into low-dimensional space.
Using Linear Algebra, we get the equality:
Lλ = λv (18)
where v is the eigenvector of L corresponding to eigenvalue λ.
Thus, we get eigenvalues {λ1,λ2,λ3,...,λn} where 0 = λ1≤λ2≤...≤λn and eigenvectors {v1,v2,...,vn}.
1. If L has eigenvalue 0 with k different eigenvalues, then the undirected graph G=(V,E) has k
connected components (subgraph).
2. If G=(V,E) is connected (aka λ1 = 0) and λ2 > 0, then λ2 is the algebraic connectivity of G. This
2nd non-zero eigenvalue is called the Fiedler Value—it approximates the minimum graph cut
needed to separate graph into 2 connected components.
3. 0 = λ1≤λ2≤...≤λn is called the Spectrum of the Laplacian—which tells us a great deal about the
graph: sparsity of a graph cut, number of connected components, and if graph is bipartite.
Now, we’re going to use those eigenvectors and eigenvalues to cluster the data.
10
11. Step 3: Create clusters
Creating 2 Clusters: For this tricky step, we are going to assign values to each vertex. Using that 2nd
non-zero eigenvalue λi, with its corresponding eigenvector vi, we assign each vertex to each element of
that eigenvalue. Then, we cluster the vertices based on if their value is > 0 or ≤ 0. Hence, we can see
that each element of the Fiedler eigenvector tells us which cluster each vertex belongs to.
Example: Suppose vi = [x1, x2, x3, x4, x5, x6]. We assign the first vertex to x1, the second ver-
tex to x2, and so forth... Assume that x1, x2, x3 > 0 and x4, x5, x6 ≤ 0. Then, we split the vertices
such that all vertices with assigned value > 0 are in one cluster, and all vertices with value ≤ 0 are in
another cluster. Therefore, we end up with two clusters with vertices 1,2,3 in one cluster and vertices
4,5,6 in the second cluster.
Hint: Look for big difference gaps between the eigenvalues to guess the number of clusters! If you
guess that there are k clusters, eigenvectors associated with the first k-1 non-zero eigenvalues should
give information on how to cut the data into k clusters.
This method of creating clusters is perfect for 2 clusters. However, it is too difficult if there are too
many clusters (k»2).
——————————————————————————————————————————
Creating k Clusters: For k clusters, we first normalize our Laplacian and then perform K-means to
group our data points into k clusters.
Normalizing the Laplacian matrix:
Lnorm = D−1/2
LD−1/2
(19)
The Normalized Laplacian was created by Ng, Jordan, and Weiss.
1. For k, clusters, compute the first k eigenvectors of the Normalized Laplacian.
2. Stack the vectors vertically in order to form a new matrix with those vectors as its columns.
3. Every vertex will be represented by the corresponding row of this new matrix. These rows then
form the feature vectors of the vertices.
4. Use K-Means to cluster these data points into our k clusters {C1,C2,...,Ck}
5.4 Analysis of the PCA Graph using Spectral Clustering
Referencing from Figure 8: K-means failing on PCA Graph, we can obviously see that Spectral
Clustering gives us much better information (See Figure 12)! The strongest players for all positions
lie in the outer cluster and the more average players are contained in the inner cluster. Players such as
Kevin Durant, Stephen Curry, Damien Lillard, Paul George, Klay Thompson, J.J. Redick, CJ McCollum,
Lou Williams, Russell Westbrook, Steven Adams, etc. are found in the outer cluster.
Figure 11: Zoomed-in on PCA Graph
We can also apply Spectral clustering on a single NBA team: Below, we chose the Golden State
Warriors. However, we realized that the observations made on a single team may not hold useful using
Spectral clustering because each player in a team has a different position.(See Figure 13.)
11
12. Figure 12: Spectral Clustering on PCA Graph of the 7 Teams
Figure 13: Spectral Clustering on PCA Graph of the Golden State Warriors Team
6 Advantages/Disadvantages of K-Means vs. Spectral clustering
For our NBA dataset:
• Using the LDA Graph with K-Means clustering was conclusive.
• Using the PCA Graph with Spectral clustering was inconclusive.
• Using the PCA Graph with Spectral Clustering was conclusive.
For K-Means clustering (compactness), the density of data points is the main factor pushing the
clustering of data. While it is useful for mixture models (See Definition 1.1.3 from Section 2.1), it is
ineffective when applied to spiral or circular data as it relies on the Euclidean distance between data
points.
Spectral clustering (connectivity) does not rely on strong assumptions on the statistics of the
clusters. Methods like K-Means clustering assume that points assigned to a cluster are spherical around
the cluster’s centroid. Hence, this is a very strong assumption. However, the disadvantage with Spectral
clustering is that it is computationally expensive for large datasets. This is due to the K-Means clustering
done on the Laplacian’s computed eigenvalues and eigenvectors. The K-Means clustering in the final
step of Spectral clustering also implies that the clusters are not always the same—it depends on the
initial choice of centroids.
12
13. 7 K-Means Algorithm
This is the K-Means algorithm that we used in both K-Means clustering and Spectral clustering for
k clusters.
Given a data set with n attributes X = X1, X2, ..., Xn and we know k clusters will form (k<n). Let
m = m1, m2, ..., mk be the initial guesses of where the i number of clusters are. Remember, our goal is
to minimize ||X
(j)
i − mj|| (minimize the distance of all the k clusters).
Pseudo code:
1. Input the data we are clustering give the k number of cluster.
2. Give an initial guess for the location of each cluster center is.
3. Run the Euclidean distance function for each data point with the centroid.
4. Run the Euclidean distance function for each data point with the centroid
5. Record/assign each data point to its closest centroid.
6. Sum up each data point with similar clusters and divide by number of values summed (this will be
the new centroid locations for the next iteration).
7. Repeat 3,4,5, and 6 until the centroids stop moving consecutively.
The Figure 14 below is the pseudo code in Matlab form:
Figure 14: Homemade K-Means algorithm
Explanation of the code: The first i,j for the loop in the MatLab code created represents the Euclidean
distance function. This essentially records the distance of every point from each centroid.
Distance(i, j) = ||Xi − mj||2
(20)
13
14. The second i,j loop in the MatLab code will test the distance of each point to each centroid with the
min distance of all the recorded distances of all the centroids.This will then label it with the centroid
that it is closest to. The reason is because we can be able to build and easy-reference list to point to the
data points we want.
The third i,j loop will sum up all the data points that have the same centroid label that we gave them
in the previous loop. We then will divide the sum by the number of data points we summed up. Lastly,
we take those results and set them to be the new centroids for the next iteration. Rinse and repeat until
max iteration is reached.
References
[1] D. Doty. Theory of Computation: ECS 120 Lecture Notes. 2019.
[2] G. Calafiore, L. El Ghaoui. Optimization Models. Cambridge University Press, 2014.
[3] J. De Loera. "Math for Data Analytics." Notes, Lecture from University of California, Davis, CA,
April 2019.
[4] L. Elden. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.
[5] U. von Luxburg. Statistics and Computing. A Tutorial on Spectral Clustering. Springer, 2007.
14