SlideShare a Scribd company logo
1 of 14
Download to read offline
Cluster Analysis: NBA Sports Data
Kathlene Ngo and Gareth Williams
Presented on May 21, 2019
Abstract
Clustering is one of the main tasks in unsupervised machine learning, methods that use algo-
rithms inferring from a dataset without reference or prior knowledge of labeled outcomes. Unlike
its supervised counterpart, unsupervised machine learning methods cannot be directly applied to
regression or classification problems. This is due to the lack of knowledge about the output values;
hence, it would be impossible to train the algorithms. Unsupervised learning uses techniques to
learn the structure of the dataset. In this paper, we will be discussing two traditional methods of
clustering: K-Means and Spectral.
Contents
1 Introduction to Machine Learning 2
2 What is Clustering? 2
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Theory of Computation: NP-Hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Our NBA Player Dataset: What does our data look like? . . . . . . . . . . . . . . . . . . 3
3 K-Means Clustering 4
3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 What is K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 What if our data is more than 3 dimensions? . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Guessing Your Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Graphing our Data onto 2-Dimensions 5
4.1 Initial Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Method 1: Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.1 Analysis of the LDA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 6
4.3 Method 2: Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3.1 Analysis of the PCA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 8
5 Spectral Clustering 9
5.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 What is Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Steps for Building a Spectral Clustering Algorithm: . . . . . . . . . . . . . . . . . . . . . . 9
Step 1: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Step 2: Project Data onto Low-Dimensional Space . . . . . . . . . . . . . . . . . . . . . . 10
Step 3: Create clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Analysis of the PCA Graph using Spectral Clustering . . . . . . . . . . . . . . . . . . . . 11
6 Advantages/Disadvantages of K-Means vs. Spectral clustering 12
7 K-Means Algorithm 13
References 14
1
1 Introduction to Machine Learning
Supervised machine learning uses prior knowledge of output values for the sample data. A popular
example of supervised learning is classification and the more common algorithms include support vector
machines, artificial neural networks, and linear regression.
Unsupervised machine learning, on the other hand, does not have labeled outcomes; hence, its
goal is to infer the natural structure present within a dataset. The most common task within unsupervised
learning is data clustering, finding a coherent structure within our data without using any explicit labels.
We will be demonstrating two algorithms of data clustering in our NBA problem: K-Means clustering
and Spectral clustering.
2 What is Clustering?
2.1 Definition
Let us be given a large amount of arbitrary data. As future data analysts, we want to obtain informa-
tion from this big dataset. How exactly do we organize and categorize such information? Clustering is a
Machine Learning technique that involves the grouping of data points that many modern data scientists
familarize themselves with. Given a set of data points from a dataset, we can use clustering algorithms to
"classify" each data point into specific groups. In theory, data points in the same group must be similar
to one another, and thus, data points in different groups must be dissimilar to one another. Similarity
can be thought of as having same properties or features depending on what context the data is in and
what we are specifically studying.
There are two different criteria for data clustering:
• Compactness, e.g., k-means, mixture models
• Connectivity, e.g., spectral clustering
Definition 1.1.1: Compactness is the property that generalizes the notion of a subset of Euclidean
space being closed and bounded, in order words—points that lie close to each other fall in the same
cluster and are "compact" around the cluster’s center.
Definition 1.1.2: Connectivity is one of the basic concepts of graph theory related to network flow:
it asks for the minimum number of elements (edges or nodes) needed to be removed or cut to separate
the remaining nodes into isolated subgraphs. Or, in order words—points that are connected by edges or
next to each other are put in the same cluster.
Definition 1.1.3: Mixture models are probabilistic models for representing the sub-populations
within an overall population that do not require that the dataset must identity to which sub-population
each individual data point belongs.
Figure 1: Different types of clustering normally specializes in mainly one of the two criteria.
The goal is to find homogeneous groups of data points based on the degree of similarity and dis-
similarity of their attributes/features. Most clustering methods that exist are specialized to a single
criterion. Hence, such methods would be unsuitable for datasets with different characteristics. And
therefore, modern data scientists are researching multiple-objective clustering algorithms.
However, we will be using two traditional clustering methods that each focus only on a single crite-
rion to study a given basketball player dataset: K-Means clustering and Spectral clustering.
2
2.2 Theory of Computation: NP-Hard
Data clustering is a relaxation of this NP-Hard problem. How to partition a graph into two clusters:
We will use min-cut to partition into two sets(A and B) such that the weight of edges connecting vertices
in A to vertices in B is minimum.
cut(A, B) =
i∈A,∈B
wij (1)
Note: This is a rather easy-to-solve algorithm, but not exactly a good partition as it often isolates
vertices. And thus, unwanted cuts with weights lesser than the "ideal cut" will occur.
Figure 2: Less than Ideal Cut
Therefore, we want to normalize the cut to make A and B similar in size.
Ncut(A, B) = cut(A, B)(
1
vol(A)
+
1
vol(B)
) (2)
vol(A) =
i∈A
dj (3)
dj =
n
j=1
wij (4)
|A| = number of vertices of A;
vol(A) is the size of A by summing over the weights of all edge attached to vertices in A
This is NP-Hard, or in order words–computationally difficult to execute. We must use heuristic algo-
rithms (aka clustering) to convey quickly to a local optimum.
2.3 Our NBA Player Dataset: What does our data look like?
Figure 3: Golden State Warriors and Toronto Raptors.
We have a 145 x 27 matrix containing all the players’ names and certain attributes such as: Age,
Games played, Minutes played, Free Throws, Free Throw Percentage, etc. The Figure 3 above is a
snippet of the data we’ll be clustering! (Check References for link to the complete dataset.)
3
3 K-Means Clustering
3.1 History
K-Means was first created by Hugo Steinhaus in 1956 and was not really utilized until 1957 by Stuart
Lloyd as a technique for pulse-code modulation. The original algorithm then branched off into different
variants like k-means++, k-medians, k-metroids, Guassian mixtures models, HG-means, etc.
The problem that we will be presenting and solving with the K-Means algorithm is invented by math-
ematicians who had a passion for sports gambling or is most likely involved with the plot in the movie
Moneyball.
3.2 What is K-means
What is K-Means: K-means is a simple and easy way to classify a given data set with a certain k-
number of clusters or groups. It is a method of vector quantization that is highly popular for clustering
analysis in data mining. The criterion that K-Means focuses on is Compactness (See Definition 1.1.1
in Section 2.1). Firstly, we will need to understand some of the fundamental basics.
The main function of K-Means is to define k number of centroids, with each cluster containing a
single centroid at the center of that particular cluster. Using an iterative process, the new centroids
will be located and reassigned repetitively until we have found the stabilized centers. In some cases, the
centroids will not stop changing and we must incorporate a max iteration to halt the algorithm within
a reasonable margin of error. This is also a method for finding local optimums for objective functions.
Min
k
j=1
n
i=1
||x
(j)
j − mj||2
(5)
Our goal is minimize the error using the Euclidean distance function, which is a feature of the K-Means
algorithm. x
(j)
i is our data, i is the data point, and j is the centroid we are comparing to. mj is the
centroids with j indicating the jth centroid.
Note: Although it can be proven theoretically that k-mean will always terminate once the centroids
stop moving, but reality—the program does not always stop running. This might be due to computer
limitations when dealing with very large numbers or extremely resource-intensive data (after the 16th
digit, data becomes more and more inaccurate).
Note: Initial conditions are very important. Basically, we don’t want to be naive and place centroids
on top of one other or place them in questionable locations. Two very good techniques on the placement
of initial centroids are: (1) give your initial guess near the origin or (2) slightly outside the data.
3.3 What if our data is more than 3 dimensions?
Then we would have to use a dimension-reduction method to get our data into our ideal dimensions.
PCA and LDA are examples of some good methods to use. This will discussed in detail in a later section.
3.4 Guessing Your Centroids
If we are lucky and the data is separable, we can ideally draw a hyperplane/line where all the similar
group data is on its own side of the hyperplane/line. This is very convenient visually because we can
place our centroids very close to the actual center of the data.
However, if our data is not obviously separable, then we have to give an educated guess about where
the centroids are and the number of centroids that actually exists. This is one of the most difficult parts
of K-Means clustering. Why is picking the number of centroids difficult? Suppose I gave you our NBA
data and told you “who would be the next best pick if Kevin Durant got injured or left the Golden State
Warriors Team?” If you had no knowledge about the NBA or Sports in general, like myself, then this
would be very difficult to answer.
Thus, having knowledge about your data is extremely useful: who, what, and where in from our data.
We will first try to use K-Means clustering to analyze the current Top 7 NBA teams and their players
to categorize players and determine the best athletes.
4
4 Graphing our Data onto 2-Dimensions
4.1 Initial Dataset
Going back to our dataset (See Figure 3 in Section 2.3), recall that we have a 145 x 27 matrix. It
contains all players’ names and attributes such as: Age, Games played, Minutes played, Free Throws,
Free Throw Percentage, etc.
This is 27-dimensional data. In order to visualize our information, we need to turn these 27 dimen-
sions into 2 dimensions.
1. For our purposes of K-Means clustering, we will use the Linear Discriminant Analysis
(LDA) to dimension-reduce and graph. This method of clustering helps to categorize players
according to their positions/skills.
2. For the purposes of Spectral Clustering, we will use Principal Component Analysis (PCA)
to dimension-reduce and graph. This method of clustering will separate the average players from
extremely skilled players.
Using both of these methods will help give us a mathematical decision on who the best pick players are
without knowing anything prior about the sport.
4.2 Method 1: Linear Discriminant Analysis
Linear Discriminant Analysis is method using a linear combination of features attributes, quali-
ties, etc. in order to separate the data into classes, groups, and/or events. It is used in statistics, pattern
recognition, and machine learning. To avoid getting side-tracked from our problem, the most important
thing to know about the Linear Discriminant Analysis is that it is a dimensional reduction method
and that it is similar to the Principal Component Analysis method. However, the Linear Dis-
criminant Analysis attempts to find the differences between classes and the Principle Component
Analysis doesn’t consider the difference. Consider the Figure 4: LDA Plot Method below.
Figure 4: Players plotted using Linear Discriminant Analysis method.
Using the Linear Discriminant Analysis method on our 27-dimensional data (using a built-in MAT-
LAB function), our data is now in 2-dimensions. Now, what do we do from here? We then proceed to
plug our 2-dimensional data into the K-Means to obtain how many clusters there are and their locations.
Our detailed K-Means algorithm that we used is in Section 7 titled K-Means Algorithm.
5
Figure 5: K-mean Results (K-means applied on Figure 4)
Figure 4 on the previous page is the resulting graph using LDA to reduce the 27-dimensions to 2-
dimension data. We will next use the K-Means algorithm to cluster the data shown in Figure 5 above.
This shows where center of the clusters are located after applying the K-Means algorithm. The circled
areas is approximately where each player falls within each cluster.
How is this Useful?
Suppose your favorite team gets a new rookie player and the only information that you can find on
Google is his (Age, Games played, Minutes played, Free Throws, Free Throw Percentage, ...etc). We are
able to go through this process again and the rookie will belong in one of clusters in Figure 5.
What does this exactly tell us? This shows us the following information: (1) which players the rookie
will be similar to, (2) statistics on the similar features within the cluster he lands in, and (3) if he will
be an outstanding player. How close he is to the players within his cluster means how relatively similar
they are. And vice versa, the farther the player is to another player, the more dissimilar they will be.
These features could be skillset, position, ranking, etc.
4.2.1 Analysis of the LDA Graph using K-Means Clustering
With the data being collected, refined, graphed, and clustered, it can now satisfy as potential training
data for numerous types of predicting algorithms (both Unsupervised and Supervised 1).
Suppose some friends start a competitive fantasy draft: Everyone picks a player to form any team.
However, they can only play the positions they realistically play, and no repeated picks can occur.
Suppose all your friends are huge sports fans, so they start picking their favorite players and who they
think are the best athletes. Since we already have our training data, we could build an algorithm that
can predict who the best player will be for each position and then print out all the next season’s best
6
picks. Hence, this is how NBA gambling can become very competitive.
From the Figure 6, we can infer that the orange cluster involves Forward position players. Kevin
Durant is known to be an excellent Small Forward and Power Forward. His data point appears on the edge
of his cluster. Note: Outliers in each cluster represent the stronger players for that basketball position
because their 27-dimensional statistics are insane compared to other players within their respective
clusters. This observation will also be proven with spectral clustering!
Figure 6: Zoomed-In Cluster
4.3 Method 2: Principle Component Analysis
Principal Component Analysis (PCA) is a dimension-reduction method used to cluster high-
dimensional data into 2 dimensions or 3 dimensions.
The Principal Component Analysis method is important because the K-Means algorithm is most
useful when you can visualize the data clustering; therefore, anything above 3 dimensions would not be
useful. Thus, the Principal Component Analysis method can reduce large dimensional data into smaller
dimensions while still retaining all the information from the large dataset.
Principal Component Analysis method is basically a procedure to transform a number of correlated
variables into a smaller number of uncorrelated variables called Principal Components. This method
is very similar to the Singular Value Decomposition Method (SVD) in that the very first Singular Value
of the matrix carries the most weight compared to all the rest.
σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 (6)
The same is also true for the principal components, in that the first component will carry the most
information from the dataset.
Traditionally, the Principal Component Analysis method is performed on a squares symmetric matrix.
Sum of squares and cross products matrix, covariance matrix, or correlation matrix can be used instead,
but the difference between the methods is merely a scalar factor. However, for the sake of simplicity, we
will use Singular Decomposition to also obtain our Principal Components.
Given the data X, SVD(X) = UΣVT
where Σ is our singular value matrix which is a diagonal matrix
and U,V are our orthogonal unit matrices; thus,
XT
X = (UΣVT
)T
∗ UΣVT
(7)
= VΣT
UT
UΣVT
(8)
= VΣT
ΣVT
(9)
= VˆΣ2
VT
(10)
= VΛVT
(11)
where Λ is our square diagonal matrix (diagonal Lambda matrix) which holds all our eigenvalues in order
of importance.
λ1 ≥ λ2 ≥ ... ≥ λn (12)
∀n ∈ Z+
, λn ≥ 0 (13)
Since Λ is diagonal, we can create a co-variance matrix. This is one way of getting Principal Com-
ponents, and thankfully the Principal Component Analysis method is pre-built for most computer
languages.
7
4.3.1 Analysis of the PCA Graph using K-Means Clustering
Running the PCA method on all 7 teams yields a very interesting graph. How many clusters can you
spot out? One? Two? Seventy? (See Figure 7 below.)
Figure 7: PCA Method for all 7 teams
Evidently, one cluster doesn’t tell us anything. So, how about two? Let the number of centriods k=2.
Clearly, applying the Principal Component Analysis method here is not really useful.
At first glance, we can already foreshadow that the data is not separable with a hyperplane. This
is one example of the limitations of using K-Means because this algorithms focuses on Compactness.
In the next section, we will use Spectral Clustering to cluster by the criterion Connectivity (See
Definition 1.1.2 in Section 2.1) to analyze the PCA graph.
Figure 8: K-means failing on PCA Graph
8
5 Spectral Clustering
5.1 History
Spectral Clustering has a long history and was mainly popularized by Jianbo Shi and Jitendra Malik
in "Normalized Cuts and Image Segmentation," and in "On Spectral Clustering: Analysis and an Algo-
rithm" in Advances in Neural Information Processing Systems by Ng, Jordan, and Weiss. Since Spectral
Clustering has a greater variety of different algorithms, it is rather difficult to pinpoint the origin/creator
of this clustering method.
5.2 What is Spectral Clustering
What is Spectral Clustering: Spectral Clustering classifies a dataset using the data points as nodes
of a graph. Therefore, it is almost similar to a graph partitioning problem (See Section 2.2: NP-Hard).
The nodes are mapped into low-dimensions so that they can be easily segregated to form clusters. The
criterion that Spectral Clustering focuses on is Connectivity (See Definition 1.1.2 in Section 2.1).
5.3 Steps for Building a Spectral Clustering Algorithm:
1. Create a similarity graph.
2. Project our given data onto a low-dimensional space.
3. Create the clusters.
Step 1: Similarity Graph
The main function of Spectral Clustering is to compute a similarity graph. We first will create an
undirected graph G= (V,E) with vertex set V= v1,v2,...,vn and represent this with an adjacency matrix
with the similarity between each vertex/node as its entries. To compute the simiarity function, there are
3 common ways:
1. The -neightborhood graph: Connecting all points whose pairwise distances are smaller than
. Since the distances are of the same scalar (at most ), the edges are unweighted and thus, this
is an undirected graph.
2. K-Nearest Neighbors: Attaching an edge from each node to its k nearest neighbors in the space
(k is not choice-sensitive). After, connecting appropriate vertices/nodes, the edges are weighted by
the similarity of the adjacent points.
3. Fully connected graph: Connecting all points with each other and weighting all edges by sim-
ilarity wij. This graph models the local neighborhood relationships. Hence, similarity functions
such as the Gaussian similarity function can be used.
w(xi, xj) = exp(−
||xi − xj||2
2σ2
) (14)
σ = width of neighborhoods/free tuning parameter to adjust similarity
Note: There are other similarity functions such as the polynomial kernel, dynamic similarity kernel,
and the inverse multi-quadric kernel.
Figure 9:
Thus, we are able to create an adjacency matrix A for any of our chosen similarity graphs. Note: Entries
of adjacency matrix A will be 0 and 1 if the graph is unweighted.
9
Step 2: Project Data onto Low-Dimensional Space
As noted in Figure 10, some data points in the same circular cluster have a greater Euclidean distance
between them compared to points in different clusters.
Figure 10: Circles Data
Hence, we want to project any observations made into low-dimensional space. We can do this by
computing the Graph Laplacian.
L = D − A, (15)
where A is the adjacency matrix and D is the degree matrix such that:
di =
j|(i,j)∈E
wij (16)
Thus, we get the following Laplacian matrix:
L =



di if i = j
−wij if (i, j) ∈ E
0 if (i, j) /∈ E
(17)
From the Graph Laplacian, we are able to find the eigenvalues and eigenvectors to embed data points
into low-dimensional space.
Using Linear Algebra, we get the equality:
Lλ = λv (18)
where v is the eigenvector of L corresponding to eigenvalue λ.
Thus, we get eigenvalues {λ1,λ2,λ3,...,λn} where 0 = λ1≤λ2≤...≤λn and eigenvectors {v1,v2,...,vn}.
1. If L has eigenvalue 0 with k different eigenvalues, then the undirected graph G=(V,E) has k
connected components (subgraph).
2. If G=(V,E) is connected (aka λ1 = 0) and λ2 > 0, then λ2 is the algebraic connectivity of G. This
2nd non-zero eigenvalue is called the Fiedler Value—it approximates the minimum graph cut
needed to separate graph into 2 connected components.
3. 0 = λ1≤λ2≤...≤λn is called the Spectrum of the Laplacian—which tells us a great deal about the
graph: sparsity of a graph cut, number of connected components, and if graph is bipartite.
Now, we’re going to use those eigenvectors and eigenvalues to cluster the data.
10
Step 3: Create clusters
Creating 2 Clusters: For this tricky step, we are going to assign values to each vertex. Using that 2nd
non-zero eigenvalue λi, with its corresponding eigenvector vi, we assign each vertex to each element of
that eigenvalue. Then, we cluster the vertices based on if their value is > 0 or ≤ 0. Hence, we can see
that each element of the Fiedler eigenvector tells us which cluster each vertex belongs to.
Example: Suppose vi = [x1, x2, x3, x4, x5, x6]. We assign the first vertex to x1, the second ver-
tex to x2, and so forth... Assume that x1, x2, x3 > 0 and x4, x5, x6 ≤ 0. Then, we split the vertices
such that all vertices with assigned value > 0 are in one cluster, and all vertices with value ≤ 0 are in
another cluster. Therefore, we end up with two clusters with vertices 1,2,3 in one cluster and vertices
4,5,6 in the second cluster.
Hint: Look for big difference gaps between the eigenvalues to guess the number of clusters! If you
guess that there are k clusters, eigenvectors associated with the first k-1 non-zero eigenvalues should
give information on how to cut the data into k clusters.
This method of creating clusters is perfect for 2 clusters. However, it is too difficult if there are too
many clusters (k»2).
——————————————————————————————————————————
Creating k Clusters: For k clusters, we first normalize our Laplacian and then perform K-means to
group our data points into k clusters.
Normalizing the Laplacian matrix:
Lnorm = D−1/2
LD−1/2
(19)
The Normalized Laplacian was created by Ng, Jordan, and Weiss.
1. For k, clusters, compute the first k eigenvectors of the Normalized Laplacian.
2. Stack the vectors vertically in order to form a new matrix with those vectors as its columns.
3. Every vertex will be represented by the corresponding row of this new matrix. These rows then
form the feature vectors of the vertices.
4. Use K-Means to cluster these data points into our k clusters {C1,C2,...,Ck}
5.4 Analysis of the PCA Graph using Spectral Clustering
Referencing from Figure 8: K-means failing on PCA Graph, we can obviously see that Spectral
Clustering gives us much better information (See Figure 12)! The strongest players for all positions
lie in the outer cluster and the more average players are contained in the inner cluster. Players such as
Kevin Durant, Stephen Curry, Damien Lillard, Paul George, Klay Thompson, J.J. Redick, CJ McCollum,
Lou Williams, Russell Westbrook, Steven Adams, etc. are found in the outer cluster.
Figure 11: Zoomed-in on PCA Graph
We can also apply Spectral clustering on a single NBA team: Below, we chose the Golden State
Warriors. However, we realized that the observations made on a single team may not hold useful using
Spectral clustering because each player in a team has a different position.(See Figure 13.)
11
Figure 12: Spectral Clustering on PCA Graph of the 7 Teams
Figure 13: Spectral Clustering on PCA Graph of the Golden State Warriors Team
6 Advantages/Disadvantages of K-Means vs. Spectral clustering
For our NBA dataset:
• Using the LDA Graph with K-Means clustering was conclusive.
• Using the PCA Graph with Spectral clustering was inconclusive.
• Using the PCA Graph with Spectral Clustering was conclusive.
For K-Means clustering (compactness), the density of data points is the main factor pushing the
clustering of data. While it is useful for mixture models (See Definition 1.1.3 from Section 2.1), it is
ineffective when applied to spiral or circular data as it relies on the Euclidean distance between data
points.
Spectral clustering (connectivity) does not rely on strong assumptions on the statistics of the
clusters. Methods like K-Means clustering assume that points assigned to a cluster are spherical around
the cluster’s centroid. Hence, this is a very strong assumption. However, the disadvantage with Spectral
clustering is that it is computationally expensive for large datasets. This is due to the K-Means clustering
done on the Laplacian’s computed eigenvalues and eigenvectors. The K-Means clustering in the final
step of Spectral clustering also implies that the clusters are not always the same—it depends on the
initial choice of centroids.
12
7 K-Means Algorithm
This is the K-Means algorithm that we used in both K-Means clustering and Spectral clustering for
k clusters.
Given a data set with n attributes X = X1, X2, ..., Xn and we know k clusters will form (k<n). Let
m = m1, m2, ..., mk be the initial guesses of where the i number of clusters are. Remember, our goal is
to minimize ||X
(j)
i − mj|| (minimize the distance of all the k clusters).
Pseudo code:
1. Input the data we are clustering give the k number of cluster.
2. Give an initial guess for the location of each cluster center is.
3. Run the Euclidean distance function for each data point with the centroid.
4. Run the Euclidean distance function for each data point with the centroid
5. Record/assign each data point to its closest centroid.
6. Sum up each data point with similar clusters and divide by number of values summed (this will be
the new centroid locations for the next iteration).
7. Repeat 3,4,5, and 6 until the centroids stop moving consecutively.
The Figure 14 below is the pseudo code in Matlab form:
Figure 14: Homemade K-Means algorithm
Explanation of the code: The first i,j for the loop in the MatLab code created represents the Euclidean
distance function. This essentially records the distance of every point from each centroid.
Distance(i, j) = ||Xi − mj||2
(20)
13
The second i,j loop in the MatLab code will test the distance of each point to each centroid with the
min distance of all the recorded distances of all the centroids.This will then label it with the centroid
that it is closest to. The reason is because we can be able to build and easy-reference list to point to the
data points we want.
The third i,j loop will sum up all the data points that have the same centroid label that we gave them
in the previous loop. We then will divide the sum by the number of data points we summed up. Lastly,
we take those results and set them to be the new centroids for the next iteration. Rinse and repeat until
max iteration is reached.
References
[1] D. Doty. Theory of Computation: ECS 120 Lecture Notes. 2019.
[2] G. Calafiore, L. El Ghaoui. Optimization Models. Cambridge University Press, 2014.
[3] J. De Loera. "Math for Data Analytics." Notes, Lecture from University of California, Davis, CA,
April 2019.
[4] L. Elden. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.
[5] U. von Luxburg. Statistics and Computing. A Tutorial on Spectral Clustering. Springer, 2007.
14

More Related Content

What's hot

Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
 
On the High Dimentional Information Processing in Quaternionic Domain and its...
On the High Dimentional Information Processing in Quaternionic Domain and its...On the High Dimentional Information Processing in Quaternionic Domain and its...
On the High Dimentional Information Processing in Quaternionic Domain and its...IJAAS Team
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...IJECEIAES
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sineijcsa
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Gesture recognition system
Gesture recognition systemGesture recognition system
Gesture recognition systemeSAT Journals
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsEditor IJCATR
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...csandit
 
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...IRJET Journal
 
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...csandit
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 

What's hot (20)

Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
On the High Dimentional Information Processing in Quaternionic Domain and its...
On the High Dimentional Information Processing in Quaternionic Domain and its...On the High Dimentional Information Processing in Quaternionic Domain and its...
On the High Dimentional Information Processing in Quaternionic Domain and its...
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
report
reportreport
report
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
50120140505013
5012014050501350120140505013
50120140505013
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sine
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Gesture recognition system
Gesture recognition systemGesture recognition system
Gesture recognition system
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA Guards
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
 
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
IRJET-An Effective Strategy for Defense & Medical Pictures Security by Singul...
 
M2R Group 26
M2R Group 26M2R Group 26
M2R Group 26
 
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 

Similar to Mat189: Cluster Analysis with NBA Sports Data

CS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningCS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningssuserb02eff
 
CS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptCS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptHathiramN1
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reductionShatakirti Er
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET Journal
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptxtayyaba977749
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKSara Parker
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Nonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralNonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralHưng Đặng
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 

Similar to Mat189: Cluster Analysis with NBA Sports Data (20)

Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
 
CS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learningCS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt learning
 
CS583-unsupervised-learning.ppt
CS583-unsupervised-learning.pptCS583-unsupervised-learning.ppt
CS583-unsupervised-learning.ppt
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Aina_final
Aina_finalAina_final
Aina_final
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
image_classification.pptx
image_classification.pptximage_classification.pptx
image_classification.pptx
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
 
Nonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralNonlinear image processing using artificial neural
Nonlinear image processing using artificial neural
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 

Mat189: Cluster Analysis with NBA Sports Data

  • 1. Cluster Analysis: NBA Sports Data Kathlene Ngo and Gareth Williams Presented on May 21, 2019 Abstract Clustering is one of the main tasks in unsupervised machine learning, methods that use algo- rithms inferring from a dataset without reference or prior knowledge of labeled outcomes. Unlike its supervised counterpart, unsupervised machine learning methods cannot be directly applied to regression or classification problems. This is due to the lack of knowledge about the output values; hence, it would be impossible to train the algorithms. Unsupervised learning uses techniques to learn the structure of the dataset. In this paper, we will be discussing two traditional methods of clustering: K-Means and Spectral. Contents 1 Introduction to Machine Learning 2 2 What is Clustering? 2 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Theory of Computation: NP-Hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Our NBA Player Dataset: What does our data look like? . . . . . . . . . . . . . . . . . . 3 3 K-Means Clustering 4 3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 What is K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 What if our data is more than 3 dimensions? . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.4 Guessing Your Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Graphing our Data onto 2-Dimensions 5 4.1 Initial Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Method 1: Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2.1 Analysis of the LDA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 6 4.3 Method 2: Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3.1 Analysis of the PCA Graph using K-Means Clustering . . . . . . . . . . . . . . . . 8 5 Spectral Clustering 9 5.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 What is Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.3 Steps for Building a Spectral Clustering Algorithm: . . . . . . . . . . . . . . . . . . . . . . 9 Step 1: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Step 2: Project Data onto Low-Dimensional Space . . . . . . . . . . . . . . . . . . . . . . 10 Step 3: Create clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.4 Analysis of the PCA Graph using Spectral Clustering . . . . . . . . . . . . . . . . . . . . 11 6 Advantages/Disadvantages of K-Means vs. Spectral clustering 12 7 K-Means Algorithm 13 References 14 1
  • 2. 1 Introduction to Machine Learning Supervised machine learning uses prior knowledge of output values for the sample data. A popular example of supervised learning is classification and the more common algorithms include support vector machines, artificial neural networks, and linear regression. Unsupervised machine learning, on the other hand, does not have labeled outcomes; hence, its goal is to infer the natural structure present within a dataset. The most common task within unsupervised learning is data clustering, finding a coherent structure within our data without using any explicit labels. We will be demonstrating two algorithms of data clustering in our NBA problem: K-Means clustering and Spectral clustering. 2 What is Clustering? 2.1 Definition Let us be given a large amount of arbitrary data. As future data analysts, we want to obtain informa- tion from this big dataset. How exactly do we organize and categorize such information? Clustering is a Machine Learning technique that involves the grouping of data points that many modern data scientists familarize themselves with. Given a set of data points from a dataset, we can use clustering algorithms to "classify" each data point into specific groups. In theory, data points in the same group must be similar to one another, and thus, data points in different groups must be dissimilar to one another. Similarity can be thought of as having same properties or features depending on what context the data is in and what we are specifically studying. There are two different criteria for data clustering: • Compactness, e.g., k-means, mixture models • Connectivity, e.g., spectral clustering Definition 1.1.1: Compactness is the property that generalizes the notion of a subset of Euclidean space being closed and bounded, in order words—points that lie close to each other fall in the same cluster and are "compact" around the cluster’s center. Definition 1.1.2: Connectivity is one of the basic concepts of graph theory related to network flow: it asks for the minimum number of elements (edges or nodes) needed to be removed or cut to separate the remaining nodes into isolated subgraphs. Or, in order words—points that are connected by edges or next to each other are put in the same cluster. Definition 1.1.3: Mixture models are probabilistic models for representing the sub-populations within an overall population that do not require that the dataset must identity to which sub-population each individual data point belongs. Figure 1: Different types of clustering normally specializes in mainly one of the two criteria. The goal is to find homogeneous groups of data points based on the degree of similarity and dis- similarity of their attributes/features. Most clustering methods that exist are specialized to a single criterion. Hence, such methods would be unsuitable for datasets with different characteristics. And therefore, modern data scientists are researching multiple-objective clustering algorithms. However, we will be using two traditional clustering methods that each focus only on a single crite- rion to study a given basketball player dataset: K-Means clustering and Spectral clustering. 2
  • 3. 2.2 Theory of Computation: NP-Hard Data clustering is a relaxation of this NP-Hard problem. How to partition a graph into two clusters: We will use min-cut to partition into two sets(A and B) such that the weight of edges connecting vertices in A to vertices in B is minimum. cut(A, B) = i∈A,∈B wij (1) Note: This is a rather easy-to-solve algorithm, but not exactly a good partition as it often isolates vertices. And thus, unwanted cuts with weights lesser than the "ideal cut" will occur. Figure 2: Less than Ideal Cut Therefore, we want to normalize the cut to make A and B similar in size. Ncut(A, B) = cut(A, B)( 1 vol(A) + 1 vol(B) ) (2) vol(A) = i∈A dj (3) dj = n j=1 wij (4) |A| = number of vertices of A; vol(A) is the size of A by summing over the weights of all edge attached to vertices in A This is NP-Hard, or in order words–computationally difficult to execute. We must use heuristic algo- rithms (aka clustering) to convey quickly to a local optimum. 2.3 Our NBA Player Dataset: What does our data look like? Figure 3: Golden State Warriors and Toronto Raptors. We have a 145 x 27 matrix containing all the players’ names and certain attributes such as: Age, Games played, Minutes played, Free Throws, Free Throw Percentage, etc. The Figure 3 above is a snippet of the data we’ll be clustering! (Check References for link to the complete dataset.) 3
  • 4. 3 K-Means Clustering 3.1 History K-Means was first created by Hugo Steinhaus in 1956 and was not really utilized until 1957 by Stuart Lloyd as a technique for pulse-code modulation. The original algorithm then branched off into different variants like k-means++, k-medians, k-metroids, Guassian mixtures models, HG-means, etc. The problem that we will be presenting and solving with the K-Means algorithm is invented by math- ematicians who had a passion for sports gambling or is most likely involved with the plot in the movie Moneyball. 3.2 What is K-means What is K-Means: K-means is a simple and easy way to classify a given data set with a certain k- number of clusters or groups. It is a method of vector quantization that is highly popular for clustering analysis in data mining. The criterion that K-Means focuses on is Compactness (See Definition 1.1.1 in Section 2.1). Firstly, we will need to understand some of the fundamental basics. The main function of K-Means is to define k number of centroids, with each cluster containing a single centroid at the center of that particular cluster. Using an iterative process, the new centroids will be located and reassigned repetitively until we have found the stabilized centers. In some cases, the centroids will not stop changing and we must incorporate a max iteration to halt the algorithm within a reasonable margin of error. This is also a method for finding local optimums for objective functions. Min k j=1 n i=1 ||x (j) j − mj||2 (5) Our goal is minimize the error using the Euclidean distance function, which is a feature of the K-Means algorithm. x (j) i is our data, i is the data point, and j is the centroid we are comparing to. mj is the centroids with j indicating the jth centroid. Note: Although it can be proven theoretically that k-mean will always terminate once the centroids stop moving, but reality—the program does not always stop running. This might be due to computer limitations when dealing with very large numbers or extremely resource-intensive data (after the 16th digit, data becomes more and more inaccurate). Note: Initial conditions are very important. Basically, we don’t want to be naive and place centroids on top of one other or place them in questionable locations. Two very good techniques on the placement of initial centroids are: (1) give your initial guess near the origin or (2) slightly outside the data. 3.3 What if our data is more than 3 dimensions? Then we would have to use a dimension-reduction method to get our data into our ideal dimensions. PCA and LDA are examples of some good methods to use. This will discussed in detail in a later section. 3.4 Guessing Your Centroids If we are lucky and the data is separable, we can ideally draw a hyperplane/line where all the similar group data is on its own side of the hyperplane/line. This is very convenient visually because we can place our centroids very close to the actual center of the data. However, if our data is not obviously separable, then we have to give an educated guess about where the centroids are and the number of centroids that actually exists. This is one of the most difficult parts of K-Means clustering. Why is picking the number of centroids difficult? Suppose I gave you our NBA data and told you “who would be the next best pick if Kevin Durant got injured or left the Golden State Warriors Team?” If you had no knowledge about the NBA or Sports in general, like myself, then this would be very difficult to answer. Thus, having knowledge about your data is extremely useful: who, what, and where in from our data. We will first try to use K-Means clustering to analyze the current Top 7 NBA teams and their players to categorize players and determine the best athletes. 4
  • 5. 4 Graphing our Data onto 2-Dimensions 4.1 Initial Dataset Going back to our dataset (See Figure 3 in Section 2.3), recall that we have a 145 x 27 matrix. It contains all players’ names and attributes such as: Age, Games played, Minutes played, Free Throws, Free Throw Percentage, etc. This is 27-dimensional data. In order to visualize our information, we need to turn these 27 dimen- sions into 2 dimensions. 1. For our purposes of K-Means clustering, we will use the Linear Discriminant Analysis (LDA) to dimension-reduce and graph. This method of clustering helps to categorize players according to their positions/skills. 2. For the purposes of Spectral Clustering, we will use Principal Component Analysis (PCA) to dimension-reduce and graph. This method of clustering will separate the average players from extremely skilled players. Using both of these methods will help give us a mathematical decision on who the best pick players are without knowing anything prior about the sport. 4.2 Method 1: Linear Discriminant Analysis Linear Discriminant Analysis is method using a linear combination of features attributes, quali- ties, etc. in order to separate the data into classes, groups, and/or events. It is used in statistics, pattern recognition, and machine learning. To avoid getting side-tracked from our problem, the most important thing to know about the Linear Discriminant Analysis is that it is a dimensional reduction method and that it is similar to the Principal Component Analysis method. However, the Linear Dis- criminant Analysis attempts to find the differences between classes and the Principle Component Analysis doesn’t consider the difference. Consider the Figure 4: LDA Plot Method below. Figure 4: Players plotted using Linear Discriminant Analysis method. Using the Linear Discriminant Analysis method on our 27-dimensional data (using a built-in MAT- LAB function), our data is now in 2-dimensions. Now, what do we do from here? We then proceed to plug our 2-dimensional data into the K-Means to obtain how many clusters there are and their locations. Our detailed K-Means algorithm that we used is in Section 7 titled K-Means Algorithm. 5
  • 6. Figure 5: K-mean Results (K-means applied on Figure 4) Figure 4 on the previous page is the resulting graph using LDA to reduce the 27-dimensions to 2- dimension data. We will next use the K-Means algorithm to cluster the data shown in Figure 5 above. This shows where center of the clusters are located after applying the K-Means algorithm. The circled areas is approximately where each player falls within each cluster. How is this Useful? Suppose your favorite team gets a new rookie player and the only information that you can find on Google is his (Age, Games played, Minutes played, Free Throws, Free Throw Percentage, ...etc). We are able to go through this process again and the rookie will belong in one of clusters in Figure 5. What does this exactly tell us? This shows us the following information: (1) which players the rookie will be similar to, (2) statistics on the similar features within the cluster he lands in, and (3) if he will be an outstanding player. How close he is to the players within his cluster means how relatively similar they are. And vice versa, the farther the player is to another player, the more dissimilar they will be. These features could be skillset, position, ranking, etc. 4.2.1 Analysis of the LDA Graph using K-Means Clustering With the data being collected, refined, graphed, and clustered, it can now satisfy as potential training data for numerous types of predicting algorithms (both Unsupervised and Supervised 1). Suppose some friends start a competitive fantasy draft: Everyone picks a player to form any team. However, they can only play the positions they realistically play, and no repeated picks can occur. Suppose all your friends are huge sports fans, so they start picking their favorite players and who they think are the best athletes. Since we already have our training data, we could build an algorithm that can predict who the best player will be for each position and then print out all the next season’s best 6
  • 7. picks. Hence, this is how NBA gambling can become very competitive. From the Figure 6, we can infer that the orange cluster involves Forward position players. Kevin Durant is known to be an excellent Small Forward and Power Forward. His data point appears on the edge of his cluster. Note: Outliers in each cluster represent the stronger players for that basketball position because their 27-dimensional statistics are insane compared to other players within their respective clusters. This observation will also be proven with spectral clustering! Figure 6: Zoomed-In Cluster 4.3 Method 2: Principle Component Analysis Principal Component Analysis (PCA) is a dimension-reduction method used to cluster high- dimensional data into 2 dimensions or 3 dimensions. The Principal Component Analysis method is important because the K-Means algorithm is most useful when you can visualize the data clustering; therefore, anything above 3 dimensions would not be useful. Thus, the Principal Component Analysis method can reduce large dimensional data into smaller dimensions while still retaining all the information from the large dataset. Principal Component Analysis method is basically a procedure to transform a number of correlated variables into a smaller number of uncorrelated variables called Principal Components. This method is very similar to the Singular Value Decomposition Method (SVD) in that the very first Singular Value of the matrix carries the most weight compared to all the rest. σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 (6) The same is also true for the principal components, in that the first component will carry the most information from the dataset. Traditionally, the Principal Component Analysis method is performed on a squares symmetric matrix. Sum of squares and cross products matrix, covariance matrix, or correlation matrix can be used instead, but the difference between the methods is merely a scalar factor. However, for the sake of simplicity, we will use Singular Decomposition to also obtain our Principal Components. Given the data X, SVD(X) = UΣVT where Σ is our singular value matrix which is a diagonal matrix and U,V are our orthogonal unit matrices; thus, XT X = (UΣVT )T ∗ UΣVT (7) = VΣT UT UΣVT (8) = VΣT ΣVT (9) = VˆΣ2 VT (10) = VΛVT (11) where Λ is our square diagonal matrix (diagonal Lambda matrix) which holds all our eigenvalues in order of importance. λ1 ≥ λ2 ≥ ... ≥ λn (12) ∀n ∈ Z+ , λn ≥ 0 (13) Since Λ is diagonal, we can create a co-variance matrix. This is one way of getting Principal Com- ponents, and thankfully the Principal Component Analysis method is pre-built for most computer languages. 7
  • 8. 4.3.1 Analysis of the PCA Graph using K-Means Clustering Running the PCA method on all 7 teams yields a very interesting graph. How many clusters can you spot out? One? Two? Seventy? (See Figure 7 below.) Figure 7: PCA Method for all 7 teams Evidently, one cluster doesn’t tell us anything. So, how about two? Let the number of centriods k=2. Clearly, applying the Principal Component Analysis method here is not really useful. At first glance, we can already foreshadow that the data is not separable with a hyperplane. This is one example of the limitations of using K-Means because this algorithms focuses on Compactness. In the next section, we will use Spectral Clustering to cluster by the criterion Connectivity (See Definition 1.1.2 in Section 2.1) to analyze the PCA graph. Figure 8: K-means failing on PCA Graph 8
  • 9. 5 Spectral Clustering 5.1 History Spectral Clustering has a long history and was mainly popularized by Jianbo Shi and Jitendra Malik in "Normalized Cuts and Image Segmentation," and in "On Spectral Clustering: Analysis and an Algo- rithm" in Advances in Neural Information Processing Systems by Ng, Jordan, and Weiss. Since Spectral Clustering has a greater variety of different algorithms, it is rather difficult to pinpoint the origin/creator of this clustering method. 5.2 What is Spectral Clustering What is Spectral Clustering: Spectral Clustering classifies a dataset using the data points as nodes of a graph. Therefore, it is almost similar to a graph partitioning problem (See Section 2.2: NP-Hard). The nodes are mapped into low-dimensions so that they can be easily segregated to form clusters. The criterion that Spectral Clustering focuses on is Connectivity (See Definition 1.1.2 in Section 2.1). 5.3 Steps for Building a Spectral Clustering Algorithm: 1. Create a similarity graph. 2. Project our given data onto a low-dimensional space. 3. Create the clusters. Step 1: Similarity Graph The main function of Spectral Clustering is to compute a similarity graph. We first will create an undirected graph G= (V,E) with vertex set V= v1,v2,...,vn and represent this with an adjacency matrix with the similarity between each vertex/node as its entries. To compute the simiarity function, there are 3 common ways: 1. The -neightborhood graph: Connecting all points whose pairwise distances are smaller than . Since the distances are of the same scalar (at most ), the edges are unweighted and thus, this is an undirected graph. 2. K-Nearest Neighbors: Attaching an edge from each node to its k nearest neighbors in the space (k is not choice-sensitive). After, connecting appropriate vertices/nodes, the edges are weighted by the similarity of the adjacent points. 3. Fully connected graph: Connecting all points with each other and weighting all edges by sim- ilarity wij. This graph models the local neighborhood relationships. Hence, similarity functions such as the Gaussian similarity function can be used. w(xi, xj) = exp(− ||xi − xj||2 2σ2 ) (14) σ = width of neighborhoods/free tuning parameter to adjust similarity Note: There are other similarity functions such as the polynomial kernel, dynamic similarity kernel, and the inverse multi-quadric kernel. Figure 9: Thus, we are able to create an adjacency matrix A for any of our chosen similarity graphs. Note: Entries of adjacency matrix A will be 0 and 1 if the graph is unweighted. 9
  • 10. Step 2: Project Data onto Low-Dimensional Space As noted in Figure 10, some data points in the same circular cluster have a greater Euclidean distance between them compared to points in different clusters. Figure 10: Circles Data Hence, we want to project any observations made into low-dimensional space. We can do this by computing the Graph Laplacian. L = D − A, (15) where A is the adjacency matrix and D is the degree matrix such that: di = j|(i,j)∈E wij (16) Thus, we get the following Laplacian matrix: L =    di if i = j −wij if (i, j) ∈ E 0 if (i, j) /∈ E (17) From the Graph Laplacian, we are able to find the eigenvalues and eigenvectors to embed data points into low-dimensional space. Using Linear Algebra, we get the equality: Lλ = λv (18) where v is the eigenvector of L corresponding to eigenvalue λ. Thus, we get eigenvalues {λ1,λ2,λ3,...,λn} where 0 = λ1≤λ2≤...≤λn and eigenvectors {v1,v2,...,vn}. 1. If L has eigenvalue 0 with k different eigenvalues, then the undirected graph G=(V,E) has k connected components (subgraph). 2. If G=(V,E) is connected (aka λ1 = 0) and λ2 > 0, then λ2 is the algebraic connectivity of G. This 2nd non-zero eigenvalue is called the Fiedler Value—it approximates the minimum graph cut needed to separate graph into 2 connected components. 3. 0 = λ1≤λ2≤...≤λn is called the Spectrum of the Laplacian—which tells us a great deal about the graph: sparsity of a graph cut, number of connected components, and if graph is bipartite. Now, we’re going to use those eigenvectors and eigenvalues to cluster the data. 10
  • 11. Step 3: Create clusters Creating 2 Clusters: For this tricky step, we are going to assign values to each vertex. Using that 2nd non-zero eigenvalue λi, with its corresponding eigenvector vi, we assign each vertex to each element of that eigenvalue. Then, we cluster the vertices based on if their value is > 0 or ≤ 0. Hence, we can see that each element of the Fiedler eigenvector tells us which cluster each vertex belongs to. Example: Suppose vi = [x1, x2, x3, x4, x5, x6]. We assign the first vertex to x1, the second ver- tex to x2, and so forth... Assume that x1, x2, x3 > 0 and x4, x5, x6 ≤ 0. Then, we split the vertices such that all vertices with assigned value > 0 are in one cluster, and all vertices with value ≤ 0 are in another cluster. Therefore, we end up with two clusters with vertices 1,2,3 in one cluster and vertices 4,5,6 in the second cluster. Hint: Look for big difference gaps between the eigenvalues to guess the number of clusters! If you guess that there are k clusters, eigenvectors associated with the first k-1 non-zero eigenvalues should give information on how to cut the data into k clusters. This method of creating clusters is perfect for 2 clusters. However, it is too difficult if there are too many clusters (k»2). —————————————————————————————————————————— Creating k Clusters: For k clusters, we first normalize our Laplacian and then perform K-means to group our data points into k clusters. Normalizing the Laplacian matrix: Lnorm = D−1/2 LD−1/2 (19) The Normalized Laplacian was created by Ng, Jordan, and Weiss. 1. For k, clusters, compute the first k eigenvectors of the Normalized Laplacian. 2. Stack the vectors vertically in order to form a new matrix with those vectors as its columns. 3. Every vertex will be represented by the corresponding row of this new matrix. These rows then form the feature vectors of the vertices. 4. Use K-Means to cluster these data points into our k clusters {C1,C2,...,Ck} 5.4 Analysis of the PCA Graph using Spectral Clustering Referencing from Figure 8: K-means failing on PCA Graph, we can obviously see that Spectral Clustering gives us much better information (See Figure 12)! The strongest players for all positions lie in the outer cluster and the more average players are contained in the inner cluster. Players such as Kevin Durant, Stephen Curry, Damien Lillard, Paul George, Klay Thompson, J.J. Redick, CJ McCollum, Lou Williams, Russell Westbrook, Steven Adams, etc. are found in the outer cluster. Figure 11: Zoomed-in on PCA Graph We can also apply Spectral clustering on a single NBA team: Below, we chose the Golden State Warriors. However, we realized that the observations made on a single team may not hold useful using Spectral clustering because each player in a team has a different position.(See Figure 13.) 11
  • 12. Figure 12: Spectral Clustering on PCA Graph of the 7 Teams Figure 13: Spectral Clustering on PCA Graph of the Golden State Warriors Team 6 Advantages/Disadvantages of K-Means vs. Spectral clustering For our NBA dataset: • Using the LDA Graph with K-Means clustering was conclusive. • Using the PCA Graph with Spectral clustering was inconclusive. • Using the PCA Graph with Spectral Clustering was conclusive. For K-Means clustering (compactness), the density of data points is the main factor pushing the clustering of data. While it is useful for mixture models (See Definition 1.1.3 from Section 2.1), it is ineffective when applied to spiral or circular data as it relies on the Euclidean distance between data points. Spectral clustering (connectivity) does not rely on strong assumptions on the statistics of the clusters. Methods like K-Means clustering assume that points assigned to a cluster are spherical around the cluster’s centroid. Hence, this is a very strong assumption. However, the disadvantage with Spectral clustering is that it is computationally expensive for large datasets. This is due to the K-Means clustering done on the Laplacian’s computed eigenvalues and eigenvectors. The K-Means clustering in the final step of Spectral clustering also implies that the clusters are not always the same—it depends on the initial choice of centroids. 12
  • 13. 7 K-Means Algorithm This is the K-Means algorithm that we used in both K-Means clustering and Spectral clustering for k clusters. Given a data set with n attributes X = X1, X2, ..., Xn and we know k clusters will form (k<n). Let m = m1, m2, ..., mk be the initial guesses of where the i number of clusters are. Remember, our goal is to minimize ||X (j) i − mj|| (minimize the distance of all the k clusters). Pseudo code: 1. Input the data we are clustering give the k number of cluster. 2. Give an initial guess for the location of each cluster center is. 3. Run the Euclidean distance function for each data point with the centroid. 4. Run the Euclidean distance function for each data point with the centroid 5. Record/assign each data point to its closest centroid. 6. Sum up each data point with similar clusters and divide by number of values summed (this will be the new centroid locations for the next iteration). 7. Repeat 3,4,5, and 6 until the centroids stop moving consecutively. The Figure 14 below is the pseudo code in Matlab form: Figure 14: Homemade K-Means algorithm Explanation of the code: The first i,j for the loop in the MatLab code created represents the Euclidean distance function. This essentially records the distance of every point from each centroid. Distance(i, j) = ||Xi − mj||2 (20) 13
  • 14. The second i,j loop in the MatLab code will test the distance of each point to each centroid with the min distance of all the recorded distances of all the centroids.This will then label it with the centroid that it is closest to. The reason is because we can be able to build and easy-reference list to point to the data points we want. The third i,j loop will sum up all the data points that have the same centroid label that we gave them in the previous loop. We then will divide the sum by the number of data points we summed up. Lastly, we take those results and set them to be the new centroids for the next iteration. Rinse and repeat until max iteration is reached. References [1] D. Doty. Theory of Computation: ECS 120 Lecture Notes. 2019. [2] G. Calafiore, L. El Ghaoui. Optimization Models. Cambridge University Press, 2014. [3] J. De Loera. "Math for Data Analytics." Notes, Lecture from University of California, Davis, CA, April 2019. [4] L. Elden. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007. [5] U. von Luxburg. Statistics and Computing. A Tutorial on Spectral Clustering. Springer, 2007. 14