Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

Class 9
K-Means & Hierarchical Clustering
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com

Clustering -
The Basic Idea
access more at legalanalyticscourse.com

Adapted from Slides By
Victor Lavrenko and Nigel Goddard
@ University of Edinburgh
Take A LookThese 12

72
Female
Human
3
Female
Horse
36
Male
Human
21
Male
Human
67
Male
Human
29
Female
Human
54
Male
Human
44
Male
Human
50
Male
Human
42
Female
Human
6
Male
Dog
7
Female
Human

Task = Can We Determine to Which
Group the Agent Belongs?
Clustering (Unsupervised Learning)
f( )
Group?
Cluster

Clustering (Unsupervised Learning)
Clusterf( )
Group?

How did we arrive at these clusters?

Clustering-
Some High Level Points

Clustering is Unsupervised Learning

“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning

There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)

There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
Remember real data is n-dimensional
(which makes implementation / accuracy challenging)

What makes two (or more) objects ‘similar’ ?

As humans, we often place
objects into categories, groups, etc.

this is often done without
an explicit model
(just our mental model(s), etc.)

ExampleVia: Piyush Rai
Similarity is Slippery Concept

in clustering, we are interested in trying
to formalize the idea of ‘similarity’

A typical approach is to project
n-dimensional data into
a unidimensional ‘similarity index’
f( )
dimension 1
dimension 2
dimension 3
.
.
.
.
dimension n
similarity
or
distance function
similarity
index

everything in its own cluster
(i.e. everyone is a special snowﬂake)
everything in one cluster
unidimensional similarity spectrum

everything in its own cluster
(i.e. everyone is a special snowﬂake)
everything in one cluster
unidimensional similarity spectrum
as we slide across this spectrum is where the groupings become interesting
0% similarity threshold
hard question is where to stop as move from left to right
100% similarity threshold

The Heavy Lifting is the
develop/apply the optimal
similarity/distance function
for the substantive problem at issue

Different similarity criteria can
lead to different clusterings

Goal for Any Clustering Method:
Achieve High Within Cluster Similarity
Achieve Low Cross Cluster Similarity

We Want to Develop a Notion
of Distance Between Objects
Similarity is inversely related to distance

K-Means
and
H-Clust

K Means and
Hierarchical Clustering
are the Most Popular Approaches
Used in Clustering

K-Means

K Means
How do we ﬁnd the clusters in the data shown below?
We select K clusters in advance
Iteratively seek to min sum of
squared distances
Iteratively seek to min sum of
squared distances

K Means Optimization
We start with K clusters with unknown centers
We are attempting to min the sum of squared distances
(i.e. the objective function shown below)
Tricky Part is that this minimization problem
cannot be solved analytically

Stuart Lloyd proposed a simple heuristic solution
“Lloyd’s algorithm” aka “k-means” is a good candidate solution
K Means Optimization
from
FlachText
Page 248

K-Means
where k = 2
Adapted from Example by Piyush Rai
initialization step

K-Means
where k = 2
First Iteration - Assigning Points

K-Means
where k = 2
First Iteration - Recalculate the Center of the Cluster

K-Means
where k = 2
Second Iteration - Assigning Points

K-Means
where k = 2
Second Iteration - Recalculate the Center of the Cluster

K-Means
where k = 2
Third Iteration - Assigning Points

K-Means
where k = 2
Third Iteration - Recalculate the Center of the Cluster

K Means Clustering
Fast Method But Leads to Local Minimum
Should repeat from different starting conditions
(must then ﬁgure best heuristic to ﬁnd global min)
Important Weakness is it often not clear what value of K

https://www.youtube.com/watch?v=Qqg4Fklxqh0https://www.youtube.com/watch?v=0MQEt10e4NM
K-Means Clustering
some helpful videos
https://www.youtube.com/watch?v=4shfFAArxSc

H-Clust

Partitions can be visualized using a tree structure (a dendrogram)
Does not need the number of clusters as input
Possible to view partitions at different levels of granularities
(i.e., can reﬁne/coarsen clusters) using different K
DescriptionVia: Piyush Rai

http://scaledinnovation.com/analytics/trees/dendrograms.html

Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy.
Agglomerative versus Divisive Methods

Agglomerative
Methods
Divisive
Methods

Agglomerative
Methods
Divisive
Methods
dendrogram
memorializes the
splits or order of agglomeration

Groups within Groups within Groups ...

“(1) Start by assigning each item to a cluster, so that if you have
N items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
(2) Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one cluster less.
(3) Compute distances (similarities) between the new cluster and
each of the old clusters.
(4) Repeat steps 2 and 3 until all items are clustered into a single
cluster of size N. (*)”
S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika, 2:241-254

There are a variety of different approaches to Step 3
(3) Compute distances (similarities) between the new
cluster and each of the old clusters.
single-linkage clustering
complete-linkage clustering
average-linkage clustering
centroid linkage clustering
(see pages 253-258 of Flach)

https://www.youtube.com/watch?v=zygVdmlS-YAhttps://www.youtube.com/watch?v=2z5wwyv0Zk4
some helpful videos

Implementation in R

https://www.youtube.com/watch?v=M9jb6KrBlPc

https://www.youtube.com/watch?v=sAtnX3UJyN0

https://www.youtube.com/watch?v=v3k8WEOVSYw

Clustering -
E-Discovery

E-Discovery is simply
information retrevial + context

relevant v. not-relevant
privileged v. not-privileged
Trying to locate documents

Pre-Clustering Documents
Can Aid in the Review Process

http://edu.cluster-text.com/

download the movie here
(its .wmv might require an additional download to run on a Mac)

Mapping the Case Space
(Using Citation Networks to Extract
Distance Functions for Clustering Documents)

http://www.slideshare.net/Danielkatz/sinks-method-paper-presentation-duke-political-networks-conference-2010

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

More Related Content

What's hot

Viewers also liked

Similar to Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

More from Daniel Katz

Recently uploaded

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito