Class 9
K-Means & Hierarchical Clustering
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com
Clustering -
The Basic Idea
access more at legalanalyticscourse.com
Adapted from Slides By
Victor Lavrenko and Nigel Goddard
@ University of Edinburgh
Take A LookThese 12
access more at legalanalyticscourse.com
72
Female
Human
3
Female
Horse
36
Male
Human
21
Male
Human
67
Male
Human
29
Female
Human
54
Male
Human
44
Male
Human
50
Male
Human
42
Female
Human
6
Male
Dog
7
Female
Human
Task = Can We Determine to Which
Group the Agent Belongs?
Clustering (Unsupervised Learning)
f( )
Group?
Cluster
access more at legalanalyticscourse.com
Clustering (Unsupervised Learning)
Clusterf( )
Group?
access more at legalanalyticscourse.com
Clustering (Unsupervised Learning)
Clusterf( )
Group?
access more at legalanalyticscourse.com
How did we arrive at these clusters?
access more at legalanalyticscourse.com
Clustering-
Some High Level Points
access more at legalanalyticscourse.com
Clustering is Unsupervised Learning
access more at legalanalyticscourse.com
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
Remember real data is n-dimensional
(which makes implementation / accuracy challenging)
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is typically Unsupervised Learning
access more at legalanalyticscourse.com
The Science of Similarity
What makes two (or more) objects ‘similar’ ?
access more at legalanalyticscourse.com
As humans, we often place
objects into categories, groups, etc.
access more at legalanalyticscourse.com
this is often done without
an explicit model
(just our mental model(s), etc.)
access more at legalanalyticscourse.com
ExampleVia: Piyush Rai
Similarity is Slippery Concept
access more at legalanalyticscourse.com
in clustering, we are interested in trying
to formalize the idea of ‘similarity’
access more at legalanalyticscourse.com
A typical approach is to project
n-dimensional data into
a unidimensional ‘similarity index’
f( )
dimension 1
dimension 2
dimension 3
.
.
.
.
dimension n
similarity
or
distance function
similarity
index
access more at legalanalyticscourse.com
everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity spectrum
access more at legalanalyticscourse.com
everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity spectrum
as we slide across this spectrum is where the groupings become interesting
0% similarity threshold
hard question is where to stop as move from left to right
100% similarity threshold
access more at legalanalyticscourse.com
The Heavy Lifting is the
develop/apply the optimal
similarity/distance function
for the substantive problem at issue
access more at legalanalyticscourse.com
Different similarity criteria can
lead to different clusterings
access more at legalanalyticscourse.com
Goal for Any Clustering Method:
Achieve High Within Cluster Similarity
Achieve Low Cross Cluster Similarity
access more at legalanalyticscourse.com
We Want to Develop a Notion
of Distance Between Objects
Similarity is inversely related to distance
access more at legalanalyticscourse.com
K-Means
and
H-Clust
access more at legalanalyticscourse.com
K Means and
Hierarchical Clustering
are the Most Popular Approaches
Used in Clustering
access more at legalanalyticscourse.com
K-Means
access more at legalanalyticscourse.com
K Means
How do we find the clusters in the data shown below?
We select K clusters in advance
Iteratively seek to min sum of
squared distances
Iteratively seek to min sum of
squared distances
K Means Optimization
We start with K clusters with unknown centers
We are attempting to min the sum of squared distances
(i.e. the objective function shown below)
Tricky Part is that this minimization problem
cannot be solved analytically
access more at legalanalyticscourse.com
Stuart Lloyd proposed a simple heuristic solution
“Lloyd’s algorithm” aka “k-means” is a good candidate solution
K Means Optimization
from
FlachText
Page 248
K-Means
a visual example
K-Means
where k = 2
Adapted from Example by Piyush Rai
initialization step
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Assigning Points
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Assigning Points
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Assigning Points
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Recalculate the Center of the Cluster
access more at legalanalyticscourse.com
K Means Clustering
Fast Method But Leads to Local Minimum
Should repeat from different starting conditions
(must then figure best heuristic to find global min)
Important Weakness is it often not clear what value of K
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=Qqg4Fklxqh0https://www.youtube.com/watch?v=0MQEt10e4NM
K-Means Clustering
some helpful videos
https://www.youtube.com/watch?v=4shfFAArxSc
access more at legalanalyticscourse.com
H-Clust
access more at legalanalyticscourse.com
Hierarchical Clustering
Partitions can be visualized using a tree structure (a dendrogram)
Does not need the number of clusters as input
Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K
DescriptionVia: Piyush Rai
http://scaledinnovation.com/analytics/trees/dendrograms.html
Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy.
Agglomerative versus Divisive Methods
access more at legalanalyticscourse.com
Agglomerative
Methods
Divisive
Methods
access more at legalanalyticscourse.com
Agglomerative
Methods
Divisive
Methods
dendrogram
memorializes the
splits or order of agglomeration
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Hierarchical Clustering
“(1) Start by assigning each item to a cluster, so that if you have
N items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
(2) Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one cluster less.
(3) Compute distances (similarities) between the new cluster and
each of the old clusters.
(4) Repeat steps 2 and 3 until all items are clustered into a single
cluster of size N. (*)”
S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika, 2:241-254
Hierarchical Clustering
There are a variety of different approaches to Step 3
(3) Compute distances (similarities) between the new
cluster and each of the old clusters.
single-linkage clustering
complete-linkage clustering
average-linkage clustering
centroid linkage clustering
(see pages 253-258 of Flach)
https://www.youtube.com/watch?v=zygVdmlS-YAhttps://www.youtube.com/watch?v=2z5wwyv0Zk4
Hierarchical Clustering
some helpful videos
access more at legalanalyticscourse.com
Implementation in R
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=M9jb6KrBlPc
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=sAtnX3UJyN0
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=v3k8WEOVSYw
access more at legalanalyticscourse.com
Clustering -
E-Discovery
access more at legalanalyticscourse.com
E-Discovery is simply
information retrevial + context
access more at legalanalyticscourse.com
relevant v. not-relevant
privileged v. not-privileged
Trying to locate documents
access more at legalanalyticscourse.com
Pre-Clustering Documents
Can Aid in the Review Process
access more at legalanalyticscourse.com
http://edu.cluster-text.com/
access more at legalanalyticscourse.com
download the movie here
(its .wmv might require an additional download to run on a Mac)
access more at legalanalyticscourse.com
download the movie here
(its .wmv might require an additional download to run on a Mac)
access more at legalanalyticscourse.com
Mapping the Case Space
(Using Citation Networks to Extract
Distance Functions for Clustering Documents)
access more at legalanalyticscourse.com
http://www.slideshare.net/Danielkatz/sinks-method-paper-presentation-duke-political-networks-conference-2010
access more at legalanalyticscourse.com
Legal Analytics
Class 9 - K-Means & Hierarchical Clustering
daniel martin katz
blog | ComputationalLegalStudies
corp | LexPredict
michael j bommarito
twitter | @computational
blog | ComputationalLegalStudies
corp | LexPredict
twitter | @mjbommar
more content available at legalanalyticscourse.com
site | danielmartinkatz.com site | bommaritollc.com

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito