Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Class 9
K-Means & Hierarchical Clustering
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
le...
Clustering -
The Basic Idea
access more at legalanalyticscourse.com
Adapted from Slides By
Victor Lavrenko and Nigel Goddard
@ University of Edinburgh
Take A LookThese 12
access more at lega...
72
Female
Human
3
Female
Horse
36
Male
Human
21
Male
Human
67
Male
Human
29
Female
Human
54
Male
Human
44
Male
Human
50
Ma...
Task = Can We Determine to Which
Group the Agent Belongs?
Clustering (Unsupervised Learning)
f( )
Group?
Cluster
access mo...
Clustering (Unsupervised Learning)
Clusterf( )
Group?
access more at legalanalyticscourse.com
Clustering (Unsupervised Learning)
Clusterf( )
Group?
access more at legalanalyticscourse.com
How did we arrive at these clusters?
access more at legalanalyticscourse.com
Clustering-
Some High Level Points
access more at legalanalyticscourse.com
Clustering is Unsupervised Learning
access more at legalanalyticscourse.com
“Similar” is the Key Idea (but it is a slippery concept)
Clustering is a Method of Grouping Similar Objects
Clustering is ...
There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
“Similar” is the Key Idea (but it...
There are a variety of methods used in this area
(Agglomerative versus Divisive Methods)
Remember real data is n-dimension...
The Science of Similarity
What makes two (or more) objects ‘similar’ ?
access more at legalanalyticscourse.com
As humans, we often place
objects into categories, groups, etc.
access more at legalanalyticscourse.com
this is often done without
an explicit model
(just our mental model(s), etc.)
access more at legalanalyticscourse.com
ExampleVia: Piyush Rai
Similarity is Slippery Concept
access more at legalanalyticscourse.com
in clustering, we are interested in trying
to formalize the idea of ‘similarity’
access more at legalanalyticscourse.com
A typical approach is to project
n-dimensional data into
a unidimensional ‘similarity index’
f( )
dimension 1
dimension 2
...
everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity sp...
everything in its own cluster
(i.e. everyone is a special snowflake)
everything in one cluster
unidimensional similarity sp...
The Heavy Lifting is the
develop/apply the optimal
similarity/distance function
for the substantive problem at issue
acces...
Different similarity criteria can
lead to different clusterings
access more at legalanalyticscourse.com
Goal for Any Clustering Method:
Achieve High Within Cluster Similarity
Achieve Low Cross Cluster Similarity
access more at...
We Want to Develop a Notion
of Distance Between Objects
Similarity is inversely related to distance
access more at legalan...
K-Means
and
H-Clust
access more at legalanalyticscourse.com
K Means and
Hierarchical Clustering
are the Most Popular Approaches
Used in Clustering
access more at legalanalyticscourse...
K-Means
access more at legalanalyticscourse.com
K Means
How do we find the clusters in the data shown below?
We select K clusters in advance
Iteratively seek to min sum of...
K Means Optimization
We start with K clusters with unknown centers
We are attempting to min the sum of squared distances
(...
Stuart Lloyd proposed a simple heuristic solution
“Lloyd’s algorithm” aka “k-means” is a good candidate solution
K Means O...
K-Means
a visual example
K-Means
where k = 2
Adapted from Example by Piyush Rai
initialization step
access more at legalanalyticscourse.com
K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Assigning Points
access more at legalanalyticscou...
K-Means
where k = 2
Adapted from Example by Piyush Rai
First Iteration - Recalculate the Center of the Cluster
access more...
K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Assigning Points
access more at legalanalyticsco...
K-Means
where k = 2
Adapted from Example by Piyush Rai
Second Iteration - Recalculate the Center of the Cluster
access mor...
K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Assigning Points
access more at legalanalyticscou...
K-Means
where k = 2
Adapted from Example by Piyush Rai
Third Iteration - Recalculate the Center of the Cluster
access more...
K Means Clustering
Fast Method But Leads to Local Minimum
Should repeat from different starting conditions
(must then figur...
https://www.youtube.com/watch?v=Qqg4Fklxqh0https://www.youtube.com/watch?v=0MQEt10e4NM
K-Means Clustering
some helpful vid...
H-Clust
access more at legalanalyticscourse.com
Hierarchical Clustering
Partitions can be visualized using a tree structure (a dendrogram)
Does not need the number of clu...
http://scaledinnovation.com/analytics/trees/dendrograms.html
Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merge...
Agglomerative
Methods
Divisive
Methods
access more at legalanalyticscourse.com
Agglomerative
Methods
Divisive
Methods
dendrogram
memorializes the
splits or order of agglomeration
access more at legalan...
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Groups within Groups within Groups ...
access more at legalanalyticscourse.com
Hierarchical Clustering
“(1) Start by assigning each item to a cluster, so that if you have
N items, you now have N cluste...
Hierarchical Clustering
There are a variety of different approaches to Step 3
(3) Compute distances (similarities) between...
https://www.youtube.com/watch?v=zygVdmlS-YAhttps://www.youtube.com/watch?v=2z5wwyv0Zk4
Hierarchical Clustering
some helpfu...
Implementation in R
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=M9jb6KrBlPc
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=sAtnX3UJyN0
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=v3k8WEOVSYw
access more at legalanalyticscourse.com
Clustering -
E-Discovery
access more at legalanalyticscourse.com
E-Discovery is simply
information retrevial + context
access more at legalanalyticscourse.com
relevant v. not-relevant
privileged v. not-privileged
Trying to locate documents
access more at legalanalyticscourse.com
Pre-Clustering Documents
Can Aid in the Review Process
access more at legalanalyticscourse.com
http://edu.cluster-text.com/
access more at legalanalyticscourse.com
download the movie here
(its .wmv might require an additional download to run on a Mac)
access more at legalanalyticscours...
download the movie here
(its .wmv might require an additional download to run on a Mac)
access more at legalanalyticscours...
Mapping the Case Space
(Using Citation Networks to Extract
Distance Functions for Clustering Documents)
access more at leg...
http://www.slideshare.net/Danielkatz/sinks-method-paper-presentation-duke-political-networks-conference-2010
access more a...
Legal Analytics
Class 9 - K-Means & Hierarchical Clustering
daniel martin katz
blog | ComputationalLegalStudies
corp | Lex...
Legal Analytics Course - Class 9 -  Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Ka...
Legal Analytics Course - Class 9 -  Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Ka...
Legal Analytics Course - Class 9 -  Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Ka...
Legal Analytics Course - Class 9 -  Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Ka...
Legal Analytics Course - Class 9 -  Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Ka...
Upcoming SlideShare
Loading in …5
×

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

1,517 views

Published on

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

Published in: Law
  • Be the first to comment

Legal Analytics Course - Class 9 - Clustering Algorithms (K-Means & Hierarchical Clustering) - Professor Daniel Martin Katz + Professor Michael J Bommarito

  1. 1. Class 9 K-Means & Hierarchical Clustering Legal Analytics Professor Daniel Martin Katz Professor Michael J Bommarito II legalanalyticscourse.com
  2. 2. Clustering - The Basic Idea access more at legalanalyticscourse.com
  3. 3. Adapted from Slides By Victor Lavrenko and Nigel Goddard @ University of Edinburgh Take A LookThese 12 access more at legalanalyticscourse.com
  4. 4. 72 Female Human 3 Female Horse 36 Male Human 21 Male Human 67 Male Human 29 Female Human 54 Male Human 44 Male Human 50 Male Human 42 Female Human 6 Male Dog 7 Female Human
  5. 5. Task = Can We Determine to Which Group the Agent Belongs? Clustering (Unsupervised Learning) f( ) Group? Cluster access more at legalanalyticscourse.com
  6. 6. Clustering (Unsupervised Learning) Clusterf( ) Group? access more at legalanalyticscourse.com
  7. 7. Clustering (Unsupervised Learning) Clusterf( ) Group? access more at legalanalyticscourse.com
  8. 8. How did we arrive at these clusters? access more at legalanalyticscourse.com
  9. 9. Clustering- Some High Level Points access more at legalanalyticscourse.com
  10. 10. Clustering is Unsupervised Learning access more at legalanalyticscourse.com
  11. 11. “Similar” is the Key Idea (but it is a slippery concept) Clustering is a Method of Grouping Similar Objects Clustering is typically Unsupervised Learning access more at legalanalyticscourse.com
  12. 12. There are a variety of methods used in this area (Agglomerative versus Divisive Methods) “Similar” is the Key Idea (but it is a slippery concept) Clustering is a Method of Grouping Similar Objects Clustering is typically Unsupervised Learning access more at legalanalyticscourse.com
  13. 13. There are a variety of methods used in this area (Agglomerative versus Divisive Methods) Remember real data is n-dimensional (which makes implementation / accuracy challenging) “Similar” is the Key Idea (but it is a slippery concept) Clustering is a Method of Grouping Similar Objects Clustering is typically Unsupervised Learning access more at legalanalyticscourse.com
  14. 14. The Science of Similarity
  15. 15. What makes two (or more) objects ‘similar’ ? access more at legalanalyticscourse.com
  16. 16. As humans, we often place objects into categories, groups, etc. access more at legalanalyticscourse.com
  17. 17. this is often done without an explicit model (just our mental model(s), etc.) access more at legalanalyticscourse.com
  18. 18. ExampleVia: Piyush Rai Similarity is Slippery Concept access more at legalanalyticscourse.com
  19. 19. in clustering, we are interested in trying to formalize the idea of ‘similarity’ access more at legalanalyticscourse.com
  20. 20. A typical approach is to project n-dimensional data into a unidimensional ‘similarity index’ f( ) dimension 1 dimension 2 dimension 3 . . . . dimension n similarity or distance function similarity index access more at legalanalyticscourse.com
  21. 21. everything in its own cluster (i.e. everyone is a special snowflake) everything in one cluster unidimensional similarity spectrum access more at legalanalyticscourse.com
  22. 22. everything in its own cluster (i.e. everyone is a special snowflake) everything in one cluster unidimensional similarity spectrum as we slide across this spectrum is where the groupings become interesting 0% similarity threshold hard question is where to stop as move from left to right 100% similarity threshold access more at legalanalyticscourse.com
  23. 23. The Heavy Lifting is the develop/apply the optimal similarity/distance function for the substantive problem at issue access more at legalanalyticscourse.com
  24. 24. Different similarity criteria can lead to different clusterings access more at legalanalyticscourse.com
  25. 25. Goal for Any Clustering Method: Achieve High Within Cluster Similarity Achieve Low Cross Cluster Similarity access more at legalanalyticscourse.com
  26. 26. We Want to Develop a Notion of Distance Between Objects Similarity is inversely related to distance access more at legalanalyticscourse.com
  27. 27. K-Means and H-Clust access more at legalanalyticscourse.com
  28. 28. K Means and Hierarchical Clustering are the Most Popular Approaches Used in Clustering access more at legalanalyticscourse.com
  29. 29. K-Means access more at legalanalyticscourse.com
  30. 30. K Means How do we find the clusters in the data shown below? We select K clusters in advance Iteratively seek to min sum of squared distances Iteratively seek to min sum of squared distances
  31. 31. K Means Optimization We start with K clusters with unknown centers We are attempting to min the sum of squared distances (i.e. the objective function shown below) Tricky Part is that this minimization problem cannot be solved analytically access more at legalanalyticscourse.com
  32. 32. Stuart Lloyd proposed a simple heuristic solution “Lloyd’s algorithm” aka “k-means” is a good candidate solution K Means Optimization from FlachText Page 248
  33. 33. K-Means a visual example
  34. 34. K-Means where k = 2 Adapted from Example by Piyush Rai initialization step access more at legalanalyticscourse.com
  35. 35. K-Means where k = 2 Adapted from Example by Piyush Rai First Iteration - Assigning Points access more at legalanalyticscourse.com
  36. 36. K-Means where k = 2 Adapted from Example by Piyush Rai First Iteration - Recalculate the Center of the Cluster access more at legalanalyticscourse.com
  37. 37. K-Means where k = 2 Adapted from Example by Piyush Rai Second Iteration - Assigning Points access more at legalanalyticscourse.com
  38. 38. K-Means where k = 2 Adapted from Example by Piyush Rai Second Iteration - Recalculate the Center of the Cluster access more at legalanalyticscourse.com
  39. 39. K-Means where k = 2 Adapted from Example by Piyush Rai Third Iteration - Assigning Points access more at legalanalyticscourse.com
  40. 40. K-Means where k = 2 Adapted from Example by Piyush Rai Third Iteration - Recalculate the Center of the Cluster access more at legalanalyticscourse.com
  41. 41. K Means Clustering Fast Method But Leads to Local Minimum Should repeat from different starting conditions (must then figure best heuristic to find global min) Important Weakness is it often not clear what value of K access more at legalanalyticscourse.com
  42. 42. https://www.youtube.com/watch?v=Qqg4Fklxqh0https://www.youtube.com/watch?v=0MQEt10e4NM K-Means Clustering some helpful videos https://www.youtube.com/watch?v=4shfFAArxSc access more at legalanalyticscourse.com
  43. 43. H-Clust access more at legalanalyticscourse.com
  44. 44. Hierarchical Clustering Partitions can be visualized using a tree structure (a dendrogram) Does not need the number of clusters as input Possible to view partitions at different levels of granularities (i.e., can refine/coarsen clusters) using different K DescriptionVia: Piyush Rai
  45. 45. http://scaledinnovation.com/analytics/trees/dendrograms.html
  46. 46. Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Agglomerative versus Divisive Methods access more at legalanalyticscourse.com
  47. 47. Agglomerative Methods Divisive Methods access more at legalanalyticscourse.com
  48. 48. Agglomerative Methods Divisive Methods dendrogram memorializes the splits or order of agglomeration access more at legalanalyticscourse.com
  49. 49. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  50. 50. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  51. 51. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  52. 52. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  53. 53. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  54. 54. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  55. 55. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  56. 56. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  57. 57. Groups within Groups within Groups ... access more at legalanalyticscourse.com
  58. 58. Hierarchical Clustering “(1) Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. (2) Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. (3) Compute distances (similarities) between the new cluster and each of the old clusters. (4) Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)” S. C. Johnson (1967): "Hierarchical Clustering Schemes" Psychometrika, 2:241-254
  59. 59. Hierarchical Clustering There are a variety of different approaches to Step 3 (3) Compute distances (similarities) between the new cluster and each of the old clusters. single-linkage clustering complete-linkage clustering average-linkage clustering centroid linkage clustering (see pages 253-258 of Flach)
  60. 60. https://www.youtube.com/watch?v=zygVdmlS-YAhttps://www.youtube.com/watch?v=2z5wwyv0Zk4 Hierarchical Clustering some helpful videos access more at legalanalyticscourse.com
  61. 61. Implementation in R access more at legalanalyticscourse.com
  62. 62. https://www.youtube.com/watch?v=M9jb6KrBlPc access more at legalanalyticscourse.com
  63. 63. https://www.youtube.com/watch?v=sAtnX3UJyN0 access more at legalanalyticscourse.com
  64. 64. https://www.youtube.com/watch?v=v3k8WEOVSYw access more at legalanalyticscourse.com
  65. 65. Clustering - E-Discovery access more at legalanalyticscourse.com
  66. 66. E-Discovery is simply information retrevial + context access more at legalanalyticscourse.com
  67. 67. relevant v. not-relevant privileged v. not-privileged Trying to locate documents access more at legalanalyticscourse.com
  68. 68. Pre-Clustering Documents Can Aid in the Review Process access more at legalanalyticscourse.com
  69. 69. http://edu.cluster-text.com/ access more at legalanalyticscourse.com
  70. 70. download the movie here (its .wmv might require an additional download to run on a Mac) access more at legalanalyticscourse.com
  71. 71. download the movie here (its .wmv might require an additional download to run on a Mac) access more at legalanalyticscourse.com
  72. 72. Mapping the Case Space (Using Citation Networks to Extract Distance Functions for Clustering Documents) access more at legalanalyticscourse.com
  73. 73. http://www.slideshare.net/Danielkatz/sinks-method-paper-presentation-duke-political-networks-conference-2010 access more at legalanalyticscourse.com
  74. 74. Legal Analytics Class 9 - K-Means & Hierarchical Clustering daniel martin katz blog | ComputationalLegalStudies corp | LexPredict michael j bommarito twitter | @computational blog | ComputationalLegalStudies corp | LexPredict twitter | @mjbommar more content available at legalanalyticscourse.com site | danielmartinkatz.com site | bommaritollc.com

×