Upcoming SlideShare
×

# Cluster Analysis

4,363 views

Published on

AACIMP 2011 Summer School. Operational Research Stream. Lecture by Erik Kropat.

Published in: Education, Technology
8 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
4,363
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
447
0
Likes
8
Embeds 0
No embeds

No notes for slide

### Cluster Analysis

1. 1. Summer School“Achievements and Applications of Contemporary Informatics, Mathematics and Physics” (AACIMP 2011) August 8-20, 2011, Kiev, Ukraine Cluster Analysis Erik Kropat University of the Bundeswehr Munich Institute for Theoretical Computer Science, Mathematics and Operations Research Neubiberg, Germany
2. 2. The Knowledge Discovery Process
3. 3. PATTERN EVALUATION Knowledge DATA MINING Patterns Strategic planning PRE- PROCESSING Preprocessed Data Patterns, clusters, correlations automated classificationRaw outlier / anomaly detection Standardizing association rule learning…Data Missing values / outliers
4. 4. Clustering
5. 5. Clustering… is a tool for data analysis, which solves classification problems.ProblemGiven n observations, split them into K similar groups.QuestionHow can we define “similarity” ?
6. 6. SimilarityA cluster is a set of entities which are alike,and entities from different clusters are not alike.
7. 7. DistanceA cluster is an aggregation of points such thatthe distance between any two points in the clusteris less thanthe distance between any point in the cluster and any point not in it.
8. 8. DensityClusters may be described asconnected regions of a multidimensional spacecontaining a relatively high density of points,separated from other such regions by a regioncontaining a relatively low density of points.
9. 9. Min Max-ProblemHomogeneity: Objects within the same cluster should be similar to each other.Separation: Objects in different clusters should be dissimilar from each other. Distance between clusters Distance between objects similarity ⇔ distance
10. 10. Types of Clustering Clustering Hierarchical Partitional Clustering Clustering agglomerative divisive
11. 11. Similarity and Distance
12. 12. Distance MeasuresA metric on a set G is a function d: G x G → R+ that satisfies the followingconditions:(D1) d(x, y) = 0 ⇔ x=y (identity)(D2) d(x, y) = d(y, x) ≥ 0 for all x, y ∈ G (symmetry & non-negativity)(D3) d(x, y) ≤ d(x, z) + d(z, y) for all x, y, z ∈ G (triangle inequality) z y x
13. 13. ExamplesMinkowski-Distance 1 _ n r r d r (x, y) = Σ | xi − yi | i=1 , r ∈ [1, ∞) , x, y ∈ Rn. o r = 1: Manhatten distance o r = 2: Euklidian distance
14. 14. Euclidean Distance 1 _ n Σ 2 2 d2 (x, y) = ( xi − yi ) , x, y ∈ Rn i=1 y x = (1, 1) y = (4, 3) _ 1 ____ 2 2 2 d2 (x, y) = (1 - 4) + (1 - 3) = √13 x
15. 15. Manhatten Distance n d1 (x, y) = Σ | xi − yi | , x, y ∈ Rn i=1 y x = (1, 1) y = (4, 3) d1 (x, y) = | 1 - 4 | + | 1 - 3 | = 3 + 2 = 5 x
16. 16. Maximum Distance d∞ (x, y) = max | xi − yi | , x, y ∈ Rn 1≤i≤n y x = (1, 1) y = (4, 3) d∞ (x, y) = max (3, 2) = 3 x
17. 17. Similarity MeasuresA similarity function on a set G is a function S: G x G → R that satisfies thefollowing conditions:(S1) S (x, y) ≥ 0 for all x, y ∈ G (positive defined)(S2) S (x, y) ≤ S (x, x) for all x, y ∈ G (auto-similarity)(S3) S (x, y) = S (x, x) ⇔ x=y for all x, y ∈ G (identity)The value of the similarity function is greater when two points are closer.
18. 18. Similarity Measures• There are many different definitions of similarity.• Often used (S4) S (x, y) = S (y, x) for all x, y ∈ G (symmetry)
19. 19. Hierachical Clustering
20. 20. Dendrogram Cluster Dendrogram Euclidean distance (complete linkage) Euclidean distance (complete linkage) Gross national product of EU countries – agriculture (1993) www.isa.uni-stuttgart.de/lehre/SAHBD
21. 21. Hierarchical ClusteringHierarchical clustering creates a hierarchy of clusters of the set G. Hierarchical Clustering agglomerative divisiveAgglomerative clustering: Clusters are successively merged togetherDivisive clustering: Clusters are recursively split
22. 22. Agglomerative ClusteringMerge clusters with smallest distance between the two clusters Step 3 e1, e2 , e3, e4 1 cluster Step 2 e1, e2, e3 e4 2 clusters Step 1 e1, e2 e3 e4 3 clusters Step 0 e1 e2 e3 e4 4 clusters
23. 23. Divisive ClusteringChose a cluster, that optimally splits in two particular clustersaccording to a given criterion. Step 0 e1, e2 , e3, e4 1 cluster Step 1 e1, e2 e3, e4 2 clusters Step 2 e1, e2 e3 e4 3 clusters Step 3 e1 e2 e3 e4 4 clusters
24. 24. Agglomerative Clustering
25. 25. INPUTGiven n objects G = { e1,...,en }represented by p-dimensional feature vectors x1,...,xn ∈ Rp Feature p Feature 1 Feature 2 Feature 3 Object x1 = ( x11 x12 x13 ... x1p ) x2 = ( x21 x22 x23 ... x2p ) ⁞ ⁞ ⁞ ⁞ ⁞ xn = ( xn1 xn2 xn3 ... xnp )
26. 26. Example IAn online shop collects data from its customers. For each of the n customers it exists a p-dimensional feature vector Object
27. 27. Example IIIn a clinical trial laboratory values of a large number of patients are gathered. For each of the n patients it exists a p-dimensional feature vector
28. 28. Agglomerative Algorithms• Begin with disjoint clustering C1 = { {e1}, {e2}, ... , {en} }• Terminate when all objects are in one cluster Cn = { {e1, e2, ... , en} } e1 e2 e3 e4• Iterate find the most similar pair of clusters and merge them into a single cluster. Sequence of clusterings (Ci )i=1,...n of G with Ci ̶ 1 ⊂ Ci for i = 2,...,n.
29. 29. What is the distance between two clusters? A d (A,B) B⇒ Various hierarchical clustering algorithms
30. 30. Agglomerative Hierarchical ClusteringThere exist many metrics to measure the distance between clusters.They lead to particular agglomerative clustering methods:• Single-Linkage Clustering• Complete-Linkage Clustering• Average Linkage Clustering• Centroid Method• ...
31. 31. Single-Linkage ClusteringNearest-Neighbor-MethodThe distance between the clusters A und B is theminimum distance between the elements of each cluster: d(A,B) = min { d (a, b) | a ∈ A, b ∈ B } a d(A,B) b
32. 32. Single-Linkage Clustering• Advantage: Can detect very long and even curved clusters. Can be used to detect outliers.• Drawback: Chaining phenomen Clusters that are very distant to each other may be forced together due to single elements being close to each other. B C A
33. 33. Complete-Linkage ClusteringFurthest-Neighbor-MethodThe distance between the clusters A and B is themaximum distance between the elements of each cluster: d(A,B) = max { d(a,b) | a ∈ A, b ∈ B } a d (A, B) b
34. 34. Complete-Linkage Clustering• … tends to find compact clusters of approximately equal diameters.• … avoids the chaining phenomen.• … cannot be used for outlier detection.
35. 35. Average-Linkage ClusteringThe distance between the clusters A and B is the meandistance between the elements of each cluster: 1 d (A, B) = ⋅ Σ d (a, b) |A| ⋅ |B| a ∈ A, b∈B a b A B d(A,B)
36. 36. Centroid MethodThe distance between the clusters A and B is the(squared) Euclidean distance of the cluster centroids. d (A, B) x x
37. 37. Agglomerative Hierarchical Clustering d (A, B) d (A, B) a d (A, B) d (A, B) b d (A, B) x (A, B) x
38. 38. Bioinformatics Alizadeh et al., Nature 403 (2000): pp.503–511
39. 39. Exercise Berlin Kiev Paris Odessa
40. 40. ExerciseThe following table shows the distances between 4 cities: Kiev Odessa Berlin Paris Kiev ̶ 440 1200 2000 Odessa 440 ̶ 1400 2100 Berlin 1200 1400 ̶ 900 Paris 2000 2100 900 ̶Determine a hierarchical clustering withthe single linkage method.
41. 41. Solution - Single LinkageStep 0: Clustering {Kiev}, {Odessa}, {Berlin}, {Paris} Distances between clusters Kiev Odessa Berlin Paris Kiev ̶ 440 1200 2000 Odessa 440 ̶ 1400 2100 Berlin 1200 1400 ̶ 900 Paris 2000 2100 900 ̶
42. 42. Solution - Single LinkageStep 0: Clustering {Kiev}, {Odessa}, {Berlin}, {Paris} Distances between clusters minimal distance Kiev Odessa Berlin Paris Kiev ̶ 440 1200 2000 Odessa 440 ̶ 1400 2100 Berlin 1200 1400 ̶ 900 Paris 2000 2100 900 ̶ ⇒ Merge clusters { Kiev } and { Odessa } Distance value: 440
43. 43. Solution - Single LinkageStep 1: Clustering {Kiev, Odessa}, {Berlin}, {Paris} Distances between clusters Kiev, Odessa Berlin Paris Kiev, Odessa ̶ 1200 2000 Berlin 1200 ̶ 900 Paris 2000 900 ̶
44. 44. Solution - Single LinkageStep 1: Clustering {Kiev, Odessa}, {Berlin}, {Paris} Distances between clusters minimal distance Kiev, Odessa Berlin Paris Kiev, Odessa ̶ 1200 2000 Berlin 1200 ̶ 900 Paris 2000 900 ̶ ⇒ Merge clusters { Berlin } and { Paris } Distance value: 900
45. 45. Solution - Single LinkageStep 2: Clustering {Kiev, Odessa}, {Berlin, Paris} Distances between clusters minimal distance Kiev, Odessa Berlin, Paris Kiev, Odessa ̶ 1200 Berlin, Paris 1200 ̶ ⇒ Merge clusters { Kiev, Odessa } and { Berlin, Paris } Distance value: 1200
46. 46. Solution - Single LinkageStep 3: Clustering {Kiev, Odessa, Berlin, Paris}
47. 47. Solution - Single LinkageHierarchy Distance values 2540 1 cluster 1200 1340 2 clusters 900 440 3 clusters 440 0 4 clusters Kiev Odessa Berlin Paris
48. 48. Divisive Clustering
49. 49. Divisive Algorithms• Begin with one cluster C1 = { {e1, e2, ... , en} } e1 e2 e3 e4• Terminate when all objects are in disjoint clusters Cn = { {e1}, {e2}, ... , {en} }• Iterate Chose a cluster Cf , that optimally splits two particular clusters Ci and Cj according to a given criterion. Sequence of clusterings (Ci )i=1,...n of G with C i ⊃ C i + 1 for i = 1,...,n-1.
50. 50. Partitional Clustering̶ Minimal Distance Methods ̶
51. 51. Partitional Clustering K=2• Aims to partition n observations into K clusters.• The number of clusters and initial partition an initial partition are given.• The initial partition is considered as “not optimal“ and should be K=2 iteratively repartitioned. The number of clusters is given !!! final partition
52. 52. Partitional ClusteringDifference to hierarchical clustering• number of clusters is fixed.• an object can change the cluster.Initial partition is obtained by• random or• the application of an hierarchical clustering algorithm in advance.Estimation of the number of clusters• specialized methods (e.g., Silhouette) or• the application of an hierarchical clustering algorithm in advance.
53. 53. Partitional Clustering - MethodsIn this course we will introduce the minimal distance methods . . .• K-Means and• Fuzzy-C-Means
54. 54. K-Means
55. 55. K-MeansAims to partition n observations into K clustersin which each observation belongs to the cluster with the nearest mean. G C3Find K cluster centroids µ1 ,..., µKthat minimize the objective function K C1 Σ Σ 2 J = dist ( µi, x ) i =1 x ∈ Ci C2
56. 56. K-MeansAims to partition n observations into K clustersin which each observation belongs to the cluster with the nearest mean. G C3Find K cluster centroids µ1 ,..., µKthat minimize the objective function x x K C1 Σ Σ 2 J = dist ( µi, x ) x i =1 x ∈ Ci C2
57. 57. K-Means - Minimal Distance MethodGiven: n objects, K clusters1. Determine initial partition.2. Calculate cluster centroids. x x3. For each object, calculate the distances to all cluster centroids. repartition4. If the distance to the centroid of another cluster is smaller than the distance to the actual cluster centroid, then assign the object to the other cluster.5. If clusters are repartitioned: GOTO 2. ELSE: STOP.
58. 58. Example ₓ ₓ ₓ ₓ Initial Partition Final Partition
59. 59. Exercise ₓ ₓ ₓ ₓ Initial Partition Final Partition
60. 60. K-Means• K-Means does not determine the global optimal partition.• The final partition obtained by K-Means depends on the initial partition.
61. 61. Hard Clustering / Soft Clustering Clustering Hard Clustering Soft Clustering Each object is a member Each object has a fractional of exactly one cluster membership in all clusters K-Means Fuzzy-c-Means
62. 62. Fuzzy-c-Means
63. 63. Fuzzy Clustering vs. Hard Clustering• When clusters are well separated, hard clustering (K-Means) makes sense.• In many cases, clusters are not well separated. In hard clustering, borderline objects are assigned to a cluster in an arbitrary manner.
64. 64. Fuzzy Set Theory• Fuzzy Theory was introduced by Lofti Zadeh in 1965.• An object can belong to a set with a degree of membership between 0 and 1.• Classical set theory is a special case of fuzzy theory that restricts membership values to be either 0 or 1.
65. 65. Fuzzy Clustering• Is based on fuzzy logic and fuzzy set theory.• Objects can belong to more than one cluster.• Each object belongs to all clusters with some weight (degree of membership) 1 Cluster 1 Cluster 2 Cluster 3 0
66. 66. Hard Clustering• K-Means − The number K of clusters is given. − Each object is assigned to exactly one cluster. Partition Object C3 Cluster e1 e2 e3 e4 e3 e4 C1 0 1 0 0 C2 1 0 0 0 e2 e1 C1 C2 C3 0 0 1 1
67. 67. Fuzzy Clustering• Fuzzy-c-means − The number c of clusters is given. − Each object has a fractional membership in all clusters Object Cluster e1 e2 e3 e4 C1 0.8 0.2 0.1 0.0 Fuzzy-Clustering C2 0.2 0.2 0.2 0.0 There is no strict sub-division of clusters. C3 0.0 0.6 0.7 1.0 Σ 1 1 1 1
68. 68. Fuzzy-c-Means • Membership Matrix U = ( u i k ) ∈ [0, 1]c x n The entry u i k denotes the degree of membership of object k in cluster i . Object 1 Object 2 … Object n Cluster 1 u11 u12 … u1n Cluster 2 u21 u22 … u2n … … … … Cluster c uc1 uc2 … ucn
69. 69. Restrictions (Membership Matrix)1. All weights for a given object, ek, must add up to 1. c Σ u ik = 1 (k = 1,...,n) i =12. Each cluster contains – with non-zero weight – at least one object, but does not contain – with a weight of one – all the objects. n 0< Σ u ik < n (i = 1,...,c) k =1
70. 70. Fuzzy-c-Means• Vector of prototypes (cluster centroids) T V = ( v1,...,vc ) ∈ RcRemarkThe cluster centroids and the membership matrix are initialized randomly.Afterwards they are iteratively optimized.
71. 71. Fuzzy-c-MeansALGORITHM1. Select an initial fuzzy partition U = (u i k ) ⇒ assign values to all u i k2. Repeat3. Compute the centroid of each cluster using the fuzzy partition4. Update the fuzzy partition U = (u i k )5. Until the centroids do not change.Other stopping criterion: “change in the u i k is below a given threshold”.
72. 72. v3 u3kFuzzy-c-Means xk• K-Means and Fuzzy-c-Means attempt to minimize v1 u1k u2k the sum of the squared errors (SSE). v2• In K-Means: K Σ Σ 2 SSE = dist ( vi, x ) i =1 x ∈ Ci• In Fuzzy-c-Means: c n m SSE = Σ Σ u ik . 2 dist ( vi, xk ) i =1 k =1 m ∈ [1, ∞] is a parameter (fuzzifier) that determines the influence of the weights.
73. 73. v3 u3kComputing Cluster Centroids xk• For each cluster i = 1,...,c the centroid is defined by v1 u1k u2k n v2 Σ u m xk k = 1 ik _________________ ( i = 1,...,c ) vi = n m Σ u ik (V) k=1• This is an extension of the definition of centroids of k-means.• All points are considered and the contribution of each point to the centroid is weighted by its membership degrees.
74. 74. Update of the Fuzzy Partition (Membership Matrix)• Minimization of SSE subject to the constraints leads to the following update formula: 1 ______________________________________ u ik = 1 _____ c 2 (U) dist ( v i , xk ) m – 1 Σ __________ 2 s=1 dist ( vs , xk )
75. 75. Fuzzy-c-Means Initialization Determine (randomly) • Matrix U of membership grades • Matrix V of cluster centroids. Iteration Calculate updates of • Matrix U of membership grades with (U) • Matrix V of cluster centroids with (V) until cluster centroids are stable or the maximum number of iterations is reached.
76. 76. Fuzzy-c-means• Fuzzy-c-means depends on the Euclidian metric ⇒ spherical clusters.• Other metrics can be applied to obtain different cluster shapes.• Fuzzy covariance matrix (Gustafson/Kessel 1979) ⇒ ellipsoidal clusters.
77. 77. Cluster Validity Indizes
78. 78. Cluster Validity IndexesFuzzy-c-means requires the number of clusters as input.Question: How can we determine the “optimal” number of clusters?Idea: Determine the cluster partition for a given number of clusters. Then, evaluate the cluster partition by a cluster validity index.Method: For all possible number of clusters calculate the cluster validity index. Then, determine the optimal number of clusters.Note: CVIs usually do not depend on the clustering algorithm.
79. 79. Cluster Validity Indexes • Partition Coefficient (Bezdek 1981) c n 1 2 PC (c) = __ Σ Σ uik , 2 ≤ c ≤ n-1 n i=1 k=1• Optimal number of clusters c∗ : PC (c∗) = max PC (c) 2 ≤ c ≤ n-1
80. 80. Cluster Validity Indexes• Partition Entropy (Bezdek 1974) c n 1 PC (c) = _ __ Σ Σ u i k log2 u i k , 2 ≤ c ≤ n-1 n i=1 k=1• Optimal number of clusters c∗ : PC (c∗) = min PC (c) 2 ≤ c ≤ n-1• Drawback of PC and PE: Only degrees of memberships are considered. The geometry of the data set is neglected.
81. 81. Cluster Validity Indexes• Fukuyama-Sugeno Index (Fukuyama/Sugeno 1989) c n m Σ Σ 2 Compactness FS (c) = uik dist ( vi , xk ) of clusters i=1 k=1 c n _ m _ Σ Σ Separation 2 uik dist ( vi , v ) of clusters i=1 k=1• Optimal number of clusters c∗ : _ 1 c PC (c∗) = max PC (c) v = __ Σ vi 2 ≤ c ≤ n-1 c i =1
82. 82. Application
83. 83. Data Mining and Decision Support Systems ̶ Landslide Events(UniBw, Geoinformatics Group: W. Reinhardt, E. Nuhn)→ Spatial Data Mining / Early Warning Systems for Landslide Events→ Fuzzy clustering approaches (feature weighting)• Measurements (pressure values, tension, deformation vectors)• Simulations (finite-element model)
84. 84. Hard Clustering Data Partition Problem: Uncertain data from measurements and simulations
85. 85. Fuzzy Clustering Data Fuzzy-Cluster Fuzzy-Partition
86. 86. Fuzzy Clustering
87. 87. Feature WeightingNuhn/Kropat/Reinhardt/Pickl: Preparation of complex landslide simulation results with clustering approaches for decision support and earlywarning. Submitted to Hawaii International Conference on System Sciences (HICCS 45), Grand Wailea, Maui, 2012.
88. 88. Thank you very much!