Statistical Clustering


Published on

Clustering using Ward\’s ESS and review of methods and concepts

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Clustering

  1. 1. Nearest Neighbor based approaches to Multivariate Data Analysis Tim Hare
  2. 2. We can measure a multivariate item’s similarity to other items (n) via its distance from other ITEMS in variable (p) space <ul><li>Distance = Similarity (or we might say “dissimilarity” – the two seem to get interchanged) </li></ul><ul><ul><li>We can use Euclidian distance ( others to discuss if we have time ) </li></ul></ul><ul><ul><li>Distance (similarity) searching works regardless of dimension </li></ul></ul>p n
  3. 3. Nearest Neighbor Searching Locate the nearest multivariate neighbors in p-space <ul><li>Compute the distance from target to all items </li></ul><ul><ul><li>Retain all those within some distance criteria (d) </li></ul></ul><ul><ul><li>Retain based upon some upper limit on items, say (k). </li></ul></ul><ul><li>Uses? </li></ul><ul><ul><li>Fill in missing variable values with weighted MEAN of k most similar items </li></ul></ul><ul><ul><li>Predict the future value of a variable’s component with a current record based upon antecedent component variables in past similar records? </li></ul></ul><ul><li>What else can we do? </li></ul>
  4. 4. Clustering Approaches <ul><li>Hierarchical Clustering: </li></ul><ul><ul><li>Agglomerative OR Divisive </li></ul></ul><ul><ul><li>we can group items (where distance = similarity) </li></ul></ul><ul><ul><li>we can group variables (where correlation = similarity) </li></ul></ul><ul><ul><ul><li>Can use correlation coefficients for continuous random variables </li></ul></ul></ul><ul><ul><ul><li>Can use Binary Weighting Schemes </li></ul></ul></ul><ul><ul><ul><ul><li>for presence of absence of certain characteristics (0,1 component values of item vector) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>See P674-678 Dean & Wichern </li></ul></ul></ul></ul><ul><li>Non-Hierarchical Clustering </li></ul><ul><ul><li>Divisive only (?) </li></ul></ul><ul><ul><li>we group items only (where distance = similarity) . </li></ul></ul><ul><li>Statistical Clustering </li></ul><ul><ul><li>More recent </li></ul></ul><ul><ul><li>Based on density estimates and mixed density estimation </li></ul></ul><ul><ul><li>SAS appears to have a non-parmetric density estimate based clustering via MODECLUS. </li></ul></ul><ul><ul><li>Parametric density estimate based clustering is discussed in Dean and Wichern (Section 12.5) </li></ul></ul><ul><ul><li>R-language appears to offer parametric density estimate based statistical clustering via MCLUST (P705). </li></ul></ul>
  5. 5. Non-Hierarchical Divisive <ul><li>The simple K-means Non-Hierarchical Divisive Process </li></ul><ul><ul><li>Pick K*, which is our initial “seed” number which we hope is the true cluster# </li></ul></ul><ul><ul><li>Carry out RANDOM SAMPLING of the data set to establish “seed” Centroids (average location of cluster members) </li></ul></ul><ul><ul><li>Go through list of items and reassign those that are closer to a “competing” cluster’s Centroid </li></ul></ul><ul><ul><li>Calculate an updated Centroid value and repeat the process until no reassignments of items take place. </li></ul></ul><ul><li>GOOD: Membership of items in clusters is fluid and can change at ANY time. </li></ul><ul><li>BAD: Since K-means relies on RANDOM SAMPLING you may not pick up on RARE groups and so they may not appear as distinct clusters: so  K* < K </li></ul><ul><li>BAD: In simple K-means, K is fixed in advance, however newer methods to iteratively adapt K seem to be available (P696, Dean & Wichern). </li></ul><ul><li>For examples see P696-701 Dean & Wichern </li></ul>
  6. 6. Hierarchical Agglomerative Clustering <ul><li>The Hierarchical Agglomerative Process </li></ul><ul><ul><li>Estimate distance (similarity) between items </li></ul></ul><ul><ul><ul><li>create a “distance matrix” for each item vs every other item </li></ul></ul></ul><ul><ul><li>Assign those with distance (similarity) below some threshold to common clusters to create a mix of initial clusters and residual items </li></ul></ul><ul><ul><li>Increase the tolerance for distance (similarity) as a threshold for selection and repeat the process. </li></ul></ul><ul><ul><li>Keep track of the distance at which clusters/items were merged at, as closer proximity on merge is better than larger </li></ul></ul><ul><ul><li>Result? Eventually all items will be assigned to a single cluster </li></ul></ul><ul><ul><li>What is the correct number of clusters? </li></ul></ul><ul><ul><ul><li>Onus on user to analyze the data in a number of ways, similar to Factor Analysis </li></ul></ul></ul><ul><ul><ul><li>We must build a case in order to make a decision about what the best representation of the real data structure is </li></ul></ul></ul><ul><ul><ul><li>We’ll need to use various metrics or surrogate markers of success in this regard due to the typical high dimensionality of the data </li></ul></ul></ul><ul><ul><ul><li>“ Stress test ” our solution by alternative approaches: do they produce the same results ? </li></ul></ul></ul>
  7. 7. Distance is not enough to deal with objects that have dimension themselves: “LINKAGE” <ul><li>Clusters of items have “VOLUME” -- they aren’t points </li></ul><ul><li>The distance between, say, two bags of marbles is hard to specify </li></ul><ul><ul><li>Measure distance from an estimate of the center: the Centroid ? </li></ul></ul><ul><ul><li>Measure from inner edge closest point? </li></ul></ul><ul><ul><li>Measure from outer edge farthest point? </li></ul></ul><ul><li>“ LINKAGE” specifies how we use DISTANCE in CLUSTERING </li></ul><ul><ul><li>In SAS distance and Linkage are often combined in a “METHOD” </li></ul></ul>
  8. 8. SINGLE vs COMPLETE linkage (PROC CLUSTER Method = Single/Complete ) <ul><li>ds(A,B)=min(A,B) and ds(A,B)=max(A,B) </li></ul><ul><li>where A,B = clusters </li></ul>min(S,Q)= max(S,Q)= CHAINING during single linkage clustering : one of the few ways to delineate non-ellipsoidal clusters but can be misleading in that items on opposite ends of the clusters are likely to be quite different Resulting Clusters Single Linkages
  9. 9. AVERAGE linkage (PROC CLUSTER Method = AVERAGE) <ul><li>Σ [MxN d(a i ,b i )]/ (MxN) </li></ul><ul><li>As one would expect, less influenced by outliers than SINLGLE or COMPLETE </li></ul>[ d(A1,B1),d(A1,B2),d(A1,B3) d(A2,B1),d(A2,B2),d(A2,B3) d(A3,B1),d(A3,B2),d(A3,B3) ]/9
  10. 10. Ward’s Method (PROC CLUSTER METHOD=WARD ) <ul><li>Ward’s Method: Error Sum of Squares (ESS) </li></ul><ul><ul><li>ESS(k) = sum of squared differences between the Centroid (cluster average) and each member </li></ul></ul><ul><ul><li>ESS = sum I = 1 to k </li></ul></ul><ul><li>For example, </li></ul><ul><ul><li>a large increase in ESS on merge, or at the end of a run, is an indication of a bad match or bad result. </li></ul></ul><ul><ul><li>As we lose clusters by agglomeration, ESS goes up. </li></ul></ul><ul><ul><li>In final single cluster, ESS is at MAX. </li></ul></ul><ul><ul><li>At initial state, ESS=0 </li></ul></ul><ul><ul><li>At intermediate stages we like to see cluster mergers that don’t increase ESS much. </li></ul></ul><ul><ul><li>ESS is used to decide whether to merge two clusters: search for smallest ESS increase for each merge operation. </li></ul></ul><ul><li>Dividing Ward’s ESS by total SS (TSS) gives (or is in similar approach with respect to normalization) the semi-partial sum of squares (_SPRSQ_) found in the PROC CLUSTER </li></ul><ul><li>Certain assumptions are associated with Ward’s method (MVN, equal spherical covariance matrices, uniform distribution within the sample) and data normalization + verification is required. </li></ul><ul><li>You could think of Ward’s Method as an ANOVA where if we keep the null then two clusters really aren’t distinct, and so can be merged. If we reject the null, the question is, how different are the clusters and if they are TOO different, we don’t want to merge then. </li></ul><ul><li>Notice we’re not using any measure of DISTANACE or LINKAGE – this would make a nice contrast to distance/linkage approaches, allowing us to “stress test” our final results. </li></ul>
  11. 11. SAS options for Data Normalization <ul><li>PROC ACECLUS </li></ul><ul><ul><li>A pproximate C ovariance E stimate for Clus tering </li></ul></ul><ul><ul><li>normalizes data by an estimate (based upon sampling) of the within-cluster covariance matrix </li></ul></ul><ul><ul><li>Usually start with a range of values for the PROPORTION of data sampled with 0<p<1, and runs ranging from 0.01 to 0.5 being useful (we’ll use p=0.03) </li></ul></ul><ul><ul><li>Useful in conjunction with Ward’s Method </li></ul></ul><ul><li>PROC STDIZE </li></ul><ul><ul><li>z-transforms, etc </li></ul></ul>
  12. 12. PROC ACECLUS output from Poverty Data set (p=3) : QQ-PLOTS to check MVN on transformed variables (can1, can2, can3) which is needed for Ward’s method. Rq(can1)=0.951, Rq(can2)=0.981, Rq(can3)=0.976, where n=97 and RqCP=0.9895 at α =0.1 A more thorough investigation would involve outlier detection and removal as well as data transform testing (BOX-COX)
  13. 13. Minimal code needed for a cluster analysis Generate a data set with only the resulting clustering # we wish to examine for use in PLOTTING, if needed Sampling proportion: try values from 0.01 to 0.5
  14. 14. PROC TREE output: how many clusters do we think are appropriate? (Distance criteria and value at time of merger on horizontal axis) Ward’s ? Average
  15. 16. Pseudo-F Statistic Plot Interpretation
  16. 17. Pseudo-T2 Statistic Plot Interpretation
  17. 18. Comparison of CCC, Pseudo-F, Pseudo-T2 under different clustering runs varying distance, linkage and normalization If we didn’t have a low dimensional variable set (p=3) it would be impossible to build a case on AVERAGE- and SIMPLE linkage Euclidian Dist, AVG linkage, Aceclus Normalized ? Ward Linkage, Aceclus Normalized What we want to see. Simple Linkage, Aceclus Normalized ?
  18. 19. Birth Rate vs Death Rate Notice the evidence for the known bias in Ward to equal numbers of observations per cluster where as with AVG the process allows us to have some small clusters in the lower right. The Expected Maximum Likelihood (EML) method in PROC CLUSTER produces similar results to Ward’s method, but with a slight bias in the opposite direction toward clusters of unequal sizes. Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  19. 20. Birth Rate vs Infant-Death Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  20. 21. DeathRate vs InfantDeath Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  21. 22. Lessons learned? <ul><li>Since we used a low variable data set we can judge our success to some degree: </li></ul><ul><ul><li>But how would it be with a dimension p=20? </li></ul></ul><ul><ul><li>CCC, Pseudo-F, and Pseudo-T2 critical </li></ul></ul><ul><ul><li>Try different linkage/distance approaches </li></ul></ul><ul><ul><li>Try different normalization approaches </li></ul></ul><ul><li>Consider the possibility that your data may need more variable space to be clustered, or different variable space: our particular p=3 were sufficient here to distinguish countries, but perhaps other/additional variables would allow better clustering? </li></ul><ul><ul><li>Could use +/- PCA in advance of clustering, perhaps? </li></ul></ul><ul><li>Keep in mind that certain methods have certain ASSUMPTIONS : example – Ward’s method has MVN density distribution mixture as an assumption, along with others. It was necessary to use appropriate normalization (ACECLUS) and verification to speak to these assumptions. </li></ul><ul><li>These appear to be “diffuse, poorly differentiated” clusters, & we had our only real success with Ward’s Method in that we would NOT have been able to interpret higher dimensional data in the other instances via the critical metrics CCC, Pseudo-F, and Pseudo-T2 </li></ul><ul><li>That said, the result appears to be stable across two very different clustering approaches (Ward’s Method, and Average Linkage) so that’s encouraging. Yet we would not KNOW this from the CCC, Pseudo-F, and Pseudo-T2 alone! </li></ul>
  22. 23. Q & A
  23. 24. Here’s an example of the risk of “bad” Hierarchical Agglomerative clustering early on: small run on 8 items shows us divergence in cluster membership. If the final cluster number were 4, then we’d have different results from these two runs. Which would be best? Slight difference in clustering with a robust approach but bad approaches can result in significant differences that will not be undone as Hierarchical Agglomerative clustering proceeds.
  24. 25. MVN and outlier sensitivity of Ward’s linkage: Test on a small 4 item sample to show the effect of clustering with ACECLUS normalization (left) and NO normalization (right) under Ward’s linkage method: clustering is somewhat different.
  25. 26. Method = WARD in PROC CLUSTER (P692-693, Dean & Wichern) in Proc Cluster
  26. 27. Ward’s + Aceclus <ul><li>Ward’s method has assumptions MVN and is also sensitive to OUTLIERS. </li></ul><ul><ul><li>It would be good to use QQ-Plots </li></ul></ul><ul><ul><li>rQ tests </li></ul></ul><ul><ul><li>identify and remove outliers </li></ul></ul><ul><li>Ward also assumes we have equal spherical covariance matrices and equal sampling probability </li></ul><ul><ul><li>we can’t inspect our CLUSTER var-cov (S) matrices as they don’t exist </li></ul></ul><ul><ul><li>we can however use PROC ACECLUS to produce SPHERICAL within-cluster covariance matrix (does an estimate of within-cluster covariance using a small sample of ~3% which is specified in code) </li></ul></ul><ul><ul><li>We can also go BACK after our clustering is done and inspect clusters, then repeat the analysis under NEW assumptions, perhaps. </li></ul></ul><ul><ul><li>We also don’t necessarily have equal SAMPLING probabilities within the CLUSTER, and we might inspect and redo the analysis on this basis as well. </li></ul></ul><ul><li>Also, we can’t assume VARIABLES within the data set have equal variance, so we need either a Z-transform or ACECLUS or some other normalization. </li></ul>
  27. 29. We need a stopping criteria: what is the best number of clusters to use? Don’t want too few &/or a RISE in SPRSQ Large jump in SPRSQ Small increase in SPRSQ Intermediate increase in SPRSQ
  28. 30. How to interpret the Proc Cluster RAW Output: cluster NAME and PARENT cluster columns can be interpreted as noted below… Bulgaria+Czechoslovakia  C3 FormerEGermany+C3  C2 Albania+C2  C1
  29. 31. SPRSQ: SAS Cluster Output <ul><li>_DIST_ = Euclidian distance between the means of the last clusters joined </li></ul><ul><li>_HEIGHT_ = user specified distance or other measure of similarity used in the clustering method </li></ul><ul><li>_SPRSQ_ = decrease in the proportion of variance accounted for by the joining of two clusters to form the current cluster: we don’t want to account for LESS variance ( SPRSQ is I believe Ward’s ESS / TSS ) See Pg 692-693 Dean & Whichern, and SAS PROC CLUSTER </li></ul>
  30. 32. How to interpret the Proc Tree RAW output: focus on CLUSTER & CLUSTERNAME Cluster 1 event forms CL3, Cluster 2 event adds FEG, Cluster 3 event adds Albania
  31. 33. Prior to clustering we’ll use PROC ACECLUS to generate normalized variables: Can1~BirthRate, Can2~DeathRate, Can3~InfantDeathRate
  32. 34. True Distance* Measures between Items are preferable in Clustering** but not always possible (e.g. binary variables) <ul><li>d(s,q) = distance between points s & q </li></ul><ul><li>We want meaningful p-dimensional (p=#variables) measurements for pairs of items </li></ul><ul><li>A TRUE measure of distance satisfies: </li></ul><ul><ul><li>d(s,q)=d(s,p) ( commutative: order not important ) </li></ul></ul><ul><ul><li>d(s,q)>0 if s≠q </li></ul></ul><ul><ul><li>d(s,q)=0 if s=q </li></ul></ul><ul><ul><li>d(s,q)≤d(s,r)+d(r,q) ( triangle inequality ) </li></ul></ul><ul><li>Can us binary variable scoring when not possible </li></ul><ul><li>*P37, Dean & Wichern : **P674 Dean & Wichern </li></ul>
  33. 35. Mahalanbis Distance <ul><li>In clustering we typically don’t have knowledge of S, the covariance matrix between observations X and Y. </li></ul><ul><li>Also known as Statistical Distance (P673, Dean and Wichern) I believe. </li></ul>
  34. 36. Minkowski Distance m=1, sum of absolute values, or “City Block” distance m=2, sum of squares, or Euclidian distance
  35. 37. SAS CODE for Clustering <ul><li>title ''; </li></ul><ul><li>data PovertyAll; </li></ul><ul><li>input Birth Death InfantDeath Country $20. @@; </li></ul><ul><li>datalines; </li></ul><ul><li>24.7 5.7 30.8 Albania 12.5 11.9 14.4 Bulgaria </li></ul><ul><li>13.4 11.7 11.3 Czechoslovakia 12 12.4 7.6 Former_E._Germany </li></ul><ul><li>11.6 13.4 14.8 Hungary 14.3 10.2 16 Poland </li></ul><ul><li>13.6 10.7 26.9 Romania 14 9 20.2 Yugoslavia </li></ul><ul><li>17.7 10 23 USSR 15.2 9.5 13.1 Byelorussia_SSR </li></ul><ul><li>13.4 11.6 13 Ukrainian_SSR 20.7 8.4 25.7 Argentina </li></ul><ul><li>46.6 18 111 Bolivia 28.6 7.9 63 Brazil </li></ul><ul><li>23.4 5.8 17.1 Chile 27.4 6.1 40 Columbia </li></ul><ul><li>32.9 7.4 63 Ecuador 28.3 7.3 56 Guyana </li></ul><ul><li>34.8 6.6 42 Paraguay 32.9 8.3 109.9 Peru </li></ul><ul><li>18 9.6 21.9 Uruguay 27.5 4.4 23.3 Venezuela </li></ul><ul><li>29 23.2 43 Mexico 12 10.6 7.9 Belgium </li></ul><ul><li>13.2 10.1 5.8 Finland 12.4 11.9 7.5 Denmark </li></ul><ul><li>13.6 9.4 7.4 France 11.4 11.2 7.4 Germany </li></ul><ul><li>10.1 9.2 11 Greece 15.1 9.1 7.5 Ireland </li></ul><ul><li>9.7 9.1 8.8 Italy 13.2 8.6 7.1 Netherlands </li></ul><ul><li>14.3 10.7 7.8 Norway 11.9 9.5 13.1 Portugal </li></ul><ul><li>10.7 8.2 8.1 Spain 14.5 11.1 5.6 Sweden </li></ul><ul><li>12.5 9.5 7.1 Switzerland 13.6 11.5 8.4 U.K. </li></ul><ul><li>14.9 7.4 8 Austria 9.9 6.7 4.5 Japan </li></ul><ul><li>14.5 7.3 7.2 Canada 16.7 8.1 9.1 U.S.A. </li></ul><ul><li>40.4 18.7 181.6 Afghanistan 28.4 3.8 16 Bahrain </li></ul><ul><li>42.5 11.5 108.1 Iran 42.6 7.8 69 Iraq </li></ul><ul><li>22.3 6.3 9.7 Israel 38.9 6.4 44 Jordan </li></ul><ul><li>26.8 2.2 15.6 Kuwait 31.7 8.7 48 Lebanon </li></ul><ul><li>45.6 7.8 40 Oman 42.1 7.6 71 Saudi_Arabia </li></ul><ul><li>29.2 8.4 76 Turkey 22.8 3.8 26 United_Arab_Emirates </li></ul><ul><li>42.2 15.5 119 Bangladesh 41.4 16.6 130 Cambodia </li></ul><ul><li>21.2 6.7 32 China 11.7 4.9 6.1 Hong_Kong </li></ul><ul><li>30.5 10.2 91 India 28.6 9.4 75 Indonesia </li></ul><ul><li>23.5 18.1 25 Korea 31.6 5.6 24 Malaysia </li></ul><ul><li>36.1 8.8 68 Mongolia 39.6 14.8 128 Nepal </li></ul><ul><li>30.3 8.1 107.7 Pakistan 33.2 7.7 45 Philippines </li></ul><ul><li>17.8 5.2 7.5 Singapore 21.3 6.2 19.4 Sri_Lanka </li></ul><ul><li>22.3 7.7 28 Thailand 31.8 9.5 64 Vietnam </li></ul><ul><li>35.5 8.3 74 Algeria 47.2 20.2 137 Angola </li></ul><ul><li>48.5 11.6 67 Botswana 46.1 14.6 73 Congo </li></ul><ul><li>38.8 9.5 49.4 Egypt 48.6 20.7 137 Ethiopia </li></ul><ul><li>39.4 16.8 103 Gabon 47.4 21.4 143 Gambia </li></ul><ul><li>44.4 13.1 90 Ghana 47 11.3 72 Kenya </li></ul><ul><li>44 9.4 82 Libya 48.3 25 130 Malawi </li></ul><ul><li>35.5 9.8 82 Morocco 45 18.5 141 Mozambique </li></ul><ul><li>44 12.1 135 Namibia 48.5 15.6 105 Nigeria </li></ul><ul><li>48.2 23.4 154 Sierra_Leone 50.1 20.2 132 Somalia </li></ul><ul><li>32.1 9.9 72 South_Africa 44.6 15.8 108 Sudan </li></ul><ul><li>46.8 12.5 118 Swaziland 31.1 7.3 52 Tunisia </li></ul><ul><li>52.2 15.6 103 Uganda 50.5 14 106 Tanzania </li></ul><ul><li>45.6 14.2 83 Zaire 51.1 13.7 80 Zambia </li></ul><ul><li>41.7 10.3 66 Zimbabwe </li></ul><ul><li>; </li></ul><ul><li>run ; </li></ul><ul><li>proc aceclus data=PovertyAll out=AceAll p= .03 noprint; </li></ul><ul><li>var Birth Death InfantDeath; </li></ul><ul><li>run ; </li></ul><ul><li>title ''; </li></ul><ul><li>ods graphics on; </li></ul><ul><li>proc cluster data=PovertyAll method=average ccc pseudo print= 15 outtree=TreePovertyAll; </li></ul><ul><li>var can1 can2 can3 ; </li></ul><ul><li>id country; </li></ul><ul><li>format country $12.; </li></ul><ul><li>run ; </li></ul><ul><li>ods graphics off; </li></ul><ul><li>goptions vsize= 8 in hsize= 6.4 in htext= 0.9 pct htitle= 3 pct; </li></ul><ul><li>axis1 order=( 0 to 1 by 0.2 ); </li></ul><ul><li>proc tree data=TreePovertyAll out=New nclusters= 5 </li></ul><ul><li>haxis=axis1 horizontal; </li></ul><ul><li>height _SPRSQ_; </li></ul><ul><li>copy can1 can2 can3; </li></ul><ul><li>id country; </li></ul><ul><li>run ; </li></ul><ul><li>proc print new; </li></ul><ul><li>run ; </li></ul><ul><li>proc sgplot data=New; </li></ul><ul><li>scatter y=can2 x=can1 / datalabel=country group=cluster; </li></ul><ul><li>run ; </li></ul><ul><li>proc sgplot data=New; </li></ul><ul><li>scatter y=can3 x=can1 / datalabel=country group=cluster; </li></ul><ul><li>run ; </li></ul><ul><li>proc sgplot data=New; </li></ul><ul><li>scatter y=can3 x=can2 / datalabel=country group=cluster; </li></ul><ul><li>run ; </li></ul>