Sawtooth 2012 what's in a label

696 views

Published on

Sawtooth Conference 2012 Orlando - Flordida
What's in a label? The business value of hard versus soft clustering
by Nicole Huyghe and Anita Prinzie

Published in: Business, Technology, Education
  • Be the first to comment

  • Be the first to like this

Sawtooth 2012 what's in a label

  1. 1. What’s in a Label?Business value of “soft” vs “hard” cluster ensembles solutions-2 Nicole Huyghe & Anita Prinzie
  2. 2. Answers the who and the why
  3. 3. Theme 1Theme 2Theme 3 ...Theme 9Theme 10 Cluster Ensemble
  4. 4. HARD OR SOFTCLUSTER ENSEMBLE
  5. 5. Stability Integrity Accuracy Size
  6. 6. StabilitySimilarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the samecluster in both clustering C and clustering C’.
  7. 7. Cluster Integrity – HeterogeneityTotal separation of clusters: based on the distance between cluster centers
  8. 8. Cluster Integrity - HomogeneityScatter (compactness): average ratio of the cluster variance to the variance of the dataset.
  9. 9. Accuracy Reality Prediction 5 5 6 4 6 4 2 1 2 1 3 7 7 3 8 8 9 9Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the realsegment correcting for the expected level of agreement.
  10. 10. SizeUniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).
  11. 11. RheumatismSoftware journeyOsteoporosis
  12. 12. Stability Heterogeneity H>S H>S Accuracy Homogeneity S>HH>S H>S S>H S>H
  13. 13. LC gives smaller segments RheumatismSoft LCSoft CCEAHard LCHard CCEA Software journey OsteoporosisSoft LCSoft CCEAHard LCHard CCEA
  14. 14. MIXED EVIDENCE
  15. 15. Fixed Factors x 10 100 100 100 100
  16. 16. Stability: SOFT is better High confidence Low confidence Sim. Index soft > hard Sim. Index hard > soft Strong Weak similarity similarity
  17. 17. Homogeneity: SOFT is better Scatter hard > soft High confidence Low confidence Strong Weak similarity similarity
  18. 18. Heterogeneity: Hard is better High confidence Low confidence Tot. Sep. soft > hard Strong Weak similarity similarity
  19. 19. Size: Hard is better High confidence Low confidence Uni. dev. soft > hard Strong Weak similarity similarity
  20. 20. HARD ENSEMBLESGIVE BETTERBUSINESSSEGMENTS
  21. 21. Anita Prinzie, Nicole Huyghe anita@solutions2.be www.solutions2.be do we causerisingquestions
  22. 22. References• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability- based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo Package (MCMCpack) (2003-2012), R software.

×