What’s in a Label?Business value of “soft” vs “hard” cluster ensembles                                              soluti...
Answers the who and the why
Theme 1Theme 2Theme 3          ...Theme 9Theme 10           Cluster          Ensemble
HARD OR SOFTCLUSTER ENSEMBLE
Stability   Integrity   Accuracy   Size
StabilitySimilarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the sameclu...
Cluster Integrity – HeterogeneityTotal separation of clusters: based on the distance between cluster centers
Cluster Integrity - HomogeneityScatter (compactness): average ratio of the cluster variance to the variance of the dataset.
Accuracy                Reality                                                    Prediction                        5    ...
SizeUniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).
RheumatismSoftware journeyOsteoporosis
Stability          Heterogeneity H>S                     H>S       Accuracy           Homogeneity                         ...
LC gives smaller segments            RheumatismSoft LCSoft CCEAHard LCHard CCEA            Software journey   Osteoporosis...
MIXED EVIDENCE
Fixed Factors                         x 10 100   100   100   100
Stability: SOFT is better                                  High confidence                                  Low confidence...
Homogeneity: SOFT is better                                Scatter hard > soft                                High confide...
Heterogeneity: Hard is better                                High confidence                                Low confidence...
Size: Hard is better                                 High confidence                                 Low confidence       ...
HARD ENSEMBLESGIVE BETTERBUSINESSSEGMENTS
Anita Prinzie, Nicole Huyghe                     anita@solutions2.be                      www.solutions2.be        do we c...
References•   Fred and Jain, Combining Multiple Clustering using Evidence    Accumulation (2005), IEEE Transactions on Pat...
Sawtooth 2012   what's in a label
Upcoming SlideShare
Loading in...5
×

Sawtooth 2012 what's in a label

500

Published on

Sawtooth Conference 2012 Orlando - Flordida
What's in a label? The business value of hard versus soft clustering
by Nicole Huyghe and Anita Prinzie

Published in: Business, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
500
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sawtooth 2012 what's in a label

  1. 1. What’s in a Label?Business value of “soft” vs “hard” cluster ensembles solutions-2 Nicole Huyghe & Anita Prinzie
  2. 2. Answers the who and the why
  3. 3. Theme 1Theme 2Theme 3 ...Theme 9Theme 10 Cluster Ensemble
  4. 4. HARD OR SOFTCLUSTER ENSEMBLE
  5. 5. Stability Integrity Accuracy Size
  6. 6. StabilitySimilarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the samecluster in both clustering C and clustering C’.
  7. 7. Cluster Integrity – HeterogeneityTotal separation of clusters: based on the distance between cluster centers
  8. 8. Cluster Integrity - HomogeneityScatter (compactness): average ratio of the cluster variance to the variance of the dataset.
  9. 9. Accuracy Reality Prediction 5 5 6 4 6 4 2 1 2 1 3 7 7 3 8 8 9 9Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the realsegment correcting for the expected level of agreement.
  10. 10. SizeUniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).
  11. 11. RheumatismSoftware journeyOsteoporosis
  12. 12. Stability Heterogeneity H>S H>S Accuracy Homogeneity S>HH>S H>S S>H S>H
  13. 13. LC gives smaller segments RheumatismSoft LCSoft CCEAHard LCHard CCEA Software journey OsteoporosisSoft LCSoft CCEAHard LCHard CCEA
  14. 14. MIXED EVIDENCE
  15. 15. Fixed Factors x 10 100 100 100 100
  16. 16. Stability: SOFT is better High confidence Low confidence Sim. Index soft > hard Sim. Index hard > soft Strong Weak similarity similarity
  17. 17. Homogeneity: SOFT is better Scatter hard > soft High confidence Low confidence Strong Weak similarity similarity
  18. 18. Heterogeneity: Hard is better High confidence Low confidence Tot. Sep. soft > hard Strong Weak similarity similarity
  19. 19. Size: Hard is better High confidence Low confidence Uni. dev. soft > hard Strong Weak similarity similarity
  20. 20. HARD ENSEMBLESGIVE BETTERBUSINESSSEGMENTS
  21. 21. Anita Prinzie, Nicole Huyghe anita@solutions2.be www.solutions2.be do we causerisingquestions
  22. 22. References• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability- based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo Package (MCMCpack) (2003-2012), R software.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×