SlideShare a Scribd company logo
1 of 14
“A Comparative Study between Clustering Algorithms”
Pattern Discovery for Categorical Cross-Cultural Data in
the Market Research Domain
September, 2015
Supervisor : Reviewer: - Industry Partner:
Professor: Plamen Angelov Professor: Nigel Davies Bonamy Finch
Author:
Ahmed Hamada
INDUSTRY PARTNER
+ 50
Customers
THE CHALLENGE
Cross-cultural attitudinal segmentation studies using rating scales are
seriously a challengeable tasks within the market research domain as there are
a lot of shared views with fuzzy boundaries in these studies, unlike clustering
on demographics. The dilemma of having meaningful clusters that can
realistically reflect the respondents segments with good geometrical cluster
properties is also a demanding subject in the market research domain
GAP ANALYSIS
• 76% used K-means as a partitioning method for their segmentation
• 93% of the segmentation studies Euclidean distance.
• More 60% of the examined market research studies didn’t include an
evaluation criteria for the developed clusters
In a multi variate survey study, studying 243 market segmentation
publications in the tourism domain (Dolnicar, 2003)
K-MEANS PROBLEMS
Data
Dimensionality
• Distances between
points become
relatively uniform,
therefore the
concept of the
nearest neighbour
of a point becomes
meaningless
Dissimilarity
Measure
• it isn't just about
distances, but
about computing
the mean. But
there is no
reasonable mean
on categorical data
Non-Convex
Shaped Clusters
• In Euclidean space,
an object is convex
if for every pair of
points within the
object, every point
on the straight line
segment that joins
them is also within
the object
Local Minima
• differentiating the
objective function
w.r.t. to the
centroids, to find a
local minimum.
More paths and
more initiation
points can result in
a global minima
EXPERIMENTS
PARTITIONING METHODS
HIERARCHICAL
METHODS
K-means K-modes ROCK
Kernel
K-means
K-meansonrawdata
K-meanson
standardizedrows
MCAonrawdata
+K-means
KernelK-meanson
rawdata
KernelK-meanson
standardizedrows
K-modesonrawdata
ROCKonrawdata
Euclidean Distance
Matching Measure
Arbitrary
shaped
clusters
Non-convex
shaped
clusters
21experiments
7 X 3
DETERMINING THE NUMBER OF CLUSTERS
______________________________________________
Gap Statistic for 10 clusters
_____________________________________________
Within Sum of Squares for 10 clusters
? 5, 6 & 7
Clusters
Models
7-CLUSTER MODEL GEOMETRICAL
COMPARISON
117,604
87,232
1,644
283,904
224,892
0
100,000
200,000
300,000
K-means K-means on
standardised
rows
MCA + K-
means
Kernel K-
means
Kernel K-
means on
standardised
rows
21% 18%
59%
0.04% 0.05%0%
20%
40%
60%
80%
K-means K-means on
standardised
rows
MCA + K-
means
Kernel K-
means
Kernel K-
means on
standardised
rows
Within cluster sum of squares Cluster closeness index
INTERNAL MEASURES COMPARISON
0.102
0.05
0.09 0.08 0.07
0.125
0.05
0.1
0
0.05
0.109
0.05
0.1
0.08 0.05
0
0.1
0.2
0.3
0.4
5 clusters 6 clusters 7 clusters
0.05
0.03
-0.02
-0.01 -0.01
0.05
0.03
-0.02
0.01 0.01
0.04
0.03
-0.03
-0.01
0.02
-0.1
0
0.1
0.2
5 clusters 6 clusters 7 clusters
Dunn index Silhouette measure
INDUSTRY EVALUATION
Algorithm K-means on standardised rows Kernel K-means on standardised
rows
No. Clusters 5 6 7 5 6 7
Response Bias
Freedom
1 79% 86% 79% 70% 59% 58%
2 81% 77% 67% 93% 61% 71%
3 90% 61% 79% 77% 64% 75%
4 72% 81% 71% 74% 79% 83%
5 80% 70% 75% 79% 67% 67%
6 71% 71% 61% 79%
7 71% 79%
Reportability 1 71% 62% 67% 62% 76% 71%
2 38% 19% 19% 90% 24% 19%
3 19% 29% 81% 48% 81% 71%
4 43% 52% 29% 71% 33% 62%
5 62% 52% 43% 10% 33% 43%
6 71% 57% 33% 43%
7 62% 52%
5-CLUSTERS MODEL SCATTER PLOT MATRIX FOR THE
FIRST 4 VARIABLES
K-means on standardised rows Kernel K-means on standardised rows
CONCLUSION
1. The results of this research revealed that the standardisation of the
respondents developed better segments from the pragmatic point
of view.
2. From the overall evaluation analysis, the results of the 5 clusters
model using the K-means and the kernel K-means on standardised
rows revealed more meaningful segments than the other methods.
3. The results illustrated that the ROCK algorithm and the application
of MCA then K-means was not suitable for multiscale categorical
data and resulted in meaningless clusters.
FURTHER RESEARCH
• Evaluate the stability of the classification accuracy using different
algorithms
• Study other clustering methods available in the literature
• Evaluate the same algorithms on various cross-cultural multiscale
data sets and test the hypothesis whether the multi-scaled data (i.e.
Likert scale) develop better clusters from the geometrical point of
view.
• Evaluate the clustering algorithms on a different type of response
scales rather than using the multi point biased response scales
Thank You

More Related Content

Viewers also liked

Getting ready to teach computer
Getting ready to teach computerGetting ready to teach computer
Getting ready to teach computerBESOR ACADEMY
 
Blog.august2015
Blog.august2015Blog.august2015
Blog.august2015Leon Lim
 
Upgrading A Cosmetic Industry Water System to Meet
Upgrading A Cosmetic Industry Water System to MeetUpgrading A Cosmetic Industry Water System to Meet
Upgrading A Cosmetic Industry Water System to MeetChris Gallagher
 
Owatonna, MN plat map
Owatonna, MN plat mapOwatonna, MN plat map
Owatonna, MN plat mapBrent Lundell
 
Ajay KUmar Yadav
Ajay KUmar YadavAjay KUmar Yadav
Ajay KUmar YadavAjay Yadav
 
Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3gundul28
 
Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3gundul28
 
Presentazione metodo rieduca
Presentazione metodo rieducaPresentazione metodo rieduca
Presentazione metodo rieducaMaurizio Cancedda
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmProshantaShil
 

Viewers also liked (12)

Cuaderno de fisica III
Cuaderno de fisica IIICuaderno de fisica III
Cuaderno de fisica III
 
Getting ready to teach computer
Getting ready to teach computerGetting ready to teach computer
Getting ready to teach computer
 
Blog.august2015
Blog.august2015Blog.august2015
Blog.august2015
 
Upgrading A Cosmetic Industry Water System to Meet
Upgrading A Cosmetic Industry Water System to MeetUpgrading A Cosmetic Industry Water System to Meet
Upgrading A Cosmetic Industry Water System to Meet
 
Moluskoak
MoluskoakMoluskoak
Moluskoak
 
Owatonna, MN plat map
Owatonna, MN plat mapOwatonna, MN plat map
Owatonna, MN plat map
 
Ajay KUmar Yadav
Ajay KUmar YadavAjay KUmar Yadav
Ajay KUmar Yadav
 
Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3
 
Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3Tugas 2 Matematika Semester 3
Tugas 2 Matematika Semester 3
 
Presentazione metodo rieduca
Presentazione metodo rieducaPresentazione metodo rieduca
Presentazione metodo rieduca
 
Корпоративный квест
Корпоративный квестКорпоративный квест
Корпоративный квест
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 

Similar to Disseration_ppt

Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
QUALITY AND VALIDITY OF CLUSTER ANALYSIS
QUALITY AND VALIDITY OF CLUSTER ANALYSISQUALITY AND VALIDITY OF CLUSTER ANALYSIS
QUALITY AND VALIDITY OF CLUSTER ANALYSISguruswamyd785
 
QUALITY AND VALIDITY of cluster analysis in data minig
QUALITY AND VALIDITY of cluster analysis in data minigQUALITY AND VALIDITY of cluster analysis in data minig
QUALITY AND VALIDITY of cluster analysis in data minigsani7728264
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxSureshPolisetty2
 
Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsChemseddine Berbague
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...IRJET Journal
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.pptSueMiu
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentssriharipatilin
 
Chapter 6 data analysis iec11
Chapter 6 data analysis iec11Chapter 6 data analysis iec11
Chapter 6 data analysis iec11Ho Cao Viet
 
Exploratory_Analysis_of_Data_ppt.pdf
Exploratory_Analysis_of_Data_ppt.pdfExploratory_Analysis_of_Data_ppt.pdf
Exploratory_Analysis_of_Data_ppt.pdfRushikeshKulkarni71
 
Rank based similarity search reducing the dimensional dependence
Rank based similarity search reducing the dimensional dependenceRank based similarity search reducing the dimensional dependence
Rank based similarity search reducing the dimensional dependenceredpel dot com
 

Similar to Disseration_ppt (20)

DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Kmeans
KmeansKmeans
Kmeans
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Cluster
ClusterCluster
Cluster
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
QUALITY AND VALIDITY OF CLUSTER ANALYSIS
QUALITY AND VALIDITY OF CLUSTER ANALYSISQUALITY AND VALIDITY OF CLUSTER ANALYSIS
QUALITY AND VALIDITY OF CLUSTER ANALYSIS
 
QUALITY AND VALIDITY of cluster analysis in data minig
QUALITY AND VALIDITY of cluster analysis in data minigQUALITY AND VALIDITY of cluster analysis in data minig
QUALITY AND VALIDITY of cluster analysis in data minig
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptx
 
Study of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systemsStudy of relevancy, diversity, and novelty in recommender systems
Study of relevancy, diversity, and novelty in recommender systems
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
 
Chapter 6 data analysis iec11
Chapter 6 data analysis iec11Chapter 6 data analysis iec11
Chapter 6 data analysis iec11
 
Di35605610
Di35605610Di35605610
Di35605610
 
Exploratory_Analysis_of_Data_ppt.pdf
Exploratory_Analysis_of_Data_ppt.pdfExploratory_Analysis_of_Data_ppt.pdf
Exploratory_Analysis_of_Data_ppt.pdf
 
Rank based similarity search reducing the dimensional dependence
Rank based similarity search reducing the dimensional dependenceRank based similarity search reducing the dimensional dependence
Rank based similarity search reducing the dimensional dependence
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 

Disseration_ppt

  • 1. “A Comparative Study between Clustering Algorithms” Pattern Discovery for Categorical Cross-Cultural Data in the Market Research Domain September, 2015 Supervisor : Reviewer: - Industry Partner: Professor: Plamen Angelov Professor: Nigel Davies Bonamy Finch Author: Ahmed Hamada
  • 3. THE CHALLENGE Cross-cultural attitudinal segmentation studies using rating scales are seriously a challengeable tasks within the market research domain as there are a lot of shared views with fuzzy boundaries in these studies, unlike clustering on demographics. The dilemma of having meaningful clusters that can realistically reflect the respondents segments with good geometrical cluster properties is also a demanding subject in the market research domain
  • 4. GAP ANALYSIS • 76% used K-means as a partitioning method for their segmentation • 93% of the segmentation studies Euclidean distance. • More 60% of the examined market research studies didn’t include an evaluation criteria for the developed clusters In a multi variate survey study, studying 243 market segmentation publications in the tourism domain (Dolnicar, 2003)
  • 5. K-MEANS PROBLEMS Data Dimensionality • Distances between points become relatively uniform, therefore the concept of the nearest neighbour of a point becomes meaningless Dissimilarity Measure • it isn't just about distances, but about computing the mean. But there is no reasonable mean on categorical data Non-Convex Shaped Clusters • In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object Local Minima • differentiating the objective function w.r.t. to the centroids, to find a local minimum. More paths and more initiation points can result in a global minima
  • 6. EXPERIMENTS PARTITIONING METHODS HIERARCHICAL METHODS K-means K-modes ROCK Kernel K-means K-meansonrawdata K-meanson standardizedrows MCAonrawdata +K-means KernelK-meanson rawdata KernelK-meanson standardizedrows K-modesonrawdata ROCKonrawdata Euclidean Distance Matching Measure Arbitrary shaped clusters Non-convex shaped clusters 21experiments 7 X 3
  • 7. DETERMINING THE NUMBER OF CLUSTERS ______________________________________________ Gap Statistic for 10 clusters _____________________________________________ Within Sum of Squares for 10 clusters ? 5, 6 & 7 Clusters Models
  • 8. 7-CLUSTER MODEL GEOMETRICAL COMPARISON 117,604 87,232 1,644 283,904 224,892 0 100,000 200,000 300,000 K-means K-means on standardised rows MCA + K- means Kernel K- means Kernel K- means on standardised rows 21% 18% 59% 0.04% 0.05%0% 20% 40% 60% 80% K-means K-means on standardised rows MCA + K- means Kernel K- means Kernel K- means on standardised rows Within cluster sum of squares Cluster closeness index
  • 9. INTERNAL MEASURES COMPARISON 0.102 0.05 0.09 0.08 0.07 0.125 0.05 0.1 0 0.05 0.109 0.05 0.1 0.08 0.05 0 0.1 0.2 0.3 0.4 5 clusters 6 clusters 7 clusters 0.05 0.03 -0.02 -0.01 -0.01 0.05 0.03 -0.02 0.01 0.01 0.04 0.03 -0.03 -0.01 0.02 -0.1 0 0.1 0.2 5 clusters 6 clusters 7 clusters Dunn index Silhouette measure
  • 10. INDUSTRY EVALUATION Algorithm K-means on standardised rows Kernel K-means on standardised rows No. Clusters 5 6 7 5 6 7 Response Bias Freedom 1 79% 86% 79% 70% 59% 58% 2 81% 77% 67% 93% 61% 71% 3 90% 61% 79% 77% 64% 75% 4 72% 81% 71% 74% 79% 83% 5 80% 70% 75% 79% 67% 67% 6 71% 71% 61% 79% 7 71% 79% Reportability 1 71% 62% 67% 62% 76% 71% 2 38% 19% 19% 90% 24% 19% 3 19% 29% 81% 48% 81% 71% 4 43% 52% 29% 71% 33% 62% 5 62% 52% 43% 10% 33% 43% 6 71% 57% 33% 43% 7 62% 52%
  • 11. 5-CLUSTERS MODEL SCATTER PLOT MATRIX FOR THE FIRST 4 VARIABLES K-means on standardised rows Kernel K-means on standardised rows
  • 12. CONCLUSION 1. The results of this research revealed that the standardisation of the respondents developed better segments from the pragmatic point of view. 2. From the overall evaluation analysis, the results of the 5 clusters model using the K-means and the kernel K-means on standardised rows revealed more meaningful segments than the other methods. 3. The results illustrated that the ROCK algorithm and the application of MCA then K-means was not suitable for multiscale categorical data and resulted in meaningless clusters.
  • 13. FURTHER RESEARCH • Evaluate the stability of the classification accuracy using different algorithms • Study other clustering methods available in the literature • Evaluate the same algorithms on various cross-cultural multiscale data sets and test the hypothesis whether the multi-scaled data (i.e. Likert scale) develop better clusters from the geometrical point of view. • Evaluate the clustering algorithms on a different type of response scales rather than using the multi point biased response scales