SlideShare a Scribd company logo
1 of 17
Clustering Analysis -
Partitioning method
Presented by-kamalakshi Deshmukh
M.E(Comp)
Roll No-ME101
Guided by- Prof.Harmeet kaur khanuja
Marathwada Mitra Mandal College of Engineering(MMCOE) PUNE
Cluster Analysis 1-04-2017 1
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
Cluster Analysis 1-04-2017 2
Classification Vs Clustering
Cluster Analysis 1-04-2017 3
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Cluster Analysis 1-04-2017 4
Methods of Clustering
Cluster Analysis 1-04-2017 5
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 k-means : ( centroid is mean )Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
Cluster Analysis 1-04-2017 6
The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
 Assign each object to the cluster with the nearest seed
point
 Go back to Step 2, stop when no more new assignment
Cluster Analysis 1-04-2017 7
How the K-Mean Clustering
algorithm works?
1-04-2017Cluster Analysis 8
Comments on the K-Means Method
 Strength: Relatively efficient: O(tkn), where n is #
objects, k is # clusters, and t is # iterations. Normally,
k, t << n.
 Weakness
 Applicable only when mean is defined
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
Cluster Analysis 1-04-2017 9
What is the problem of k-Means
Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially distort the
distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.
Cluster Analysis 1-04-2017 10
PAM (Partitioning Around Medoids)
1-04-2017Cluster Analysis 11
Typical k-medoids algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
What is the problem with PAM?
 Pam is more robust than k-means in the presence of noise and outliers
because a medoid is less influenced by outliers or other extreme values
than a mean
 Pam works efficiently for small data sets but does not scale well for
large data sets.
 O(k(n-k)2
) for each iteration
where n is number of data elements ,k is number of clusters
 Sampling based method,
CLARA(Clustering LARge Applications) of sample size s, O(ks2+k(n-k))
CLARANS : Randomized re-sampling improving efficiency+quality
1-04-2017Cluster Analysis 13
CLARA (Clustering Large Applications)
It draws multiple samples of the data set, applies PAM on each sample,
and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased
1-04-2017Cluster Analysis 14
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation
database
 Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
1-04-2017Cluster Analysis 15
References
 Books
1. Data mining concepts and techniques, Jawai Han,
Michelline Kamber, Jiran Pie, Morgan Kaufmann
Publishers, 3rd Edition.
 Weblinks
1. http://stackoverflow.com/questions/5064928/difference-between-classificatio
2. https://en.wikipedia.org/wiki/Cluster_analysis
1-04-2017Cluster Analysis 16
Thank you!
1-04-2017Cluster Analysis 17

More Related Content

What's hot (20)

Clustering
ClusteringClustering
Clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Clustering
ClusteringClustering
Clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Decision tree
Decision treeDecision tree
Decision tree
 

Similar to Cluster analysis

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxDiaaMustafa2
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478IJRAT
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
 

Similar to Cluster analysis (20)

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptx
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
My8clst
My8clstMy8clst
My8clst
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
41 125-1-pb
41 125-1-pb41 125-1-pb
41 125-1-pb
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 

Cluster analysis

  • 1. Clustering Analysis - Partitioning method Presented by-kamalakshi Deshmukh M.E(Comp) Roll No-ME101 Guided by- Prof.Harmeet kaur khanuja Marathwada Mitra Mandal College of Engineering(MMCOE) PUNE Cluster Analysis 1-04-2017 1
  • 2. What is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Grouping a set of data objects into clusters  Clustering is unsupervised classification: no predefined classes Cluster Analysis 1-04-2017 2
  • 4. Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability Cluster Analysis 1-04-2017 4
  • 5. Methods of Clustering Cluster Analysis 1-04-2017 5
  • 6. Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  k-means : ( centroid is mean )Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster Cluster Analysis 1-04-2017 6
  • 7. The K-Means Clustering Method  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment Cluster Analysis 1-04-2017 7
  • 8. How the K-Mean Clustering algorithm works? 1-04-2017Cluster Analysis 8
  • 9. Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Weakness  Applicable only when mean is defined  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers Cluster Analysis 1-04-2017 9
  • 10. What is the problem of k-Means Method?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data.  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. Cluster Analysis 1-04-2017 10
  • 11. PAM (Partitioning Around Medoids) 1-04-2017Cluster Analysis 11
  • 12. Typical k-medoids algorithm (PAM) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 13. What is the problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean  Pam works efficiently for small data sets but does not scale well for large data sets.  O(k(n-k)2 ) for each iteration where n is number of data elements ,k is number of clusters  Sampling based method, CLARA(Clustering LARge Applications) of sample size s, O(ks2+k(n-k)) CLARANS : Randomized re-sampling improving efficiency+quality 1-04-2017Cluster Analysis 13
  • 14. CLARA (Clustering Large Applications) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness:  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 1-04-2017Cluster Analysis 14
  • 15. Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 1-04-2017Cluster Analysis 15
  • 16. References  Books 1. Data mining concepts and techniques, Jawai Han, Michelline Kamber, Jiran Pie, Morgan Kaufmann Publishers, 3rd Edition.  Weblinks 1. http://stackoverflow.com/questions/5064928/difference-between-classificatio 2. https://en.wikipedia.org/wiki/Cluster_analysis 1-04-2017Cluster Analysis 16