SlideShare a Scribd company logo
1 of 22
Clustering on Database Systems 
Vahid Mirjalili 
Michigan State University
Clustering 
• Partitioning data into groups 
Items in the same group should have higher similarity to each 
other than items from different groups 
• A similarity/dissimilarity measure 
• Examples: 
 Clustering patients in a hospital 
 Genomic clustering 
 Hand-written character recognition 
A. Jain, “Data Clustering: 50 years beyond K-means”
Clustering vs. Classification 
Reinforcement 
learning 
Predictive Modeling Tasks 
Unsupervised 
Learning 
• Classification is supervised 
Supervised 
Learning 
– class labels are provided; 
– learn a classifier to predict class labels of novel/unseen 
data 
• Clustering is unsupervised or semi-supervised; 
– No class label is give 
– Understand the structure underlying your data
Clustering Approaches 
 Probability-based 
– Assuming statistical independence among features 
– Inefficient updating and storing clusters 
 Distance-based 
– Assuming direct access to all data points 
– Hierarchical clustering: O(N2), not giving the best clustering
Distance-Based Clustering Algorithms 
• kmeans and its variants (kmedoids, kernel 
kmeans, fuzzy c-means, …) 
• Density based methods (DBSCAN) 
• Hierarchical methods
Challenges 
• Unknown number of clusters (from 1 to N) 
Input data K=2 K=6 
You always get some 
output as clusters 
Are they really distinct 
clusters? 
A. Jain, “Data Clustering: 50 years beyond K-means”
Challenges 
• Clusters with different shapes, sizes and 
densities 
Shapes: globular shape, linear vs. non-linear 
shapes 
A. Jain, “Data Clustering: 50 years beyond K-means”
Standard K-Means Algorithm 
• Find initial Cluster centroids randomly 
• An iterative algorithm 
1. Assignment step: assign each data point the 
cluster whose mean is closest (smallest distance) 
2. Update step: update the mean (centroid) of each 
cluster 
Distance: squared Euclidean distance 
( , )  
dist x   x  
 
j j  1  
 
Centroid: mean of feature vectors  
 
 
i C 
 
i 
C 
X 
N 
2 
  
1 
d 
j
Standard K-Means Algorithm
Problem in Database-oriented 
Clustering 
• Low memory available compared to size of 
dataset  data doesn’t fit in main memory 
• High I/O 
• Necessary to avoid too many iterations
RKM: An Efficient Disk-based KMeans 
Method 
• Find the initial centroids by 
• Only 3 iterations: 
r d c all      / 
– Assign every L points to nearest centroids; 
– Update the cluster centroids 
• Minor efficiency tricks: 
N L  
– Keep track of LS, SS and Nc for each cluster during 
assignment  update step: 
c c   LS / N
Implementation of RKM: 
storing data matrices 
• D  input dataset 
• Pj 
 cluster j (for j in [1..k]) 
• Mj, Qj, Nj 
 Linear Sum, Squared Sum, cluster 
size 
• Cj, Rj, Wj 
 Centroids, Variances, Weights 
(accessed during update step) 
C  
M / 
N 
j j j 
R  Q / N  
M M / 
N 
 
 
 
j j l 
l k 
j 
t 
j j j j j 
W N N 
1.. 
2 
/
RKM avoids local minima: 
split large clusters 
• Only performed if size of a cluster is less than 
a user-defined threshold 
1. Remove the centroid of the small cluster 
2. Find the largest cluster (largest Wj) 
3. Randomly choose two centroids for the largest 
cluster (using Cj, and Rj) 
4. Reassign the items of small and large clusters
RKM vs. Standard K-means: 
Random Dataset
RKM vs. Standard K-means: 
Initial Cluster Centroids 
K = 3
Cluster assignment: 
Results after one pass over all the data 
Many iterations needed 2 more iterations
RKM: Database design 
• Relational schema for sparse data 
representation: D(pid, inx, value) 
• For other matrices: doing 1 I/O per matrix row 
to minimize I/O 
Matrix access 
E step (assignment step) 
M step (update step) 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Performance Comparison 
• RKM (disk-based) 
• Memory based: 
– Standard K-means 
– Scalable K-means 
  
  
C dist x C 
Quan.error( )  
( , ) 
j k i P 
i j 
j 
1..
Time Complexity of RKM 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Time Complexity 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Conclusion 
• RKM resolve some of the limitations of K-means 
• RKM limits disk access (I/O) 
• Final clustering is achieved with 3 iterations 
• On large datasets RKM outperforms standard K-means 
• Other limitations of K-means clustering still 
remain
Read more … 
General implementation in IPython notebook: 
http://goo.gl/YZScH9 
http://www.vahidmirjalili.com

More Related Content

What's hot

2-Approximation Vertex Cover
2-Approximation Vertex Cover2-Approximation Vertex Cover
2-Approximation Vertex CoverKowshik Roy
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1Amrinder Arora
 
DDA (digital differential analyzer)
DDA (digital differential analyzer)DDA (digital differential analyzer)
DDA (digital differential analyzer)Inamul Hossain Imran
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Red black tree
Red black treeRed black tree
Red black treeRajendran
 
Uninformed search
Uninformed searchUninformed search
Uninformed searchBablu Shofi
 
A presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmA presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmGaurav Kolekar
 
Unit I discrete mathematics lecture notes
Unit I  discrete mathematics lecture notesUnit I  discrete mathematics lecture notes
Unit I discrete mathematics lecture notesGIRIM8
 
Prims and kruskal algorithms
Prims and kruskal algorithmsPrims and kruskal algorithms
Prims and kruskal algorithmsSaga Valsalan
 
Goal stack planning.ppt
Goal stack planning.pptGoal stack planning.ppt
Goal stack planning.pptSadagopanS
 

What's hot (20)

2-Approximation Vertex Cover
2-Approximation Vertex Cover2-Approximation Vertex Cover
2-Approximation Vertex Cover
 
AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1
 
DDA (digital differential analyzer)
DDA (digital differential analyzer)DDA (digital differential analyzer)
DDA (digital differential analyzer)
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Red black tree
Red black treeRed black tree
Red black tree
 
Unit 3 daa
Unit 3 daaUnit 3 daa
Unit 3 daa
 
Uninformed search
Uninformed searchUninformed search
Uninformed search
 
Primality
PrimalityPrimality
Primality
 
A presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmA presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithm
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 
Dijkstra.ppt
Dijkstra.pptDijkstra.ppt
Dijkstra.ppt
 
Unit I discrete mathematics lecture notes
Unit I  discrete mathematics lecture notesUnit I  discrete mathematics lecture notes
Unit I discrete mathematics lecture notes
 
Prims and kruskal algorithms
Prims and kruskal algorithmsPrims and kruskal algorithms
Prims and kruskal algorithms
 
Shortest path
Shortest pathShortest path
Shortest path
 
Hamiltonian path
Hamiltonian pathHamiltonian path
Hamiltonian path
 
Convex Hull Algorithms
Convex Hull AlgorithmsConvex Hull Algorithms
Convex Hull Algorithms
 
Goal stack planning.ppt
Goal stack planning.pptGoal stack planning.ppt
Goal stack planning.ppt
 

Viewers also liked

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative ClusteringToshihiro Kamishima
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveChong-Kuan Chen
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniquesGiorgos Bamparopoulos
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Longhow Lam
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksAnna Förster
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Viewers also liked (15)

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative Clustering
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning Perspective
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor Networks
 
Introduction to pattern recognition
Introduction to pattern recognitionIntroduction to pattern recognition
Introduction to pattern recognition
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Clustering on database systems rkm

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 

Similar to Clustering on database systems rkm (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

Clustering on database systems rkm

  • 1. Clustering on Database Systems Vahid Mirjalili Michigan State University
  • 2. Clustering • Partitioning data into groups Items in the same group should have higher similarity to each other than items from different groups • A similarity/dissimilarity measure • Examples:  Clustering patients in a hospital  Genomic clustering  Hand-written character recognition A. Jain, “Data Clustering: 50 years beyond K-means”
  • 3. Clustering vs. Classification Reinforcement learning Predictive Modeling Tasks Unsupervised Learning • Classification is supervised Supervised Learning – class labels are provided; – learn a classifier to predict class labels of novel/unseen data • Clustering is unsupervised or semi-supervised; – No class label is give – Understand the structure underlying your data
  • 4. Clustering Approaches  Probability-based – Assuming statistical independence among features – Inefficient updating and storing clusters  Distance-based – Assuming direct access to all data points – Hierarchical clustering: O(N2), not giving the best clustering
  • 5. Distance-Based Clustering Algorithms • kmeans and its variants (kmedoids, kernel kmeans, fuzzy c-means, …) • Density based methods (DBSCAN) • Hierarchical methods
  • 6. Challenges • Unknown number of clusters (from 1 to N) Input data K=2 K=6 You always get some output as clusters Are they really distinct clusters? A. Jain, “Data Clustering: 50 years beyond K-means”
  • 7. Challenges • Clusters with different shapes, sizes and densities Shapes: globular shape, linear vs. non-linear shapes A. Jain, “Data Clustering: 50 years beyond K-means”
  • 8. Standard K-Means Algorithm • Find initial Cluster centroids randomly • An iterative algorithm 1. Assignment step: assign each data point the cluster whose mean is closest (smallest distance) 2. Update step: update the mean (centroid) of each cluster Distance: squared Euclidean distance ( , )  dist x   x   j j  1   Centroid: mean of feature vectors    i C  i C X N 2   1 d j
  • 10. Problem in Database-oriented Clustering • Low memory available compared to size of dataset  data doesn’t fit in main memory • High I/O • Necessary to avoid too many iterations
  • 11. RKM: An Efficient Disk-based KMeans Method • Find the initial centroids by • Only 3 iterations: r d c all      / – Assign every L points to nearest centroids; – Update the cluster centroids • Minor efficiency tricks: N L  – Keep track of LS, SS and Nc for each cluster during assignment  update step: c c   LS / N
  • 12. Implementation of RKM: storing data matrices • D  input dataset • Pj  cluster j (for j in [1..k]) • Mj, Qj, Nj  Linear Sum, Squared Sum, cluster size • Cj, Rj, Wj  Centroids, Variances, Weights (accessed during update step) C  M / N j j j R  Q / N  M M / N    j j l l k j t j j j j j W N N 1.. 2 /
  • 13. RKM avoids local minima: split large clusters • Only performed if size of a cluster is less than a user-defined threshold 1. Remove the centroid of the small cluster 2. Find the largest cluster (largest Wj) 3. Randomly choose two centroids for the largest cluster (using Cj, and Rj) 4. Reassign the items of small and large clusters
  • 14. RKM vs. Standard K-means: Random Dataset
  • 15. RKM vs. Standard K-means: Initial Cluster Centroids K = 3
  • 16. Cluster assignment: Results after one pass over all the data Many iterations needed 2 more iterations
  • 17. RKM: Database design • Relational schema for sparse data representation: D(pid, inx, value) • For other matrices: doing 1 I/O per matrix row to minimize I/O Matrix access E step (assignment step) M step (update step) Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 18. Performance Comparison • RKM (disk-based) • Memory based: – Standard K-means – Scalable K-means     C dist x C Quan.error( )  ( , ) j k i P i j j 1..
  • 19. Time Complexity of RKM Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 20. Time Complexity Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 21. Conclusion • RKM resolve some of the limitations of K-means • RKM limits disk access (I/O) • Final clustering is achieved with 3 iterations • On large datasets RKM outperforms standard K-means • Other limitations of K-means clustering still remain
  • 22. Read more … General implementation in IPython notebook: http://goo.gl/YZScH9 http://www.vahidmirjalili.com