SlideShare a Scribd company logo
1 of 22
Clustering on Database Systems 
Vahid Mirjalili 
Michigan State University
Clustering 
• Partitioning data into groups 
Items in the same group should have higher similarity to each 
other than items from different groups 
• A similarity/dissimilarity measure 
• Examples: 
 Clustering patients in a hospital 
 Genomic clustering 
 Hand-written character recognition 
A. Jain, “Data Clustering: 50 years beyond K-means”
Clustering vs. Classification 
Reinforcement 
learning 
Predictive Modeling Tasks 
Unsupervised 
Learning 
• Classification is supervised 
Supervised 
Learning 
– class labels are provided; 
– learn a classifier to predict class labels of novel/unseen 
data 
• Clustering is unsupervised or semi-supervised; 
– No class label is give 
– Understand the structure underlying your data
Clustering Approaches 
 Probability-based 
– Assuming statistical independence among features 
– Inefficient updating and storing clusters 
 Distance-based 
– Assuming direct access to all data points 
– Hierarchical clustering: O(N2), not giving the best clustering
Distance-Based Clustering Algorithms 
• kmeans and its variants (kmedoids, kernel 
kmeans, fuzzy c-means, …) 
• Density based methods (DBSCAN) 
• Hierarchical methods
Challenges 
• Unknown number of clusters (from 1 to N) 
Input data K=2 K=6 
You always get some 
output as clusters 
Are they really distinct 
clusters? 
A. Jain, “Data Clustering: 50 years beyond K-means”
Challenges 
• Clusters with different shapes, sizes and 
densities 
Shapes: globular shape, linear vs. non-linear 
shapes 
A. Jain, “Data Clustering: 50 years beyond K-means”
Standard K-Means Algorithm 
• Find initial Cluster centroids randomly 
• An iterative algorithm 
1. Assignment step: assign each data point the 
cluster whose mean is closest (smallest distance) 
2. Update step: update the mean (centroid) of each 
cluster 
Distance: squared Euclidean distance 
( , )  
dist x   x  
 
j j  1  
 
Centroid: mean of feature vectors  
 
 
i C 
 
i 
C 
X 
N 
2 
  
1 
d 
j
Standard K-Means Algorithm
Problem in Database-oriented 
Clustering 
• Low memory available compared to size of 
dataset  data doesn’t fit in main memory 
• High I/O 
• Necessary to avoid too many iterations
RKM: An Efficient Disk-based KMeans 
Method 
• Find the initial centroids by 
• Only 3 iterations: 
r d c all      / 
– Assign every L points to nearest centroids; 
– Update the cluster centroids 
• Minor efficiency tricks: 
N L  
– Keep track of LS, SS and Nc for each cluster during 
assignment  update step: 
c c   LS / N
Implementation of RKM: 
storing data matrices 
• D  input dataset 
• Pj 
 cluster j (for j in [1..k]) 
• Mj, Qj, Nj 
 Linear Sum, Squared Sum, cluster 
size 
• Cj, Rj, Wj 
 Centroids, Variances, Weights 
(accessed during update step) 
C  
M / 
N 
j j j 
R  Q / N  
M M / 
N 
 
 
 
j j l 
l k 
j 
t 
j j j j j 
W N N 
1.. 
2 
/
RKM avoids local minima: 
split large clusters 
• Only performed if size of a cluster is less than 
a user-defined threshold 
1. Remove the centroid of the small cluster 
2. Find the largest cluster (largest Wj) 
3. Randomly choose two centroids for the largest 
cluster (using Cj, and Rj) 
4. Reassign the items of small and large clusters
RKM vs. Standard K-means: 
Random Dataset
RKM vs. Standard K-means: 
Initial Cluster Centroids 
K = 3
Cluster assignment: 
Results after one pass over all the data 
Many iterations needed 2 more iterations
RKM: Database design 
• Relational schema for sparse data 
representation: D(pid, inx, value) 
• For other matrices: doing 1 I/O per matrix row 
to minimize I/O 
Matrix access 
E step (assignment step) 
M step (update step) 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Performance Comparison 
• RKM (disk-based) 
• Memory based: 
– Standard K-means 
– Scalable K-means 
  
  
C dist x C 
Quan.error( )  
( , ) 
j k i P 
i j 
j 
1..
Time Complexity of RKM 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Time Complexity 
Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for 
Relational Databases”
Conclusion 
• RKM resolve some of the limitations of K-means 
• RKM limits disk access (I/O) 
• Final clustering is achieved with 3 iterations 
• On large datasets RKM outperforms standard K-means 
• Other limitations of K-means clustering still 
remain
Read more … 
General implementation in IPython notebook: 
http://goo.gl/YZScH9 
http://www.vahidmirjalili.com

More Related Content

What's hot

Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...
Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...
Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...Edureka!
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Network defenses
Network defensesNetwork defenses
Network defensesG Prachi
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Key mechanism of mobile ip
Key mechanism of mobile ip Key mechanism of mobile ip
Key mechanism of mobile ip priya Nithya
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
0 1 knapsack using naive recursive approach and top-down dynamic programming ...
0 1 knapsack using naive recursive approach and top-down dynamic programming ...0 1 knapsack using naive recursive approach and top-down dynamic programming ...
0 1 knapsack using naive recursive approach and top-down dynamic programming ...Abhishek Singh
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
K MEANS CLUSTERING.pptx
K MEANS CLUSTERING.pptxK MEANS CLUSTERING.pptx
K MEANS CLUSTERING.pptxkibriaswe
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
Knapsack Problem (DP & GREEDY)
Knapsack Problem (DP & GREEDY)Knapsack Problem (DP & GREEDY)
Knapsack Problem (DP & GREEDY)Ridhima Chowdhury
 
Transport layer (computer networks)
Transport layer (computer networks)Transport layer (computer networks)
Transport layer (computer networks)Fatbardh Hysa
 
0 1 knapsack using branch and bound
0 1 knapsack using branch and bound0 1 knapsack using branch and bound
0 1 knapsack using branch and boundAbhishek Singh
 

What's hot (20)

Broadband isdn and atm
Broadband  isdn and atmBroadband  isdn and atm
Broadband isdn and atm
 
Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...
Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...
Tkinter Python Tutorial | Python GUI Programming Using Tkinter Tutorial | Pyt...
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Network defenses
Network defensesNetwork defenses
Network defenses
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Transport layer protocols : TCP and UDP
Transport layer protocols  : TCP and UDPTransport layer protocols  : TCP and UDP
Transport layer protocols : TCP and UDP
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...
Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...
Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...
 
Key mechanism of mobile ip
Key mechanism of mobile ip Key mechanism of mobile ip
Key mechanism of mobile ip
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
0 1 knapsack using naive recursive approach and top-down dynamic programming ...
0 1 knapsack using naive recursive approach and top-down dynamic programming ...0 1 knapsack using naive recursive approach and top-down dynamic programming ...
0 1 knapsack using naive recursive approach and top-down dynamic programming ...
 
Sandbox
SandboxSandbox
Sandbox
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
K MEANS CLUSTERING.pptx
K MEANS CLUSTERING.pptxK MEANS CLUSTERING.pptx
K MEANS CLUSTERING.pptx
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Virtual Private Network
Virtual Private NetworkVirtual Private Network
Virtual Private Network
 
Knapsack Problem (DP & GREEDY)
Knapsack Problem (DP & GREEDY)Knapsack Problem (DP & GREEDY)
Knapsack Problem (DP & GREEDY)
 
Transport layer (computer networks)
Transport layer (computer networks)Transport layer (computer networks)
Transport layer (computer networks)
 
0 1 knapsack using branch and bound
0 1 knapsack using branch and bound0 1 knapsack using branch and bound
0 1 knapsack using branch and bound
 

Viewers also liked

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative ClusteringToshihiro Kamishima
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveChong-Kuan Chen
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniquesGiorgos Bamparopoulos
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Longhow Lam
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksAnna Förster
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognitionHưng Đặng
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Viewers also liked (15)

Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative Clustering
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning Perspective
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 
Machine Learning for Body Sensor Networks
Machine Learning for Body Sensor NetworksMachine Learning for Body Sensor Networks
Machine Learning for Body Sensor Networks
 
Introduction to pattern recognition
Introduction to pattern recognitionIntroduction to pattern recognition
Introduction to pattern recognition
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Lecture artificial neural networks and pattern recognition
Lecture   artificial neural networks and pattern recognitionLecture   artificial neural networks and pattern recognition
Lecture artificial neural networks and pattern recognition
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Clustering on database systems rkm

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 

Similar to Clustering on database systems rkm (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Clustering on database systems rkm

  • 1. Clustering on Database Systems Vahid Mirjalili Michigan State University
  • 2. Clustering • Partitioning data into groups Items in the same group should have higher similarity to each other than items from different groups • A similarity/dissimilarity measure • Examples:  Clustering patients in a hospital  Genomic clustering  Hand-written character recognition A. Jain, “Data Clustering: 50 years beyond K-means”
  • 3. Clustering vs. Classification Reinforcement learning Predictive Modeling Tasks Unsupervised Learning • Classification is supervised Supervised Learning – class labels are provided; – learn a classifier to predict class labels of novel/unseen data • Clustering is unsupervised or semi-supervised; – No class label is give – Understand the structure underlying your data
  • 4. Clustering Approaches  Probability-based – Assuming statistical independence among features – Inefficient updating and storing clusters  Distance-based – Assuming direct access to all data points – Hierarchical clustering: O(N2), not giving the best clustering
  • 5. Distance-Based Clustering Algorithms • kmeans and its variants (kmedoids, kernel kmeans, fuzzy c-means, …) • Density based methods (DBSCAN) • Hierarchical methods
  • 6. Challenges • Unknown number of clusters (from 1 to N) Input data K=2 K=6 You always get some output as clusters Are they really distinct clusters? A. Jain, “Data Clustering: 50 years beyond K-means”
  • 7. Challenges • Clusters with different shapes, sizes and densities Shapes: globular shape, linear vs. non-linear shapes A. Jain, “Data Clustering: 50 years beyond K-means”
  • 8. Standard K-Means Algorithm • Find initial Cluster centroids randomly • An iterative algorithm 1. Assignment step: assign each data point the cluster whose mean is closest (smallest distance) 2. Update step: update the mean (centroid) of each cluster Distance: squared Euclidean distance ( , )  dist x   x   j j  1   Centroid: mean of feature vectors    i C  i C X N 2   1 d j
  • 10. Problem in Database-oriented Clustering • Low memory available compared to size of dataset  data doesn’t fit in main memory • High I/O • Necessary to avoid too many iterations
  • 11. RKM: An Efficient Disk-based KMeans Method • Find the initial centroids by • Only 3 iterations: r d c all      / – Assign every L points to nearest centroids; – Update the cluster centroids • Minor efficiency tricks: N L  – Keep track of LS, SS and Nc for each cluster during assignment  update step: c c   LS / N
  • 12. Implementation of RKM: storing data matrices • D  input dataset • Pj  cluster j (for j in [1..k]) • Mj, Qj, Nj  Linear Sum, Squared Sum, cluster size • Cj, Rj, Wj  Centroids, Variances, Weights (accessed during update step) C  M / N j j j R  Q / N  M M / N    j j l l k j t j j j j j W N N 1.. 2 /
  • 13. RKM avoids local minima: split large clusters • Only performed if size of a cluster is less than a user-defined threshold 1. Remove the centroid of the small cluster 2. Find the largest cluster (largest Wj) 3. Randomly choose two centroids for the largest cluster (using Cj, and Rj) 4. Reassign the items of small and large clusters
  • 14. RKM vs. Standard K-means: Random Dataset
  • 15. RKM vs. Standard K-means: Initial Cluster Centroids K = 3
  • 16. Cluster assignment: Results after one pass over all the data Many iterations needed 2 more iterations
  • 17. RKM: Database design • Relational schema for sparse data representation: D(pid, inx, value) • For other matrices: doing 1 I/O per matrix row to minimize I/O Matrix access E step (assignment step) M step (update step) Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 18. Performance Comparison • RKM (disk-based) • Memory based: – Standard K-means – Scalable K-means     C dist x C Quan.error( )  ( , ) j k i P i j j 1..
  • 19. Time Complexity of RKM Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 20. Time Complexity Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”
  • 21. Conclusion • RKM resolve some of the limitations of K-means • RKM limits disk access (I/O) • Final clustering is achieved with 3 iterations • On large datasets RKM outperforms standard K-means • Other limitations of K-means clustering still remain
  • 22. Read more … General implementation in IPython notebook: http://goo.gl/YZScH9 http://www.vahidmirjalili.com