SlideShare a Scribd company logo
mlcourse.ai. Clustering
Yury Kashnitskiy, Dmitry Ignatov
Higher School of Economics
November 16, 2018
(Higher School of Economics) Clustering 16.11.2018 1 / 24
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 2 / 24
Clustering
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 3 / 24
Clustering Problem formulation
Problem formulation
The main task of cluster analysis is to group instances into subgroups (clusters) of
similar ones.
These groups can be
Partitions
Hierarchies
Fuzzy partitions
Biclusters
Mixtures of distributions
(Higher School of Economics) Clustering 16.11.2018 4 / 24
Clustering Applications
Applications
Biology and medicine
Gene expression analysis
Tomography clustering
Humanitarian sciences
Sociology and anthropology
Psychology
Technical systems
Telemetry
Image segmentation
Marketing
Customer segmentation
Subgroup behavioral analysis
Text analytics
News clustering
Social networks
Comunity detection
(Higher School of Economics) Clustering 16.11.2018 5 / 24
Clustering methods
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 6 / 24
Clustering methods
How to measure dissimilarity of instances
Instances x ∈ Rm
are representaed as feature matrices.





x1
x2
...
xn





⇐⇒




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n




Minkowski distance
d(x, y) =
m
i=1
|xi
− yi
|p
1
p
Cosine distance
d(x, y) = 1 −
⟨x, y⟩
⟨x, x⟩ ⟨y, y⟩
Hamming distance
d(x, y) =
1
m
m
i=1
[xi
̸= yi
]
(Higher School of Economics) Clustering 16.11.2018 7 / 24
Clustering methods
k-Means
k-Means is an iterative algorithm to split data into k clusters.
Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined
as
cj =
1
|Cj |
i∈Cj
xi
The objective is the sum of squares of all distances between instances and
centroids of clusteres to which these instances belong.
J(C) =
k
j=1 i∈Cj
d(xi , cj )2
(Higher School of Economics) Clustering 16.11.2018 8 / 24
Clustering methods
k-Means
The algorithm
Input: Data, k — is a hyperparameter
Ouput: Partition of data into k clusters
* * *
1. Initialization: Set k points to be initial centroid
2. Update clusters: Given k centroids, each instance is attributed to one of
centroids. Thus, all instances attributed to a centroid cj
(j = 1 . . . k), form a cluster Cj .
3. Update centroids: For each cluster Cj , a new centroid is calculated as a
geometrical mean of all instances in this cluster.
Steps 2-3 are repeated until convergence.
(Higher School of Economics) Clustering 16.11.2018 9 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
For each k we can calculate J(C).
Then, we find such k that further increasing it does not decrease J “too much”.
Formally, we look for k that minimizes the following D(k):
D(k) =
|J(k) − J(k + 1)|
|J(k − 1) − J(k)|
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
−6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
4000
k
J(R)
Elbow Method
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Silhouette
Silhouette for an instance xi in a cluster C is a function
s(i) =
bi − ai
max(ai , bi )
,
where a(i) — is the mean distance from xi to all other instances from C, а bm(i)
— is the mean distance from xi to instances from other clusters.
(Higher School of Economics) Clustering 16.11.2018 12 / 24
Clustering methods
Silhouette
Acceptable number of clusters
(Higher School of Economics) Clustering 16.11.2018 13 / 24
Clustering methods
Silhouette
Bad number of clusters
(Higher School of Economics) Clustering 16.11.2018 14 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒






d(x1, x1) d(x1, x2) . . . d(x1, xn)
d(x2, x1)
...
... d(x2, xn)
...
...
...
...
d(xn, x1) d(xn, x2) · · · d(xn, xn)






(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒







0 d(x1, x2) d(x1, x3) · · · d(x1, xn)
0 d(x2, x3) · · · d(x2, xn)
... · · · · · ·
0 d(xn−1, xn)
0







(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Sequential merging of similar clusters
0 Start with each cluster having only one instance
1 Find two closest clusters
2 Merge them
Repeat steps 1-2 untill all instances are in the same cluster
How to define distance between clusters?
(Higher School of Economics) Clustering 16.11.2018 16 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
1 Single Linkage
d(A, B) = min
x∈A,y∈B
d(x, y)
2 Complete Linkage
d(A, B) = max
x∈A,y∈B
d(x, y)
(Higher School of Economics) Clustering 16.11.2018 17 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
3 Average Linkage
d(A, B) =
1
|A||B|
i∈A j∈B
d(xi , yj )
4 Weighted Average Linkage
Let clusterA be a union of clusters q и p. Then
d(A, B) =
d(p, B) + d(q, B)
2
5 Centroid Linkage
d(A, B) = ∥cA − cB ∥2
(Higher School of Economics) Clustering 16.11.2018 18 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Merging clusters can be depicted with a dendrogram.
Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 }
1 2 3 7 10 12 25 29
0
5
10
15
20
25
Objects
Clusterdistances
B
C
A
Distance between
cluster A and B
(Higher School of Economics) Clustering 16.11.2018 19 / 24
Clustering methods Density-based methods
Density-based methods
DBSCAN
DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise.
(Higher School of Economics) Clustering 16.11.2018 20 / 24
Clustering methods Density-based methods
DBSCAN algorithm
All point can be divided into elements of dense regions, border points and noise
(skipping formal definition here).
(Higher School of Economics) Clustering 16.11.2018 21 / 24
Clustering methods Density-based methods
DBSCAN. Example
Hyperparams: M = 4, Eps > 0
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Pros and cons
Pros
+ Can find clusters of any shape
+ Easy to implement
+ Can find noise in data
+ Nice complexity — O(n log(n)) with a good data sctructure
(otherwise — O(n2
) )
Cons
- Parametric
- Doesn’t work well when clusters differ in density
- Depends on the chosen metric
(Higher School of Economics) Clustering 16.11.2018 23 / 24
Clustering methods Density-based methods
Contacts
Questions
Thanks!
Please ask your questions in OpenDataSciene Slack team.
http://ods.ai
(Higher School of Economics) Clustering 16.11.2018 24 / 24

More Related Content

What's hot

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!A Jorge Garcia
 
Tutorial1
Tutorial1Tutorial1
Tutorial1
Soon Yau Cheong
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
Capgemini
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1roszelan
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET Journal
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image Processing
MichaelRra
 
Cmb part3
Cmb part3Cmb part3
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
Jordan Open Source Association
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysis
Alexander Decker
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebra
Nikolaos Vasiloglou
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
theijes
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer scienceSandeep Katta
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Cemal Ardil
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
IRJET Journal
 
Tutorial7
Tutorial7Tutorial7
Tutorial7
Soon Yau Cheong
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinants
s9182647608y
 

What's hot (18)

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!
 
Tutorial1
Tutorial1Tutorial1
Tutorial1
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application Program
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image Processing
 
Cmb part3
Cmb part3Cmb part3
Cmb part3
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysis
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebra
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer science
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
 
Tutorial7
Tutorial7Tutorial7
Tutorial7
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinants
 

Similar to mlcourse.ai. Clustering

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
IRJET Journal
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
Mohaiminur Rahman
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
The Statistical and Applied Mathematical Sciences Institute
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringIAEME Publication
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringprjpublications
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
arogozhnikov
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
talktoharry
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Rafael Nogueras
 
Data clustering
Data clustering Data clustering
Data clustering
GARIMA SHAKYA
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
mqasimsheikh5
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
INSA Lyon - L'Institut National des Sciences Appliquées de Lyon
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
Apoorva Srinivasan
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
csandit
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
yatheeshabodumalla
 

Similar to mlcourse.ai. Clustering (20)

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 

More from Yury Kashnitsky

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data Science
Yury Kashnitsky
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
Yury Kashnitsky
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
Yury Kashnitsky
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun Resolution
Yury Kashnitsky
 
mlcourse.ai. Outro
mlcourse.ai. Outromlcourse.ai. Outro
mlcourse.ai. Outro
Yury Kashnitsky
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMA
Yury Kashnitsky
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overview
Yury Kashnitsky
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данных
Yury Kashnitsky
 

More from Yury Kashnitsky (8)

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data Science
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun Resolution
 
mlcourse.ai. Outro
mlcourse.ai. Outromlcourse.ai. Outro
mlcourse.ai. Outro
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMA
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overview
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данных
 

Recently uploaded

Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
Landownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptxLandownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptx
JezreelCabil2
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
NelTorrente
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 

Recently uploaded (20)

Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
Landownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptxLandownership in the Philippines under the Americans-2-pptx.pptx
Landownership in the Philippines under the Americans-2-pptx.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 

mlcourse.ai. Clustering

  • 1. mlcourse.ai. Clustering Yury Kashnitskiy, Dmitry Ignatov Higher School of Economics November 16, 2018 (Higher School of Economics) Clustering 16.11.2018 1 / 24
  • 2. Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 2 / 24
  • 3. Clustering Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 3 / 24
  • 4. Clustering Problem formulation Problem formulation The main task of cluster analysis is to group instances into subgroups (clusters) of similar ones. These groups can be Partitions Hierarchies Fuzzy partitions Biclusters Mixtures of distributions (Higher School of Economics) Clustering 16.11.2018 4 / 24
  • 5. Clustering Applications Applications Biology and medicine Gene expression analysis Tomography clustering Humanitarian sciences Sociology and anthropology Psychology Technical systems Telemetry Image segmentation Marketing Customer segmentation Subgroup behavioral analysis Text analytics News clustering Social networks Comunity detection (Higher School of Economics) Clustering 16.11.2018 5 / 24
  • 6. Clustering methods Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 6 / 24
  • 7. Clustering methods How to measure dissimilarity of instances Instances x ∈ Rm are representaed as feature matrices.      x1 x2 ... xn      ⇐⇒     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     Minkowski distance d(x, y) = m i=1 |xi − yi |p 1 p Cosine distance d(x, y) = 1 − ⟨x, y⟩ ⟨x, x⟩ ⟨y, y⟩ Hamming distance d(x, y) = 1 m m i=1 [xi ̸= yi ] (Higher School of Economics) Clustering 16.11.2018 7 / 24
  • 8. Clustering methods k-Means k-Means is an iterative algorithm to split data into k clusters. Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined as cj = 1 |Cj | i∈Cj xi The objective is the sum of squares of all distances between instances and centroids of clusteres to which these instances belong. J(C) = k j=1 i∈Cj d(xi , cj )2 (Higher School of Economics) Clustering 16.11.2018 8 / 24
  • 9. Clustering methods k-Means The algorithm Input: Data, k — is a hyperparameter Ouput: Partition of data into k clusters * * * 1. Initialization: Set k points to be initial centroid 2. Update clusters: Given k centroids, each instance is attributed to one of centroids. Thus, all instances attributed to a centroid cj (j = 1 . . . k), form a cluster Cj . 3. Update centroids: For each cluster Cj , a new centroid is calculated as a geometrical mean of all instances in this cluster. Steps 2-3 are repeated until convergence. (Higher School of Economics) Clustering 16.11.2018 9 / 24
  • 10. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 11. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 12. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 13. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 14. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 15. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 16. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 17. Clustering methods Clustering quality and the number of clusters Elbow method For each k we can calculate J(C). Then, we find such k that further increasing it does not decrease J “too much”. Formally, we look for k that minimizes the following D(k): D(k) = |J(k) − J(k + 1)| |J(k − 1) − J(k)| (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 18. Clustering methods Clustering quality and the number of clusters Elbow method −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 2 3 4 5 6 7 8 9 10 0 500 1000 1500 2000 2500 3000 3500 4000 k J(R) Elbow Method (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 19. Clustering methods Clustering quality and the number of clusters Silhouette Silhouette for an instance xi in a cluster C is a function s(i) = bi − ai max(ai , bi ) , where a(i) — is the mean distance from xi to all other instances from C, а bm(i) — is the mean distance from xi to instances from other clusters. (Higher School of Economics) Clustering 16.11.2018 12 / 24
  • 20. Clustering methods Silhouette Acceptable number of clusters (Higher School of Economics) Clustering 16.11.2018 13 / 24
  • 21. Clustering methods Silhouette Bad number of clusters (Higher School of Economics) Clustering 16.11.2018 14 / 24
  • 22. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒       d(x1, x1) d(x1, x2) . . . d(x1, xn) d(x2, x1) ... ... d(x2, xn) ... ... ... ... d(xn, x1) d(xn, x2) · · · d(xn, xn)       (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 23. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒        0 d(x1, x2) d(x1, x3) · · · d(x1, xn) 0 d(x2, x3) · · · d(x2, xn) ... · · · · · · 0 d(xn−1, xn) 0        (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 24. Clustering methods Hierarchical methods Agglomerative clustering Sequential merging of similar clusters 0 Start with each cluster having only one instance 1 Find two closest clusters 2 Merge them Repeat steps 1-2 untill all instances are in the same cluster How to define distance between clusters? (Higher School of Economics) Clustering 16.11.2018 16 / 24
  • 25. Clustering methods Hierarchical methods Agglomerative clustering Linkage 1 Single Linkage d(A, B) = min x∈A,y∈B d(x, y) 2 Complete Linkage d(A, B) = max x∈A,y∈B d(x, y) (Higher School of Economics) Clustering 16.11.2018 17 / 24
  • 26. Clustering methods Hierarchical methods Agglomerative clustering Linkage 3 Average Linkage d(A, B) = 1 |A||B| i∈A j∈B d(xi , yj ) 4 Weighted Average Linkage Let clusterA be a union of clusters q и p. Then d(A, B) = d(p, B) + d(q, B) 2 5 Centroid Linkage d(A, B) = ∥cA − cB ∥2 (Higher School of Economics) Clustering 16.11.2018 18 / 24
  • 27. Clustering methods Hierarchical methods Agglomerative clustering Merging clusters can be depicted with a dendrogram. Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 } 1 2 3 7 10 12 25 29 0 5 10 15 20 25 Objects Clusterdistances B C A Distance between cluster A and B (Higher School of Economics) Clustering 16.11.2018 19 / 24
  • 28. Clustering methods Density-based methods Density-based methods DBSCAN DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise. (Higher School of Economics) Clustering 16.11.2018 20 / 24
  • 29. Clustering methods Density-based methods DBSCAN algorithm All point can be divided into elements of dense regions, border points and noise (skipping formal definition here). (Higher School of Economics) Clustering 16.11.2018 21 / 24
  • 30. Clustering methods Density-based methods DBSCAN. Example Hyperparams: M = 4, Eps > 0 (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 31. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 32. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 33. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 34. Clustering methods Density-based methods DBSCAN. Pros and cons Pros + Can find clusters of any shape + Easy to implement + Can find noise in data + Nice complexity — O(n log(n)) with a good data sctructure (otherwise — O(n2 ) ) Cons - Parametric - Doesn’t work well when clusters differ in density - Depends on the chosen metric (Higher School of Economics) Clustering 16.11.2018 23 / 24
  • 35. Clustering methods Density-based methods Contacts Questions Thanks! Please ask your questions in OpenDataSciene Slack team. http://ods.ai (Higher School of Economics) Clustering 16.11.2018 24 / 24