SlideShare a Scribd company logo
1 of 45
1
Lecture 10
Clustering
2
Preview
 Introduction
 Partitioning methods
 Hierarchical methods
 Model-based methods
 Density-based methods
4
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 Urban planning: Identifying groups of houses according to
their house type, value, and geographical location
 Seismology: Observed earth quake epicenters should be
clustered along continent faults
5
What Is a Good Clustering?
 A good clustering method will produce
clusters with
 High intra-class similarity
 Low inter-class similarity
 Precise definition of clustering quality is difficult
 Application-dependent
 Ultimately subjective
6
Requirements for Clustering
in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal domain knowledge required to determine
input parameters
 Ability to deal with noise and outliers
 Insensitivity to order of input records
 Robustness wrt high dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
7
Similarity and Dissimilarity
Between Objects
 Same we used for IBL (e.g, Lp norm)
 Euclidean distance (p = 2):
 Properties of a metric d(i,j):
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






8
Major Clustering Approaches
 Partitioning: Construct various partitions and then evaluate
them by some criterion
 Hierarchical: Create a hierarchical decomposition of the set
of objects using some criterion
 Model-based: Hypothesize a model for each cluster and
find best fit of models to data
 Density-based: Guided by connectivity and density
functions
9
Partitioning Algorithms
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen, 1967): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw, 1987): Each cluster is represented by
one of the objects in the cluster
10
K-Means Clustering
 Given k, the k-means algorithm consists of
four steps:
 Select initial centroids at random.
 Assign each object to the cluster with the
nearest centroid.
 Compute each centroid as the mean of the
objects assigned to it.
 Repeat previous 2 steps until no change.
11
K-Means Clustering (contd.)
 Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
12
Comments on the K-Means Method
 Strengths
 Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
 Often terminates at a local optimum. The global optimum
may be found using techniques such as simulated
annealing and genetic algorithms
 Weaknesses
 Applicable only when mean is defined (what about
categorical data?)
 Need to specify k, the number of clusters, in advance
 Trouble with noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
13
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
14
AGNES (Agglomerative Nesting)
 Produces tree of clusters (nodes)
 Initially: each object is a cluster (leaf)
 Recursively merges nodes that have the least dissimilarity
 Criteria: min distance, max distance, avg distance, center
distance
 Eventually all nodes belong to the same cluster (root)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
15
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level. Then each connected
component forms a cluster.
16
DIANA (Divisive Analysis)
 Inverse order of AGNES
 Start with root cluster containing all objects
 Recursively divide into subclusters
 Eventually each cluster contains a single object
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
17
Other Hierarchical Clustering Methods
 Major weakness of agglomerative clustering methods
 Do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 Can never undo what was done previously
 Integration of hierarchical with distance-based clustering
 BIRCH: uses CF-tree and incrementally adjusts the
quality of sub-clusters
 CURE: selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a
specified fraction
18
BIRCH
 BIRCH: Balanced Iterative Reducing and Clustering using
Hierarchies (Zhang, Ramakrishnan & Livny, 1996)
 Incrementally construct a CF (Clustering Feature) tree
 Parameters: max diameter, max children
 Phase 1: scan DB to build an initial in-memory CF tree
(each node: #points, sum, sum of squares)
 Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weaknesses: handles only numeric data, sensitive to order
of data records.
19
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: N
i=1 Xi
SS: N
i=1 Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
20
CF Tree
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
21
CURE (Clustering Using REpresentatives)
 CURE: non-spherical clusters, robust wrt outliers
 Uses multiple representative points to evaluate
the distance between clusters
 Stops the creation of a cluster hierarchy if a
level consists of k clusters
22
Drawbacks of Distance-Based
Method
 Drawbacks of square-error-based clustering method
 Consider only one point as representative of a cluster
 Good only for convex clusters, of similar size and
density, and if k can be reasonably estimated
23
Cure: The Algorithm
 Draw random sample s
 Partition sample to p partitions with size s/p
 Partially cluster partitions into s/pq clusters
 Cluster partial clusters, shrinking
representatives towards centroid
 Label data on disk
24
Data Partitioning and Clustering
 s = 50
 p = 2
 s/p = 25
x x
x
y
y y
y
x
y
x
s/pq = 5
25
Cure: Shrinking Representative Points
 Shrink the multiple representative points towards the
gravity center by a fraction of .
 Multiple representatives capture the shape of the cluster
x
y
x
y
26
Model-Based Clustering
 Basic idea: Clustering as probability estimation
 One model for each cluster
 Generative model:
 Probability of selecting a cluster
 Probability of generating an object in cluster
 Find max. likelihood or MAP model
 Missing information: Cluster membership
 Use EM algorithm
 Quality of clustering: Likelihood of test objects
27
Mixtures of Gaussians
 Cluster model: Normal distribution (mean, covariance)
 Assume: diagonal covariance, known variance,
same for all clusters
 Max. likelihood: mean = avg. of samples
 But what points are samples of a given cluster?
 Estimate prob. that point belongs to cluster
 Mean = weighted avg. of points, weight = prob.
 But to estimate probs. we need model
 “Chicken and egg” problem: use EM algorithm
28
EM Algorithm for Mixtures
 Initialization: Choose means at random
 E step:
 For all points and means, compute Prob(point|mean)
 Prob(mean|point) =
Prob(mean) Prob(point|mean) / Prob(point)
 M step:
 Each mean = Weighted avg. of points
 Weight = Prob(mean|point)
 Repeat until convergence
29
EM Algorithm (contd.)
 Guaranteed to converge to local optimum
 K-means is special case
30
AutoClass
 Developed at NASA (Cheeseman & Stutz, 1988)
 Mixture of Naïve Bayes models
 Variety of possible models for
Prob(attribute|class)
 Missing information: Class of each example
 Apply EM algorithm as before
 Special case of learning Bayes net with
missing values
 Widely used in practice
31
COBWEB
 Grows tree of clusters (Fisher, 1987)
 Each node contains:
P(cluster), P(attribute|cluster) for each attribute
 Objects presented sequentially
 Options: Add to node, new node; merge, split
 Quality measure: Category utility:
Increase in predictability of attributes/#Clusters
32
A COBWEB Tree
33
Neural Network Approaches
 Neuron = Cluster = Centroid in instance space
 Layer = Level of hierarchy
 Several competing sets of clusters in each layer
 Objects sequentially presented to network
 Within each set, neurons compete to win object
 Winning neuron is moved towards object
 Can be viewed as mapping from low-level
features to high-level ones
34
Competitive Learning
35
Self-Organizing Feature Maps
 Clustering is also performed by having several
units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing that
can occur in the brain
 Useful for visualizing high-dimensional data in
2- or 3-D space
36
Density-Based Clustering
 Clustering based on density (local cluster criterion),
such as density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Representative algorithms:
 DBSCAN (Ester et al., 1996)
 DENCLUE (Hinneburg & Keim, 1998)
37
Definitions (I)
 Two parameters:
 Eps: Maximum radius of neighborhood
 MinPts: Minimum number of points in an Eps-
neighborhood of a point
 NEps(p) ={q Є D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts iff
 1) p belongs to NEps(q)
 2) q is a core point:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
38
Definitions (II)
 Density-reachable:
 A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
p1
p q
o
39
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
40
DBSCAN: The Algorithm
 Arbitrarily select a point p
 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
 Continue the process until all of the points have been
processed.
41
DENCLUE: Using Density Functions
 DENsity-based CLUstEring (Hinneburg & Keim, 1998)
 Major features
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significantly faster than other algorithms
(faster than DBSCAN by a factor of up to 45)
 But needs a large number of parameters
42
 Uses grid cells but only keeps information about grid
cells that do actually contain data points and manages
these cells in a tree-based access structure.
 Influence function: describes the impact of a data point
within its neighborhood.
 Overall density of the data space can be calculated as
the sum of the influence function of all data points.
 Clusters can be determined mathematically by
identifying density attractors.
 Density attractors are local maxima of the overall
density function.
DENCLUE
43
Influence Functions
 Example
 


N
i
x
x
d
D
Gaussian
i
e
x
f 1
2
)
,
(
2
2
)
( 






N
i
x
x
d
i
i
D
Gaussian
i
e
x
x
x
x
f 1
2
)
,
(
2
2
)
(
)
,
( 
f x y e
Gaussian
d x y
( , )
( , )


2
2
2
44
Density Attractors
45
Center-Defined & Arbitrary Clusters
46
Clustering: Summary
 Introduction
 Partitioning methods
 Hierarchical methods
 Model-based methods
 Density-based methods

More Related Content

Similar to multiarmed bandit.ppt

10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.pptSueMiu
 

Similar to multiarmed bandit.ppt (20)

10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

multiarmed bandit.ppt

  • 2. 2 Preview  Introduction  Partitioning methods  Hierarchical methods  Model-based methods  Density-based methods
  • 3. 4 Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  Urban planning: Identifying groups of houses according to their house type, value, and geographical location  Seismology: Observed earth quake epicenters should be clustered along continent faults
  • 4. 5 What Is a Good Clustering?  A good clustering method will produce clusters with  High intra-class similarity  Low inter-class similarity  Precise definition of clustering quality is difficult  Application-dependent  Ultimately subjective
  • 5. 6 Requirements for Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal domain knowledge required to determine input parameters  Ability to deal with noise and outliers  Insensitivity to order of input records  Robustness wrt high dimensionality  Incorporation of user-specified constraints  Interpretability and usability
  • 6. 7 Similarity and Dissimilarity Between Objects  Same we used for IBL (e.g, Lp norm)  Euclidean distance (p = 2):  Properties of a metric d(i,j):  d(i,j)  0  d(i,i) = 0  d(i,j) = d(j,i)  d(i,j)  d(i,k) + d(k,j) ) | | ... | | | (| ) , ( 2 2 2 2 2 1 1 p p j x i x j x i x j x i x j i d       
  • 7. 8 Major Clustering Approaches  Partitioning: Construct various partitions and then evaluate them by some criterion  Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion  Model-based: Hypothesize a model for each cluster and find best fit of models to data  Density-based: Guided by connectivity and density functions
  • 8. 9 Partitioning Algorithms  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster
  • 9. 10 K-Means Clustering  Given k, the k-means algorithm consists of four steps:  Select initial centroids at random.  Assign each object to the cluster with the nearest centroid.  Compute each centroid as the mean of the objects assigned to it.  Repeat previous 2 steps until no change.
  • 10. 11 K-Means Clustering (contd.)  Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 11. 12 Comments on the K-Means Method  Strengths  Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms  Weaknesses  Applicable only when mean is defined (what about categorical data?)  Need to specify k, the number of clusters, in advance  Trouble with noisy data and outliers  Not suitable to discover clusters with non-convex shapes
  • 12. 13 Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 13. 14 AGNES (Agglomerative Nesting)  Produces tree of clusters (nodes)  Initially: each object is a cluster (leaf)  Recursively merges nodes that have the least dissimilarity  Criteria: min distance, max distance, avg distance, center distance  Eventually all nodes belong to the same cluster (root) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 14. 15 A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level. Then each connected component forms a cluster.
  • 15. 16 DIANA (Divisive Analysis)  Inverse order of AGNES  Start with root cluster containing all objects  Recursively divide into subclusters  Eventually each cluster contains a single object 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 16. 17 Other Hierarchical Clustering Methods  Major weakness of agglomerative clustering methods  Do not scale well: time complexity of at least O(n2), where n is the number of total objects  Can never undo what was done previously  Integration of hierarchical with distance-based clustering  BIRCH: uses CF-tree and incrementally adjusts the quality of sub-clusters  CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction
  • 17. 18 BIRCH  BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, 1996)  Incrementally construct a CF (Clustering Feature) tree  Parameters: max diameter, max children  Phase 1: scan DB to build an initial in-memory CF tree (each node: #points, sum, sum of squares)  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree  Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans  Weaknesses: handles only numeric data, sensitive to order of data records.
  • 18. 19 Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i=1 Xi SS: N i=1 Xi 2 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)
  • 19. 20 CF Tree CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF1 child1 CF3 child3 CF2 child2 CF5 child5 CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next B = 7 L = 6 Root Non-leaf node Leaf node Leaf node
  • 20. 21 CURE (Clustering Using REpresentatives)  CURE: non-spherical clusters, robust wrt outliers  Uses multiple representative points to evaluate the distance between clusters  Stops the creation of a cluster hierarchy if a level consists of k clusters
  • 21. 22 Drawbacks of Distance-Based Method  Drawbacks of square-error-based clustering method  Consider only one point as representative of a cluster  Good only for convex clusters, of similar size and density, and if k can be reasonably estimated
  • 22. 23 Cure: The Algorithm  Draw random sample s  Partition sample to p partitions with size s/p  Partially cluster partitions into s/pq clusters  Cluster partial clusters, shrinking representatives towards centroid  Label data on disk
  • 23. 24 Data Partitioning and Clustering  s = 50  p = 2  s/p = 25 x x x y y y y x y x s/pq = 5
  • 24. 25 Cure: Shrinking Representative Points  Shrink the multiple representative points towards the gravity center by a fraction of .  Multiple representatives capture the shape of the cluster x y x y
  • 25. 26 Model-Based Clustering  Basic idea: Clustering as probability estimation  One model for each cluster  Generative model:  Probability of selecting a cluster  Probability of generating an object in cluster  Find max. likelihood or MAP model  Missing information: Cluster membership  Use EM algorithm  Quality of clustering: Likelihood of test objects
  • 26. 27 Mixtures of Gaussians  Cluster model: Normal distribution (mean, covariance)  Assume: diagonal covariance, known variance, same for all clusters  Max. likelihood: mean = avg. of samples  But what points are samples of a given cluster?  Estimate prob. that point belongs to cluster  Mean = weighted avg. of points, weight = prob.  But to estimate probs. we need model  “Chicken and egg” problem: use EM algorithm
  • 27. 28 EM Algorithm for Mixtures  Initialization: Choose means at random  E step:  For all points and means, compute Prob(point|mean)  Prob(mean|point) = Prob(mean) Prob(point|mean) / Prob(point)  M step:  Each mean = Weighted avg. of points  Weight = Prob(mean|point)  Repeat until convergence
  • 28. 29 EM Algorithm (contd.)  Guaranteed to converge to local optimum  K-means is special case
  • 29. 30 AutoClass  Developed at NASA (Cheeseman & Stutz, 1988)  Mixture of Naïve Bayes models  Variety of possible models for Prob(attribute|class)  Missing information: Class of each example  Apply EM algorithm as before  Special case of learning Bayes net with missing values  Widely used in practice
  • 30. 31 COBWEB  Grows tree of clusters (Fisher, 1987)  Each node contains: P(cluster), P(attribute|cluster) for each attribute  Objects presented sequentially  Options: Add to node, new node; merge, split  Quality measure: Category utility: Increase in predictability of attributes/#Clusters
  • 32. 33 Neural Network Approaches  Neuron = Cluster = Centroid in instance space  Layer = Level of hierarchy  Several competing sets of clusters in each layer  Objects sequentially presented to network  Within each set, neurons compete to win object  Winning neuron is moved towards object  Can be viewed as mapping from low-level features to high-level ones
  • 34. 35 Self-Organizing Feature Maps  Clustering is also performed by having several units competing for the current object  The unit whose weight vector is closest to the current object wins  The winner and its neighbors learn by having their weights adjusted  SOMs are believed to resemble processing that can occur in the brain  Useful for visualizing high-dimensional data in 2- or 3-D space
  • 35. 36 Density-Based Clustering  Clustering based on density (local cluster criterion), such as density-connected points  Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Representative algorithms:  DBSCAN (Ester et al., 1996)  DENCLUE (Hinneburg & Keim, 1998)
  • 36. 37 Definitions (I)  Two parameters:  Eps: Maximum radius of neighborhood  MinPts: Minimum number of points in an Eps- neighborhood of a point  NEps(p) ={q Є D | dist(p,q) <= Eps}  Directly density-reachable: A point p is directly density- reachable from a point q wrt. Eps, MinPts iff  1) p belongs to NEps(q)  2) q is a core point: |NEps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm
  • 37. 38 Definitions (II)  Density-reachable:  A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi  Density-connected  A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1 p q o
  • 38. 39 DBSCAN: Density Based Spatial Clustering of Applications with Noise  Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points  Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5
  • 39. 40 DBSCAN: The Algorithm  Arbitrarily select a point p  Retrieve all points density-reachable from p wrt Eps and MinPts.  If p is a core point, a cluster is formed.  If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.  Continue the process until all of the points have been processed.
  • 40. 41 DENCLUE: Using Density Functions  DENsity-based CLUstEring (Hinneburg & Keim, 1998)  Major features  Good for data sets with large amounts of noise  Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets  Significantly faster than other algorithms (faster than DBSCAN by a factor of up to 45)  But needs a large number of parameters
  • 41. 42  Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure.  Influence function: describes the impact of a data point within its neighborhood.  Overall density of the data space can be calculated as the sum of the influence function of all data points.  Clusters can be determined mathematically by identifying density attractors.  Density attractors are local maxima of the overall density function. DENCLUE
  • 42. 43 Influence Functions  Example     N i x x d D Gaussian i e x f 1 2 ) , ( 2 2 ) (        N i x x d i i D Gaussian i e x x x x f 1 2 ) , ( 2 2 ) ( ) , (  f x y e Gaussian d x y ( , ) ( , )   2 2 2
  • 45. 46 Clustering: Summary  Introduction  Partitioning methods  Hierarchical methods  Model-based methods  Density-based methods