SlideShare a Scribd company logo
1 of 47
Lecture #12
Clustering Techniques: Similarity Measures
Topics to be coveredโ€ฆ
๏‚— Introduction to clustering
๏‚— Similarity and dissimilarity measures
๏‚— Clustering techniques
๏‚— Partitioning algorithms
๏‚— Hierarchical algorithms
๏‚— Density-based algorithm
2
Introduction to Clustering
3
๏‚— Classification consists of assigning a class label to a set of unclassified
cases.
๏‚— Supervised Classification
๏‚— The set of possible classes is known in advance.
๏‚— Unsupervised Classification
๏‚— Set of possible classes is not known. After classification we can try to assign a
name to that class.
๏‚— Unsupervised classification is called clustering.
Supervised Technique
CS 40003: Data Analytics 4
Unsupervised Technique
5
Introduction to Clustering
๏‚— Clustering is somewhat related to classification in the sense that in both
cases data are grouped.
โ€ข
๏‚— However, there is a major difference between these two techniques.
๏‚— In order to understand the difference between the two, consider a sample
dataset containing marks obtained by a set of students and corresponding
grades as shown in Table 15.1.
6
Introduction to Clustering
Table 12.1: Tabulation of Marks
7
Roll No Mark Grade
1 80 A
2 70 A
3 55 C
4 91 EX
5 65 B
6 35 D
7 76 A
8 40 D
9 50 C
10 85 EX
11 25 F
12 60 B
13 45 D
14 95 EX
15 63 B
16 88 A
Figure 12.1: Group representation of
dataset in Table 15.1
15 12
5 11
14 10
4
6 13
8
16 7
1 2
3
9
B
F
EX
D
C
A
Introduction to Clustering
๏‚— It is evident that there is a simple mapping between Table 12.1 and Fig 12.1.
๏‚— The fact is that groups in Fig 12.1 are already predefined in Table 12.1. This
is similar to classification, where we have given a dataset where groups of
data are predefined.
๏‚— Consider another situation, where โ€˜Gradeโ€™ is not known, but we have to
make a grouping.
๏‚— Put all the marks into a group if any other mark in that group does not
exceed by 5 or more.
๏‚— This is similar to โ€œRelative gradingโ€ concept and grade may range from A
to Z.
8
Introduction to Clustering
๏‚— Figure 12.2 shows another grouping by means of another simple mapping,
but the difference is this mapping does not based on predefined classes.
๏‚— In other words, this grouping is
accomplished by finding similarities
between data according to
characteristics found in the actual
data.
๏‚— Such a group making is called
clustering.
Introduction to Clustering
Example 12.1 : The task of clustering
In order to elaborate the clustering task, consider the following dataset.
Table 12.2: Life Insurance database
With certain similarity or likeliness defined, we can classify the records to
one or group of more attributes (and thus mapping being non-trivial).
10
Martial
Status
Age Income Education Number of
children
Single 35 25000 Under Graduate 3
Married 25 15000 Graduate 1
Single 40 20000 Under Graduate 0
Divorced 20 30000 Post-Graduate 0
Divorced 25 20000 Under Graduate 3
Married 60 70000 Graduate 0
Married 30 90000 Post-Graduate 0
Married 45 60000 Graduate 5
Divorced 50 80000 Under Graduate 2
๏‚— Clustering has been used in many application domains:
๏‚— Image analysis
๏‚— Document retrieval
๏‚— Machine learning, etc.
๏‚— When clustering is applied to real-world database, many problems may
arise.
1. The (best) number of cluster is not known.
๏‚— There is not correct answer to a clustering problem.
๏‚— In fact, many answers may be found.
๏‚— The exact number of cluster required is not easy to determine.
11
Introduction to Clustering
2. There may not be any a priori knowledge concerning the clusters.
โ€ข This is an issue that what data should be used for clustering.
โ€ข Unlike classification, in clustering, we have not supervisory learning to aid
the process.
โ€ข Clustering can be viewed as similar to unsupervised learning.
3. Interpreting the semantic meaning of each cluster may be difficult.
โ€ข With classification, the labeling of classes is known ahead of time. In
contrast, with clustering, this may not be the case.
โ€ข Thus, when the clustering process is finished yielding a set of clusters, the
exact meaning of each cluster may not be obvious.
12
Introduction to Clustering
Definition of Clustering Problem
13
Given a database D = ๐‘ก1, ๐‘ก2, โ€ฆ . . , ๐‘ก๐‘› of ๐‘› tuples, the clustering problem is to
define a mapping ๐‘“ โˆถ D ๐ถ, where each ๐‘ก๐‘– โˆˆ ๐ท is assigned to one cluster ๐‘๐‘– โˆˆ
๐ถ. Here, C = ๐‘1, ๐‘2, โ€ฆ . . , ๐‘๐‘˜ denotes a set of clusters.
Definition 12.1: Clustering
โ€ข Solution to a clustering problem is devising a mapping formulation.
โ€ข The formulation behind such a mapping is to establish that a tuple within
one cluster is more like tuples within that cluster and not similar to tuples
outside it.
Definition of Clustering Problem
14
โ€ข Hence, mapping function f in Definition 12.1 may be explicitly stated as
๐‘“ โˆถ D ๐‘1, ๐‘2, โ€ฆ . . , ๐‘๐‘˜
where i) each ๐‘ก๐‘– โˆˆ ๐ท is assigned to one cluster ๐‘๐‘– โˆˆ ๐ถ.
ii) for each cluster ๐‘๐‘– โˆˆ ๐ถ, and for all ๐‘ก๐‘–๐‘ , ๐‘ก๐‘–๐‘ž โˆˆ ๐‘๐‘– and there exist ๐‘ก๐‘— โˆ‰ ๐‘๐‘– such that
similarity (๐‘ก๐‘–๐‘ , ๐‘ก๐‘–๐‘ž ) > similarity (๐‘ก๐‘–๐‘ , ๐‘ก๐‘— ) AND similarity (๐‘ก๐‘–๐‘ž , ๐‘ก๐‘— )
โ€ข In the field of cluster analysis, this similarity plays an important part.
โ€ข Now, we shall learn how similarity (this is also alternatively judged as โ€œdissimilarityโ€)
between any two data can be measured.
Similarity and Dissimilarity Measures
16
โ€ข In clustering techniques, similarity (or dissimilarity) is an important measurement.
โ€ข Informally, similarity between two objects (e.g., two images, two documents, two
records, etc.) is a numerical measure of the degree to which two objects are alike.
โ€ข The dissimilarity on the other hand, is another alternative (or opposite) measure of
the degree to which two objects are different.
โ€ข Both similarity and dissimilarity also termed as proximity.
โ€ข Usually, similarity and dissimilarity are non-negative numbers and may range from
zero (highly dissimilar (no similar)) to some finite/infinite value (highly similar (no
dissimilar)).
Note:
โ€ข Frequently, the term distance is used as a synonym for dissimilarity
โ€ข In fact, it is used to refer as a special case of dissimilarity.
Proximity Measures: Single-Attribute
17
โ€ข Consider an object, which is defined by a single attribute A (e.g., length) and the
attribute A has n-distinct values ๐‘Ž1, ๐‘Ž2, โ€ฆ . . , ๐‘Ž๐‘›.
โ€ข A data structure called โ€œDissimilarity matrixโ€ is used to store a collection of
proximities that are available for all pair of n attribute values.
โ€ข In other words, the Dissimilarity matrix for an attribute A with n values is represented by
an ๐‘› ร— ๐‘› matrix as shown below.
0
๐‘(2,1) 0
๐‘(3,1)
โ‹ฎ
๐‘(๐‘›,1)
๐‘(3,2)
โ‹ฎ
๐‘(๐‘›,2)
0
โ‹ฎ
โ€ฆ โ€ฆ 0 ๐‘›ร—๐‘›
โ€ข Here, ๐‘(๐‘–,๐‘—) denotes the proximity measure between two objects with attribute values ๐‘Ž๐‘–
and ๐‘Ž๐‘—.
โ€ข Note: The proximity measure is symmetric, that is, ๐‘(๐‘–,๐‘—) = ๐‘(๐‘—,๐‘–)
Proximity Calculation
๏‚— Proximity calculation to compute ๐‘(๐‘–,๐‘—) is different for different types of
attributes according to NOIR topology.
Proximity calculation for Nominal attributes:
โ€ข For example, binary attribute, Gender = {Male, female} where Male is
equivalent to binary 1 and female is equivalent to binary 0.
โ€ข Similarity value is 1 if the two objects contains the same attribute value, while
similarity value is 0 implies objects are not at all similar.
18
Object Gender
Ram Male
Sita Female
Laxman Male
โ€ข Here, Similarity value let it be denoted by ๐‘, among
different objects are as follows.
๐‘ ๐‘…๐‘Ž๐‘š, ๐‘ ๐‘–๐‘ก๐‘Ž = 0
๐‘ ๐‘…๐‘Ž๐‘š, ๐ฟ๐‘Ž๐‘ฅ๐‘š๐‘Ž๐‘› = 1
Note : In this case, if ๐‘ž denotes the dissimilarity between two objects ๐‘– ๐‘Ž๐‘›๐‘‘ ๐‘— with
single binary attributes, then ๐‘ž(๐‘–,๐‘—)= 1 โˆ’ ๐‘(๐‘–,๐‘—)
Proximity Calculation
19
โ€ข Now, let us focus on how to calculate proximity measures between objects which are
defined by two or more binary attributes.
โ€ข Suppose, the number of attributes be ๐‘. We can define the contingency table
summarizing the different matches and mismatches between any two objects
๐‘ฅ and ๐‘ฆ, which are as follows.
Object
๐‘ฅ
Object y
1 0
1 ๐‘“11 ๐‘“10
0 ๐‘“01 ๐‘“00
Table 12.3: Contingency table with binary attributes
Here, ๐‘“11= the number of attributes where ๐‘ฅ=1 and ๐‘ฆ=1.
๐‘“10= the number of attributes where ๐‘ฅ=1 and ๐‘ฆ=0.
๐‘“01= the number of attributes where ๐‘ฅ=0 and ๐‘ฆ=1.
๐‘“00= the number of attributes where ๐‘ฅ=0 and ๐‘ฆ=0.
Note : ๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11 = ๐‘, the total number of binary attributes.
Now, two cases may arise: symmetric and asymmetric binary attributes.
Similarity Measure with Symmetric Binary
20
โ€ข To measure the similarity between two objects defined by symmetric binary
attributes using a measure called symmetric binary coefficient and denoted as ๐’ฎ and
defined below
๐’ฎ =
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘ 
๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘ 
or
๐’ฎ =
๐‘“00 + ๐‘“11
๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11
The dissimilarity measure, likewise can be denoted as ๐’Ÿ and defined as
๐’Ÿ =
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘–๐‘ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘ 
๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘ 
or
๐’Ÿ =
๐‘“01 + ๐‘“10
๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11
Note that, ๐’Ÿ = 1 โˆ’ ๐’ฎ
Similarity Measure with Symmetric Binary
21
Example 12.2: Proximity measures with symmetric binary attributes
Consider the following two dataset, where objects are defined with symmetric binary
attributes.
Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I},
Hobby = {T, C}, Job = {Y, N}
Object Gender Food Caste Education Hobby Job
Hari M V M L C N
Ram M N M I T N
Tomi F N H L C Y
๐’ฎ(Hari, Ram) =
1+2
1+2+1+2
= 0.5
Proximity Measure with Asymmetric Binary
22
โ€ข Such a similarity measure between two objects defined by asymmetric binary
attributes is done by Jaccard Coefficient and which is often symbolized by ๐’ฅ is given
by the following equation
๐’ฅ=
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” ๐‘๐‘Ÿ๐‘’๐‘ ๐‘’๐‘›๐‘๐‘’
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  ๐‘›๐‘œ๐‘ก ๐‘–๐‘›๐‘ฃ๐‘œ๐‘™๐‘ฃ๐‘’๐‘‘ ๐‘–๐‘› 00 ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘”
or
๐’ฅ =
๐‘“11
๐‘“01 + ๐‘“10 + ๐‘“11
23
Example 12.3: Jaccard Coefficient
Consider the following two dataset.
Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I},
Hobby = {T, C}, Job = {Y, N}
Calculate the Jaccard coefficient between Ram and Hari assuming that all binary
attributes are asymmetric and for each pair values for an attribute, first one is more
frequent than the second.
Object Gender Food Caste Education Hobby Job
Hari M V M L C N
Ram M N M I T N
Tomi F N H L C Y
๐’ฅ(Hari, Ram) =
1
2+1+1
= 0.25
Note: ๐’ฅ(Ram, Tomi) = 0 and ๐’ฅ(Hari, Ram) = ๐’ฅ(Ram, Hari), etc.
Proximity Measure with Asymmetric Binary
24
Example 12.4:
Consider the following two dataset.
Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I},
Hobby = {T, C}, Job = {Y, N}
Object Gender Food Caste Education Hobby Job
Hari M V M L C N
Ram M N M I T N
Tomi F N H L C Y
How you can calculate similarity if Gender, Hobby and Job are symmetric
binary attributes and Food, Caste, Education are asymmetric binary
attributes?
Obtain the similarity matrix with Jaccard coefficient of objects for the above, e.g.
?
25
โ€ข Binary attribute is a special kind of nominal attribute where the attribute has values
with two states only.
โ€ข On the other hand, categorical attribute is another kind of nominal attribute where it
has values with three or more states (e.g. color = {Red, Green, Blue}).
โ€ข If ๐“ˆ ๐‘ฅ, ๐‘ฆ denotes the similarity between two objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ, then
๐“ˆ ๐‘ฅ, ๐‘ฆ =
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘ 
๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘ 
and the dissimilarity ๐’น(๐‘ฅ, ๐‘ฆ) is
๐’น(๐‘ฅ, ๐‘ฆ) =
๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘–๐‘ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘ 
๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘ 
โ€ข If ๐‘š = number of matches and ๐‘Ž = number of categorical attributes with which
objects are defined as
๐“ˆ ๐‘ฅ, ๐‘ฆ =
๐‘š
๐‘Ž
and ๐’น ๐‘ฅ, ๐‘ฆ =
๐‘Žโˆ’๐‘š
๐‘Ž
Proximity Measure with Categorical Attribute
26
Example 12.4:
Object Color Position Distance
1 R L L
2 B C M
3 G R M
4 R L H
The similarity matrix considering only color attribute is shown below
Dissimilarity matrix, ๐’น =?
Obtain the dissimilarity matrix considering both the categorical attributes (i.e.
color and position).
Proximity Measure with Categorical Attribute
27
โ€ข Ordinal attribute is a special kind of categorical attribute, where the values of
attribute follows a sequence (ordering) e.g. Grade = {Ex, A, B, C} where Ex > A >B >C.
โ€ข Suppose, A is an attribute of type ordinal and the set of values of
๐ด = {๐‘Ž1, ๐‘Ž2, โ€ฆ . . , ๐‘Ž๐‘›}. Let ๐‘› values of ๐ด are ordered in ascending order as ๐‘Ž1< ๐‘Ž2 <. . <
๐‘Ž๐‘›. Let i-th attribute value ๐‘Ž๐‘– be ranked as i, i=1,2,..n.
โ€ข The normalized value of ๐‘Ž๐‘– can be expressed as
๐‘Ž๐‘– =
๐‘– โˆ’ 1
๐‘› โˆ’ 1
โ€ข Thus, normalized values lie in the range [0. . 1].
โ€ข As ๐‘Ž๐‘– is a numerical value, the similarity measure, then can be calculated using any
similarity measurement method for numerical attribute.
โ€ข For example, the similarity measure between two objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ with attribute
values ๐‘Ž๐‘– and ๐‘Ž๐‘—, then can be expressed as
๐“ˆ ๐‘ฅ, ๐‘ฆ = (๐‘Ž๐‘– โˆ’ ๐‘Ž๐‘—)2
where ๐‘Ž๐‘– and ๐‘Ž๐‘– are the normalized values of ๐‘Ž๐‘– and ๐‘Ž๐‘– , respectively.
Proximity Measure with Ordinal Attribute
28
Example 12.5:
Consider the following set of records, where each record is defined by two ordinal attributes
size={S, M, L} and Quality = {Ex, A, B, C} such that S<M<L and Ex>A>B>C.
Object Size Quality
A S (0.0) A (0.66)
B L (1.0) Ex (1.0)
C L (1.0) C (0.0)
D M (0.5) B (0.33)
โ€ข Normalized values are shown in brackets.
โ€ข Their similarity measures are shown in the similarity matrix below.
?
Find the dissimilarity matrix, when each object is defined by only one ordinal
attribute say size (or quality).
๐ด
๐ต
๐ถ
๐ท
0 0 0
0 0
0
0
0
0
0
๐ด ๐ต ๐ถ D
Proximity Measure with Ordinal Attribute
29
โ€ข The measure called distance is usually referred to estimate the similarity between
two objects defined with interval-scaled attributes.
โ€ข We first present a generic formula to express distance d between two objects
๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ in n-dimensional space. Suppose, ๐‘ฅ๐‘– ๐‘Ž๐‘›๐‘‘ ๐‘ฆ๐‘– denote the values of ith
attribute of the objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ respectively.
๐‘‘ ๐‘ฅ, ๐‘ฆ =
๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
๐‘Ÿ
1
๐‘Ÿ
โ€ข Here, ๐‘Ÿ is any integer value.
โ€ข This distance metric most popularly known as Minkowski metric.
โ€ข This distance measure follows some well-known properties. These are mentioned
in the next slide.
Proximity Measure with Interval Scale
30
Properties of Minkowski metrics:
1. Non-negativity:
a. ๐‘‘ ๐‘ฅ, ๐‘ฆ โ‰ฅ 0 ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ
b. ๐‘‘ ๐‘ฅ, ๐‘ฆ = 0 only if ๐‘ฅ = ๐‘ฆ. This is also called identity condition.
2. Symmetry:
๐‘‘ ๐‘ฅ, ๐‘ฆ = ๐‘‘ ๐‘ฆ, ๐‘ฅ ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ
This condition ensures that the order in which objects are considered is not important.
3. Transitivity:
๐‘‘ ๐‘ฅ, ๐‘ง โ‰ค ๐‘‘ ๐‘ฅ, ๐‘ฆ + ๐‘‘ ๐‘ฆ, ๐‘ง ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ , ๐‘ฆ ๐‘Ž๐‘›๐‘‘ ๐‘ง.
โ€ข This condition has the interpretation that the least distance ๐‘‘ ๐‘ฅ, ๐‘ง between objects
๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ง is always less than or equal to the sum of the distance between the objects
๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ, and between ๐‘ฆ ๐‘Ž๐‘›๐‘‘ ๐‘ง.
โ€ข This property is also termed as Triangle Inequality.
Proximity Measure with Interval Scale
31
Depending on the value of ๐‘Ÿ, the distance measure is renamed accordingly.
1. Manhattan distance (L1 Norm: ๐’“ = 1)
The Manhattan distance is expressed as
๐‘‘ =
๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
where โ€ฆ denotes the absolute value. This metric is also alternatively termed as Taxicals
metric, city-block metric.
Example: x = [7, 3, 5] and y = [3, 2, 6].
The Manhattan distance is 7 โˆ’ 3 + 3 โˆ’ 2 + 5 โˆ’ 6 = 6.
โ€ข As a special instance of Manhattan distance, when attribute values โˆˆ [0, 1] is called
Hamming distance.
โ€ข Alternatively, Hamming distance is the number of bits that are different between two
objects that have only binary values (i.e. between two binary vectors).
Proximity Measure with Interval Scale
32
2. Euclidean Distance (L2 Norm: ๐’“ = 2)
This metric is same as Euclidean distance between any two points ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ ๐‘–๐‘› โ„›๐‘›.
๐‘‘(๐‘ฅ, ๐‘ฆ) =
๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
2
Example: x = [7, 3, 5] and y = [3, 2, 6].
The Euclidean distance between ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ is
๐‘‘ ๐‘ฅ, ๐‘ฆ = 7 โˆ’ 3 2 + 3 โˆ’ 2 2 + 5 โˆ’ 6 2 = 18 โ‰ˆ 2.426
Proximity Measure with Interval Scale
33
3. Chebychev Distance (Lโˆ
Norm: ๐’“ โˆˆ โ„›)
This metric is defined as
๐‘‘(๐‘ฅ, ๐‘ฆ) = max
โˆ€๐‘–
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
โ€ข We may clearly note the difference between Chebychev metric and Manhattan distance.
That is, instead of summing up the absolute difference (in Manhattan distance), we simply
take the maximum of the absolute differences (in Chebychev distance). Hence, Lโˆ
< L๐Ÿ
Example: x = [7, 3, 5] and y = [3, 2, 6].
The Manhattan distance = 7 โˆ’ 3 + 3 โˆ’ 2 + 5 โˆ’ 6 = 6.
The chebychev distance = Max { 7 โˆ’ 3 , 3 โˆ’ 2 , 5 โˆ’ 6 } = 4.
Proximity Measure with Interval Scale
34
4. Other metrics:
a. Canberra metric:
๐‘‘(๐‘ฅ, ๐‘ฆ) =
๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
๐‘ฅ๐‘– + ๐‘ฆ๐‘–
๐‘ž
โ€ข where q is a real number. Usually q = 1, because numerator of the ratio is always โ‰ค
denominator, the ratio โ‰ค 1, that is, the sum is always bounded and small.
โ€ข If q โ‰  1, it is called Fractional Canberra metric.
โ€ข If q > 1, the oppositive relationship holds.
b. Hellinger metric:
๐‘‘(๐‘ฅ, ๐‘ฆ) =
๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
2
This metric is then used as either squared or transformed into an acceptable range [-1, +1]
using the following transformations.
i. ๐‘‘ ๐‘ฅ, ๐‘ฆ = (1 โˆ’ ๐‘Ÿ(๐‘ฅ, ๐‘ฆ)) 2
ii. ๐‘‘ ๐‘ฅ, ๐‘ฆ = 1 โˆ’ ๐‘Ÿ ๐‘ฅ, ๐‘ฆ
Where ๐‘Ÿ ๐‘ฅ, ๐‘ฆ is correlation coefficient between ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ.
Note: Dissimilarity measurement is not relevant with distance measurement.
Proximity Measure with Interval Scale
Proximity Measure for Ratio-Scale
35
The proximity between the objects with ratio-scaled variable can be carried with the
following steps:
1. Apply the appropriate transformation to the data to bring it into a linear scale. (e.g.
logarithmic transformation to data of the form ๐‘‹ = ๐ด๐‘’๐ต.
2. The transformed values can be treated as interval-scaled values. Any distance measure
discussed for interval-scaled variable can be applied to measure the similarity.
Note:
There are two concerns on proximity measures:
โ€ข Normalization of the measured values.
โ€ข Intra-transformation from similarity to dissimilarity measure and vice-versa.
Proximity Measure for Ratio-Scale
36
Normalization:
โ€ข A major problem when using the similarity (or dissimilarity) measures (such as Euclidean
distance) is that the large values frequently swamp the small ones.
โ€ข For example, consider the following data.
Here, the contribution of Cost 2 and Cost 3 is insignificant compared to Cost 1 so far the
Euclidean distance is concerned.
โ€ข This problem can be avoided if we consider the normalized values of all numerical
attributes.
โ€ข Another normalization may be to take the estimated values in a normalized range say [0, 1].
Note that, if a measure varies in the range, then it can be normalized as
๐“ˆโ€ฒ =
1
1+๐“ˆ
where ๐“ˆ โˆˆ 0. . โˆž
37
Intra-transformation:
โ€ข Transforming similarities to dissimilarities and vice-versa is also relatively straightforward.
โ€ข If the similarity (or dissimilarity) falls in the interval [0..1], the dissimilarity (or similarity)
can be obtained as
๐‘‘ = 1 โˆ’ ๐“ˆ
or
๐“ˆ = 1 โˆ’ ๐‘‘
โ€ข Another approach is to define similarity as the negative of dissimilarity ( or vice-versa).
Proximity Measure for Ratio-Scale
Proximity Measure with Mixed Attributes
38
โ€ข The previous metrics on similarity measures assume that all the attributes were of
the same type. Thus, a general approach is needed when the attributes are of
different types.
โ€ข One straightforward approach is to compute the similarity between each attribute
separately and then combine these attribute using a method that results in a
similarity between 0 and 1.
โ€ข Typically, the overall similarity is defined as the average of all the individual
attribute similarities.
โ€ข See the algorithm in the next slide for doing this.
Similarity Measure with Vector Objects
39
Suppose, the objects are defined with ๐ด1, ๐ด2, โ€ฆ . . , ๐ด๐‘› attributes.
1. For the k-th attribute (k = 1, 2, . . , n), compute similarity ๐“ˆ๐‘˜(๐‘ฅ, ๐‘ฆ) in the range [0, 1].
2. Compute the overall similarity between two objects using the following formula
๐‘ ๐‘–๐‘š๐‘–๐‘™๐‘Ž๐‘Ÿ๐‘–๐‘ก๐‘ฆ ๐‘ฅ, ๐‘ฆ =
๐‘–=1
๐‘›
๐“ˆ๐‘–(๐‘ฅ, ๐‘ฆ)
๐‘›
3. The above formula can be modified by weighting the contribution of each attribute. If the
weight ๐‘ค๐‘˜ is for the k-th attribute, then
๐‘ค_๐‘ ๐‘–๐‘š๐‘–๐‘™๐‘Ž๐‘Ÿ๐‘–๐‘ก๐‘ฆ ๐‘ฅ, ๐‘ฆ =
๐‘–=1
๐‘›
๐‘ค๐‘–๐“ˆ๐‘–(๐‘ฅ, ๐‘ฆ)
๐‘›
Such that ๐‘–=1
๐‘›
๐‘ค๐‘– = 1.
4. The definition of the Minkowski distance can also be modified as follows:
๐‘‘ ๐‘ฅ, ๐‘ฆ =
๐‘–=1
๐‘›
๐‘ค๐‘– ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
๐‘Ÿ
1
๐‘Ÿ
Each symbols are having their usual meanings.
Similarity Measure with Mixed Attributes
40
Example 12.6:
Consider the following set of objects. Obtain the similarity matrix.
[For C: X>A>B>C]
Object
A
(Binary)
B
(Categorical)
C
(Ordinal)
D
(Numeric)
E
(Numeric)
1 Y R X 475 108
2 N R A 10 10-2
3 N B C 1000 105
4 Y G B 500 103
5 Y B A 80 1
How cosine similarity can be applied to this?
Non-Metric similarity
41
โ€ข In many applications (such as information retrieval) objects are complex and contains a
large number of symbolic entities (such as keywords, phrases, etc.).
โ€ข To measure the distance between complex objects, it is often desirable to introduce a non-
metric similarity function.
โ€ข Here, we discuss few such non-metric similarity measurements.
Cosine similarity
Suppose, x and y denote two vectors representing two complex objects. The cosine similarity
denoted as cos(๐‘ฅ, ๐‘ฆ) and defined as
cos(๐‘ฅ, ๐‘ฆ) =
๐‘ฅ โ‹… ๐‘ฆ
๐‘ฅ โ‹… ๐‘ฆ
โ€ข where ๐‘ฅ โ‹… ๐‘ฆ denotes the vector dot product, namely ๐‘ฅ โ‹… ๐‘ฆ = ๐‘–=1
๐‘›
๐‘ฅ๐‘– โ‹… ๐‘ฆ๐‘– such that ๐‘ฅ =
[๐‘ฅ1, ๐‘ฅ2, . . , ๐‘ฅ๐‘›] and ๐‘ฆ = [๐‘ฆ1, ๐‘ฆ2, . . , ๐‘ฆ๐‘›].
โ€ข ๐‘ฅ and ๐‘ฆ denote the Euclidean norms of vector x and y, respectively (essentially the
length of vectors x and y), that is
๐‘ฅ = ๐‘ฅ1
2
+ ๐‘ฅ2
2
+. . +๐‘ฅ๐‘›
2
and ๐‘ฆ = ๐‘ฆ1
2
+ ๐‘ฆ2
2
+. . +๐‘ฆ๐‘›
2
Cosine Similarity
42
โ€ข In fact, cosine similarity essentially is a measure of the (cosine of the)
angle between x and y.
โ€ข Thus if the cosine similarity is 1, then the angle between x and y is 0ยฐ
and in this case, x and y are the same except for magnitude.
โ€ข On the other hand, if cosine similarity is 0, then the angle between
x and y is 90ยฐ and they do not share any terms.
โ€ข Considering, this cosine similarity can be written equivalently
cos ๐‘ฅ, ๐‘ฆ =
๐‘ฅ โ‹… ๐‘ฆ
๐‘ฅ โ‹… ๐‘ฆ
=
๐‘ฅ
๐‘ฅ
โ‹…
๐‘ฆ
๐‘ฆ
= ๐‘ฅ โ‹… ๐‘ฆ
where ๐‘ฅ =
๐‘ฅ
๐‘ฅ
and ๐‘ฆ =
๐‘ฆ
๐‘ฆ
. This means that cosine similarity does
not take the magnitude of the two vectors into account, when
computing similarity.
โ€ข It is thus, one way normalized measurement.
Non-Metric Similarity
43
Example 12.7: Cosine Similarity
Suppose, we are given two documents with count of 10 words in each are shown in the
form of vectors x and y as below.
x = [3, 2, 0, 5, 0, 0, 0, 2, 0, 0] and y = [1, 0, 0, 0, 0, 0, 0, 1, 0, 2]
Thus, ๐‘ฅ โ‹… ๐‘ฆ = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2
= 5
๐‘ฅ = 32 + 22 + 0 + 52 + 0 + 0 + 0 + 22 + 0 + 0 = 6.48
๐‘ฆ = 12 + 0 + 0 + 0 + 0 + 0 + 0 + 12 + 0 + 22 = 2.24
โˆด cos ๐‘ฅ, ๐‘ฆ = 0.31
Extended Jaccard Coefficient
The extended Jaccard coefficient is denoted as ๐ธ๐ฝ and defined as
๐ธ๐ฝ =
๐‘ฅ โ‹… ๐‘ฆ
โ€–xโ€–2 โ‹… ๐‘ฆ 2 โˆ’ ๐‘ฅ โ‹… ๐‘ฆ
โ€ข This is also alternatively termed as Tanimoto coefficient and can be used to measure
like document similarity.
Compute Extended Jaccard coefficient (๐ธ๐ฝ) for the above example 12.7.
Pearsonโ€™s Correlation
44
โ€ข The correlation between two objects x and y gives a measure of the linear relationship
between the attributes of the objects.
โ€ข More precisely, Pearsonโ€™s correlation coefficient between two objects x and y is defined in
the following.
๐‘ƒ ๐‘ฅ, ๐‘ฆ =
๐‘†๐‘ฅ๐‘ฆ
๐‘†๐‘ฅ โ‹… ๐‘†๐‘ฆ
where ๐‘†๐‘ฅ๐‘ฆ = ๐‘๐‘œ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’ ๐‘ฅ, ๐‘ฆ =
1
๐‘›โˆ’1 ๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฅ ๐‘ฆ๐‘– โˆ’ ๐‘ฆ
๐‘†๐‘ฅ = ๐‘†๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘‘๐‘’๐‘ฃ๐‘–๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ฅ =
1
๐‘›โˆ’1 ๐‘–=1
๐‘›
๐‘ฅ๐‘– โˆ’ ๐‘ฅ 2
๐‘†๐‘ฆ = ๐‘†๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘‘๐‘’๐‘ฃ๐‘–๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ฆ =
1
๐‘›โˆ’1 ๐‘–=1
๐‘›
๐‘ฆ๐‘– โˆ’ ๐‘ฆ 2
๐‘ฅ = ๐‘š๐‘’๐‘Ž๐‘› ๐‘ฅ =
1
๐‘› ๐‘–=1
๐‘›
๐‘ฅ๐‘–
๐‘ฆ = ๐‘š๐‘’๐‘Ž๐‘› ๐‘ฆ =
1
๐‘› ๐‘–=1
๐‘›
๐‘ฆ๐‘–
and n is the number of attributes in x and y.
45
Note 1: Correlation is always in the range of -1 to 1. A correlation of 1(-1) means that x and y
have a perfect positive (negative) linear relationship, that is, ๐‘ฅ๐‘– = ๐‘Ž โ‹… ๐‘ฆ๐‘– + ๐‘ for some and b.
Example 12.8: Pearsonโ€™s correlation
Calculate the Pearson's correlation of the two vectors x and y as given below.
x = [3, 6, 0, 3, 6]
y = [1, 2, 0, 1, 2]
Note: Vector components can be negative values as well.
Note:
If the correlation is 0, then there is no linear relationship between the attribute of the object.
Example 12.9: Non-linear correlation
Verify that there is no linear relationship among attributes in the objects x and y given below.
x = [-3, -2, -1, 0, 1, 2, 3]
y = [9, 4, 1, 0, 1, 4, 9]
P (x, y) = 0, and also note ๐‘ฅ๐‘– = ๐‘ฆ๐‘–
2
for all attributes here.
Mahalanobis Distance
46
โ€ข A related issue with distance measurement is how to handle the situation when attributes do
not have the same range of values.
โ€ข For example, a record with two objects Age and Income. Here, two attributes have different
scales. Thus, Euclidean distance is not a suitable measure to handle such situation.
โ€ข In the other way around, how to compute distance when there is correlation between some
of the attributes, perhaps, in addition to difference in the ranges of values.
โ€ข A generalization of Euclidean distance, the mahalanobiโ€™s distance is useful when attributes
are (partially) correlated and/or have different range of values.
โ€ข The Mahalanobiโ€™s distance between two objects (vectors) x and y is defined as
๐‘€(๐‘ฅ, ๐‘ฆ) = (๐‘ฅ โˆ’ ๐‘ฆ) โˆ’1
๐‘ฅ โˆ’ ๐‘ฆ ๐‘‡
Here, โˆ’1
is inverse if the covariance matrix.
Set Difference and Time Difference
47
Set Difference
โ€ข Another non-metric dissimilarity measurement is set difference.
โ€ข Given two sets A and B, A โ€“ B is the set of elements of A that are not in B. Thus, if
A = {1, 2, 3, 4} and B = {2, 3, 4} then A โ€“ B = {1} and B โ€“ A = โˆ….
โ€ข We can define d between two sets as d(A, B) as
๐‘‘(๐ด, ๐ต) = |๐ด โˆ’ ๐ต|
where ๐ด denotes the size of set A.
Note:
This measure does not satisfy the property of Non-negativity, Symmetric and Transitivity.
โ€ข Another modified definition however satisfies
๐‘‘ ๐ด, ๐ต = ๐ด โˆ’ ๐ต + |๐ต โˆ’ ๐ด|
Time Difference
โ€ข It defines the distance between times of the day as follows
๐‘‘ ๐‘ก1, ๐‘ก2 =
๐‘ก2 โˆ’ ๐‘ก1 ๐‘–๐‘“ ๐‘ก1 โ‰ค ๐‘ก2
24 + (๐‘ก2โˆ’๐‘ก1) ๐‘–๐‘“ ๐‘ก1 โ‰ฅ ๐‘ก2
โ€ข Example: ๐‘‘ 1 ๐‘๐‘š, 2 ๐‘๐‘š = 1 โ„Ž๐‘œ๐‘ข๐‘Ÿ
๐‘‘ 2 ๐‘๐‘š, 1 ๐‘๐‘š = 23 โ„Ž๐‘œ๐‘ข๐‘Ÿ๐‘ .

More Related Content

What's hot

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patternsAzad public school
ย 
Decision tree
Decision treeDecision tree
Decision treeSoujanya V
ย 
Naive bayes
Naive bayesNaive bayes
Naive bayesAshraf Uddin
ย 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
ย 
Solving recurrences
Solving recurrencesSolving recurrences
Solving recurrencesMegha V
ย 
Decision tree
Decision treeDecision tree
Decision treeAmi_Surati
ย 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
ย 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
ย 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitiveslavanya marichamy
ย 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
ย 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
ย 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
ย 
Pattern recognition UNIT 5
Pattern recognition UNIT 5Pattern recognition UNIT 5
Pattern recognition UNIT 5SURBHI SAROHA
ย 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science zekeLabs Technologies
ย 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
ย 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
ย 
Data reduction
Data reductionData reduction
Data reductionkalavathisugan
ย 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
ย 

What's hot (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
ย 
Decision tree
Decision treeDecision tree
Decision tree
ย 
Naive bayes
Naive bayesNaive bayes
Naive bayes
ย 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
ย 
Solving recurrences
Solving recurrencesSolving recurrences
Solving recurrences
ย 
Decision tree
Decision treeDecision tree
Decision tree
ย 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
ย 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
ย 
Clustering
ClusteringClustering
Clustering
ย 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
ย 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
ย 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
ย 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
ย 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
ย 
Pattern recognition UNIT 5
Pattern recognition UNIT 5Pattern recognition UNIT 5
Pattern recognition UNIT 5
ย 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
ย 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
ย 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
ย 
Data reduction
Data reductionData reduction
Data reduction
ย 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
ย 

Similar to Similarity Measures (pptx)

Data mining
Data miningData mining
Data miningKani Selvam
ย 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptSamPrem3
ย 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptPalaniKumarR2
ย 
[PPT]
[PPT][PPT]
[PPT]butest
ย 
Clustering
ClusteringClustering
ClusteringNLPseminar
ย 
47 292-298
47 292-29847 292-298
47 292-298idescitation
ย 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292HARDIK SINGH
ย 
Application for Logical Expression Processing
Application for Logical Expression Processing Application for Logical Expression Processing
Application for Logical Expression Processing csandit
ย 
Clustering in artificial intelligence
Clustering in artificial intelligence Clustering in artificial intelligence
Clustering in artificial intelligence Karam Munir Butt
ย 
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...ijsc
ย 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisAvijit Famous
ย 
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...Abebe Admasu
ย 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
ย 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...IOSR Journals
ย 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221Editor Jacotech
ย 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
ย 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
ย 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
ย 

Similar to Similarity Measures (pptx) (20)

Data mining
Data miningData mining
Data mining
ย 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
ย 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
ย 
[PPT]
[PPT][PPT]
[PPT]
ย 
Clustering
ClusteringClustering
Clustering
ย 
47 292-298
47 292-29847 292-298
47 292-298
ย 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
ย 
Application for Logical Expression Processing
Application for Logical Expression Processing Application for Logical Expression Processing
Application for Logical Expression Processing
ย 
Clustering in artificial intelligence
Clustering in artificial intelligence Clustering in artificial intelligence
Clustering in artificial intelligence
ย 
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
ย 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
ย 
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...
Google BigQuery is a very popular enterprise warehouse thatโ€™s built with a co...
ย 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
ย 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
ย 
G0354451
G0354451G0354451
G0354451
ย 
Localization, Classification, and Evaluation.pdf
Localization, Classification, and Evaluation.pdfLocalization, Classification, and Evaluation.pdf
Localization, Classification, and Evaluation.pdf
ย 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
ย 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
ย 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
ย 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ย 

Recently uploaded

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
ย 
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdfssuser54595a
ย 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
ย 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
ย 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
ย 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
ย 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
ย 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
ย 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
ย 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
ย 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
ย 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
ย 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
ย 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
ย 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)Dr. Mazin Mohamed alkathiri
ย 

Recently uploaded (20)

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
ย 
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAะกY_INDEX-DM_23-1-final-eng.pdf
ย 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
ย 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
ย 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
ย 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ย 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
ย 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
ย 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
ย 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
ย 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
ย 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
ย 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
ย 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
ย 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
ย 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
ย 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
ย 
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Bikash Puri  Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Bikash Puri Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 

Similarity Measures (pptx)

  • 2. Topics to be coveredโ€ฆ ๏‚— Introduction to clustering ๏‚— Similarity and dissimilarity measures ๏‚— Clustering techniques ๏‚— Partitioning algorithms ๏‚— Hierarchical algorithms ๏‚— Density-based algorithm 2
  • 3. Introduction to Clustering 3 ๏‚— Classification consists of assigning a class label to a set of unclassified cases. ๏‚— Supervised Classification ๏‚— The set of possible classes is known in advance. ๏‚— Unsupervised Classification ๏‚— Set of possible classes is not known. After classification we can try to assign a name to that class. ๏‚— Unsupervised classification is called clustering.
  • 6. Introduction to Clustering ๏‚— Clustering is somewhat related to classification in the sense that in both cases data are grouped. โ€ข ๏‚— However, there is a major difference between these two techniques. ๏‚— In order to understand the difference between the two, consider a sample dataset containing marks obtained by a set of students and corresponding grades as shown in Table 15.1. 6
  • 7. Introduction to Clustering Table 12.1: Tabulation of Marks 7 Roll No Mark Grade 1 80 A 2 70 A 3 55 C 4 91 EX 5 65 B 6 35 D 7 76 A 8 40 D 9 50 C 10 85 EX 11 25 F 12 60 B 13 45 D 14 95 EX 15 63 B 16 88 A Figure 12.1: Group representation of dataset in Table 15.1 15 12 5 11 14 10 4 6 13 8 16 7 1 2 3 9 B F EX D C A
  • 8. Introduction to Clustering ๏‚— It is evident that there is a simple mapping between Table 12.1 and Fig 12.1. ๏‚— The fact is that groups in Fig 12.1 are already predefined in Table 12.1. This is similar to classification, where we have given a dataset where groups of data are predefined. ๏‚— Consider another situation, where โ€˜Gradeโ€™ is not known, but we have to make a grouping. ๏‚— Put all the marks into a group if any other mark in that group does not exceed by 5 or more. ๏‚— This is similar to โ€œRelative gradingโ€ concept and grade may range from A to Z. 8
  • 9. Introduction to Clustering ๏‚— Figure 12.2 shows another grouping by means of another simple mapping, but the difference is this mapping does not based on predefined classes. ๏‚— In other words, this grouping is accomplished by finding similarities between data according to characteristics found in the actual data. ๏‚— Such a group making is called clustering.
  • 10. Introduction to Clustering Example 12.1 : The task of clustering In order to elaborate the clustering task, consider the following dataset. Table 12.2: Life Insurance database With certain similarity or likeliness defined, we can classify the records to one or group of more attributes (and thus mapping being non-trivial). 10 Martial Status Age Income Education Number of children Single 35 25000 Under Graduate 3 Married 25 15000 Graduate 1 Single 40 20000 Under Graduate 0 Divorced 20 30000 Post-Graduate 0 Divorced 25 20000 Under Graduate 3 Married 60 70000 Graduate 0 Married 30 90000 Post-Graduate 0 Married 45 60000 Graduate 5 Divorced 50 80000 Under Graduate 2
  • 11. ๏‚— Clustering has been used in many application domains: ๏‚— Image analysis ๏‚— Document retrieval ๏‚— Machine learning, etc. ๏‚— When clustering is applied to real-world database, many problems may arise. 1. The (best) number of cluster is not known. ๏‚— There is not correct answer to a clustering problem. ๏‚— In fact, many answers may be found. ๏‚— The exact number of cluster required is not easy to determine. 11 Introduction to Clustering
  • 12. 2. There may not be any a priori knowledge concerning the clusters. โ€ข This is an issue that what data should be used for clustering. โ€ข Unlike classification, in clustering, we have not supervisory learning to aid the process. โ€ข Clustering can be viewed as similar to unsupervised learning. 3. Interpreting the semantic meaning of each cluster may be difficult. โ€ข With classification, the labeling of classes is known ahead of time. In contrast, with clustering, this may not be the case. โ€ข Thus, when the clustering process is finished yielding a set of clusters, the exact meaning of each cluster may not be obvious. 12 Introduction to Clustering
  • 13. Definition of Clustering Problem 13 Given a database D = ๐‘ก1, ๐‘ก2, โ€ฆ . . , ๐‘ก๐‘› of ๐‘› tuples, the clustering problem is to define a mapping ๐‘“ โˆถ D ๐ถ, where each ๐‘ก๐‘– โˆˆ ๐ท is assigned to one cluster ๐‘๐‘– โˆˆ ๐ถ. Here, C = ๐‘1, ๐‘2, โ€ฆ . . , ๐‘๐‘˜ denotes a set of clusters. Definition 12.1: Clustering โ€ข Solution to a clustering problem is devising a mapping formulation. โ€ข The formulation behind such a mapping is to establish that a tuple within one cluster is more like tuples within that cluster and not similar to tuples outside it.
  • 14. Definition of Clustering Problem 14 โ€ข Hence, mapping function f in Definition 12.1 may be explicitly stated as ๐‘“ โˆถ D ๐‘1, ๐‘2, โ€ฆ . . , ๐‘๐‘˜ where i) each ๐‘ก๐‘– โˆˆ ๐ท is assigned to one cluster ๐‘๐‘– โˆˆ ๐ถ. ii) for each cluster ๐‘๐‘– โˆˆ ๐ถ, and for all ๐‘ก๐‘–๐‘ , ๐‘ก๐‘–๐‘ž โˆˆ ๐‘๐‘– and there exist ๐‘ก๐‘— โˆ‰ ๐‘๐‘– such that similarity (๐‘ก๐‘–๐‘ , ๐‘ก๐‘–๐‘ž ) > similarity (๐‘ก๐‘–๐‘ , ๐‘ก๐‘— ) AND similarity (๐‘ก๐‘–๐‘ž , ๐‘ก๐‘— ) โ€ข In the field of cluster analysis, this similarity plays an important part. โ€ข Now, we shall learn how similarity (this is also alternatively judged as โ€œdissimilarityโ€) between any two data can be measured.
  • 15.
  • 16. Similarity and Dissimilarity Measures 16 โ€ข In clustering techniques, similarity (or dissimilarity) is an important measurement. โ€ข Informally, similarity between two objects (e.g., two images, two documents, two records, etc.) is a numerical measure of the degree to which two objects are alike. โ€ข The dissimilarity on the other hand, is another alternative (or opposite) measure of the degree to which two objects are different. โ€ข Both similarity and dissimilarity also termed as proximity. โ€ข Usually, similarity and dissimilarity are non-negative numbers and may range from zero (highly dissimilar (no similar)) to some finite/infinite value (highly similar (no dissimilar)). Note: โ€ข Frequently, the term distance is used as a synonym for dissimilarity โ€ข In fact, it is used to refer as a special case of dissimilarity.
  • 17. Proximity Measures: Single-Attribute 17 โ€ข Consider an object, which is defined by a single attribute A (e.g., length) and the attribute A has n-distinct values ๐‘Ž1, ๐‘Ž2, โ€ฆ . . , ๐‘Ž๐‘›. โ€ข A data structure called โ€œDissimilarity matrixโ€ is used to store a collection of proximities that are available for all pair of n attribute values. โ€ข In other words, the Dissimilarity matrix for an attribute A with n values is represented by an ๐‘› ร— ๐‘› matrix as shown below. 0 ๐‘(2,1) 0 ๐‘(3,1) โ‹ฎ ๐‘(๐‘›,1) ๐‘(3,2) โ‹ฎ ๐‘(๐‘›,2) 0 โ‹ฎ โ€ฆ โ€ฆ 0 ๐‘›ร—๐‘› โ€ข Here, ๐‘(๐‘–,๐‘—) denotes the proximity measure between two objects with attribute values ๐‘Ž๐‘– and ๐‘Ž๐‘—. โ€ข Note: The proximity measure is symmetric, that is, ๐‘(๐‘–,๐‘—) = ๐‘(๐‘—,๐‘–)
  • 18. Proximity Calculation ๏‚— Proximity calculation to compute ๐‘(๐‘–,๐‘—) is different for different types of attributes according to NOIR topology. Proximity calculation for Nominal attributes: โ€ข For example, binary attribute, Gender = {Male, female} where Male is equivalent to binary 1 and female is equivalent to binary 0. โ€ข Similarity value is 1 if the two objects contains the same attribute value, while similarity value is 0 implies objects are not at all similar. 18 Object Gender Ram Male Sita Female Laxman Male โ€ข Here, Similarity value let it be denoted by ๐‘, among different objects are as follows. ๐‘ ๐‘…๐‘Ž๐‘š, ๐‘ ๐‘–๐‘ก๐‘Ž = 0 ๐‘ ๐‘…๐‘Ž๐‘š, ๐ฟ๐‘Ž๐‘ฅ๐‘š๐‘Ž๐‘› = 1 Note : In this case, if ๐‘ž denotes the dissimilarity between two objects ๐‘– ๐‘Ž๐‘›๐‘‘ ๐‘— with single binary attributes, then ๐‘ž(๐‘–,๐‘—)= 1 โˆ’ ๐‘(๐‘–,๐‘—)
  • 19. Proximity Calculation 19 โ€ข Now, let us focus on how to calculate proximity measures between objects which are defined by two or more binary attributes. โ€ข Suppose, the number of attributes be ๐‘. We can define the contingency table summarizing the different matches and mismatches between any two objects ๐‘ฅ and ๐‘ฆ, which are as follows. Object ๐‘ฅ Object y 1 0 1 ๐‘“11 ๐‘“10 0 ๐‘“01 ๐‘“00 Table 12.3: Contingency table with binary attributes Here, ๐‘“11= the number of attributes where ๐‘ฅ=1 and ๐‘ฆ=1. ๐‘“10= the number of attributes where ๐‘ฅ=1 and ๐‘ฆ=0. ๐‘“01= the number of attributes where ๐‘ฅ=0 and ๐‘ฆ=1. ๐‘“00= the number of attributes where ๐‘ฅ=0 and ๐‘ฆ=0. Note : ๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11 = ๐‘, the total number of binary attributes. Now, two cases may arise: symmetric and asymmetric binary attributes.
  • 20. Similarity Measure with Symmetric Binary 20 โ€ข To measure the similarity between two objects defined by symmetric binary attributes using a measure called symmetric binary coefficient and denoted as ๐’ฎ and defined below ๐’ฎ = ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  or ๐’ฎ = ๐‘“00 + ๐‘“11 ๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11 The dissimilarity measure, likewise can be denoted as ๐’Ÿ and defined as ๐’Ÿ = ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘–๐‘ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  or ๐’Ÿ = ๐‘“01 + ๐‘“10 ๐‘“00 + ๐‘“01 + ๐‘“10 + ๐‘“11 Note that, ๐’Ÿ = 1 โˆ’ ๐’ฎ
  • 21. Similarity Measure with Symmetric Binary 21 Example 12.2: Proximity measures with symmetric binary attributes Consider the following two dataset, where objects are defined with symmetric binary attributes. Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I}, Hobby = {T, C}, Job = {Y, N} Object Gender Food Caste Education Hobby Job Hari M V M L C N Ram M N M I T N Tomi F N H L C Y ๐’ฎ(Hari, Ram) = 1+2 1+2+1+2 = 0.5
  • 22. Proximity Measure with Asymmetric Binary 22 โ€ข Such a similarity measure between two objects defined by asymmetric binary attributes is done by Jaccard Coefficient and which is often symbolized by ๐’ฅ is given by the following equation ๐’ฅ= ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” ๐‘๐‘Ÿ๐‘’๐‘ ๐‘’๐‘›๐‘๐‘’ ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  ๐‘›๐‘œ๐‘ก ๐‘–๐‘›๐‘ฃ๐‘œ๐‘™๐‘ฃ๐‘’๐‘‘ ๐‘–๐‘› 00 ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” or ๐’ฅ = ๐‘“11 ๐‘“01 + ๐‘“10 + ๐‘“11
  • 23. 23 Example 12.3: Jaccard Coefficient Consider the following two dataset. Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I}, Hobby = {T, C}, Job = {Y, N} Calculate the Jaccard coefficient between Ram and Hari assuming that all binary attributes are asymmetric and for each pair values for an attribute, first one is more frequent than the second. Object Gender Food Caste Education Hobby Job Hari M V M L C N Ram M N M I T N Tomi F N H L C Y ๐’ฅ(Hari, Ram) = 1 2+1+1 = 0.25 Note: ๐’ฅ(Ram, Tomi) = 0 and ๐’ฅ(Hari, Ram) = ๐’ฅ(Ram, Hari), etc. Proximity Measure with Asymmetric Binary
  • 24. 24 Example 12.4: Consider the following two dataset. Gender = {M, F}, Food = {V, N}, Caste = {H, M}, Education = {L, I}, Hobby = {T, C}, Job = {Y, N} Object Gender Food Caste Education Hobby Job Hari M V M L C N Ram M N M I T N Tomi F N H L C Y How you can calculate similarity if Gender, Hobby and Job are symmetric binary attributes and Food, Caste, Education are asymmetric binary attributes? Obtain the similarity matrix with Jaccard coefficient of objects for the above, e.g. ?
  • 25. 25 โ€ข Binary attribute is a special kind of nominal attribute where the attribute has values with two states only. โ€ข On the other hand, categorical attribute is another kind of nominal attribute where it has values with three or more states (e.g. color = {Red, Green, Blue}). โ€ข If ๐“ˆ ๐‘ฅ, ๐‘ฆ denotes the similarity between two objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ, then ๐“ˆ ๐‘ฅ, ๐‘ฆ = ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘  ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  and the dissimilarity ๐’น(๐‘ฅ, ๐‘ฆ) is ๐’น(๐‘ฅ, ๐‘ฆ) = ๐‘๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘š๐‘–๐‘ ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘  ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘Ž๐‘ก๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘’๐‘  โ€ข If ๐‘š = number of matches and ๐‘Ž = number of categorical attributes with which objects are defined as ๐“ˆ ๐‘ฅ, ๐‘ฆ = ๐‘š ๐‘Ž and ๐’น ๐‘ฅ, ๐‘ฆ = ๐‘Žโˆ’๐‘š ๐‘Ž Proximity Measure with Categorical Attribute
  • 26. 26 Example 12.4: Object Color Position Distance 1 R L L 2 B C M 3 G R M 4 R L H The similarity matrix considering only color attribute is shown below Dissimilarity matrix, ๐’น =? Obtain the dissimilarity matrix considering both the categorical attributes (i.e. color and position). Proximity Measure with Categorical Attribute
  • 27. 27 โ€ข Ordinal attribute is a special kind of categorical attribute, where the values of attribute follows a sequence (ordering) e.g. Grade = {Ex, A, B, C} where Ex > A >B >C. โ€ข Suppose, A is an attribute of type ordinal and the set of values of ๐ด = {๐‘Ž1, ๐‘Ž2, โ€ฆ . . , ๐‘Ž๐‘›}. Let ๐‘› values of ๐ด are ordered in ascending order as ๐‘Ž1< ๐‘Ž2 <. . < ๐‘Ž๐‘›. Let i-th attribute value ๐‘Ž๐‘– be ranked as i, i=1,2,..n. โ€ข The normalized value of ๐‘Ž๐‘– can be expressed as ๐‘Ž๐‘– = ๐‘– โˆ’ 1 ๐‘› โˆ’ 1 โ€ข Thus, normalized values lie in the range [0. . 1]. โ€ข As ๐‘Ž๐‘– is a numerical value, the similarity measure, then can be calculated using any similarity measurement method for numerical attribute. โ€ข For example, the similarity measure between two objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ with attribute values ๐‘Ž๐‘– and ๐‘Ž๐‘—, then can be expressed as ๐“ˆ ๐‘ฅ, ๐‘ฆ = (๐‘Ž๐‘– โˆ’ ๐‘Ž๐‘—)2 where ๐‘Ž๐‘– and ๐‘Ž๐‘– are the normalized values of ๐‘Ž๐‘– and ๐‘Ž๐‘– , respectively. Proximity Measure with Ordinal Attribute
  • 28. 28 Example 12.5: Consider the following set of records, where each record is defined by two ordinal attributes size={S, M, L} and Quality = {Ex, A, B, C} such that S<M<L and Ex>A>B>C. Object Size Quality A S (0.0) A (0.66) B L (1.0) Ex (1.0) C L (1.0) C (0.0) D M (0.5) B (0.33) โ€ข Normalized values are shown in brackets. โ€ข Their similarity measures are shown in the similarity matrix below. ? Find the dissimilarity matrix, when each object is defined by only one ordinal attribute say size (or quality). ๐ด ๐ต ๐ถ ๐ท 0 0 0 0 0 0 0 0 0 0 ๐ด ๐ต ๐ถ D Proximity Measure with Ordinal Attribute
  • 29. 29 โ€ข The measure called distance is usually referred to estimate the similarity between two objects defined with interval-scaled attributes. โ€ข We first present a generic formula to express distance d between two objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ in n-dimensional space. Suppose, ๐‘ฅ๐‘– ๐‘Ž๐‘›๐‘‘ ๐‘ฆ๐‘– denote the values of ith attribute of the objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ respectively. ๐‘‘ ๐‘ฅ, ๐‘ฆ = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘Ÿ 1 ๐‘Ÿ โ€ข Here, ๐‘Ÿ is any integer value. โ€ข This distance metric most popularly known as Minkowski metric. โ€ข This distance measure follows some well-known properties. These are mentioned in the next slide. Proximity Measure with Interval Scale
  • 30. 30 Properties of Minkowski metrics: 1. Non-negativity: a. ๐‘‘ ๐‘ฅ, ๐‘ฆ โ‰ฅ 0 ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ b. ๐‘‘ ๐‘ฅ, ๐‘ฆ = 0 only if ๐‘ฅ = ๐‘ฆ. This is also called identity condition. 2. Symmetry: ๐‘‘ ๐‘ฅ, ๐‘ฆ = ๐‘‘ ๐‘ฆ, ๐‘ฅ ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ This condition ensures that the order in which objects are considered is not important. 3. Transitivity: ๐‘‘ ๐‘ฅ, ๐‘ง โ‰ค ๐‘‘ ๐‘ฅ, ๐‘ฆ + ๐‘‘ ๐‘ฆ, ๐‘ง ๐‘“๐‘œ๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘ฅ , ๐‘ฆ ๐‘Ž๐‘›๐‘‘ ๐‘ง. โ€ข This condition has the interpretation that the least distance ๐‘‘ ๐‘ฅ, ๐‘ง between objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ง is always less than or equal to the sum of the distance between the objects ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ, and between ๐‘ฆ ๐‘Ž๐‘›๐‘‘ ๐‘ง. โ€ข This property is also termed as Triangle Inequality. Proximity Measure with Interval Scale
  • 31. 31 Depending on the value of ๐‘Ÿ, the distance measure is renamed accordingly. 1. Manhattan distance (L1 Norm: ๐’“ = 1) The Manhattan distance is expressed as ๐‘‘ = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– where โ€ฆ denotes the absolute value. This metric is also alternatively termed as Taxicals metric, city-block metric. Example: x = [7, 3, 5] and y = [3, 2, 6]. The Manhattan distance is 7 โˆ’ 3 + 3 โˆ’ 2 + 5 โˆ’ 6 = 6. โ€ข As a special instance of Manhattan distance, when attribute values โˆˆ [0, 1] is called Hamming distance. โ€ข Alternatively, Hamming distance is the number of bits that are different between two objects that have only binary values (i.e. between two binary vectors). Proximity Measure with Interval Scale
  • 32. 32 2. Euclidean Distance (L2 Norm: ๐’“ = 2) This metric is same as Euclidean distance between any two points ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ ๐‘–๐‘› โ„›๐‘›. ๐‘‘(๐‘ฅ, ๐‘ฆ) = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– 2 Example: x = [7, 3, 5] and y = [3, 2, 6]. The Euclidean distance between ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ is ๐‘‘ ๐‘ฅ, ๐‘ฆ = 7 โˆ’ 3 2 + 3 โˆ’ 2 2 + 5 โˆ’ 6 2 = 18 โ‰ˆ 2.426 Proximity Measure with Interval Scale
  • 33. 33 3. Chebychev Distance (Lโˆ Norm: ๐’“ โˆˆ โ„›) This metric is defined as ๐‘‘(๐‘ฅ, ๐‘ฆ) = max โˆ€๐‘– ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– โ€ข We may clearly note the difference between Chebychev metric and Manhattan distance. That is, instead of summing up the absolute difference (in Manhattan distance), we simply take the maximum of the absolute differences (in Chebychev distance). Hence, Lโˆ < L๐Ÿ Example: x = [7, 3, 5] and y = [3, 2, 6]. The Manhattan distance = 7 โˆ’ 3 + 3 โˆ’ 2 + 5 โˆ’ 6 = 6. The chebychev distance = Max { 7 โˆ’ 3 , 3 โˆ’ 2 , 5 โˆ’ 6 } = 4. Proximity Measure with Interval Scale
  • 34. 34 4. Other metrics: a. Canberra metric: ๐‘‘(๐‘ฅ, ๐‘ฆ) = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘ฅ๐‘– + ๐‘ฆ๐‘– ๐‘ž โ€ข where q is a real number. Usually q = 1, because numerator of the ratio is always โ‰ค denominator, the ratio โ‰ค 1, that is, the sum is always bounded and small. โ€ข If q โ‰  1, it is called Fractional Canberra metric. โ€ข If q > 1, the oppositive relationship holds. b. Hellinger metric: ๐‘‘(๐‘ฅ, ๐‘ฆ) = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– 2 This metric is then used as either squared or transformed into an acceptable range [-1, +1] using the following transformations. i. ๐‘‘ ๐‘ฅ, ๐‘ฆ = (1 โˆ’ ๐‘Ÿ(๐‘ฅ, ๐‘ฆ)) 2 ii. ๐‘‘ ๐‘ฅ, ๐‘ฆ = 1 โˆ’ ๐‘Ÿ ๐‘ฅ, ๐‘ฆ Where ๐‘Ÿ ๐‘ฅ, ๐‘ฆ is correlation coefficient between ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ. Note: Dissimilarity measurement is not relevant with distance measurement. Proximity Measure with Interval Scale
  • 35. Proximity Measure for Ratio-Scale 35 The proximity between the objects with ratio-scaled variable can be carried with the following steps: 1. Apply the appropriate transformation to the data to bring it into a linear scale. (e.g. logarithmic transformation to data of the form ๐‘‹ = ๐ด๐‘’๐ต. 2. The transformed values can be treated as interval-scaled values. Any distance measure discussed for interval-scaled variable can be applied to measure the similarity. Note: There are two concerns on proximity measures: โ€ข Normalization of the measured values. โ€ข Intra-transformation from similarity to dissimilarity measure and vice-versa.
  • 36. Proximity Measure for Ratio-Scale 36 Normalization: โ€ข A major problem when using the similarity (or dissimilarity) measures (such as Euclidean distance) is that the large values frequently swamp the small ones. โ€ข For example, consider the following data. Here, the contribution of Cost 2 and Cost 3 is insignificant compared to Cost 1 so far the Euclidean distance is concerned. โ€ข This problem can be avoided if we consider the normalized values of all numerical attributes. โ€ข Another normalization may be to take the estimated values in a normalized range say [0, 1]. Note that, if a measure varies in the range, then it can be normalized as ๐“ˆโ€ฒ = 1 1+๐“ˆ where ๐“ˆ โˆˆ 0. . โˆž
  • 37. 37 Intra-transformation: โ€ข Transforming similarities to dissimilarities and vice-versa is also relatively straightforward. โ€ข If the similarity (or dissimilarity) falls in the interval [0..1], the dissimilarity (or similarity) can be obtained as ๐‘‘ = 1 โˆ’ ๐“ˆ or ๐“ˆ = 1 โˆ’ ๐‘‘ โ€ข Another approach is to define similarity as the negative of dissimilarity ( or vice-versa). Proximity Measure for Ratio-Scale
  • 38. Proximity Measure with Mixed Attributes 38 โ€ข The previous metrics on similarity measures assume that all the attributes were of the same type. Thus, a general approach is needed when the attributes are of different types. โ€ข One straightforward approach is to compute the similarity between each attribute separately and then combine these attribute using a method that results in a similarity between 0 and 1. โ€ข Typically, the overall similarity is defined as the average of all the individual attribute similarities. โ€ข See the algorithm in the next slide for doing this.
  • 39. Similarity Measure with Vector Objects 39 Suppose, the objects are defined with ๐ด1, ๐ด2, โ€ฆ . . , ๐ด๐‘› attributes. 1. For the k-th attribute (k = 1, 2, . . , n), compute similarity ๐“ˆ๐‘˜(๐‘ฅ, ๐‘ฆ) in the range [0, 1]. 2. Compute the overall similarity between two objects using the following formula ๐‘ ๐‘–๐‘š๐‘–๐‘™๐‘Ž๐‘Ÿ๐‘–๐‘ก๐‘ฆ ๐‘ฅ, ๐‘ฆ = ๐‘–=1 ๐‘› ๐“ˆ๐‘–(๐‘ฅ, ๐‘ฆ) ๐‘› 3. The above formula can be modified by weighting the contribution of each attribute. If the weight ๐‘ค๐‘˜ is for the k-th attribute, then ๐‘ค_๐‘ ๐‘–๐‘š๐‘–๐‘™๐‘Ž๐‘Ÿ๐‘–๐‘ก๐‘ฆ ๐‘ฅ, ๐‘ฆ = ๐‘–=1 ๐‘› ๐‘ค๐‘–๐“ˆ๐‘–(๐‘ฅ, ๐‘ฆ) ๐‘› Such that ๐‘–=1 ๐‘› ๐‘ค๐‘– = 1. 4. The definition of the Minkowski distance can also be modified as follows: ๐‘‘ ๐‘ฅ, ๐‘ฆ = ๐‘–=1 ๐‘› ๐‘ค๐‘– ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘Ÿ 1 ๐‘Ÿ Each symbols are having their usual meanings.
  • 40. Similarity Measure with Mixed Attributes 40 Example 12.6: Consider the following set of objects. Obtain the similarity matrix. [For C: X>A>B>C] Object A (Binary) B (Categorical) C (Ordinal) D (Numeric) E (Numeric) 1 Y R X 475 108 2 N R A 10 10-2 3 N B C 1000 105 4 Y G B 500 103 5 Y B A 80 1 How cosine similarity can be applied to this?
  • 41. Non-Metric similarity 41 โ€ข In many applications (such as information retrieval) objects are complex and contains a large number of symbolic entities (such as keywords, phrases, etc.). โ€ข To measure the distance between complex objects, it is often desirable to introduce a non- metric similarity function. โ€ข Here, we discuss few such non-metric similarity measurements. Cosine similarity Suppose, x and y denote two vectors representing two complex objects. The cosine similarity denoted as cos(๐‘ฅ, ๐‘ฆ) and defined as cos(๐‘ฅ, ๐‘ฆ) = ๐‘ฅ โ‹… ๐‘ฆ ๐‘ฅ โ‹… ๐‘ฆ โ€ข where ๐‘ฅ โ‹… ๐‘ฆ denotes the vector dot product, namely ๐‘ฅ โ‹… ๐‘ฆ = ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โ‹… ๐‘ฆ๐‘– such that ๐‘ฅ = [๐‘ฅ1, ๐‘ฅ2, . . , ๐‘ฅ๐‘›] and ๐‘ฆ = [๐‘ฆ1, ๐‘ฆ2, . . , ๐‘ฆ๐‘›]. โ€ข ๐‘ฅ and ๐‘ฆ denote the Euclidean norms of vector x and y, respectively (essentially the length of vectors x and y), that is ๐‘ฅ = ๐‘ฅ1 2 + ๐‘ฅ2 2 +. . +๐‘ฅ๐‘› 2 and ๐‘ฆ = ๐‘ฆ1 2 + ๐‘ฆ2 2 +. . +๐‘ฆ๐‘› 2
  • 42. Cosine Similarity 42 โ€ข In fact, cosine similarity essentially is a measure of the (cosine of the) angle between x and y. โ€ข Thus if the cosine similarity is 1, then the angle between x and y is 0ยฐ and in this case, x and y are the same except for magnitude. โ€ข On the other hand, if cosine similarity is 0, then the angle between x and y is 90ยฐ and they do not share any terms. โ€ข Considering, this cosine similarity can be written equivalently cos ๐‘ฅ, ๐‘ฆ = ๐‘ฅ โ‹… ๐‘ฆ ๐‘ฅ โ‹… ๐‘ฆ = ๐‘ฅ ๐‘ฅ โ‹… ๐‘ฆ ๐‘ฆ = ๐‘ฅ โ‹… ๐‘ฆ where ๐‘ฅ = ๐‘ฅ ๐‘ฅ and ๐‘ฆ = ๐‘ฆ ๐‘ฆ . This means that cosine similarity does not take the magnitude of the two vectors into account, when computing similarity. โ€ข It is thus, one way normalized measurement.
  • 43. Non-Metric Similarity 43 Example 12.7: Cosine Similarity Suppose, we are given two documents with count of 10 words in each are shown in the form of vectors x and y as below. x = [3, 2, 0, 5, 0, 0, 0, 2, 0, 0] and y = [1, 0, 0, 0, 0, 0, 0, 1, 0, 2] Thus, ๐‘ฅ โ‹… ๐‘ฆ = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ๐‘ฅ = 32 + 22 + 0 + 52 + 0 + 0 + 0 + 22 + 0 + 0 = 6.48 ๐‘ฆ = 12 + 0 + 0 + 0 + 0 + 0 + 0 + 12 + 0 + 22 = 2.24 โˆด cos ๐‘ฅ, ๐‘ฆ = 0.31 Extended Jaccard Coefficient The extended Jaccard coefficient is denoted as ๐ธ๐ฝ and defined as ๐ธ๐ฝ = ๐‘ฅ โ‹… ๐‘ฆ โ€–xโ€–2 โ‹… ๐‘ฆ 2 โˆ’ ๐‘ฅ โ‹… ๐‘ฆ โ€ข This is also alternatively termed as Tanimoto coefficient and can be used to measure like document similarity. Compute Extended Jaccard coefficient (๐ธ๐ฝ) for the above example 12.7.
  • 44. Pearsonโ€™s Correlation 44 โ€ข The correlation between two objects x and y gives a measure of the linear relationship between the attributes of the objects. โ€ข More precisely, Pearsonโ€™s correlation coefficient between two objects x and y is defined in the following. ๐‘ƒ ๐‘ฅ, ๐‘ฆ = ๐‘†๐‘ฅ๐‘ฆ ๐‘†๐‘ฅ โ‹… ๐‘†๐‘ฆ where ๐‘†๐‘ฅ๐‘ฆ = ๐‘๐‘œ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’ ๐‘ฅ, ๐‘ฆ = 1 ๐‘›โˆ’1 ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฅ ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘†๐‘ฅ = ๐‘†๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘‘๐‘’๐‘ฃ๐‘–๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ฅ = 1 ๐‘›โˆ’1 ๐‘–=1 ๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฅ 2 ๐‘†๐‘ฆ = ๐‘†๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘‘๐‘’๐‘ฃ๐‘–๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ฆ = 1 ๐‘›โˆ’1 ๐‘–=1 ๐‘› ๐‘ฆ๐‘– โˆ’ ๐‘ฆ 2 ๐‘ฅ = ๐‘š๐‘’๐‘Ž๐‘› ๐‘ฅ = 1 ๐‘› ๐‘–=1 ๐‘› ๐‘ฅ๐‘– ๐‘ฆ = ๐‘š๐‘’๐‘Ž๐‘› ๐‘ฆ = 1 ๐‘› ๐‘–=1 ๐‘› ๐‘ฆ๐‘– and n is the number of attributes in x and y.
  • 45. 45 Note 1: Correlation is always in the range of -1 to 1. A correlation of 1(-1) means that x and y have a perfect positive (negative) linear relationship, that is, ๐‘ฅ๐‘– = ๐‘Ž โ‹… ๐‘ฆ๐‘– + ๐‘ for some and b. Example 12.8: Pearsonโ€™s correlation Calculate the Pearson's correlation of the two vectors x and y as given below. x = [3, 6, 0, 3, 6] y = [1, 2, 0, 1, 2] Note: Vector components can be negative values as well. Note: If the correlation is 0, then there is no linear relationship between the attribute of the object. Example 12.9: Non-linear correlation Verify that there is no linear relationship among attributes in the objects x and y given below. x = [-3, -2, -1, 0, 1, 2, 3] y = [9, 4, 1, 0, 1, 4, 9] P (x, y) = 0, and also note ๐‘ฅ๐‘– = ๐‘ฆ๐‘– 2 for all attributes here.
  • 46. Mahalanobis Distance 46 โ€ข A related issue with distance measurement is how to handle the situation when attributes do not have the same range of values. โ€ข For example, a record with two objects Age and Income. Here, two attributes have different scales. Thus, Euclidean distance is not a suitable measure to handle such situation. โ€ข In the other way around, how to compute distance when there is correlation between some of the attributes, perhaps, in addition to difference in the ranges of values. โ€ข A generalization of Euclidean distance, the mahalanobiโ€™s distance is useful when attributes are (partially) correlated and/or have different range of values. โ€ข The Mahalanobiโ€™s distance between two objects (vectors) x and y is defined as ๐‘€(๐‘ฅ, ๐‘ฆ) = (๐‘ฅ โˆ’ ๐‘ฆ) โˆ’1 ๐‘ฅ โˆ’ ๐‘ฆ ๐‘‡ Here, โˆ’1 is inverse if the covariance matrix.
  • 47. Set Difference and Time Difference 47 Set Difference โ€ข Another non-metric dissimilarity measurement is set difference. โ€ข Given two sets A and B, A โ€“ B is the set of elements of A that are not in B. Thus, if A = {1, 2, 3, 4} and B = {2, 3, 4} then A โ€“ B = {1} and B โ€“ A = โˆ…. โ€ข We can define d between two sets as d(A, B) as ๐‘‘(๐ด, ๐ต) = |๐ด โˆ’ ๐ต| where ๐ด denotes the size of set A. Note: This measure does not satisfy the property of Non-negativity, Symmetric and Transitivity. โ€ข Another modified definition however satisfies ๐‘‘ ๐ด, ๐ต = ๐ด โˆ’ ๐ต + |๐ต โˆ’ ๐ด| Time Difference โ€ข It defines the distance between times of the day as follows ๐‘‘ ๐‘ก1, ๐‘ก2 = ๐‘ก2 โˆ’ ๐‘ก1 ๐‘–๐‘“ ๐‘ก1 โ‰ค ๐‘ก2 24 + (๐‘ก2โˆ’๐‘ก1) ๐‘–๐‘“ ๐‘ก1 โ‰ฅ ๐‘ก2 โ€ข Example: ๐‘‘ 1 ๐‘๐‘š, 2 ๐‘๐‘š = 1 โ„Ž๐‘œ๐‘ข๐‘Ÿ ๐‘‘ 2 ๐‘๐‘š, 1 ๐‘๐‘š = 23 โ„Ž๐‘œ๐‘ข๐‘Ÿ๐‘ .