SlideShare a Scribd company logo
Distance and Similarity Measures
Bamshad Mobasher
DePaul University
Distance or Similarity Measures
 Many data mining and analytics tasks involve the comparison of
objects and determining in terms of their similarities (or
dissimilarities)
 Clustering
 Nearest-neighbor search, classification, and prediction
 Characterization and discrimination
 Automatic categorization
 Correlation analysis
 Many of todays real-world applications rely on the computation
similarities or distances among objects
 Personalization
 Recommender systems
 Document categorization
 Information retrieval
 Target marketing
2
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
3
4
Distance or Similarity Measures
 Measuring Distance
 In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
 Often requires the representation of objects as “feature vectors”
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
T1 T2 T3 T4 T5 T6
Doc1 0 4 0 0 0 2
Doc2 3 1 4 3 1 2
Doc3 3 0 0 0 3 0
Doc4 0 1 0 3 0 0
Doc5 2 2 2 3 1 4
An Employee DB Term Frequencies for Documents
Feature vector corresponding to
Employee 2: <M, 51, 64000.0>
Feature vector corresponding to Document 4:
<0, 1, 0, 3, 0, 0>
5
Distance or Similarity Measures
 Properties of Distance Measures:
 for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A)
 for any object A, dist(A, A) = 0
 dist(A, C)  dist(A, B) + dist (B, C)
 Representation of objects as vectors:
 Each data object (item) can be viewed as an n-dimensional vector, where
the dimensions are the attributes (features) in the data
 Example (employee DB): Emp. ID 2 = <M, 51, 64000>
 Example (Documents): DOC2 = <3, 1, 4, 3, 1, 2>
 The vector representation allows us to compute distance or similarity
between pairs of items using standard vector operations, e.g.,
Cosine of the angle between vectors
Manhattan distance
Euclidean distance
Hamming Distance
Data Matrix and Distance Matrix
 Data matrix
 Conceptual representation of a table
Cols = features; rows = data objects
 n data points with p dimensions
 Each row in the matrix is the vector
representation of a data object
 Distance (or Similarity) Matrix
 n data points, but indicates only the
pairwise distance (or similarity)
 A triangular matrix
 Symmetric
6


















np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
















0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
Proximity Measure for Nominal Attributes
 If object attributes are all nominal (categorical), then proximity
measure are used to compare objects
 Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
 Method 2: Convert to Standard Spreadsheet format
 For each attribute A create M binary attribute for the M nominal states of A
 Then use standard vector-based similarity or distance metrics
7
p
m
p
j
i
d 

)
,
(
Proximity Measure for Binary Attributes
 A contingency table for
binary data
 Distance measure for
symmetric binary variables
 Distance measure for
asymmetric binary variables
 Jaccard coefficient (similarity
measure for asymmetric
binary variables)
8
Object i
Object j
Normalizing or Standardizing Numeric Data
 Z-score:
 x: raw value to be standardized, μ: mean of the population,
σ: standard deviation
 the distance between the raw score and the population mean
in units of the standard deviation
 negative when the value is below the mean, “+” when above
 Min-Max Normalization



 x
z
9
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
ID Gender Age Salary
1 1 0.00 0.00
2 0 0.96 0.56
3 0 1.00 1.00
4 1 0.24 0.44
5 0 0.72 0.32
10
Common Distance Measures for Numeric Data
 Consider two vectors
 Rows in the data matrix
 Common Distance Measures:
 Manhattan distance:
 Euclidean distance:
 Distance can be defined as a dual of a similarity measure
( , ) 1 ( , )
dist X Y sim X Y
  2 2
( )
( , )
i i
i
i i
i i
x y
sim X Y
x y




 
11
Example: Data Matrix and Distance Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Distance Matrix (Euclidean)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Data Matrix
Distance Matrix (Manhattan)
x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Distance on Numeric Data:
Minkowski Distance
 Minkowski distance: A popular distance measure
 where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
 Note that Euclidean and Manhattan distances are special cases
 h = 1: (L1 norm) Manhattan distance
 h = 2: (L2 norm) Euclidean distance
12
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






13
Vector-Based Similarity Measures
 In some situations, distance measures provide a skewed view of data
 E.g., when the data is very sparse and 0’s in the vectors are not significant
 In such cases, typically vector-based similarity measures are used
 Most common measure: Cosine similarity
 Dot product of two vectors:
 Cosine Similarity = normalized dot product
 the norm of a vector X is:
 the cosine similarity is:


i
i
x
X 2









i
i
i
i
i
i
i
y
x
y
x
y
X
Y
X
Y
X
sim
2
2
)
(
)
,
(
1 2
, , , n
X x x x
 1 2
, , , n
Y y y y

 



i
i
i y
x
Y
X
Y
X
sim )
,
(
14
Vector-Based Similarity Measures
 Why divide by the norm?
 Example:
 X = <2, 0, 3, 2, 1, 4>
 ||X|| = SQRT(4+0+9+4+1+16) = 5.83
X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>
 Now, note that ||X*|| = 1
 So, dividing a vector by its norm, turns it into a unit-length vector
 Cosine similarity measures the angle between two unit length vectors (i.e., the
magnitude of the vectors are ignored).
1 2
, , , n
X x x x
 

i
i
x
X 2
15
Example Application: Information Retrieval
 Documents are represented as “bags of words”
 Represented as vectors when used computationally
 A vector is an array of floating point (or binary in case of bit maps)
 Has direction and magnitude
 Each vector has a place for every term in collection (most are sparse)
nova galaxy heat actor film role
A 1.0 0.5 0.3
B 0.5 1.0
C 1.0 0.8 0.7
D 0.9 1.0 0.5
E 1.0 1.0
F 0.7
G 0.5 0.7 0.9
H 0.6 1.0 0.3 0.2
I 0.7 0.5 0.3
Document Ids
a document
vector
Documents & Query in n-dimensional Space
16
 Documents are represented as vectors in the term space
 Typically values in each dimension correspond to the frequency of the
corresponding term in the document
 Queries represented as vectors in the same vector-space
 Cosine similarity between the query and documents is often used
to rank retrieved documents
17
Example: Similarities among Documents
 Consider the following document-term matrix
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2
Dot-Product(Doc2,Doc4) = <3,1,4,3,1,2,0,1> * <0,1,0,3,0,0,2,0>
0 + 1 + 0 + 9 + 0 + 0 + 0 + 0 = 10
Norm (Doc2) = SQRT(9+1+16+9+1+4+0+1) = 6.4
Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74
Cosine(Doc2, Doc4) = 10 / (6.4 * 3.74) = 0.42
Correlation as Similarity
 In cases where there could be high mean variance across data
objects (e.g., movie ratings), Pearson Correlation coefficient is
the best option
 Pearson Correlation
 Often used in recommender systems based on Collaborative
Filtering
18
19
Distance-Based Classification
 Basic Idea: classify new instances based on their similarity to or
distance from instances we have seen before
 also called “instance-based learning”
 Simplest form of MBR: Rote Learning
 learning by memorization
 save all previously encountered instance; given a new instance, find one from
the memorized set that most closely “resembles” the new one; assign new
instance to the same class as the “nearest neighbor”
 more general methods try to find k nearest neighbors rather than just one
 but, how do we define “resembles?”
 MBR is “lazy”
 defers all of the real work until new instance is obtained; no attempt is made to
learn a generalized model from the training set
 less data preprocessing and model evaluation, but more work has to be done at
classification time
Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
20
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
K-Nearest-Neighbor Strategy
 Given object x, find the k most similar objects to x
 The k nearest neighbors
 Variety of distance or similarity measures can be used to identify and rank
neighbors
 Note that this requires comparison between x and all objects in the database
 Classification:
 Find the class label for each of the k neighbor
 Use a voting or weighted voting approach to determine the majority class
among the neighbors (a combination function)
Weighted voting means the closest neighbors count more
 Assign the majority class label to x
 Prediction:
 Identify the value of the target attribute for the k neighbors
 Return the weighted average as the predicted value of the target attribute for x
21
22
K-Nearest-Neighbor Strategy
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
23
Combination Functions
 Voting: the “democracy” approach
 poll the neighbors for the answer and use the majority vote
 the number of neighbors (k) is often taken to be odd in order to avoid ties
works when the number of classes is two
if there are more than two classes, take k to be the number of classes plus 1
 Impact of k on predictions
 in general different values of k affect the outcome of classification
 we can associate a confidence level with predictions (this can be the % of
neighbors that are in agreement)
 problem is that no single category may get a majority vote
 if there is strong variations in results for different choices of k, this an
indication that the training set is not large enough
24
Voting Approach - Example
ID Gender Age Salary Respond?
1 F 27 19,000 no
2 M 51 64,000 yes
3 M 52 105,000 yes
4 F 33 55,000 yes
5 M 45 45,000 no
new F 45 100,000 ?
Neighbors Answers k =1 k = 2 k = 3 k = 4 k = 5
D_man 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes
D_euclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes
k =1 k = 2 k = 3 k = 4 k = 5
D_man yes, 100% yes, 100% yes, 67% yes, 75% yes, 60%
D_euclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60%
Will a new customer
respond to solicitation?
Using the voting method without confidence
Using the voting method with a confidence
25
Combination Functions
 Weighted Voting: not so “democratic”
 similar to voting, but the vote some neighbors counts more
 “shareholder democracy?”
 question is which neighbor’s vote counts more?
 How can weights be obtained?
 Distance-based
closer neighbors get higher weights
“value” of the vote is the inverse of the distance (may need to add a small constant)
the weighted sum for each class gives the combined score for that class
to compute confidence, need to take weighted average
 Heuristic
weight for each neighbor is based on domain-specific characteristics of that neighbor
 Advantage of weighted voting
 introduces enough variation to prevent ties in most cases
 helps distinguish between competing neighbors
26
KNN and Collaborative Filtering
 Collaborative Filtering Example
 A movie rating system
 Ratings scale: 1 = “hate it”; 7 = “love it”
 Historical DB of users includes ratings of movies by Sally, Bob, Chris, and Lynn
 Karen is a new user who has rated 3 movies, but has not yet seen “Independence
Day”; should we recommend it to her?
Sally Bob Chris Lynn Karen
Star Wars 7 7 3 4 7
Jurassic Park 6 4 7 4 4
Terminator II 3 4 7 6 3
Independence Day 7 6 2 2 ?
Will Karen like “Independence Day?”
27
Collaborative Filtering
(k Nearest Neighbor Example)
Star Wars Jurassic Park Terminator 2 Indep. Day Average Cosine Distance Euclid Pearson
Sally 7 6 3 7 5.33 0.983 2 2.00 0.85
Bob 7 4 4 6 5.00 0.995 1 1.00 0.97
Chris 3 7 7 2 5.67 0.787 11 6.40 -0.97
Lynn 4 4 6 2 4.67 0.874 6 4.24 -0.69
Karen 7 4 3 ? 4.67 1.000 0 0.00 1.00
K Pearson
1 6
2 6.5
3 5
Example computation:
Pearson(Sally, Karen) = ( (7-5.33)*(7-4.67) + (6-5.33)*(4-4.67) + (3-5.33)*(3-4.67) )
/ SQRT( ((7-5.33)2
+(6-5.33)2
+(3-5.33)2
) * ((7- 4.67)2
+(4- 4.67)2
+(3- 4.67)2
)) = 0.85
K is the number of nearest
neighbors used in to find the
average predicted ratings of
Karen on Indep. Day.
Prediction
28
Collaborative Filtering
(k Nearest Neighbor)
 In practice a more sophisticated approach is used to generate the predictions
based on the nearest neighbors
 To generate predictions for a target user a on an item i:
 ra = mean rating for user a
 u1, …, uk are the k-nearest-neighbors to a
 ru,i = rating of user u on item I
 sim(a,u) = Pearson correlation between a and u
 This is a weighted average of deviations from the neighbors’ mean
ratings (and closer neighbors count more)







 k
u
k
u u
i
u
a
i
a
u
a
sim
u
a
sim
r
r
r
p
1
1 ,
,
)
,
(
)
,
(
)
(

More Related Content

Similar to similarities-knn-1.ppt

Lect4
Lect4Lect4
Lect4
sumit621
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
IJORCS
 
Clique and sting
Clique and stingClique and sting
Clique and sting
Subramanyam Natarajan
 
Data Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptxData Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptx
GaziMdNoorHossain
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
sriharipatilin
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
Subrata Kumer Paul
 
02 data
02 data02 data
02 data
phakhwan22
 
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
Unit-I Objects,Attributes,Similarity&Dissimilarity.pptUnit-I Objects,Attributes,Similarity&Dissimilarity.ppt
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
subhashchandra197
 
Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
OllieShoresna
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
PrakasBhowmik
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
Datamining Tools
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
DataminingTools Inc
 
Connectivity-Based Clustering for Mixed Discrete and Continuous Data
Connectivity-Based Clustering for Mixed Discrete and Continuous DataConnectivity-Based Clustering for Mixed Discrete and Continuous Data
Connectivity-Based Clustering for Mixed Discrete and Continuous Data
IJCI JOURNAL
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
mattriley
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
Radhwan2
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Salah Amean
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
Editor Jacotech
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 

Similar to similarities-knn-1.ppt (20)

Lect4
Lect4Lect4
Lect4
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Data Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptxData Mining Theory and Python Project.pptx
Data Mining Theory and Python Project.pptx
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
02 data
02 data02 data
02 data
 
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
Unit-I Objects,Attributes,Similarity&Dissimilarity.pptUnit-I Objects,Attributes,Similarity&Dissimilarity.ppt
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
 
Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Connectivity-Based Clustering for Mixed Discrete and Continuous Data
Connectivity-Based Clustering for Mixed Discrete and Continuous DataConnectivity-Based Clustering for Mixed Discrete and Continuous Data
Connectivity-Based Clustering for Mixed Discrete and Continuous Data
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 

Recently uploaded

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 

Recently uploaded (20)

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 

similarities-knn-1.ppt

  • 1. Distance and Similarity Measures Bamshad Mobasher DePaul University
  • 2. Distance or Similarity Measures  Many data mining and analytics tasks involve the comparison of objects and determining in terms of their similarities (or dissimilarities)  Clustering  Nearest-neighbor search, classification, and prediction  Characterization and discrimination  Automatic categorization  Correlation analysis  Many of todays real-world applications rely on the computation similarities or distances among objects  Personalization  Recommender systems  Document categorization  Information retrieval  Target marketing 2
  • 3. Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 3
  • 4. 4 Distance or Similarity Measures  Measuring Distance  In order to group similar items, we need a way to measure the distance between objects (e.g., records)  Often requires the representation of objects as “feature vectors” ID Gender Age Salary 1 F 27 19,000 2 M 51 64,000 3 M 52 100,000 4 F 33 55,000 5 M 45 45,000 T1 T2 T3 T4 T5 T6 Doc1 0 4 0 0 0 2 Doc2 3 1 4 3 1 2 Doc3 3 0 0 0 3 0 Doc4 0 1 0 3 0 0 Doc5 2 2 2 3 1 4 An Employee DB Term Frequencies for Documents Feature vector corresponding to Employee 2: <M, 51, 64000.0> Feature vector corresponding to Document 4: <0, 1, 0, 3, 0, 0>
  • 5. 5 Distance or Similarity Measures  Properties of Distance Measures:  for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A)  for any object A, dist(A, A) = 0  dist(A, C)  dist(A, B) + dist (B, C)  Representation of objects as vectors:  Each data object (item) can be viewed as an n-dimensional vector, where the dimensions are the attributes (features) in the data  Example (employee DB): Emp. ID 2 = <M, 51, 64000>  Example (Documents): DOC2 = <3, 1, 4, 3, 1, 2>  The vector representation allows us to compute distance or similarity between pairs of items using standard vector operations, e.g., Cosine of the angle between vectors Manhattan distance Euclidean distance Hamming Distance
  • 6. Data Matrix and Distance Matrix  Data matrix  Conceptual representation of a table Cols = features; rows = data objects  n data points with p dimensions  Each row in the matrix is the vector representation of a data object  Distance (or Similarity) Matrix  n data points, but indicates only the pairwise distance (or similarity)  A triangular matrix  Symmetric 6                   np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x                 0 ... ) 2 , ( ) 1 , ( : : : ) 2 , 3 ( ) ... n d n d 0 d d(3,1 0 d(2,1) 0
  • 7. Proximity Measure for Nominal Attributes  If object attributes are all nominal (categorical), then proximity measure are used to compare objects  Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)  Method 1: Simple matching  m: # of matches, p: total # of variables  Method 2: Convert to Standard Spreadsheet format  For each attribute A create M binary attribute for the M nominal states of A  Then use standard vector-based similarity or distance metrics 7 p m p j i d   ) , (
  • 8. Proximity Measure for Binary Attributes  A contingency table for binary data  Distance measure for symmetric binary variables  Distance measure for asymmetric binary variables  Jaccard coefficient (similarity measure for asymmetric binary variables) 8 Object i Object j
  • 9. Normalizing or Standardizing Numeric Data  Z-score:  x: raw value to be standardized, μ: mean of the population, σ: standard deviation  the distance between the raw score and the population mean in units of the standard deviation  negative when the value is below the mean, “+” when above  Min-Max Normalization     x z 9 ID Gender Age Salary 1 F 27 19,000 2 M 51 64,000 3 M 52 100,000 4 F 33 55,000 5 M 45 45,000 ID Gender Age Salary 1 1 0.00 0.00 2 0 0.96 0.56 3 0 1.00 1.00 4 1 0.24 0.44 5 0 0.72 0.32
  • 10. 10 Common Distance Measures for Numeric Data  Consider two vectors  Rows in the data matrix  Common Distance Measures:  Manhattan distance:  Euclidean distance:  Distance can be defined as a dual of a similarity measure ( , ) 1 ( , ) dist X Y sim X Y   2 2 ( ) ( , ) i i i i i i i x y sim X Y x y      
  • 11. 11 Example: Data Matrix and Distance Matrix point attribute1 attribute2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Distance Matrix (Euclidean) x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 Data Matrix Distance Matrix (Manhattan) x1 x2 x3 x4 x1 0 x2 5 0 x3 3 6 0 x4 6 1 7 0
  • 12. Distance on Numeric Data: Minkowski Distance  Minkowski distance: A popular distance measure  where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)  Note that Euclidean and Manhattan distances are special cases  h = 1: (L1 norm) Manhattan distance  h = 2: (L2 norm) Euclidean distance 12 ) | | ... | | | (| ) , ( 2 2 2 2 2 1 1 p p j x i x j x i x j x i x j i d        | | ... | | | | ) , ( 2 2 1 1 p p j x i x j x i x j x i x j i d       
  • 13. 13 Vector-Based Similarity Measures  In some situations, distance measures provide a skewed view of data  E.g., when the data is very sparse and 0’s in the vectors are not significant  In such cases, typically vector-based similarity measures are used  Most common measure: Cosine similarity  Dot product of two vectors:  Cosine Similarity = normalized dot product  the norm of a vector X is:  the cosine similarity is:   i i x X 2          i i i i i i i y x y x y X Y X Y X sim 2 2 ) ( ) , ( 1 2 , , , n X x x x  1 2 , , , n Y y y y       i i i y x Y X Y X sim ) , (
  • 14. 14 Vector-Based Similarity Measures  Why divide by the norm?  Example:  X = <2, 0, 3, 2, 1, 4>  ||X|| = SQRT(4+0+9+4+1+16) = 5.83 X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>  Now, note that ||X*|| = 1  So, dividing a vector by its norm, turns it into a unit-length vector  Cosine similarity measures the angle between two unit length vectors (i.e., the magnitude of the vectors are ignored). 1 2 , , , n X x x x    i i x X 2
  • 15. 15 Example Application: Information Retrieval  Documents are represented as “bags of words”  Represented as vectors when used computationally  A vector is an array of floating point (or binary in case of bit maps)  Has direction and magnitude  Each vector has a place for every term in collection (most are sparse) nova galaxy heat actor film role A 1.0 0.5 0.3 B 0.5 1.0 C 1.0 0.8 0.7 D 0.9 1.0 0.5 E 1.0 1.0 F 0.7 G 0.5 0.7 0.9 H 0.6 1.0 0.3 0.2 I 0.7 0.5 0.3 Document Ids a document vector
  • 16. Documents & Query in n-dimensional Space 16  Documents are represented as vectors in the term space  Typically values in each dimension correspond to the frequency of the corresponding term in the document  Queries represented as vectors in the same vector-space  Cosine similarity between the query and documents is often used to rank retrieved documents
  • 17. 17 Example: Similarities among Documents  Consider the following document-term matrix T1 T2 T3 T4 T5 T6 T7 T8 Doc1 0 4 0 0 0 2 1 3 Doc2 3 1 4 3 1 2 0 1 Doc3 3 0 0 0 3 0 3 0 Doc4 0 1 0 3 0 0 2 0 Doc5 2 2 2 3 1 4 0 2 Dot-Product(Doc2,Doc4) = <3,1,4,3,1,2,0,1> * <0,1,0,3,0,0,2,0> 0 + 1 + 0 + 9 + 0 + 0 + 0 + 0 = 10 Norm (Doc2) = SQRT(9+1+16+9+1+4+0+1) = 6.4 Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74 Cosine(Doc2, Doc4) = 10 / (6.4 * 3.74) = 0.42
  • 18. Correlation as Similarity  In cases where there could be high mean variance across data objects (e.g., movie ratings), Pearson Correlation coefficient is the best option  Pearson Correlation  Often used in recommender systems based on Collaborative Filtering 18
  • 19. 19 Distance-Based Classification  Basic Idea: classify new instances based on their similarity to or distance from instances we have seen before  also called “instance-based learning”  Simplest form of MBR: Rote Learning  learning by memorization  save all previously encountered instance; given a new instance, find one from the memorized set that most closely “resembles” the new one; assign new instance to the same class as the “nearest neighbor”  more general methods try to find k nearest neighbors rather than just one  but, how do we define “resembles?”  MBR is “lazy”  defers all of the real work until new instance is obtained; no attempt is made to learn a generalized model from the training set  less data preprocessing and model evaluation, but more work has to be done at classification time
  • 20. Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck 20 Training Records Test Record Compute Distance Choose k of the “nearest” records
  • 21. K-Nearest-Neighbor Strategy  Given object x, find the k most similar objects to x  The k nearest neighbors  Variety of distance or similarity measures can be used to identify and rank neighbors  Note that this requires comparison between x and all objects in the database  Classification:  Find the class label for each of the k neighbor  Use a voting or weighted voting approach to determine the majority class among the neighbors (a combination function) Weighted voting means the closest neighbors count more  Assign the majority class label to x  Prediction:  Identify the value of the target attribute for the k neighbors  Return the weighted average as the predicted value of the target attribute for x 21
  • 22. 22 K-Nearest-Neighbor Strategy X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
  • 23. 23 Combination Functions  Voting: the “democracy” approach  poll the neighbors for the answer and use the majority vote  the number of neighbors (k) is often taken to be odd in order to avoid ties works when the number of classes is two if there are more than two classes, take k to be the number of classes plus 1  Impact of k on predictions  in general different values of k affect the outcome of classification  we can associate a confidence level with predictions (this can be the % of neighbors that are in agreement)  problem is that no single category may get a majority vote  if there is strong variations in results for different choices of k, this an indication that the training set is not large enough
  • 24. 24 Voting Approach - Example ID Gender Age Salary Respond? 1 F 27 19,000 no 2 M 51 64,000 yes 3 M 52 105,000 yes 4 F 33 55,000 yes 5 M 45 45,000 no new F 45 100,000 ? Neighbors Answers k =1 k = 2 k = 3 k = 4 k = 5 D_man 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes D_euclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes k =1 k = 2 k = 3 k = 4 k = 5 D_man yes, 100% yes, 100% yes, 67% yes, 75% yes, 60% D_euclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60% Will a new customer respond to solicitation? Using the voting method without confidence Using the voting method with a confidence
  • 25. 25 Combination Functions  Weighted Voting: not so “democratic”  similar to voting, but the vote some neighbors counts more  “shareholder democracy?”  question is which neighbor’s vote counts more?  How can weights be obtained?  Distance-based closer neighbors get higher weights “value” of the vote is the inverse of the distance (may need to add a small constant) the weighted sum for each class gives the combined score for that class to compute confidence, need to take weighted average  Heuristic weight for each neighbor is based on domain-specific characteristics of that neighbor  Advantage of weighted voting  introduces enough variation to prevent ties in most cases  helps distinguish between competing neighbors
  • 26. 26 KNN and Collaborative Filtering  Collaborative Filtering Example  A movie rating system  Ratings scale: 1 = “hate it”; 7 = “love it”  Historical DB of users includes ratings of movies by Sally, Bob, Chris, and Lynn  Karen is a new user who has rated 3 movies, but has not yet seen “Independence Day”; should we recommend it to her? Sally Bob Chris Lynn Karen Star Wars 7 7 3 4 7 Jurassic Park 6 4 7 4 4 Terminator II 3 4 7 6 3 Independence Day 7 6 2 2 ? Will Karen like “Independence Day?”
  • 27. 27 Collaborative Filtering (k Nearest Neighbor Example) Star Wars Jurassic Park Terminator 2 Indep. Day Average Cosine Distance Euclid Pearson Sally 7 6 3 7 5.33 0.983 2 2.00 0.85 Bob 7 4 4 6 5.00 0.995 1 1.00 0.97 Chris 3 7 7 2 5.67 0.787 11 6.40 -0.97 Lynn 4 4 6 2 4.67 0.874 6 4.24 -0.69 Karen 7 4 3 ? 4.67 1.000 0 0.00 1.00 K Pearson 1 6 2 6.5 3 5 Example computation: Pearson(Sally, Karen) = ( (7-5.33)*(7-4.67) + (6-5.33)*(4-4.67) + (3-5.33)*(3-4.67) ) / SQRT( ((7-5.33)2 +(6-5.33)2 +(3-5.33)2 ) * ((7- 4.67)2 +(4- 4.67)2 +(3- 4.67)2 )) = 0.85 K is the number of nearest neighbors used in to find the average predicted ratings of Karen on Indep. Day. Prediction
  • 28. 28 Collaborative Filtering (k Nearest Neighbor)  In practice a more sophisticated approach is used to generate the predictions based on the nearest neighbors  To generate predictions for a target user a on an item i:  ra = mean rating for user a  u1, …, uk are the k-nearest-neighbors to a  ru,i = rating of user u on item I  sim(a,u) = Pearson correlation between a and u  This is a weighted average of deviations from the neighbors’ mean ratings (and closer neighbors count more)         k u k u u i u a i a u a sim u a sim r r r p 1 1 , , ) , ( ) , ( ) (