SlideShare a Scribd company logo
K Nearest Neighbor Classification
Bayes Classifier: Recap
L
P( HILSA | L)
P( TUNA | L) P( SHARK | L)
Maximum Aposteriori (MAP) Rule
Distributions assumed to be of particular family (e.g., Gaussian), and
parameters estimated from training data.
Bayes Classifier: Recap
L +- 
P( HILSA | L)
P( TUNA | L) P( SHARK | L)
Approximate Maximum Aposteriori (MAP) Rule
Non-parametric (data driven) approach: consider a small window around L,
Find which class is most populous in that window.
Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then it’s
probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
Basic Idea
 k-NN classification rule is to assign to a test sample
the majority category label of its k nearest training
samples
 In practice, k is usually chosen to be odd, so as to
avoid ties
 The k = 1 rule is generally called the nearest-
neighbor classification rule
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
Voronoi Diagram
Properties:
1) All possible points
within a sample's
Voronoi cell are the
nearest neighboring
points for that sample
2) For any sample, the
nearest sample is
determined by the
closest Voronoi cell
edge
Distance-weighted k-NN
 Replace by:
å
=
Î
=
k
i
i
V
v
x
f
v
q
f
1
))
(
,
(
max
arg
)
(
ˆ d
ˆ
f (q) = argmax
vÎV
1
d xi,xq
( )
2 d(v, f (xi))
i=1
k
å
General Kernel functions like Parzen Windows may be considered
Instead of inverse distance.
Predicting Continuous Values
 Replace by:
 Note: unweighted corresponds to wi=1 for all i
å
å
=
=
= k
i
i
k
i
i
i
w
x
f
w
q
f
1
1
)
(
)
(
ˆ
å
=
Î
=
k
i
i
i
V
v
x
f
v
w
q
f
1
))
(
,
(
max
arg
)
(
ˆ d
Nearest-Neighbor Classifiers: Issues
– The value of k, the number of nearest
neighbors to retrieve
– Choice of Distance Metric to compute
distance between records
– Computational complexity
– Size of training set
– Dimension of data
Value of K
 Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from
other classes
X
Rule of thumb:
K = sqrt(N)
N: number of training points
Distance Metrics
Distance Measure: Scale Effects
 Different features may have different measurement
scales
 E.g., patient weight in kg (range [50,200]) vs. blood
protein values in ng/dL (range [-3,3])
 Consequences
 Patient weight will have a much greater influence on the
distance between samples
 May bias the performance of the classifier
Standardization
 Transform raw feature values into z-scores

 is the value for the ith sample and jth feature
 is the average of all for feature j
 is the standard deviation of all over all input samples
 Range and scale of z-scores should be similar
(providing distributions of raw feature values are alike)
zij =
xij - mj
s j
zij =
xij - mj
s j
xij
mj xij
s j xij
Nearest Neighbor : Dimensionality
 Problem with Euclidean measure:
 High dimensional data
 curse of dimensionality
 Can produce counter-intuitive results
 Shrinking density – sparsification effect
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142 d = 1.4142
Distance for Nominal Attributes
Distance for Heterogeneous Data
Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of
Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
Nearest Neighbour : Computational
Complexity
 Expensive
 To determine the nearest neighbour of a query point q, must
compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
 Storage Requirements
 Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
 High Dimensional Data
 “Curse of Dimensionality”
 Required amount of training data increases exponentially with
dimension
 Computational cost also increases dramatically
 Partitioning techniques degrade to linear search in high dimension
Reduction in Computational Complexity
 Reduce size of training set
 Condensation, editing
 Use geometric data structure for high dimensional
search
Condensation: Decision Regions
Each cell contains one
sample, and every
location within the cell is
closer to that sample than
to any other sample.
A Voronoi diagram divides
the space into such cells.
Every query point will be assigned the classification of the sample within that
cell. The decision boundary separates the class regions based on the 1-NN
decision rule.
Knowledge of this boundary is sufficient to classify new points.
The boundary itself is rarely computed; many algorithms seek to retain only
those points necessary to generate an identical boundary.
Condensing
 Aim is to reduce the number of training samples
 Retain only the samples that are needed to define the decision boundary
 Decision Boundary Consistent – a subset whose nearest neighbour decision
boundary is identical to the boundary of the entire training set
 Minimum Consistent Set – the smallest subset of the training data that correctly
classifies all of the original training data
Original data Condensed data Minimum Consistent Set
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
(or K) training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
•Incremental
•Order dependent
•Neither minimal nor decision
boundary consistent
•O(n3) for brute-force method
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
Condensing
 Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
High dimensional search
 Given a point set and a nearest neighbor query point
 Find the points enclosed in a rectangle (range) around the
query
 Perform linear search for nearest neighbor only in the
rectangle
Query
kd-tree: data structure for range search
 Index data into a tree
 Search on the tree
 Tree construction: At each level we use a different
dimension to split
x=5
y=3
y=6
x=6
A
B
C
D
E
x<5 x>=5
kd-tree example
X=5
y=5
y=6
x=3
y=2
x=8 x=7
X=5 X=8
X=7
X=3
Y=6
Y=2
KNN: Alternate Terminologies
 Instance Based Learning
 Lazy Learning
 Case Based Reasoning
 Exemplar Based Learning

More Related Content

Similar to KNN.pptx

KNN
KNNKNN
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1a
Gaston Liberman
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
instance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.pptinstance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.ppt
Johny139575
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
Pavithra Thippanaik
 
Cluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & DissimilarityCluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & Dissimilarity
ShivarkarSandip
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Text categorization
Text categorizationText categorization
Text categorization
Phuong Nguyen
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Subrata Kumer Paul
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
swapnac12
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
Zihui Li
 
Db Scan
Db ScanDb Scan
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
ShwetaPatil174
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Marina Santini
 
Statistics in Research
Statistics in ResearchStatistics in Research
Statistics in ResearchKent Kawashima
 
Statistics in Research
Statistics in ResearchStatistics in Research
Statistics in Researchguest5477b8
 

Similar to KNN.pptx (20)

KNN
KNNKNN
KNN
 
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1a
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
instance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.pptinstance bases k nearest neighbor algorithm.ppt
instance bases k nearest neighbor algorithm.ppt
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Cluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & DissimilarityCluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & Dissimilarity
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Text categorization
Text categorizationText categorization
Text categorization
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Db Scan
Db ScanDb Scan
Db Scan
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Statistics chm 235
Statistics chm 235Statistics chm 235
Statistics chm 235
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
Statistics in Research
Statistics in ResearchStatistics in Research
Statistics in Research
 
Statistics in Research
Statistics in ResearchStatistics in Research
Statistics in Research
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 

KNN.pptx

  • 1. K Nearest Neighbor Classification
  • 2. Bayes Classifier: Recap L P( HILSA | L) P( TUNA | L) P( SHARK | L) Maximum Aposteriori (MAP) Rule Distributions assumed to be of particular family (e.g., Gaussian), and parameters estimated from training data.
  • 3. Bayes Classifier: Recap L +-  P( HILSA | L) P( TUNA | L) P( SHARK | L) Approximate Maximum Aposteriori (MAP) Rule Non-parametric (data driven) approach: consider a small window around L, Find which class is most populous in that window.
  • 4. Nearest Neighbor Classifiers  Basic idea:  If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records
  • 5. Basic Idea  k-NN classification rule is to assign to a test sample the majority category label of its k nearest training samples  In practice, k is usually chosen to be odd, so as to avoid ties  The k = 1 rule is generally called the nearest- neighbor classification rule
  • 6. Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
  • 7. Voronoi Diagram Properties: 1) All possible points within a sample's Voronoi cell are the nearest neighboring points for that sample 2) For any sample, the nearest sample is determined by the closest Voronoi cell edge
  • 8. Distance-weighted k-NN  Replace by: å = Î = k i i V v x f v q f 1 )) ( , ( max arg ) ( ˆ d ˆ f (q) = argmax vÎV 1 d xi,xq ( ) 2 d(v, f (xi)) i=1 k å General Kernel functions like Parzen Windows may be considered Instead of inverse distance.
  • 9. Predicting Continuous Values  Replace by:  Note: unweighted corresponds to wi=1 for all i å å = = = k i i k i i i w x f w q f 1 1 ) ( ) ( ˆ å = Î = k i i i V v x f v w q f 1 )) ( , ( max arg ) ( ˆ d
  • 10. Nearest-Neighbor Classifiers: Issues – The value of k, the number of nearest neighbors to retrieve – Choice of Distance Metric to compute distance between records – Computational complexity – Size of training set – Dimension of data
  • 11. Value of K  Choosing the value of k:  If k is too small, sensitive to noise points  If k is too large, neighborhood may include points from other classes X Rule of thumb: K = sqrt(N) N: number of training points
  • 13. Distance Measure: Scale Effects  Different features may have different measurement scales  E.g., patient weight in kg (range [50,200]) vs. blood protein values in ng/dL (range [-3,3])  Consequences  Patient weight will have a much greater influence on the distance between samples  May bias the performance of the classifier
  • 14. Standardization  Transform raw feature values into z-scores   is the value for the ith sample and jth feature  is the average of all for feature j  is the standard deviation of all over all input samples  Range and scale of z-scores should be similar (providing distributions of raw feature values are alike) zij = xij - mj s j zij = xij - mj s j xij mj xij s j xij
  • 15. Nearest Neighbor : Dimensionality  Problem with Euclidean measure:  High dimensional data  curse of dimensionality  Can produce counter-intuitive results  Shrinking density – sparsification effect 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 vs d = 1.4142 d = 1.4142
  • 16. Distance for Nominal Attributes
  • 17. Distance for Heterogeneous Data Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
  • 18. Nearest Neighbour : Computational Complexity  Expensive  To determine the nearest neighbour of a query point q, must compute the distance to all N training examples + Pre-sort training examples into fast data structures (kd-trees) + Compute only an approximate distance (LSH) + Remove redundant data (condensing)  Storage Requirements  Must store all training data P + Remove redundant data (condensing) - Pre-sorting often increases the storage requirements  High Dimensional Data  “Curse of Dimensionality”  Required amount of training data increases exponentially with dimension  Computational cost also increases dramatically  Partitioning techniques degrade to linear search in high dimension
  • 19. Reduction in Computational Complexity  Reduce size of training set  Condensation, editing  Use geometric data structure for high dimensional search
  • 20. Condensation: Decision Regions Each cell contains one sample, and every location within the cell is closer to that sample than to any other sample. A Voronoi diagram divides the space into such cells. Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule. Knowledge of this boundary is sufficient to classify new points. The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical boundary.
  • 21. Condensing  Aim is to reduce the number of training samples  Retain only the samples that are needed to define the decision boundary  Decision Boundary Consistent – a subset whose nearest neighbour decision boundary is identical to the boundary of the entire training set  Minimum Consistent Set – the smallest subset of the training data that correctly classifies all of the original training data Original data Condensed data Minimum Consistent Set
  • 22. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single (or K) training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full •Incremental •Order dependent •Neither minimal nor decision boundary consistent •O(n3) for brute-force method
  • 23. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 24. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 25. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 26. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 27. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 28. Condensing  Condensed Nearest Neighbour (CNN) 1. Initialize subset with a single training example 2. Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset 3. Return to 2 until no transfers occurred or the subset is full
  • 29. High dimensional search  Given a point set and a nearest neighbor query point  Find the points enclosed in a rectangle (range) around the query  Perform linear search for nearest neighbor only in the rectangle Query
  • 30. kd-tree: data structure for range search  Index data into a tree  Search on the tree  Tree construction: At each level we use a different dimension to split x=5 y=3 y=6 x=6 A B C D E x<5 x>=5
  • 32. KNN: Alternate Terminologies  Instance Based Learning  Lazy Learning  Case Based Reasoning  Exemplar Based Learning