SlideShare a Scribd company logo
DATA MINING 1
Instance-based Classifiers
Dino Pedreschi, Riccardo Guidotti
a.a. 2022/2023
Slides edited from Tan, Steinbach, Kumar, Introduction to Data Mining
Instance-based Classifiers
• Instead of performing explicit generalization, compare new instances
with instances seen in training, which have been stored in memory.
• Sometimes called memory-based learning.
• Advantages
• Adapt its model to previously unseen data by storing a new instance or
throwing an old instance away.
• Disadvantages
• Lazy learner: it does not build a model explicitly.
• Classifying unknown records is relatively expensive: in the worst case, given n
training items, the complexity of classifying a single instance is O(n).
Nearest-Neighbor Classifier (K-NN)
Basic idea: If it walks like a duck, quacks
like a duck, then it’s probably a duck.
Requires three things
1. Training set of stored records
2. Distance metric to compute
distance between records
3. The value of k, the number of
nearest neighbors to retrieve
Nearest-Neighbor Classifier (K-NN)
Given a set of training records (memory),
and a test record:
1. Compute the distances from the
records in the training to the test.
2. Identify the k “nearest” records.
3. Use class labels of nearest neighbors to
determine the class label of unknown
record (e.g., by taking majority vote).
Definition of Nearest Neighbor
• K-nearest neighbors of a record x are data points that have the k
smallest distance to x.
Choosing the Value of K
• If k is too small, it is sensitive to
noise points and it can leads to
overfitting to the noise in the
training set.
• If k is too large, the neighborhood
may include points from other
classes.
• General practice k = sqrt(N) where
N is the number of samples in the
training dataset.
Nearest Neighbor Classification
Compute distance between two points:
• Euclidean distance 𝑑 𝑝, 𝑞 = ∑!(𝑝! − 𝑞!)"
Determine the class from nearest neighbors
• take the majority vote of class labels among the k nearest neighbors
• weigh the vote according to distance (e.g. weight factor, w = 1/d2)
Dimensionality and Scaling Issues
• Problem with Euclidean measure: high dimensional data can cause
curse of dimensionality.
• Solution: normalize the vectors to unit length
• Attributes may have to be scaled to prevent distance measures from
being dominated by one of the attributes.
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 10km to 200kg
• income of a person may vary from $10K to $1M
Parallel Exemplar-Based Learning System (PEBLS)
• PEBLS is a nearest-neighbor learning system (k=1) designed for
applications where the instances have symbolic feature values.
• Works with both continuous and nominal features.
• For nominal features, the distance between two nominal values is
computed using Modified Value Difference Metric (MVDM)
• 𝑑 𝑉#, 𝑉" = ∑! |
$!"
$!
−
$#"
$#
|
• Where n1 is the number of records that consists of nominal attribute
value V1 and n1i is the number of records whose target label is class i.
Distance Between Nominal Attribute Values
• d(Status=Single, Status=Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
• d(Status=Single, Status=Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
• d(Status=Married, Status=Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
• d(Refund=Yes, Refund=No) = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Distance Between Records
• 𝛿 𝑋, 𝑌 = 𝑤!𝑤" ∑#$%
&
𝑑(𝑋#, 𝑌#)
• Each record X is assigned a weight 𝑤! =
'!"#$%&'(
'!"#$%&'(
')##$'(
, which represents its reliability
• 𝑁!"#$%&'(
is the number of times X is used for prediction
• 𝑁!"#$%&'(
')##$'( is the number of times the prediction using X is correct
• If 𝑤! ≅ 1 X makes accurate prediction most of the time
• If 𝑤!> 1, then X is not reliable for making predictions. High 𝑤! >1 would result in
high distance, which makes it less possible to use X to make predictions.
Characteristics of Nearest Neighbor Classifiers
• Instance-based learner: makes predictions without maintaining
abstraction, i.e., building a model like decision trees.
• It is a lazy learner: classifying a test example can be expensive because
need to compute the proximity values between test and training examples.
• In contrast eager learners spend time in building the model but then the
classification is fast.
• Make their prediction on local information and for low k they are
susceptible to noise.
• Can produce wrong predictions if inappropriate distance functions and/or
preprocessing steps are performed.
k-NN for Regression
Given a set of training records
(memory), and a test record:
1. Compute the distances from the
records in the training to the test.
2. Identify the k “nearest” records.
3. Use target value of nearest
neighbors to determine the value
of unknown record (e.g., by
averaging the values).
3.4
5.4
5.7
8.7
9.7
3.9
3.8
4.1
5.1
4.3
10.3 9.2
3.2
5.9
4.7
12.1
10.8
8.7
9.7
3.4
4.2
References
• Nearest Neighbor classifiers. Chapter 5.2.
Introduction to Data Mining.
Exercises - kNN
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf

More Related Content

Similar to 09_dm1_knn_2022_23.pdf

K nearest neighbours
K nearest neighboursK nearest neighbours
K nearest neighbours
Learnbay Datascience
 
Clustering
ClusteringClustering
k-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxk-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptx
gamingzonedead880
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
Zachary Thomas
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
rameswara reddy venkat
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
YONG ZHENG
 
K Nearest neighbour
K Nearest neighbourK Nearest neighbour
K Nearest neighbour
Learnbay Datascience
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
Poorabpatel
 
Pattern recognition
Pattern recognitionPattern recognition
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
Pier Luca Lanzi
 
Knn
KnnKnn
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
Yan Xu
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
Kaniska Mandal
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
vijaita kashyap
 
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Professor Lili Saghafi
 

Similar to 09_dm1_knn_2022_23.pdf (20)

K nearest neighbours
K nearest neighboursK nearest neighbours
K nearest neighbours
 
Clustering
ClusteringClustering
Clustering
 
k-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxk-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptx
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
 
K Nearest neighbour
K Nearest neighbourK Nearest neighbour
K Nearest neighbour
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
Knn
KnnKnn
Knn
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
 

Recently uploaded

The Steadfast and Reliable Bull: Taurus Zodiac Sign
The Steadfast and Reliable Bull: Taurus Zodiac SignThe Steadfast and Reliable Bull: Taurus Zodiac Sign
The Steadfast and Reliable Bull: Taurus Zodiac Sign
my Pandit
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
Lacey Max
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
GraceKohler1
 
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
➑➌➋➑➒➎➑➑➊➍
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
taqyea
 
Digital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on SustainabilityDigital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on Sustainability
sssourabhsharma
 
Top mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptxTop mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptx
JeremyPeirce1
 
2022 Vintage Roman Numerals Men Rings
2022 Vintage Roman  Numerals  Men  Rings2022 Vintage Roman  Numerals  Men  Rings
2022 Vintage Roman Numerals Men Rings
aragme
 
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
my Pandit
 
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
taqyea
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
BBPMedia1
 
Profiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdfProfiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdf
TTop Threads
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdfRegistered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
dazzjoker
 
Industrial Tech SW: Category Renewal and Creation
Industrial Tech SW:  Category Renewal and CreationIndustrial Tech SW:  Category Renewal and Creation
Industrial Tech SW: Category Renewal and Creation
Christian Dahlen
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
HajeJanKamps
 
Zodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
Zodiac Signs and Food Preferences_ What Your Sign Says About Your TasteZodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
Zodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
my Pandit
 
list of states and organizations .pdf
list of  states  and  organizations .pdflist of  states  and  organizations .pdf
list of states and organizations .pdf
Rbc Rbcua
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
JoeYangGreatMachiner
 

Recently uploaded (20)

The Steadfast and Reliable Bull: Taurus Zodiac Sign
The Steadfast and Reliable Bull: Taurus Zodiac SignThe Steadfast and Reliable Bull: Taurus Zodiac Sign
The Steadfast and Reliable Bull: Taurus Zodiac Sign
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
 
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
 
Digital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on SustainabilityDigital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on Sustainability
 
Top mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptxTop mailing list providers in the USA.pptx
Top mailing list providers in the USA.pptx
 
2022 Vintage Roman Numerals Men Rings
2022 Vintage Roman  Numerals  Men  Rings2022 Vintage Roman  Numerals  Men  Rings
2022 Vintage Roman Numerals Men Rings
 
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
 
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
一比一原版(QMUE毕业证书)英国爱丁堡玛格丽特女王大学毕业证文凭如何办理
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
 
Profiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdfProfiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdf
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
 
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdfRegistered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
 
Industrial Tech SW: Category Renewal and Creation
Industrial Tech SW:  Category Renewal and CreationIndustrial Tech SW:  Category Renewal and Creation
Industrial Tech SW: Category Renewal and Creation
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
 
Zodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
Zodiac Signs and Food Preferences_ What Your Sign Says About Your TasteZodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
Zodiac Signs and Food Preferences_ What Your Sign Says About Your Taste
 
list of states and organizations .pdf
list of  states  and  organizations .pdflist of  states  and  organizations .pdf
list of states and organizations .pdf
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
 

09_dm1_knn_2022_23.pdf

  • 1. DATA MINING 1 Instance-based Classifiers Dino Pedreschi, Riccardo Guidotti a.a. 2022/2023 Slides edited from Tan, Steinbach, Kumar, Introduction to Data Mining
  • 2. Instance-based Classifiers • Instead of performing explicit generalization, compare new instances with instances seen in training, which have been stored in memory. • Sometimes called memory-based learning. • Advantages • Adapt its model to previously unseen data by storing a new instance or throwing an old instance away. • Disadvantages • Lazy learner: it does not build a model explicitly. • Classifying unknown records is relatively expensive: in the worst case, given n training items, the complexity of classifying a single instance is O(n).
  • 3. Nearest-Neighbor Classifier (K-NN) Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck. Requires three things 1. Training set of stored records 2. Distance metric to compute distance between records 3. The value of k, the number of nearest neighbors to retrieve
  • 4. Nearest-Neighbor Classifier (K-NN) Given a set of training records (memory), and a test record: 1. Compute the distances from the records in the training to the test. 2. Identify the k “nearest” records. 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote).
  • 5. Definition of Nearest Neighbor • K-nearest neighbors of a record x are data points that have the k smallest distance to x.
  • 6. Choosing the Value of K • If k is too small, it is sensitive to noise points and it can leads to overfitting to the noise in the training set. • If k is too large, the neighborhood may include points from other classes. • General practice k = sqrt(N) where N is the number of samples in the training dataset.
  • 7. Nearest Neighbor Classification Compute distance between two points: • Euclidean distance 𝑑 𝑝, 𝑞 = ∑!(𝑝! − 𝑞!)" Determine the class from nearest neighbors • take the majority vote of class labels among the k nearest neighbors • weigh the vote according to distance (e.g. weight factor, w = 1/d2)
  • 8. Dimensionality and Scaling Issues • Problem with Euclidean measure: high dimensional data can cause curse of dimensionality. • Solution: normalize the vectors to unit length • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. • Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 10km to 200kg • income of a person may vary from $10K to $1M
  • 9. Parallel Exemplar-Based Learning System (PEBLS) • PEBLS is a nearest-neighbor learning system (k=1) designed for applications where the instances have symbolic feature values. • Works with both continuous and nominal features. • For nominal features, the distance between two nominal values is computed using Modified Value Difference Metric (MVDM) • 𝑑 𝑉#, 𝑉" = ∑! | $!" $! − $#" $# | • Where n1 is the number of records that consists of nominal attribute value V1 and n1i is the number of records whose target label is class i.
  • 10. Distance Between Nominal Attribute Values • d(Status=Single, Status=Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1 • d(Status=Single, Status=Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0 • d(Status=Married, Status=Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1 • d(Refund=Yes, Refund=No) = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
  • 11. Distance Between Records • 𝛿 𝑋, 𝑌 = 𝑤!𝑤" ∑#$% & 𝑑(𝑋#, 𝑌#) • Each record X is assigned a weight 𝑤! = '!"#$%&'( '!"#$%&'( ')##$'( , which represents its reliability • 𝑁!"#$%&'( is the number of times X is used for prediction • 𝑁!"#$%&'( ')##$'( is the number of times the prediction using X is correct • If 𝑤! ≅ 1 X makes accurate prediction most of the time • If 𝑤!> 1, then X is not reliable for making predictions. High 𝑤! >1 would result in high distance, which makes it less possible to use X to make predictions.
  • 12. Characteristics of Nearest Neighbor Classifiers • Instance-based learner: makes predictions without maintaining abstraction, i.e., building a model like decision trees. • It is a lazy learner: classifying a test example can be expensive because need to compute the proximity values between test and training examples. • In contrast eager learners spend time in building the model but then the classification is fast. • Make their prediction on local information and for low k they are susceptible to noise. • Can produce wrong predictions if inappropriate distance functions and/or preprocessing steps are performed.
  • 13. k-NN for Regression Given a set of training records (memory), and a test record: 1. Compute the distances from the records in the training to the test. 2. Identify the k “nearest” records. 3. Use target value of nearest neighbors to determine the value of unknown record (e.g., by averaging the values). 3.4 5.4 5.7 8.7 9.7 3.9 3.8 4.1 5.1 4.3 10.3 9.2 3.2 5.9 4.7 12.1 10.8 8.7 9.7 3.4 4.2
  • 14. References • Nearest Neighbor classifiers. Chapter 5.2. Introduction to Data Mining.