Instance-Based   Learning
• Instance-Based Learning (Lazy  Learning)   – Learning = storing all “training”     instances   – Classification = an ins...
Instance-Based Learning       Its very similar to a            Desktop!!
Instance/Memory-based Learning  • Non-parameteric    – Hypothesis(Assumption)      complexity      grows with the data  • ...
K Nearest Neighbors• The key issues involved in training this  model includes setting  – the variable K      • Validation ...
Figure K Nearest Neighbors Example                                    X       Stored training set patternsX      input pat...
Store all input data in the training set          For each pattern in the test set Search for the K nearest patterns to th...
Training parameters and typical settings• Number of nearest neighbors   – The numbers of nearest neighbors (K) should be  ...
Training parameters and typical settings  • Input compression     – Since KNN is very storage intensive, we may want      ...
Issues• Distance measure   – Most common: Euclidean   – Better distance measures: normalize each variable by standard     ...
Upcoming SlideShare
Loading in …5
×

Instance based learning

470 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
470
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Instance based learning

  1. 1. Instance-Based Learning
  2. 2. • Instance-Based Learning (Lazy Learning) – Learning = storing all “training” instances – Classification = an instance gets a classification equal to the classification of the nearest instances to the instance
  3. 3. Instance-Based Learning Its very similar to a Desktop!!
  4. 4. Instance/Memory-based Learning • Non-parameteric – Hypothesis(Assumption) complexity grows with the data • Memory-based learning – Construct hypotheses directly from the training data itself 4
  5. 5. K Nearest Neighbors• The key issues involved in training this model includes setting – the variable K • Validation techniques (ex. Cross validation) – the type of distant metric • Euclidean measure D 2 Dist ( X , Y ) ( Xi Yi) i 1 5
  6. 6. Figure K Nearest Neighbors Example X Stored training set patternsX input pattern for classification--- Euclidean distance measure to the nearest three patterns 6
  7. 7. Store all input data in the training set For each pattern in the test set Search for the K nearest patterns to the input pattern using a Euclidean distance measureFor classification, compute the confidence foreach class as Ci /K,(where Ci is the number of patterns among the Knearest patterns belonging to class i.)The classification for the input pattern is the classwith the highest confidence. 7
  8. 8. Training parameters and typical settings• Number of nearest neighbors – The numbers of nearest neighbors (K) should be based on cross validation over a number of K setting. – When k=1 is a good baseline model to benchmark against. – A good rule-of-thumb numbers is k should be less than the square root of the total number of training patterns. 8
  9. 9. Training parameters and typical settings • Input compression – Since KNN is very storage intensive, we may want to compress data patterns as a preprocessing step before classification. – Using input compression will result in slightly worse performance. – Sometimes using compression will improve performance because it performs automatic normalization of the data which can equalize the effect of each input in the Euclidean distance measure. 9
  10. 10. Issues• Distance measure – Most common: Euclidean – Better distance measures: normalize each variable by standard deviation – For discrete data, can use hamming distance• Choosing k – Increasing k reduces variance, increases bias• For “high-dimensional space”, problem that the nearest neighbor may not be very close at all!• Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.• Indexing the data can help; for example KD trees 10

×