Instance-Based LearningV.SaranyaAP/CSESri Vidya College of Engineering andTechnology,Virudhunagar
• Instance-Based Learning (LazyLearning)–Learning = storing all “training”instances–Classification = an instance getsa classification equal to theclassification of the nearestinstances to the instance
Instance-Based LearningIts very similar to aDesktop!!
Instance/Memory-based Learning• Non-parameteric– Hypothesis(Assumption) complexitygrows with the data• Memory-based learning– Construct hypotheses directly from thetraining data itself4
5K Nearest Neighbors• The key issues involved in training thismodel includes setting– the variable K• Validation techniques(ex. Cross validation)– the type of distant metric• Euclidean measure21)(),(DiYiXiYXDist
6Figure K Nearest Neighbors ExampleXStored training set patternsX input pattern for classification--- Euclidean distance measure to the nearest three patterns
7Store all input data in the training setFor each pattern in the test setSearch for the K nearest patterns to the inputpattern using a Euclidean distance measureFor classification, compute the confidence foreach class as Ci /K,(where Ci is the number of patterns among the Knearest patterns belonging to class i.)The classification for the input pattern is the classwith the highest confidence.
8Training parameters and typical settings• Number of nearest neighbors– The numbers of nearest neighbors (K) should bebased on cross validation over a number of Ksetting.– When k=1 is a good baseline model to benchmarkagainst.– A good rule-of-thumb numbers is k should be lessthan the square root of the total number oftraining patterns.
9Training parameters and typical settings• Input compression– Since KNN is very storage intensive, we may wantto compress data patterns as a preprocessing stepbefore classification.– Using input compression will result in slightlyworse performance.– Sometimes using compression will improveperformance because it performs automaticnormalization of the data which can equalize theeffect of each input in the Euclidean distancemeasure.
Issues• Distance measure– Most common: Euclidean– Better distance measures: normalize each variable by standarddeviation– For discrete data, can use hamming distance• Choosing k– Increasing k reduces variance, increases bias• For “high-dimensional space”, problem that the nearest neighbormay not be very close at all!• Memory-based technique. Must make a pass through the datafor each classification. This can be prohibitive for large data sets.• Indexing the data can help; for example KD trees10