1. K-nearest neighbor algorithm
Glenn M. Bernstein
Computer Science and Engineering
Department
University of California, Riverside
Riverside, CA. 92521
(951) 892-8682
gbern@cs.ucr.edu
ABSTRACT There are strong engineering reasons based on the composition
This paper examines the K-nearest neighbor algorithm. of O-rings to support the judgment that failure probability may
Different values of K produce different accuracy, and this paper rise monotonically as temperature drops. No previous liftoff
determines the optimal value of K for two different data sets. temperature was under 53 degrees F. The attribute information:
The first data set is the Challenger USA Space Shuttle O-Ring 1. Number of O-rings at risk on a given flight.
from the UCI machine learning repository. The second is the El
Nino data set from the UCI KDD archive. 2. Number experiencing thermal distress
3. Launch temperature (degree F)
Categories and Subject Descriptors 4. Leak-check pressure (psi)
I.2.8 [Artificial Intelligence]: Problem Solving, Control
Methods, and Search – heuristic methods. 5 Temporal order of flight [2]
The second data set is significantly larger. This data set is from
General Terms the UCI Knowledge Discovery Database. It contains
Algorithms, Management, Experimentation. oceanographic and surface meteorological readings taken from a
series of buoys positioned throughout the equatorial Pacific. The
Keywords data is expected to aid in the understanding and prediction of El
K-nearest neighbor, cross validation. Nino/Southern Oscillation (ENSO) cycles. It consists of 782
instances measured from May 23 rd, 1998 to June 5th, 1998. The
data consists of the following variables: date, latitude, longitude,
1.INTRODUCTION zonal winds (west<0, east>0), meridional winds (south <0, north
K-nearest neighbor is a useful supervised learning algorithm. >0), relative humidity, air temperature, sea surface temperature
Used for classification the result of a new instance query is down to a depth of 500 meters. Missing values do exist in the
classified based on the majority of a K-nearest neighbor’s data because not all buoys are able to measure currents, rainfall,
category. The purpose of is to classify a new object based on and solar radiation so these values are missing dependent on the
attributes and training samples. Given a query point, we find K- individual buoy. The amount of data is available is also
number of objects or (training points) closest to the query point. dependent on the buoy, as certain buoys were commissioned
The classification uses a majority vote among the classification earlier than others[3]. Any missing values were replaced with
of K objects. The K-nearest neighbor algorithm uses an average of that attribute. The median can also be used
neighborhood classification for the prediction value of the new because the distribution is approximately symmetric so these
query instance[1]. two measures of central tendency are approximately the same.
The attributes that contain missing values are: zonal and
2.EXPERIMENTAL DESIGN meridional winds, humidity, and air and sea surface
The first data set is from the UCI Machine Learning Repository. temperatures. The values were replaced with -3.90, -0.60,
It consists of 23 instances of 4 attributes each. No missing data 84.46, 27.57, and 28.29, respectively. To clean the data requires
values are present. The task is to predict the number of O-rings conversion of the file to a .csv file (comma separated value file).
that experience thermal distress on a flight at 31 degrees F given Then remove the periods that denote missing values and replace,
data on the previous 23 shuttle flights. after calculation of the average value, with the average value for
that attribute. The task is to predict the buoy number given
various attributes.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 3.EXPERIMENTS
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To otherwise, or 3.1Challenger USA Space Shuttle O-Ring
republish, to post on servers or to redistribute to lists, requires prior Data Set
specific permission and/or a fee.
Conference’0X, Month X–X, 200X, City, State, Country. Sixty percent is partitioned as the training data set, and forty
Copyright 200X ACM 1-58113-000-0/00/0004…$5.00. percent is partitioned as the validation set randomly chosen.
The data is normalized so that all data is expressed in terms of
standard deviations so that the distance measure is not
dominated by variables with a large scale[4]. When the K-
nearest neighbor algorithm is run with all the features chosen,
2. the best value of K is 1 with a 22.22 percent validation error. Table 2
Two misclassifications out of nine validations set queries. So, by the elimination of one attribute we achieve increased
accuracy, and reduced computation time because all K values
achieve the same accuracy above. The variable number of O-
rings at risk on a given flight is not relevant because they are all
the same value.
Table 1 Figure 1
The K-nearest neighbor algorithm is run without the launch
temperature attribute. Sixty percent is partitioned as the training 3.2El Nino Data Set
data set, and forty percent is partitioned as the validation data Fifty two percent is partitioned as the training data set, and forty
set. The best K again is 1, with a 22.22 percent error. Two eight is partitioned as the validation set randomly chosen. The
misclassifications out of nine validations set queries. data is normalized so that all data is expressed in terms of
standard deviations so that the distance measure is not
The K-nearest neighbor algorithm is run without the Psi dominated by variables with a large scale[4]. When the K-
attribute. Sixty percent is partitioned as the training data set, nearest neighbor algorithm is run with all the features chosen,
and forty percent is partitioned as the validation data set. The the best value of K is 1 with a 31.55 percent validation error.
best K again is 1, with a 22.22 percent error. Two Fifty seven misclassifications out of one hundred eighty three
misclassifications out of nine validations set queries. validations set queries.
The K-nearest neighbor algorithm is run without the latitude and
longitude attributes. Fifty-two point twenty one percent is
partitioned as the training data set, and forty-seven point seventy
nine percent is partitioned as the validation data set. The best K
again is 1, with a 51.37 percent error. Ninety four
misclassifications out of one hundred eighty three validations
set queries.
The K-nearest neighbor algorithm is run without the zonal winds
and meridional winds attributes. Fifty-two point twenty one
percent is partitioned as the training data set, and forty-seven
point seventy nine percent is partitioned as the validation data
set. The best K again is 1, with a 27.87 percent error. Fifty one
misclassifications out of one hundred eighty three validations
set queries. Elimination of the longitudinal and latitude
attributes improved the accuracy.
The K-nearest neighbor algorithm is run without the humidity
attribute. Fifty-two point twenty one percent is partitioned as
the training data set, and forty-seven point seventy nine percent
is partitioned as the validation data set. The best K again is 1,
with a 28.96 percent error. Fifty three misclassifications out of
one hundred eighty three validations set queries.
The K-nearest neighbor algorithm is run without the air and sea
surface temperature attributes. Fifty-two point twenty one
3. percent is partitioned as the training data set, and forty-seven
point seventy nine percent is partitioned as the validation data
set. The best K again is 1, with a 40.98 percent error. Seventy Table 3
five misclassifications out of one hundred eighty three
validations set queries. The elimination of the temperature 4.ACKNOWLEDGMENTS
attributes decreases the accuracy. Thank to professor Eamonn Keogh for his advice a guidance.
Thanks to Paul Lammertsma for his open source code that was
The K-nearest neighbor algorithm is run without the zonal
adapted for the Challenger USA Space Shuttle O-ring data set.
winds, meridional winds attributes, and humidity. Fifty-two
point twenty one percent is partitioned as the training data set, 5. CONCLUSIONS
and forty-seven point seventy nine percent is partitioned as the
validation data set. The best K again is 1, with a 20.77 percent Search provides us a tool to reduce the number of attributes. In
error. Thirty eight misclassifications out of one hundred eighty the Shuttle data set we eliminated one attribute. In the El Nino
three validations set queries. So, by the use of five attributes we data set we eliminated more than one. Adjustment of the K-
achieve better accuracy. value can improve accuracy.
6. REFERENCES
[1] http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-
Nearest-Neighbor-Algorithm.html
[2] http://archive.ics.uci.edu/ml/datasets/Challenger+USA+Spa
ce+Shuttle+O-Ring
[3] http://kdd.ics.uci.edu/databases/el_nino/el_nino.html
[4] http://www.resample.com/xlminer/help/k-NN/knn_ex.htm