SlideShare a Scribd company logo
1 of 3
Download to read offline
K-nearest neighbor algorithm
                                                              Glenn M. Bernstein
                                                     Computer Science and Engineering
                                                                 Department
                                                      University of California, Riverside
                                                           Riverside, CA. 92521
                                                               (951) 892-8682
                                                              gbern@cs.ucr.edu

ABSTRACT                                                                     There are strong engineering reasons based on the composition
This paper examines the K-nearest neighbor algorithm.                        of O-rings to support the judgment that failure probability may
Different values of K produce different accuracy, and this paper             rise monotonically as temperature drops. No previous liftoff
determines the optimal value of K for two different data sets.               temperature was under 53 degrees F. The attribute information:
The first data set is the Challenger USA Space Shuttle O-Ring                     1.   Number of O-rings at risk on a given flight.
from the UCI machine learning repository. The second is the El
Nino data set from the UCI KDD archive.                                           2.   Number experiencing thermal distress
                                                                                  3.   Launch temperature (degree F)
Categories and Subject Descriptors                                                4.   Leak-check pressure (psi)
I.2.8 [Artificial Intelligence]: Problem Solving, Control
Methods, and Search – heuristic methods.                                          5    Temporal order of flight [2]
                                                                             The second data set is significantly larger. This data set is from
General Terms                                                                the UCI Knowledge Discovery Database.                  It contains
Algorithms, Management, Experimentation.                                     oceanographic and surface meteorological readings taken from a
                                                                             series of buoys positioned throughout the equatorial Pacific. The
Keywords                                                                     data is expected to aid in the understanding and prediction of El
K-nearest neighbor, cross validation.                                        Nino/Southern Oscillation (ENSO) cycles. It consists of 782
                                                                             instances measured from May 23 rd, 1998 to June 5th, 1998. The
                                                                             data consists of the following variables: date, latitude, longitude,
1.INTRODUCTION                                                               zonal winds (west<0, east>0), meridional winds (south <0, north
K-nearest neighbor is a useful supervised learning algorithm.                >0), relative humidity, air temperature, sea surface temperature
Used for classification the result of a new instance query is                down to a depth of 500 meters. Missing values do exist in the
classified based on the majority of a K-nearest neighbor’s                   data because not all buoys are able to measure currents, rainfall,
category. The purpose of is to classify a new object based on                and solar radiation so these values are missing dependent on the
attributes and training samples. Given a query point, we find K-             individual buoy. The amount of data is available is also
number of objects or (training points) closest to the query point.           dependent on the buoy, as certain buoys were commissioned
The classification uses a majority vote among the classification             earlier than others[3]. Any missing values were replaced with
of K objects.        The K-nearest neighbor algorithm uses                   an average of that attribute. The median can also be used
neighborhood classification for the prediction value of the new              because the distribution is approximately symmetric so these
query instance[1].                                                           two measures of central tendency are approximately the same.
                                                                             The attributes that contain missing values are: zonal and
2.EXPERIMENTAL DESIGN                                                        meridional winds, humidity, and air and sea surface
The first data set is from the UCI Machine Learning Repository.              temperatures. The values were replaced with -3.90, -0.60,
It consists of 23 instances of 4 attributes each. No missing data            84.46, 27.57, and 28.29, respectively. To clean the data requires
values are present. The task is to predict the number of O-rings             conversion of the file to a .csv file (comma separated value file).
that experience thermal distress on a flight at 31 degrees F given           Then remove the periods that denote missing values and replace,
data on the previous 23 shuttle flights.                                     after calculation of the average value, with the average value for
                                                                             that attribute. The task is to predict the buoy number given
                                                                             various attributes.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are    3.EXPERIMENTS
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To otherwise, or   3.1Challenger USA Space Shuttle O-Ring
republish, to post on servers or to redistribute to lists, requires prior    Data Set
specific permission and/or a fee.
Conference’0X, Month X–X, 200X, City, State, Country.                        Sixty percent is partitioned as the training data set, and forty
Copyright 200X ACM 1-58113-000-0/00/0004…$5.00.                              percent is partitioned as the validation set randomly chosen.
                                                                             The data is normalized so that all data is expressed in terms of
                                                                             standard deviations so that the distance measure is not
                                                                             dominated by variables with a large scale[4]. When the K-
                                                                             nearest neighbor algorithm is run with all the features chosen,
the best value of K is 1 with a 22.22 percent validation error.       Table 2
Two misclassifications out of nine validations set queries.           So, by the elimination of one attribute we achieve increased
                                                                      accuracy, and reduced computation time because all K values
                                                                      achieve the same accuracy above. The variable number of O-
                                                                      rings at risk on a given flight is not relevant because they are all
                                                                      the same value.




Table 1                                                               Figure 1
The K-nearest neighbor algorithm is run without the launch
temperature attribute. Sixty percent is partitioned as the training   3.2El Nino Data Set
data set, and forty percent is partitioned as the validation data     Fifty two percent is partitioned as the training data set, and forty
set. The best K again is 1, with a 22.22 percent error. Two           eight is partitioned as the validation set randomly chosen. The
misclassifications out of nine validations set queries.               data is normalized so that all data is expressed in terms of
                                                                      standard deviations so that the distance measure is not
The K-nearest neighbor algorithm is run without the Psi               dominated by variables with a large scale[4]. When the K-
attribute. Sixty percent is partitioned as the training data set,     nearest neighbor algorithm is run with all the features chosen,
and forty percent is partitioned as the validation data set. The      the best value of K is 1 with a 31.55 percent validation error.
best K again is 1, with a 22.22 percent error.               Two      Fifty seven misclassifications out of one hundred eighty three
misclassifications out of nine validations set queries.               validations set queries.
                                                                      The K-nearest neighbor algorithm is run without the latitude and
                                                                      longitude attributes. Fifty-two point twenty one percent is
                                                                      partitioned as the training data set, and forty-seven point seventy
                                                                      nine percent is partitioned as the validation data set. The best K
                                                                      again is 1, with a 51.37 percent error.                Ninety four
                                                                      misclassifications out of one hundred eighty three validations
                                                                      set queries.
                                                                      The K-nearest neighbor algorithm is run without the zonal winds
                                                                      and meridional winds attributes. Fifty-two point twenty one
                                                                      percent is partitioned as the training data set, and forty-seven
                                                                      point seventy nine percent is partitioned as the validation data
                                                                      set. The best K again is 1, with a 27.87 percent error. Fifty one
                                                                      misclassifications out of one hundred eighty three validations
                                                                      set queries. Elimination of the longitudinal and latitude
                                                                      attributes improved the accuracy.
                                                                      The K-nearest neighbor algorithm is run without the humidity
                                                                      attribute. Fifty-two point twenty one percent is partitioned as
                                                                      the training data set, and forty-seven point seventy nine percent
                                                                      is partitioned as the validation data set. The best K again is 1,
                                                                      with a 28.96 percent error. Fifty three misclassifications out of
                                                                      one hundred eighty three validations set queries.
                                                                      The K-nearest neighbor algorithm is run without the air and sea
                                                                      surface temperature attributes. Fifty-two point twenty one
percent is partitioned as the training data set, and forty-seven
point seventy nine percent is partitioned as the validation data
set. The best K again is 1, with a 40.98 percent error. Seventy       Table 3
five misclassifications out of one hundred eighty three
validations set queries. The elimination of the temperature           4.ACKNOWLEDGMENTS
attributes decreases the accuracy.                                    Thank to professor Eamonn Keogh for his advice a guidance.
                                                                      Thanks to Paul Lammertsma for his open source code that was
The K-nearest neighbor algorithm is run without the zonal
                                                                      adapted for the Challenger USA Space Shuttle O-ring data set.
winds, meridional winds attributes, and humidity. Fifty-two
point twenty one percent is partitioned as the training data set,     5. CONCLUSIONS
and forty-seven point seventy nine percent is partitioned as the
validation data set. The best K again is 1, with a 20.77 percent      Search provides us a tool to reduce the number of attributes. In
error. Thirty eight misclassifications out of one hundred eighty      the Shuttle data set we eliminated one attribute. In the El Nino
three validations set queries. So, by the use of five attributes we   data set we eliminated more than one. Adjustment of the K-
achieve better accuracy.                                              value can improve accuracy.

                                                                      6. REFERENCES
                                                                      [1] http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-
                                                                          Nearest-Neighbor-Algorithm.html
                                                                      [2] http://archive.ics.uci.edu/ml/datasets/Challenger+USA+Spa
                                                                          ce+Shuttle+O-Ring
                                                                      [3] http://kdd.ics.uci.edu/databases/el_nino/el_nino.html
                                                                      [4] http://www.resample.com/xlminer/help/k-NN/knn_ex.htm

More Related Content

Viewers also liked

Reverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlierReverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlierShakas Technologies
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...IAEME Publication
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...PlanetData Network of Excellence
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar AhmedZaffar Ahmed Shaikh
 

Viewers also liked (6)

Reverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlierReverse nearest neighbors in unsupervised distance based outlier
Reverse nearest neighbors in unsupervised distance based outlier
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 
KNN
KNN KNN
KNN
 

Similar to Final report

Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...IRJET Journal
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...IJDKP
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Expert system design for elastic scattering neutrons optical model using bpnn
Expert system design for elastic scattering neutrons optical model using bpnnExpert system design for elastic scattering neutrons optical model using bpnn
Expert system design for elastic scattering neutrons optical model using bpnnijcsa
 
DOWNLOAD
DOWNLOADDOWNLOAD
DOWNLOADbutest
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...Scientific Review SR
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Scientific Review
 
K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor ApproachKumud Arora
 

Similar to Final report (20)

Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...
DEEP LEARNING BASED MULTIPLE REGRESSION TO PREDICT TOTAL COLUMN WATER VAPOR (...
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Expert system design for elastic scattering neutrons optical model using bpnn
Expert system design for elastic scattering neutrons optical model using bpnnExpert system design for elastic scattering neutrons optical model using bpnn
Expert system design for elastic scattering neutrons optical model using bpnn
 
DOWNLOAD
DOWNLOADDOWNLOAD
DOWNLOAD
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
 
K- Nearest Neighbor Approach
K- Nearest Neighbor ApproachK- Nearest Neighbor Approach
K- Nearest Neighbor Approach
 
Path loss prediction
Path loss predictionPath loss prediction
Path loss prediction
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
KNN
KNNKNN
KNN
 
R&CW poster
R&CW posterR&CW poster
R&CW poster
 
K means report
K means reportK means report
K means report
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Final report

  • 1. K-nearest neighbor algorithm Glenn M. Bernstein Computer Science and Engineering Department University of California, Riverside Riverside, CA. 92521 (951) 892-8682 gbern@cs.ucr.edu ABSTRACT There are strong engineering reasons based on the composition This paper examines the K-nearest neighbor algorithm. of O-rings to support the judgment that failure probability may Different values of K produce different accuracy, and this paper rise monotonically as temperature drops. No previous liftoff determines the optimal value of K for two different data sets. temperature was under 53 degrees F. The attribute information: The first data set is the Challenger USA Space Shuttle O-Ring 1. Number of O-rings at risk on a given flight. from the UCI machine learning repository. The second is the El Nino data set from the UCI KDD archive. 2. Number experiencing thermal distress 3. Launch temperature (degree F) Categories and Subject Descriptors 4. Leak-check pressure (psi) I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search – heuristic methods. 5 Temporal order of flight [2] The second data set is significantly larger. This data set is from General Terms the UCI Knowledge Discovery Database. It contains Algorithms, Management, Experimentation. oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The Keywords data is expected to aid in the understanding and prediction of El K-nearest neighbor, cross validation. Nino/Southern Oscillation (ENSO) cycles. It consists of 782 instances measured from May 23 rd, 1998 to June 5th, 1998. The data consists of the following variables: date, latitude, longitude, 1.INTRODUCTION zonal winds (west<0, east>0), meridional winds (south <0, north K-nearest neighbor is a useful supervised learning algorithm. >0), relative humidity, air temperature, sea surface temperature Used for classification the result of a new instance query is down to a depth of 500 meters. Missing values do exist in the classified based on the majority of a K-nearest neighbor’s data because not all buoys are able to measure currents, rainfall, category. The purpose of is to classify a new object based on and solar radiation so these values are missing dependent on the attributes and training samples. Given a query point, we find K- individual buoy. The amount of data is available is also number of objects or (training points) closest to the query point. dependent on the buoy, as certain buoys were commissioned The classification uses a majority vote among the classification earlier than others[3]. Any missing values were replaced with of K objects. The K-nearest neighbor algorithm uses an average of that attribute. The median can also be used neighborhood classification for the prediction value of the new because the distribution is approximately symmetric so these query instance[1]. two measures of central tendency are approximately the same. The attributes that contain missing values are: zonal and 2.EXPERIMENTAL DESIGN meridional winds, humidity, and air and sea surface The first data set is from the UCI Machine Learning Repository. temperatures. The values were replaced with -3.90, -0.60, It consists of 23 instances of 4 attributes each. No missing data 84.46, 27.57, and 28.29, respectively. To clean the data requires values are present. The task is to predict the number of O-rings conversion of the file to a .csv file (comma separated value file). that experience thermal distress on a flight at 31 degrees F given Then remove the periods that denote missing values and replace, data on the previous 23 shuttle flights. after calculation of the average value, with the average value for that attribute. The task is to predict the buoy number given various attributes. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 3.EXPERIMENTS not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To otherwise, or 3.1Challenger USA Space Shuttle O-Ring republish, to post on servers or to redistribute to lists, requires prior Data Set specific permission and/or a fee. Conference’0X, Month X–X, 200X, City, State, Country. Sixty percent is partitioned as the training data set, and forty Copyright 200X ACM 1-58113-000-0/00/0004…$5.00. percent is partitioned as the validation set randomly chosen. The data is normalized so that all data is expressed in terms of standard deviations so that the distance measure is not dominated by variables with a large scale[4]. When the K- nearest neighbor algorithm is run with all the features chosen,
  • 2. the best value of K is 1 with a 22.22 percent validation error. Table 2 Two misclassifications out of nine validations set queries. So, by the elimination of one attribute we achieve increased accuracy, and reduced computation time because all K values achieve the same accuracy above. The variable number of O- rings at risk on a given flight is not relevant because they are all the same value. Table 1 Figure 1 The K-nearest neighbor algorithm is run without the launch temperature attribute. Sixty percent is partitioned as the training 3.2El Nino Data Set data set, and forty percent is partitioned as the validation data Fifty two percent is partitioned as the training data set, and forty set. The best K again is 1, with a 22.22 percent error. Two eight is partitioned as the validation set randomly chosen. The misclassifications out of nine validations set queries. data is normalized so that all data is expressed in terms of standard deviations so that the distance measure is not The K-nearest neighbor algorithm is run without the Psi dominated by variables with a large scale[4]. When the K- attribute. Sixty percent is partitioned as the training data set, nearest neighbor algorithm is run with all the features chosen, and forty percent is partitioned as the validation data set. The the best value of K is 1 with a 31.55 percent validation error. best K again is 1, with a 22.22 percent error. Two Fifty seven misclassifications out of one hundred eighty three misclassifications out of nine validations set queries. validations set queries. The K-nearest neighbor algorithm is run without the latitude and longitude attributes. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 51.37 percent error. Ninety four misclassifications out of one hundred eighty three validations set queries. The K-nearest neighbor algorithm is run without the zonal winds and meridional winds attributes. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 27.87 percent error. Fifty one misclassifications out of one hundred eighty three validations set queries. Elimination of the longitudinal and latitude attributes improved the accuracy. The K-nearest neighbor algorithm is run without the humidity attribute. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 28.96 percent error. Fifty three misclassifications out of one hundred eighty three validations set queries. The K-nearest neighbor algorithm is run without the air and sea surface temperature attributes. Fifty-two point twenty one
  • 3. percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 40.98 percent error. Seventy Table 3 five misclassifications out of one hundred eighty three validations set queries. The elimination of the temperature 4.ACKNOWLEDGMENTS attributes decreases the accuracy. Thank to professor Eamonn Keogh for his advice a guidance. Thanks to Paul Lammertsma for his open source code that was The K-nearest neighbor algorithm is run without the zonal adapted for the Challenger USA Space Shuttle O-ring data set. winds, meridional winds attributes, and humidity. Fifty-two point twenty one percent is partitioned as the training data set, 5. CONCLUSIONS and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 20.77 percent Search provides us a tool to reduce the number of attributes. In error. Thirty eight misclassifications out of one hundred eighty the Shuttle data set we eliminated one attribute. In the El Nino three validations set queries. So, by the use of five attributes we data set we eliminated more than one. Adjustment of the K- achieve better accuracy. value can improve accuracy. 6. REFERENCES [1] http://people.revoledu.com/kardi/tutorial/KNN/What-is-K- Nearest-Neighbor-Algorithm.html [2] http://archive.ics.uci.edu/ml/datasets/Challenger+USA+Spa ce+Shuttle+O-Ring [3] http://kdd.ics.uci.edu/databases/el_nino/el_nino.html [4] http://www.resample.com/xlminer/help/k-NN/knn_ex.htm