Vchunk join an efficient algorithm for edit similarity joins
Rinfret, Jonathan poster(2)
1. T0 T60 T90
T0 T30 T60 T90
Machine Learning: LIBSVM Support Vector Machines and K-
Nearest – Neighbor Approach for Accurate Parameter Selection
Jonathan Rinfret
1
San Jose State University, Mathematics and Statistics Dept., San Jose, CA
95192-0100
2
Gavilan College, STEM Dept., Gilroy, CA 95020
Methods Acknowledgements
References
Conclusions
Introduction
Thank you to Dr. Guangliang Chen for mentoring
this project and his extensive knowledge of
machine learning and SVM. Thanks to Mr. Rey
Morales and Dr. Hope Jukl and other Gavilan
College staff. Funding for this internship was
provided by the Gavilan College STEM Grant.
Objectives
• To implement a KNN
approach to finding a value
Sigma to use for Gamma.
•Compare the value for gamma
from KNN to that of the
value for gamma by
Gridsearch.
Support Vector Machines are a current method for
machine learning in which a program reads in a set
of training data and attempts to predict the next set
of testing data by drawing a hyperplane between the
classes of data. This study focuses on modifying an
existing program called LIBSVM to support a k-
nearest-neighbor (KNN) approach to find a hyper
parameter sigma used in the formulation of the
hyperplane and compare with a MATLAB KNN.
The hyper parameter sigma is a measure of the
closeness between two or more classes of data. In
our data sets, there are several data points, that
correspond to a label, or class. Sigma is a numerical
value of the difference between the size of two or
more classes of data. If sigma is large, then the
classes in the dataset are large in size. If sigma is
small, then the classes are small.
Results
Overall, the k-nearest-neighbor method for selecting
the hyper parameter gamma has resulted in increased
cross-validation accuracy, test accuracy, and speed
than the Gridsearch method. The KNN method is
much faster than the Gridsearch method as the value
of c increases. Also, after comparing different values
of the kth nearest neighbor, it seems that as the next
neighbor value increases, the increase in gamma
slows and reaches a limit.
From this, the modifications that have been made are
more successful than the standard Gridsearch
method.
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27,
2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen, Guangliang. MATH 285: Classification with Handwritten Digits. Guangliang
Chen, 28 Jan. 2016. Web. 3 June 2016.
<http://www.math.sjsu.edu/~gchen/Math285S16.html>.
KNN Speed VS Gridsearch Speed
Difference In Gamma For K Values
Different Data Sets Used
The Gridsearch
method finds the best
values for Gamma and
C. However,
Gridsearch may result
in a much longer
runtime and may not
be as accurate as KNN.
W ∙ (Φ)x +b =0
W ∙ (Φ)x +b = 1
W ∙ (Φ)x +b = -1
1 / ||w||
The k-nearest-neighbor method,
will find the best value for
gamma by using these two
equations to the left.
KNN method and SVM
Differences
The modifications I made to LIBSVM were different than
the MATLAB method for KNN. My method involved using
bubble sort to find the 8th
-nearest-neighbor.
The bubble sort method checks to see if the first number is
smaller than the next. If it is, the values change position.
While simple, the bubble sort method is very slow. To
increase the speed, I made my sorting algorithm sort only
the first 30 values in each row. While this did slightly lower
accuracy, it greatly increased the speed of my experiments
and my accuracy was comparatively close to that of the
MATLAB method.
Usually, most of the datasets had higher test accuracy
than validation accuracy. However, the vowel data set was
reversed. I presume that this data set had higher
validation accuracy since it had more classes and the test
and training data were similar in size.