This document discusses support vector machines (SVMs) and their application in agriculture. It begins with an introduction to SVMs, explaining that they are a supervised machine learning algorithm used for classification and regression. The document then covers key aspects of SVMs including: how they find the optimal separating hyperplane for classification; handling linearly separable and non-separable data using soft-margin hyperplanes and kernels; and common kernel functions. It provides an example application of using an SVM classifier to identify pests in leaf images. In conclusion, the document provides an overview of SVMs and their use in solving agricultural classification problems.
2. 2
2
2
Presented by,
HARISH NAYAK, G.H.
PALB 9202
Support Vector Machine And it’s
Application in Agriculture
First seminar
on
AGRICULTURAL
STATISTICS
3. Flow of seminar
1. Introduction
2. SVM
3. Linear and Nonlinear separable case
4. Kernel functions
5. Case study
6. Conclusion
7. References
4. Introduction
o SVM was introduced by
Vladimir Vapnik in 1995 as a
kernel based machine learning
model for classification and
regression task.
o SVM has been used as a
powerful tool for solving
practical binary classification
problems.
4
5. What is SVM ?
SVM is a supervised machine learning
algorithm which is mainly used to classify
data into different classes. Unlike most
algorithms, SVM makes use of a hyperplane
which acts like a decision boundary
between the various classes.
5
6. Features of SVM
1. SVM is a supervised learning
algorithm (i.e., it trains on a set of
labelled data).
o SVM studies the labelled training data
and then classifies any new input data
depending on what it learned in the
training phase.
6
7. Cont...
2. SVM can be used for both
‘classification’ and ‘regression’
problems.
o However, it is mainly used for
classification problems.
7
8. Cont...
3. SVM is used for classifying non-linear
data by using the ‘kernel trick’.
8
9. Principle of SVM
o The formulation of SVM learning is based on the
principle of Structural Risk Minimization[SRM]
(Vapnik, 2000).
o SVM allows to maximize the generalization ability of a
model. This is the objective of the SRM principle that
allows the minimization of a bound on the
generalization error of a model, instead of minimizing
the mean squared error on the set of training data,
which is often used by empirical risk minimization.
9
11. There are two cases in the dataset:
1. Linearly separable case: The data can be
separated into two or more classes
clearly. There will be no overlapping (or)
no intersection.
2. Non-Linearly separable case: The data
can not be separated into two or more
classes clearly. There will be overlapping
(or) presence of intersection.
11
12. Linearly separable case
o Each dataset consists of a pair, an vector 𝑥𝑖
(input vector) and the associated label 𝑦𝑖.
o Let training set 𝑋 be:
𝑥1, 𝑦1 , 𝑥2, 𝑦2 , …, 𝑥𝑛, 𝑦𝑛
i.e., 𝑋 = 𝑥𝑖, 𝑦𝑖 𝑖=1
𝑛
where 𝑥𝑖 ∈ 𝑅𝑑
[d-dimensions]
and 𝑦𝑖∈ +1, −1 .
12
13. CONT...
o For visualization purpose, we will consider 2-
dimensional input, i.e., 𝑥𝑖 ∈ 𝑅2
o The data are linearly separable.
o There are infinite number of hyperplanes that can
perform the separation of the data.
o Fig.1 shows several decision hyperplanes that
perfectly separate the input data set.
13
15. Cont...
o Among all the hyperplanes, the one with
maximum margin and good
generalization ability will be selected.
o This hyperplane is called “Optimal
separation hyperplane”.(Fig.2)
15
17. Cont...
o The hyperplane that separates the input space is
defined by the equation
𝑤𝑇
𝑥𝑖 + 𝑏 = 0 …(1) [𝑖 = 1, … , 𝑁]
o It can be fitted to correctly classify training
patterns, where
a. The weight vector ‘𝑤’ is normal to the hyperplane,
and defines its orientation,
b. ‘𝑏’ is the bias term and
c. ‘T’ is the transpose
17
18. Cont...
From Equation (1), the linear classifier (decision function),
given by
𝑦(𝑥)=𝑠𝑖𝑔𝑛( 𝑤𝑇𝑥+𝑏) …(2)
classifying
Class 2 (𝑦𝑖 = +1 if 𝑤𝑇𝑥 + 𝑏 ≥ 0) and
Class 1(𝑦𝑖 = −1 if 𝑤𝑇
𝑥 + 𝑏 ≤ 0) patterns
18
19. Cont...
𝑤𝑇
𝑥𝑖 + 𝑏 ≥ +1 𝑖𝑓 𝑦𝑖 = +1
𝑤𝑇𝑥𝑖 + 𝑏 ≤ −1 𝑖𝑓 𝑦𝑖 = −1 , 𝑖 = 1,2, … 𝑁
𝑚𝑖𝑛𝑤,𝑏 𝑤. 𝑤 = ||𝑤||2
This can be combined into a single set of equalities:
𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1 ≥ 0, …(∗) 𝑖 = 1,2, … 𝑁
where 𝑁 is the training data size.
o For maximal separation, the hyperplane should be as far away as
possible from each of them.
19
20. Cont...
Solving the equations:
𝑤𝑇
𝑥1 + 𝑏 = +1 and 𝑤𝑇
𝑥2 + 𝑏 = −1
𝑤𝑇
𝑥1 − 𝑥2 = +2
𝑤
||𝑤||
. 𝑥1 − 𝑥2 =
2
||𝑤||
where 𝑤 = 𝑤𝑇𝑤 = 𝑤1
2
+ 𝑤2
2
+ ⋯ + 𝑤𝑛
2
is the norm of a
vector. (Fig.3)
20
21. Cont...
o The distance between the hyperplane and
the training data closest to the
hyperplane is called ‘margin’.
o The geometric margin of 𝑥+
y𝑥−
is
𝛾𝑖 =
1
2
𝑤
||𝑤||
. 𝑥+ −
𝑤
||𝑤||
. 𝑥−
=
1
2||𝑤||
=
1
||𝑤||
21
23. Cont...
o Optimizing the geometric margin means
minimizing the norm of the vector of weights.
o The set of margin-determining training vectors
are called the ‘support vectors’. These are the
data points which are closest to the optimal
hyperplane.
o The solution of an SVM is given only by this
small set of support vectors.
23
24. Cont...
o Construction of hyperplane is same as
the convex Quadratic Programming(QP)
problem.
o The Lagrangian multipliers and Karush-
Kuhn-Tucker (KKT) complimentary
conditions are used to find the optimal
solution.
24
25. o Under condition for optimality, QP problem is finally obtained in
the dual space of Lagrange function.
𝐿 𝑤, 𝑏; 𝛼 =
1
2
||𝑤||2 − 𝑖=1
𝑁
𝛼𝑖 𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1 …(3)
where, Lagrange multipliers: 𝛼𝑖 ≥ 0
o Thus by solving the dual QP problem, the decision function from
Eq.(2) can be rewritten as
𝑓 𝑥 = 𝑠𝑖𝑔𝑛 𝑖=1
𝑁
𝛼𝑖𝑦𝑖𝑥𝑇
𝑥𝑖 + 𝑏 …(4)
o It is the set of positive multiplier that influences the classification,
and their corresponding training vectors are called the support
vectors.
25
26. Limitation of linearly separable case
o The learning problem presented before is valid for
the case where the data is linearly separable, which
means that the training data set has no
intersections.
o However, these problems are rare in the real life.
26
27. NON-SEPARABLE CASE
o We assumed that data is linearly
separable in the previous case. But,
practically it is not always possible.
o The classes in the data sets will have
overlapping and it is not possible to
classify using linear separation
hyperplane.
27
28. Cont...
o Cortes and Vapnik (1993) introduced a modified
maximum margin idea called “Soft margin
hyperplanes”.
o In other words, a linear SVM can be refitted to
learn a hyperplane that is tolerable to a small
number of non-separable training data.
o The approach of refitting is called soft margin
approach, where it introduces slack variables 𝜉𝑖
to the inseparable cases.
28
31. Cont...
o To find a classifier with maximum
margin, the algorithm presented before
should be changed allowing a soft
margin (Fig.4), therefore, it is necessary
to introduce non-negative slack
variables 𝜉𝑖(≥ 0)in the Eq. (*).
𝑦𝑖 𝑤𝑇
𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 𝑖 = 1,2, … 𝑁
31
32. Cont...
o Due to the slack variables 𝜉𝑖, the feasible solution always
exist.
o If 0 < 𝜉𝑖 < 1, the training data do not have the maximum
margin, but can be correctly classified.
o 𝐶 is the regularization parameter.
o If the value of 𝐶 = ∞ , then there will be no
misclassification.
o However, for non-linear case it is not so. The problem may
be feasible only for some value 𝐶 < ∞.
32
33. o The optimization problem, instead of the conditions of the Eq.(*),
the separation hyperplane should satisfy
𝑚𝑖𝑛𝑤,𝑏,𝜉𝑖
𝑤. 𝑤 + 𝐶
𝑖=1
𝑁
𝜉𝑖
2
such that, 𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖, 𝑖 = 1,2, … 𝑁, 𝜉𝑖 ≥ 0
𝑤𝑇
𝑥𝑖 + 𝑏 ≥ +1 − 𝜉𝑖, 𝑦𝑖 = +1, 𝜉𝑖 ≥ 0
𝑤𝑇𝑥𝑖 + 𝑏 ≤ −1 + 𝜉𝑖, 𝑦𝑖 = −1, 𝜉𝑖 ≥ 0
o For the maximum soft margin, the original Lagrangian is
𝐿 𝑤, 𝑏, 𝜉𝑖, 𝛼 =
1
2
𝑤. 𝑤 −
𝑁
𝛼𝑖 𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1 + 𝜉𝑖 +
𝐶
2
𝑁
𝜉𝑖
2
33
34. KERNELS
o In non-linearly separable case, the
classifier may not have a high
generalization ability, even if the
hyperplanes are optimally determined.
o The original input space is transformed
into a highly dimensional space called
“feature space”(refers to n-dimensions).
34
35. What Kernel Trick does?
o A kernel trick is a simple
method where a nonlinear
data is projected onto a higher
dimension space so as to make
it easier to linearly classify the
data by a plane.
35
37. Cont...
o A kernel is a function 𝐾, such that for each 𝑥, 𝑧 ∈ 𝑋
𝐾 𝑥, 𝑧 = 𝜙 𝑥 . 𝜙(𝑧)
Where 𝜙 is a mapping of 𝑋 to a feature space F.
o The decision function is
𝑓 𝑥 =
𝑖=1
𝑁
𝛼𝑖𝑦𝑖𝐾 𝑥𝑖. 𝑥𝑗 + 𝑏
37
38. o A kernel function must satisfy the
following properties, for any 𝑥, 𝑦, 𝑧 ∈ 𝑋
and 𝛼 ∈ 𝑅
1. 𝑥. 𝑥 = 0 𝑜𝑛𝑙𝑦 𝑖𝑓 𝑥 = 0
2. 𝑥. 𝑥 > 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
3. 𝑥. 𝑦 = 𝑦. 𝑥
4. 𝛼𝑥. 𝑦 = 𝛼 𝑥. 𝑦
5. 𝑧 + 𝑥 . 𝑦 = 𝑧. 𝑦 + 𝑥. 𝑦
38
41. Applications of SVM
1. Face detection
2. Text and hypertext categorization
3. Classification of images
4. Bioinformatics
5. Protein fold and remote homology
detection
6. Handwriting recognition
7. Generalized predictive control (GPC)
41
42. Advantages of SVM
1. SVM works relatively well when there is a clear margin of
separation between classes.
2. SVM is more effective in high dimensional spaces.
3. SVM is effective in cases where the number of dimensions is
greater than the number of samples.
4. SVM is relatively memory efficient.
42
44. Architecture:
It consists of various stages such as
image acquisition, preprocessing,
segmentation and accuracy of
infected area. It is calculated by
SVM classifier. The proposed
approach is implemented in
MATLAB.
44
Image Acquisition
Image Preprocessing
Image segmentation
SVM Classifier
Accuracy of infected area
45. a) Image Acquisition
The data set contains pest infected leaf images. The leaves are infected
with Whiteflies.
A sample leaf image with whiteflies
46. b) Image pre-processing
Contrast stretching is an image enhancement technique that improves
the contrast in an Image expanding the dynamic range of intensity
values it contains
46
47. c) Color Based Segmentation Using K-means Clustering
K-Means clustering algorithm is an unsupervised algorithm and it is used to
segment the interested area from the background
There are a lot of applications of the K-mean clustering, range from unsupervised
learning of neural network, Pattern recognitions, Classification analysis, Artificial
intelligent, image processing, machine vision, etc.
47
48. D) SVM Classifier
o A Support vector machine is a powerful tool for binary classification, capable of generating
very fast classifier function following a training period.
Precision=TP/ (TP+FP) :Refers to the percentage of the results which are relevant.
Recall=TP/ (TP+FN) :Refers to the percentage of total relevant results correctly classified by the
algorithm.
Accuracy = (TP+TN)/ (TD+TN+FP+FN) :Number of correct predictions.
48
Positive (+1) Negative(-1)
Positive (+1) True positive False negative
Negative(-1) False positive True negative
50. Conclusion
o Image processing technique plays an important role in the detection
of the pests.
o The pest such as whiteflies, aphids and thrips are very small in size
and infects the leaves.
o The main objective is to detect the pest infected region accuracy in
the leaf image.
o The multiclass svm classifier is used to calculate the accuracy of
infected leaf region.
50
51. SUMMARY
o SVM is a relatively new algorithm proposed for solving
problems in classification.
o SVM can also be used for prediction purpose.
o Kernel trick is the main advantage of SVM, because of
which it has gained more importance.
o SVM can be used in multiple fields of science
depending upon objectives and application domain.
51
52. References
CERVANTES, J., GARCIA-LAMONT, F., RODRÍGUEZ-MAZAHUA,
L. AND LOPEZ, A., 2020, A comprehensive survey on support
vector machine classification: Applications, challenges and
trends, Neurocomputing, 408:189-215.
JAKKULA, V., 2006, Tutorial on support vector machine (svm), School
of EECS, Washington State University, 37.
52
53. Cont...
MOHAN KUMAR, T.L., 2013, Development of Statistical Models using
Nonlinear Support Vector Machines (Doctoral dissertation, IARI-INDIAN
AGRICULTURAL STATISTICS RESEARCH INSTITUTE, NEW DELHI).
RANI, R.U. AND AMSINI, P., 2016, Pest identification in leaf images
using SVM classifier, Int. J. Computational Intelligence and
Informatics, 6(1):248-260.
53
S – represents decision function
h – represents number of data points
Its objective is to minimize both the empirical risk and the confidence interval(capacity of set of functions)
Thus the SRM principle defines a trade-off between the accuracy and complexity of the approximation by minimizing over both terms
Confidence interval or generalization ability
Empirical risk or training error
Weight vector – The weight associated with each input dimension
Weight vector – The weight associated with each input dimension
Bias – The bias is the distance to the origin of the hyperplane solution
We have to Minimize ||w||2 in order to maximize the margin 2/||w||
||w|| represents the magnitude of weight vector
We change this to the dual problem using the Lagrange formulation. There are two reasons to do this.
1. The first lies in the fact that the conditions given will be replaced by Lagrange multipliers, which are much easier to handle.
2. The second is the reformulation of the problem, the training data will only appear in the form of dot product between vectors.
Karush-Kuhn-Tucker conditions (KKT) play a very important role in the theory of optimization, because they give the conditions to obtain an optimal solution to a general optimization problem.
Regularization parameter – how much you want to avoid misclassification.