Natural Language Processing of applications.pdf

Module 3
Supervised Learning Part 2:
• KNN Algorithm
• Support Vector Machine (SVM)
• Decision Tree

K-Nearest Neighbors Algorithm
• The k-nearest neighbors algorithm, also known as KNN or k-NN,
is a non-parametric, supervised learning classifier, which uses
proximity to make classifications or predictions about the
grouping of an individual data point.
• While it can be used for either regression or classification
problems, it is typically used as a classification algorithm,
working off the assumption that similar points can be found near
one another.

• For classification problems, a class label is assigned on the basis of a
majority vote—i.e. the label that is most frequently represented around
a given data point is used.
• While this is technically considered “plurality voting”, the term,
“majority vote” is more commonly used in literature.
• The distinction between these terminologies is that “majority voting”
technically requires a majority of greater than 50%, which primarily
works when there are only two categories. When you have multiple
classes—e.g. four categories, you don’t necessarily need 50% of the vote
to make a conclusion about a class; you could assign a class label with a
vote of greater than 25%.

• Regression problems use a similar concept as classification problem, but
in this case, the average the k nearest neighbors is taken to make a
prediction about a classification.
• The main distinction here is that classification is used for discrete values,
whereas regression is used with continuous ones. However, before a
classification can be made, the distance must be defined.
• Euclidean distance is most commonly used, which we’ll delve into more
below.

• It's also worth noting that the KNN algorithm is also part of a family of
“lazy learning” models, meaning that it only stores a training dataset
versus undergoing a training stage.
• This also means that all the computation occurs when a classification or
prediction is being made. Since it heavily relies on memory to store all its
training data, it is also referred to as an instance-based or memory-
based learning method.

• Evelyn Fix and Joseph Hodges are credited with the initial ideas around
the KNN model in this 1951 paper while Thomas Cover expands on
their concept in his research “Nearest Neighbor Pattern Classification.”
• While it’s not as popular as it once was, it is still one of the first
algorithms one learns in data science/data analytics due to its
simplicity and accuracy.
• However, as a dataset grows, KNN becomes increasingly inefficient,
compromising overall model performance.

• It is commonly used for
• simple recommendation systems,
• pattern recognition,
• data mining,
• financial market predictions,
• intrusion detection, and more.

Compute KNN: distance metrics
• To recap, the goal of the k-nearest neighbor algorithm is to identify the
nearest neighbors of a given query point, so that we can assign a class
label to that point. In order to do this, KNN has a few requirements:
Determine your distance metrics
• In order to determine which data points are closest to a given query
point, the distance between the query point and the other data points
will need to be calculated.
• These distance metrics help to form decision boundaries, which
partitions query points into different regions.

• You commonly will see decision boundaries visualized with Voronoi
diagrams.

• While there are several distance measures that you can choose from,
this article will only cover the following:
Euclidean distance:
• This is the most commonly used distance measure, and it is limited to
real-valued vectors.
• This is nothing but the cartesian distance between the two points which
are in the plane/hyperplane.

Euclidean distance:
• Euclidean distance can also be
visualized as the length of the straight
line that joins the two points which are
into consideration.
• This metric helps us calculate the net
displacement done between the
two states of an object.

Manhattan Distance
• This distance metric is generally used when we are interested in the
total distance traveled by the object instead of the displacement.
• This metric is calculated by summing the absolute difference between
the coordinates of the points in n-dimensions.

Minkowski Distance
• We can say that the Euclidean, as well as the Manhattan distance, are
special cases of the Minkowski distance.
• From the formula above we can say that when p = 2 then it is the
same as the formula for the Euclidean distance and when p = 1 then
we obtain the formula for the Manhattan distance.

• The above-discussed metrics are most common while dealing with a
Machine Learning problem but there are other distance metrics as
well like Hamming Distance which come in handy while dealing with
problems that require overlapping comparisons between two vectors
whose contents can be boolean as well as string values.

How to choose the value of k for KNN Algorithm?
• The value of k is very crucial in the KNN algorithm to define the
number of neighbors in the algorithm.
• The value of k in the k-nearest neighbors (k-NN) algorithm should be
chosen based on the input data. If the input data has more outliers or
noise, a higher value of k would be better.
• It is recommended to choose an odd value for k to avoid ties in
classification.
• Cross-validation methods can help in selecting the best k value for
the given dataset.

Applications of the KNN Algorithm
• Data Preprocessing – While dealing with any Machine Learning
problem we first perform the EDA part in which if we find that the
data contains missing values then there are multiple imputation
methods are available as well. One of such method is KNN Imputer
which is quite effective ad generally used for sophisticated imputation
methodologies.
• Pattern Recognition – KNN algorithms work very well if you have
trained a KNN algorithm using the MNIST dataset (Modified National
Institute of Standards and Technology database. Contains a collection of 70,000, 28 x 28
images of handwritten digits from 0 to 9. The dataset is already divided into training and
testing sets.)and then performed the evaluation process then you must
have come across the fact that the accuracy is too high.

Applications of the KNN Algorithm
• Recommendation Engines – The main task which is performed by a
KNN algorithm is to assign a new query point to a pre-existed group
that has been created using a huge corpus of datasets.
• This is exactly what is required in the recommender systems to assign
each user to a particular group and then provide them
recommendations based on that group’s preferences.

Advantages of the KNN Algorithm
• Easy to implement as the complexity of the algorithm is not that high.
• Adapts Easily – As per the working of the KNN algorithm it stores all
the data in memory storage and hence whenever a new example or
data point is added then the algorithm adjusts itself as per that new
example and has its contribution to the future predictions as well.
• Few Hyperparameters – The only parameters which are required in
the training of a KNN algorithm are the value of k and the choice of
the distance metric which we would like to choose from our
evaluation metric.

Disadvantages of the KNN Algorithm
• Does not scale – As we have heard about this that the KNN algorithm
is also considered a Lazy Algorithm.
• The main significance of this term is that this takes lots of
computing power as well as data storage. This makes this
algorithm both time-consuming and resource exhausting.
• Curse of Dimensionality – There is a term known as the peaking
phenomenon according to this the KNN algorithm is affected by the
curse of dimensionality which implies the algorithm faces a hard time
classifying the data points properly when the dimensionality is too
high.

Disadvantages of the KNN Algorithm
• Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well.
• Hence generally feature selection as well as dimensionality
reduction techniques are applied to deal with this problem.

Some more details of the
KNN Algorithm

Smart India Hackathon, this year’s edition also focuses
on
“Software and hardware approaches to develop
innovative solutions against challenges faced by India's
governing agencies including Ministries, Departments,
and PSUs. “
The deadline for college registration, team nomination, and idea
submission is 30 September 2023. Hurry up
visit www.sih.gov.in to register.

Introduction to
Support Vector Machines (SVM)

30
History of SVM
◼ SVM is related to statistical learning theory
◼ SVM was first introduced in 1992
◼ SVM becomes popular because of its success in handwritten digit
recognition
◼ 1.1% test error rate for SVM. This is the same as the error rates
of a carefully constructed neural network, LeNet 4.

Support Vector Machines (SVM)
Can be used for both the supervised learning problems
• Classification
• (SVM Classifier)
• for binary as well as multi calss
classification also)
• Regression
• (SVM Regressor)
• Works similar like LR but with some additional features.
• One the very effective and popular algorithm in industry and academia.

Cost Function
Here in this “Cost Function”,
- “C” is the ‘hyperparameter’ tells how many number of misclassification
points can be avoided.
- “ξ (pronounces as ‘eeta’)“ is the summation of distance of wrong points
till the marginal points.
It will tell that “ how should be this
best fit line.”

Pros & Cons of SVM/C
Pros
• Accuracy
• Works well on smaller cleaner datasets
• It can be more efficient because it uses a subset of training points
Cons
• Isn’t suited to larger datasets as the training time with SVMs can
be high
• Less effective on noisier datasets with overlapping classes

Pros & Cons of SVM/C
SVM Uses
SVM is used for
• text classification tasks such as category assignment,
• detecting spam and
• sentiment analysis.
It is also commonly used for
• image recognition challenges,
• performing particularly well in aspect-based recognition and color-
based classification.
SVM also plays a vital role in many areas of
• handwritten digit recognition, such as postal automation services.

Available Data Sets in Sklearn
Scikit-learn makes available a host of datasets for testing learning algorithms.
They come in three flavors:
Packaged Data: these small datasets are packaged with the scikit-learn
installation, and can be downloaded using the tools in
sklearn.datasets.load_*
Downloadable Data: these larger datasets are available for download, and
scikit-learn includes tools which streamline this process. These tools can be
found in sklearn.datasets.fetch_*
Generated Data: there are several datasets which are generated from models
based on a random seed. These are available in the sklearn.datasets.make_*

Available Data Sets in Sklearn
You can explore the available dataset loaders, fetchers, and generators
using IPython's tab-completion functionality. After importing the datasets
submodule from sklearn, type
datasets.load_<TAB>
or
datasets.fetch_<TAB>
or
datasets.make_<TAB>
to see a list of available functions.

2023/11/8 40
Linear Classifiers
f
x yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Estimation:
w: weight vector
x: data vector

2023/11/8 41
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?

2023/11/8 42
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?

2023/11/8 43
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?

2023/11/8 44
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
Any of these would
be fine..
..but which is best?

2023/11/8 45
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.

2023/11/8 46
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Linear SVM

2023/11/8 47
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x + b)
The maximum margin
maximum margin.
Support Vectors
are those
datapoints that
the margin pushes
up against
Linear SVM

2023/11/8 48
Why Maximum Margin?
denotes +1
denotes -1
The maximum margin
maximum margin.
Support Vectors
are those
datapoints that
the margin pushes
up against

2023/11/8 49
How to calculate the distance from a point to a line?
◼ http://mathworld.wolfram.com/Point-LineDistance2-
Dimensional.html
◼ In our case, w1*x1+w2*x2+b=0,
◼ thus, w=(w1,w2), x=(x1,x2)
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W – Normal Vector
b – Scale Value
W

2023/11/8 50
Estimate the Margin
• What is the distance expression for a point x to a
line wx+b= 0?
denotes +1
denotes -1 x
wx +b = 0
2 2
1
2
( )
d
i
i
b b
d
w
=
 +  +
= =

x w x w
x
w
X – Vector
W – Normal Vector
b – Scale Value
W

2023/11/8 51
Large-margin Decision Boundary
• The decision boundary should be as far away from the data of both
classes as possible
• We should maximize the margin, m
• Distance between the origin and the line wtx=-b is b/||w||
Class 1
Class 2
m

2023/11/8 52
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of
xi
• The decision boundary should classify all points correctly 
• To see this: when y=-1, we wish (wx+b)<1, when y=1, we wish
(wx+b)>1. For support vectors, we wish y(wx+b)=1.
• The decision boundary can be found by solving the following
constrained optimization problem

2023/11/8 53
Next step… Optional
• Converting SVM to a form we can solve
• Dual form
• Allowing a few errors
• Soft margin
• Allowing nonlinear boundary
• Kernel functions

2023/11/8 54
The Dual Problem (we ignore the derivation)
• The new objective function is in terms of ai only
• It is known as the dual problem: if we know w, we know all ai; if we
know all ai, we know w
• The original problem is known as the primal problem
• The objective function of the dual problem needs to be maximized!
• The dual problem is therefore:
Properties of ai when we introduce the
Lagrange multipliers
The result when we differentiate the
original Lagrangian w.r.t. b

2023/11/8 55
The Dual Problem
• This is a quadratic programming (QP) problem
• A global maximum of ai can always be found
• w can be recovered by

2023/11/8 56
Characteristics of the Solution
• Many of the ai are zero (see next page for example)
• w is a linear combination of a small number of data points
• This “sparse” representation can be viewed as data
compression as in the construction of knn classifier
• xi with non-zero ai are called support vectors (SV)
• The decision boundary is determined only by the SV
• Let tj (j=1, ..., s) be the indices of the s support vectors.
We can write
• For testing with a new data z
• Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
• Note: w need not be formed explicitly

2023/11/8 57
a6=1.4
A Geometrical Interpretation
Class 1
Class 2
a1=0.8
a2=0
a3=0
a4=0
a5=0
a7=0
a8=0.6
a9=0
a10=0

2023/11/8 58
Allowing errors in our solutions
• We allow “error” xi in classification; it is based on the output of the
discriminant function wTx+b
• xi approximates the number of misclassified samples
Class 1
Class 2

2023/11/8 59
Soft Margin Hyperplane
• If we minimize ixi, xi can be computed by
• xi are “slack variables” in optimization
• Note that xi=0 if there is no error for xi
• xi is an upper bound of the number of errors
• We want to minimize
• C : tradeoff parameter between error and margin
• The optimization problem becomes

2023/11/8 60
Extension to Non-linear Decision Boundary
• So far, we have only considered large-margin classifier with a
linear decision boundary
• How to generalize it to become nonlinear?
• Key idea: transform xi to a higher dimensional space to
“make life easier”
• Input space: the space the point xi are located
• Feature space: the space of f(xi) after transformation

2023/11/8 61
Transforming the Data
• Computation in the feature space can be costly because
it is high dimensional
• The feature space is typically infinite-dimensional!
• The kernel trick comes to rescue
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f(.)
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
Feature space
Input space
Note: feature space is of higher dimension
than the input space in practice

2023/11/8 62
The Kernel Trick
• Recall the SVM optimization problem
• The data points only appear as inner product
• As long as we can calculate the inner product in the feature space, we do
not need the mapping explicitly
• Many common geometric operations (angles, distances) can be expressed
by inner products
• Define the kernel function K by

2023/11/8 63
An Example for f(.) and K(.,.)
• Suppose f(.) is given as follows
• An inner product in the feature space is
• So, if we define the kernel function as follows, there is no need to carry out f(.)
explicitly
• This use of kernel function to avoid carrying out f(.) explicitly is known as the
kernel trick

2023/11/8 64
More on Kernel Functions
• Not all similarity measures can be used as kernel function, however
• The kernel function needs to satisfy the Mercer function, i.e., the function is
“positive-definite”
• This implies that
• the n by n kernel matrix,
• in which the (i,j)-th entry is the K(xi, xj), is always positive definite
• This also means that optimization problem can be solved in
polynomial time!

2023/11/8 65
Examples of Kernel Functions
• Polynomial kernel with degree d
• Radial basis function kernel with width s
• Closely related to radial basis function neural networks
• The feature space is infinite-dimensional
• Sigmoid with parameter k and q
• It does not satisfy the Mercer condition on all k and q

2023/11/8 66
Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)

2023/11/8 67
Example
• Suppose we have 5 one-dimensional data points
• x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y1=1,
y2=1, y3=-1, y4=-1, y5=1
• We use the polynomial kernel of degree 2
• K(x,y) = (xy+1)2
• C is set to 100
• We first find ai (i=1, …, 5) by

2023/11/8 68
Example
• By using a QP solver, we get
• a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
• Note that the constraints are indeed satisfied
• The support vectors are {x2=2, x4=5, x5=6}
• The discriminant function is
• b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2 and x5 lie on
the line and x4 lies on the line
• All three give b=9

2023/11/8 69
Example
Value of discriminant function
1 2 4 5 6
class 2 class 1
class 1

2023/11/8 70
Degree of
Polynomial
Features
X^1 X^2 X^3
X^4 X^5 X^6

2023/11/8 71
Choosing the Kernel Function
• Probably the most tricky part of using SVM.

2023/11/8 72
Software
• A list of SVM implementation can be found at http://www.kernel-
machines.org/software.html
• Some implementation (such as LIBSVM) can handle multi-class
classification
• SVMLight is among one of the earliest implementation of SVM
• Several Matlab toolboxes for SVM are also available

2023/11/8 73
Summary: Steps for Classification
• Prepare the pattern matrix
• Select the kernel function to use
• Select the parameter of the kernel function and the value of C
• You can use the values suggested by the SVM software, or you can set apart a
validation set to determine the values of the parameter
• Execute the training algorithm and obtain the ai
• Unseen data can be classified using the ai and the support vectors

Summary
SVM regression / SVR:
• SVM regression or Support Vector Regression (SVR) is a machine learning
algorithm used for regression analysis.
• It is different from traditional linear regression methods as it finds a
hyperplane that best fits the data points in a continuous space, instead of
fitting a line to the data points.
• The SVR algorithm aims to find the hyperplane that passes through as
many data points as possible within a certain distance, called the margin.

Summary
SVM regression / SVR:
• This approach helps to reduce the prediction error and allows SVR to
handle non-linear relationships between input variables and the target
variable using a kernel function.
• As a result, SVM regression is a powerful tool for regression tasks where
there may be complex relationships between the input variables and the
target variable.

Difference between SVM/C and SVR:
• SVM/C is a classification algorithm that separates data points into
different classes with a hyperplane while minimizing the
misclassification error.
• On the other hand, SVR (Support Vector Regression) is a regression
algorithm that finds a hyperplane that best fits data points in a
continuous space while minimizing the prediction error.
• SVM/C is used for categorical target variables, while SVR is used for
continuous target variables.
Summary

Applications of SVM regression/SVR:
• SVM regression or Support Vector Regression (SVR) has a wide range
of applications in various fields.
• It is commonly used in
• finance for predicting stock prices,
• in engineering for predicting machine performance, and
• in bioinformatics for predicting protein structures.
Summary

Applications of SVM regression/SVR:
• SVR is also used in
• natural language processing for text classification and
• sentiment analysis.
• Additionally, it is used in
• image processing for object recognition and
• in healthcare for predicting medical outcomes.
• Overall, SVM regression is a versatile algorithm that can be used in
many domains for making accurate predictions.
Summary

• SVM is a useful alternative to neural networks
• Two key concepts of SVM: maximize the margin and the kernel trick
• Many SVM implementations are available on the web for you to try
on your data set!
• SVR acknowledges the presence of non-linearity in the data and
provides a proficient prediction model.
Summary

• Definition 1: A decision tree is a supervised learning algorithm, which is
utilized for both classification and regression tasks. It has a hierarchical,
tree structure, which consists of a root node, branches, internal nodes
and leaf nodes.
• Definition 2: It is a tree-structured model that splits the data into smaller
subsets based on the most significant input feature, recursively dividing
the data until a pure subset is achieved. Each split of the tree results in a
decision node, and each leaf of the tree represents a prediction.
DECISION TREES

Decision Tree Regression
Decision Tree Regression:
Decision tree regression observes features of an object and trains a model in the
structure of a tree to predict data in the future to produce meaningful
continuous output. Continuous output means that the output/result is not
discrete, i.e., it is not represented just by a discrete, known set of numbers or
values.
Continuous output example: A profit prediction model that states the probable
profit that can be generated from the sale of a product.
Here, continuous values are predicted with the help of a decision tree regression
model.

Natural Language Processing of applications.pdf

Natural Language Processing of applications.pdf

More Related Content

Similar to Natural Language Processing of applications.pdf

Recently uploaded

Natural Language Processing of applications.pdf