Classification Algorithm in Big Data.pdf

Statistical Learning for Big Data
Chapter 3.2: Supervised Learning — Classification

Classification Tasks
A. Marx, P. Bürkner © winter term 24/25 Statistical Learning for Big Data – 2 / 63

CLASSIFICATION
Learn functions that assign class labels to observation / feature
vectors. Each observation belongs to exactly one class. The main
difference to regression is the scale of the output / label.
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width
Species
5.1 3.5 1.4 0.2 setosa
5.9 3.0 5.1 1.8 virginica
Classifier
Sepal
Length
Sepal
Width
Petal
Length
Petal
Width
Species
5.4 3.3 3.2 1.1 ???
New Data with
unknown label
New Class label
Our Data

BINARY AND MULTICLASS TASKS
The task can contain two (binary) or multiple (multiclass) classes.
0.00
0.05
0.10
0.15
0.20
0.00 0.01 0.02 0.03 0.04 0.05
V1
V2
Response
M
R
Sonar
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6
Petal.Length
Petal.Width
Response
setosa
versicolor
virginica
Iris

BINARY CLASSIFICATION TASK - EXAMPLES
Credit risk prediction, based on personal data and transactions
Spam detection, based on textual features
Churn prediction, based on customer behavior
Predisposition for specific illness, based on genetic data
https://www.bendbulletin.com/localstate/deschutescounty/3430324-151/fact-or-fiction-polygraphs-just-an-investigative-tool

MULTICLASS TASK - IRIS
The iris dataset was introduced by the statistician Ronald Fisher and is
one of the most frequently used data sets. Originally, it was designed
for linear discriminant analysis.
Setosa Versicolor Virginica
Source:
https://en.wikipedia.org/wiki/Iris_flower_data_set

150 iris flowers
Predict subspecies
Based on sepal and petal
length / width in [cm]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## ---
## 146: 6.7 3.0 5.2 2.3 virginica
## 147: 6.3 2.5 5.0 1.9 virginica
## 148: 6.5 3.0 5.2 2.0 virginica
## 149: 6.2 3.4 5.4 2.3 virginica
## 150: 5.9 3.0 5.1 1.8 virginica

Corr: −0.118
setosa: 0.743***
versicolor: 0.526***
virginica: 0.457***
Corr: 0.872***
setosa: 0.267.
virginica: 0.864***
Corr: −0.428***
setosa: 0.178
virginica: 0.401**
Corr: 0.818***
setosa: 0.278.
virginica: 0.281*
Corr: −0.366***
setosa: 0.233
virginica: 0.538***
Corr: 0.963***
setosa: 0.332*
virginica: 0.322*
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5 setosaversicolor
virginica
0.0
0.4
0.8
1.2
2.0
2.5
3.0
3.5
4.0
4.5
2
4
6
0.0
0.5
1.0
1.5
2.0
2.5
0.0
2.5
5.0
7.5
0.0
2.5
5.0
7.5
0.0
2.5
5.0
7.5

Basic Definitions

CLASSIFICATION TASKS
In a classification task, we aim at predicting a discrete output
y ∈ Y = {C1, ..., Cg}
with 2 ≤ g < ∞, given data D.
In this course, we assume the classes to be encoded as
Y = {0, 1} or Y = {−1, +1} (in the binary case g = 2)
Y = {1, . . . , g} (in the multiclass case g ≥ 3)

CLASSIFICATION MODELS
We define models f : X → Rg
as functions that output (continuous)
scores / probabilities but not (discrete) classes. Why?
From an optimization perspective, it is much (!) easier to optimize
costs for continuous-valued functions
Scores / probabilities (for classes) contain more information than
the class labels alone
As we will see later, scores can easily be transformed into class
labels; but class labels cannot be transformed into scores
We distinguish scoring and probabilistic classifiers.

SCORING CLASSIFIERS
Construct g discriminant / scoring functions f1, ..., fg : X → R
Scores f1(x), . . . , fg(x) are transformed into classes by choosing
the class attaining the maximum score
h(x) = arg max
k∈{1,...,g}
fk (x).
Consider g = 2:
A single discriminant function f(x) = f1(x) − f−1(x) is sufficient
(assuming the class labels {−1, +1})
Class labels are constructed by h(x) = sgn(f(x))
|f(x)| is called confidence

PROBABILISTIC CLASSIFIERS
Construct g probability functions π1, ..., πg : X → [0, 1]
with the constraint that
P
i πi = 1
Probabilities π1(x), . . . , πg(x) are transformed into labels by
predicting the class attaining the maximum probability
h(x) = arg max
k∈{1,...,g}
πk (x)
For g = 2, one π(x) is sufficient for labels {0, 1}
Note that:
Probabilistic classifiers can also be seen as scoring classifiers
If we want to emphasize that our model outputs probabilities, we
denote the model as π(x) : X → [0, 1]g
. When talking about
models in general, we write f, comprising both probabilistic and
scoring classifiers, which will become clear from the context!

CLASSIFIERS: THRESHOLDING
Both, scoring and probabilistic classifiers, can output classes by
thresholding (binary case) / selecting the class attaining the
maximum score (multiclass)
Given a threshold c: h(x) := [f(x) ≥ c]
Usually c = 0.5 for probabilistic, c = 0 for scoring classifiers
There are also versions of thresholding for the multiclass case

DECISION REGIONS AND BOUNDARIES
A decision region for class k is the set of input points x where
class k is assigned as prediction of our model:
Xk = {x ∈ X : h(x) = k}
Points in space where the classes with maximal score are tied and
the corresponding hypersurfaces are called decision boundaries
{x ∈ X : ∃ i ̸= j s.t. fi (x) = fj (x)
and fi (x), fj (x) ≥ fk (x) ∀k ̸= i, j}
In the binary case we can simplify and generalize to the decision
boundary for general threshold c:
{x ∈ X : f(x) = c}
If we set c = 0 for scores and c = 0.5 for probabilities, this is
consistent with the definition above.

DECISION BOUNDARY EXAMPLES
0.000
0.025
0.050
0.075
0.100
0.00 0.01 0.02 0.03 0.04 0.05
V1
V2
Response
M
R
Log. Regr.
0.000
0.025
0.050
0.075
0.100
0.00 0.01 0.02 0.03 0.04 0.05
V1
V2
Response
M
R
Naive Bayes
0.000
0.025
0.050
0.075
0.100
0.00 0.01 0.02 0.03 0.04 0.05
V1
V2
Response
M
R
CART Tree
0.000
0.025
0.050
0.075
0.100
0.00 0.01 0.02 0.03 0.04 0.05
V1
V2
Response
M
R
SVM

CLASSIFICATION APPROACHES
Two fundamental approaches exist to construct classifiers:
The generative approach and the discriminant approach.
They tackle the classification problem from different angles:
Discriminant approaches use empirical risk minimization
based on a suitable loss function:
“What is the best prediction for y given these x?”
Generative classification approaches assume a data-generating
process in which the distribution of the features x is different for the
various classes of the output y, and try to learn these conditional
distributions:
“Which y tends to have x like these?”

DISCRIMINANT APPROACH
The discriminant approach tries to optimize the discriminant functions
directly, usually via empirical risk minimization.
f̂ = arg min
f∈H
Remp(f) = arg min
f∈H
n
X
i=1
L

y(i)
, f

x(i)

.
Examples:
Logistic regression (discriminant, linear)
Support vector machines (discriminant and linear; can be
extended to non-linear functions via kernels)
Neural networks (discriminant, highly non-linear beyond one layer)

GENERATIVE APPROACH
The generative approach employs Bayes’ theorem and models
p(x|y = k) by making assumptions about the structure of these
distributions:
πk (x) = P(y = k | x)
| {z }
Posterior
=
Likelihood
z }| {
P(x|y = k)
Prior
z }| {
P(y = k)
P(x)
|{z}
Marginal Likelihood
=
p(x|y = k)πk
g
P
j=1
p(x|y = j)πj
Prior class probabilities πk are easy to estimate from the training data.
Examples:
Naïve Bayes classifier
Linear discriminant analysis (generative, linear)
Quadratic discriminant analysis (generative, non-linear)
Caution: LDA and QDA have ’discriminant’ in their name, but they are
generative models!

Linear Classifiers

LINEAR CLASSIFIERS
Linear classifiers are an important subclass of classification models. If
the discriminant function(s) fk (x) can be specified as linear function(s)
(possibly through a rank-preserving, monotonic transformation
g : R → R), i. e.,
g(fk (x)) = w⊤
k x + bk ,
we will call the classifier a linear classifier.

LINEAR CLASSIFIERS
We can show that the decision boundary between classes i and j is a
hyperplane. For arbitrary x attaining a tie w.r.t. its scores:
fi (x) = fj (x)
g(fi (x))) = g(fj (x))
w⊤
i x + bi = w⊤
j x + bj
(wi − wj )⊤
x + (bi − bj ) = 0
This specifies a hyperplane separating two classes.

LINEAR VS NONLINEAR DECISION BOUNDARY
2.0
2.5
3.0
3.5
5 6 7 8
Sepal.Length
Sepal.Width
Response
versicolor
virginica
2.0
2.5
3.0
3.5
5 6 7 8
Sepal.Length
Sepal.Width
Response
versicolor
virginica

Logistic Regression

MOTIVATION
A discriminant approach for directly modeling the posterior
probabilities π(x | θ) of the labels is logistic regression.
For now, let’s focus on the binary case y ∈ {0, 1} and use empirical
risk minimization.
arg min
θ∈Θ
Remp(θ) = arg min
θ∈Θ
n
X
i=1
L

y(i)
, π

x(i)
| θ

.
A naïve approach would be to model∗
π(x | θ) = θT
x.
Obviously this could result in predicted probabilities π(x | θ) ̸∈ [0, 1].
∗
Note: we suppress the intercept for a concise notation.

LOGISTIC FUNCTION
To avoid this, logistic regression “squashes” the estimated linear scores
θT
x to [0, 1] through the logistic function s:
π(x | θ) =
exp θT
x

1 + exp (θT x)
=
1
1 + exp (−θT x)
= s θT
x

0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10
f
s(f)

LOGISTIC FUNCTION
The intercept shifts s(f) horizontally s(θ0 + f) = exp(θ0+f)
1+exp(θ0+f)
0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10
f
s(f)
θ0
−3
0
3
Scaling f controls the slope and direction of s(αf) = exp(αf)
1+exp(αf).
0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10
f
s(f)
α
−2
−0.3
1
6

BERNOULLI / LOG LOSS
We need to define a loss function for the ERM approach, and model
y | x, θ with as Bernoulli, i. e.
p(y | x, θ) =





(1 − π(x | θ)) if y = 0,
π(x | θ) if y = 1,
0 else.
To arrive at the loss function, we consider the negative log-likelihood:
L(y, π(x | θ)) = −y ln(π(x | θ)) − (1 − y) ln(1 − π(x | θ))
Also called cross-entropy loss
Used for many other classifiers, e.g., neural networks or boosting

BERNOULLI / LOG LOSS
Heavy penalty for confidently wrong predictions
Recall: L(y, π(x | θ)) = −y ln(π(x | θ)) − (1 − y) ln(1 − π(x | θ))
0
2
4
6
0.00 0.25 0.50 0.75 1.00
π(x)
L
(
y,
π
(
x
))
y
0
1

LOGISTIC REGRESSION IN 1D
With one feature x ∈ R. The figure shows data and x 7→ π(x).
0.00
0.25
0.50
0.75
1.00
0 2 4 6
x
π
(
x
)

Obviously, logistic regression is a linear classifier, since
π(x | θ) = s θT
x

and s is invertible.
0
2
4
6
0 2 4 6
x1
x2
Response
FALSE
TRUE

0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10
score
prob

SUMMARY OF LOGISTIC REGRESSION
Hypothesis Space:
H =

π : X → [0, 1] | π(x) = s(θT
x)
Risk: Logistic/Bernoulli loss function.
L (y, π(x)) = −y ln(π(x)) − (1 − y) ln(1 − π(x))
Optimization: Numerical optimization, typically gradient-based
methods.

Multiclass Classification

FROM LOGISTIC REGRESSION...
Recall logistic regression (Y = {0, 1}): We combined the hypothesis
space of linear functions, transformed by the logistic function
s(z) = 1
1+exp(−z)
H =
n
π : X → R | π(x) = s(θ⊤
x)
o
with the Bernoulli (logarithmic) loss:
L(y, π(x)) = −y log (π(x)) − (1 − y) log (1 − π(x)) .

...TO SOFTMAX REGRESSION
There is a straightforward generalization to the multiclass case:
Instead of a single linear discriminant function we have g linear
discriminant functions
fk (x) = θ⊤
k x, k = 1, 2, ..., g,
one for each class.
The g score functions are transformed into g probability functions
by the softmax function s : Rg
→ Rg
πk (x) = s(f(x))k =
exp(θ⊤
k x)
Pg
j=1 exp(θ⊤
j x)
instead of the logistic function for g = 2. The probabilities are
well-defined:
P
πk (x) = 1 and πk (x) ∈ [0, 1] for all k.

...TO SOFTMAX REGRESSION
The softmax function is a generalization of the logistic function.
For g = 2, they are equivalent.
Instead of Bernoulli, we use the multiclass logarithmic loss
L(y, π(x)) = −
g
X
k=1
1{y=k} log (πk (x)) .
Note that the softmax function is a smooth approximation of the
arg max operation, so s((1, 1000, 2)T
) ≈ (0, 1, 0)T
(selects 2nd element!).
Furthermore, it is invariant to constant offsets in the input:
s(f(x)+c) =
exp(θ⊤
k x + c)
Pg
j=1 exp(θ⊤
j x + c)
=
exp(θ⊤
k x) · exp(c)
Pg
j=1 exp(θ⊤
j x) · exp(c)
= s(f(x))

LOGISTIC VS. SOFTMAX REGRESSION
Logistic Regression Softmax Regression
Y {0, 1} {1, 2, ..., g}
Discriminant fun. f(x) = θ⊤x fk (x) = θ⊤
k x, k = 1, 2, ..., g
Probabilities π(x) = 1
1+exp(−θ⊤x)
πk (x) =
exp(θ⊤
k x)
Pg
j=1
exp(θ⊤
j
x)
L(y, π(x)) Bernoulli / logarithmic loss Multiclass logarithmic loss
−y log (π(x)) − (1 − y) log (1 − π(x)) −
Pg
k=1[y = k] log (πk (x))

LOGISTIC VS. SOFTMAX REGRESSION
Further comments:
We can now, for instance, calculate gradients and optimize this
with standard numerical optimization software.
Softmax regression has an unusual property in that it has a
“redundant” set of parameters. If we add a fixed vector to all θk ,
the predictions do not change. I.e., our model is
“over-parameterized”, since multiple parameter vectors correspond
to exactly the same hypothesis. This also implies that the
minimizer of Remp(θ) is not unique! Hence, a numerical trick is to
set θg = 0 and only optimize the other θk , k ̸= g.
A similar approach is used in many ML models: multiclass LDA,
naive Bayes, neural networks, etc.

SOFTMAX: LINEAR DISCRIMINANT FUNCTIONS
Softmax regression gives us a linear classifier.
The softmax function s(z)k = exp(zk )
Pg
j=1 exp(zj )
is
a rank-preserving function, i.e., the ranks among the elements
of the vector z are the same as among the elements of s(z). This
is because softmax transforms all scores by taking the exp(·)
(rank-preserving) and divides each element by the same
normalizing constant.
Thus, the softmax function has a unique inverse function
s−1
: Rg
→ Rg
that is also monotonic and rank-preserving.
Applying s−1
k to πk (x) =
exp(θ⊤
k x)
Pn
j=1 exp(θ⊤
j x)
gives us fk (x) = θ⊤
k x. Thus
softmax regression is a linear classifier.

Discriminant Analysis

LINEAR DISCRIMINANT ANALYSIS (LDA)
LDA follows a generative approach
πk (x) = P(y = k | x)
| {z }
Posterior
=
Likelihood
z }| {
P(x|y = k)
Prior
z }| {
P(y = k)
P(x)
|{z}
Marginal Likelihood
=
p(x|y = k)πk
g
P
j=1
p(x|y = j)πj
where we now have to choose a distributional form for p(x|y = k).

LDA assumes that each class density (the likelihood) is modeled as a
multivariate Gaussian:
p(x|y = k) =
1
(2π)
p
2 |Σ|
1
2
exp

−
1
2
(x − µk )T
Σ−1
(x − µk )

with equal covariance, i.e., ∀k : Σk = Σ.
0
5
10
15
0 5 10 15
X1
X2

Parameters θ are estimated in a straightforward manner
Prior class probabilities:
ˆ
πk =
nk
n
, where nk is the number of class-k observations
Likelihood parameters:
ˆ
µk =
1
nk
X
i:y(i)=k
x(i)
(1)
Σ̂ =
1
n − g
g
X
k=1
X
i:y(i)=k
(x(i)
− ˆ
µk )(x(i)
− ˆ
µk )T
(2)

LDA AS LINEAR CLASSIFIER
Because of the common covariance structure for all class-specific
Gaussians, the decision boundaries of LDA are linear.
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6
Petal.Length
Petal.Width
Response
setosa
versicolor
virginica

We can formally show that LDA is a linear classifier by showing that the
posterior probabilities can be written as linear scoring functions — up to
an isotonic / rank-preserving transformation.
πk (x) =
πk · p(x|y = k)
p(x)
=
πk · p(x|y = k)
g
P
j=1
πj · p(x|y = j)
Since the denominator is the same (constant) for all classes we only
need to consider
πk · p(x|y = k).
Our plan is to show that the logarithm of this can be written as a linear
function of x.

πk · p(x|y = k)
∝ πk exp −1
2
xT
Σ−1
x − 1
2
µT
k Σ−1
µk + xT
Σ−1
µk

= exp log πk − 1
2
µT
k Σ−1
µk + xT
Σ−1
µk

exp −1
2
xT
Σ−1
x

= exp θ0k + xT
θk

exp −1
2
xT
Σ−1
x

∝ exp θ0k + xT
θk

by defining θ0k := log πk − 1
2
µT
k Σ−1
µk , and θk := Σ−1
µk .
We have left out all constants that are common for all classes k, i.e., the
normalizing constant of our Gaussians and exp −1
2
xT
Σ−1
x

.
Finally, by taking the log, we can write the transformed scores as linear
functions:
fk (x) = θ0k + xT
θk

QUADRATIC DISCRIMINANT ANALYSIS (QDA)
QDA is a direct generalization of LDA, where the class specific
densities are now Gaussians with individual covariances Σk .
p(x|y = k) =
1
(2π)
p
2 |Σk |
1
2
exp

−
1
2
(x − µk )T
Σ−1
k (x − µk )

Parameters are estimated in a straightforward manner
ˆ
πk =
nk
n
, where nk is the number of class-k observations
ˆ
µk =
1
nk
X
i:y(i)=k
x(i)
ˆ
Σk =
1
nk − 1
X
i:y(i)=k
(x(i)
− ˆ
µk )(x(i)
− ˆ
µk )T

Covariance matrices can differ across classes.
Yields better data fit but also requires estimating more parameters.
0
5
10
15
0 5 10 15
X1
X2

πk (x) ∝ πk · p(x|y = k)
∝ πk |Σk |−1
2 exp(−
1
2
xT
Σ−1
k x −
1
2
µT
k Σ−1
k µk + xT
Σ−1
k µk )
Taking the log of the above, we can define a discriminant function that
is quadratic in x.
log πk −
1
2
log |Σk | −
1
2
µT
k Σ−1
k µk + xT
Σ−1
k µk −
1
2
xT
Σ−1
k x

0.0
0.5
1.0
1.5
2.0
2.5
2 4 6
Petal.Length
Petal.Width
Response
setosa
versicolor
virginica

Naïve Bayes

NAÏVE BAYES CLASSIFIER
NB is a generative multiclass classifier. Recall:
πk (x) = P(y = k | x)
| {z }
Posterior
=
Likelihood
z }| {
P(x|y = k)
Prior
z }| {
P(y = k)
P(x)
|{z}
Marginal Likelihood
=
p(x|y = k)πk
g
P
j=1
p(x|y = j)πj
NB relies on a simple conditional independence assumption: the
features are conditionally independent given class y.
p(x|y = k) = p((x1, x2, ..., xp)|y = k) =
p
Y
j=1
p(xj |y = k).
So we only need to specify and estimate the distributions
p(xj |y = k), j ∈ {1, . . . , p}, k ∈ {1, . . . , g}, which is considerably
simpler since they are univariate.

NB: NUMERICAL FEATURES
We use a univariate Gaussian for p(xj |y = k), and estimate (µj , σ2
j )k in
the standard manner. Since p(x|y = k) =
p
Q
j=1
p(xj |y = k), the joint
conditional density is Gaussian with diagonal but non-isotropic
covariance structure, and potentially different across classes. Hence,
NB is a (specific) QDA model, with quadratic decision boundary.
0
5
10
15
0 5 10 15
x1
x2
Response
a
b

NB: CATEGORICAL FEATURES
We use a categorical distribution for p(xj |y = k) and estimate the
probabilities that, in class k, our j-th feature has value m, xj = m,
simply by counting the frequencies, i. e.
p(xj | y = k) =
Y
m
p(xj = m | y = k) .
Because of the simple conditional independence structure it is also very
easy to deal with mixed numerical / categorical feature spaces.

LAPLACE SMOOTHING
If a given class and feature value never occur together in the training
data, then the frequency-based probability estimate will be zero.
This is a problem since it will wipe all information of the other non-zero
probabilities when they are multiplied.
A simple numerical correction is to set these zero probabilities to a
small value δ 0 to regularize against this case.

NAÏVE BAYES: APPLICATION AS SPAM FILTER
In the late 90s, Naïve Bayes became popular for e-mail spam filters
Word counts were used as features to detect spam mails (e.g.,
Viagra often occurs in spam mail)
Independence assumption implies: occurrence of two words in
mail is not correlated
Seems naïve (Viagra more likely to occur in context with buy
than flower), but leads to less required parameters and better
generalization, and often works well in practice.

k-Nearest Neighbors Classification

K-NEAREST NEIGHBORS
For each point x to predict:
Compute k-nearest neighbors in training data Nk (x) ⊆ X
Average over labels y of the k neighbors
For regression:
f̂(x) =
1
|Nk (x)|
X
i:x(i)∈Nk (x)
y(i)
For classification in g groups, a majority vote is used:
ĥ(x) = arg max
ℓ∈{1,...,g}
X
i:x(i)∈Nk (x)
I(y(i)
= ℓ)
Posterior probabilities can be estimated with:
π̂ℓ(x) =
1
|Nk (x)|
X
i:x(i)∈Nk (x)
I(y(i)
= ℓ)

K-NEAREST NEIGHBORS / 2
Example on a subset of iris data (k = 3):
xnew
2.9
3.0
3.1
3.2
6.2 6.3 6.4 6.5 6.6
Sepal.Length
Sepal.Width
Species
versicolor
virginica
SL SW Species dist
52 6.4 3.2 versicolor 0.200
59 6.6 2.9 versicolor 0.224
75 6.4 2.9 versicolor 0.100
76 6.6 3.0 versicolor 0.200
98 6.2 2.9 versicolor 0.224
104 6.3 2.9 virginica 0.141
105 6.5 3.0 virginica 0.100
111 6.5 3.2 virginica 0.224
116 6.4 3.2 virginica 0.200
117 6.5 3.0 virginica 0.100
138 6.4 3.1 virginica 0.100
148 6.5 3.0 virginica 0.100
π̂setosa(xnew ) = 0
3
= 0%
π̂versicolor (xnew ) = 1
3
= 33%
π̂virginica(xnew ) = 2
3
= 67%
ĥ(xnew ) = virginica

K-NN: FROM SMALL TO LARGE K
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8
Sepal.Length
Sepal.Width
k = 1
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8
Sepal.Length
Sepal.Width
k = 5
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8
Sepal.Length
Sepal.Width
k = 10
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8
Sepal.Length
Sepal.Width
k = 50
Complex, local model vs smoother, more global model

K-NN AS NON-PARAMETRIC MODEL
k-NN is a lazy classifier, it has no real training step, it simply stores
the complete data which is queried during prediction∗
Hence, its parameters are the training data, there is no real
compression of information∗
As the number of parameters grows with the number of training
points, we call k-NN a non-parametric model
Hence, k-NN is not based on any distributional or strong functional
assumption, and can, in theory, model data situations of arbitrary
complexity
∗
clever data structures can compress the data such as to work for Big Data k-NN
models, see regression slides, but this is no inherent property of the statistical model.

SUMMARY
Today, we saw two fundamental approaches to construct classifiers:
Discriminant approaches use empirical risk minimization
based on a suitable loss function, e. g., logistic regression,
softmax regression
Generative approaches assume there exists a data-generating
process in which the distribution of the features x is different for the
various classes of the output y, and try to learn these conditional
distributions, e. g., LDA, QDA, naïve Bayes
In addition, we discussed k-NN classification, which is a non-parametric
approach that falls into none of the above classes.

Classification Algorithm in Big Data.pdf

More Related Content

Similar to Classification Algorithm in Big Data.pdf

Recently uploaded

Classification Algorithm in Big Data.pdf