Supervised_Learning.ppt

CHAPTER 2:
Supervised Learning

Machine Learning
 Machine learning is a subset of AI, which enables the
machine to automatically learn from data, improve
performance from past experiences, and make
predictions.
 Machine learning contains a set of algorithms that work on
a huge amount of data.
 Data is fed to these algorithms to train them, and on the
basis of training, they build the model & perform a specific
task.
 These ML algorithms help to solve different business
problems like Regression, Classification, Forecasting,
Clustering, and Associations, etc.

 Based on the methods and way of learning,
machine learning is divided into mainly four types,
which are:
 Supervised Machine Learning
 Unsupervised Machine Learning
 Semi-Supervised Machine Learning
 Reinforcement Learning
Types of Machine Learning

Outline of this chapter
 Learning a Class from Examples,
 Vapnik-Chervonenkis Dimension,
 Probably Approximately Correct Learning,
 Noise,
 Learning Multiple Classes,
 Regression,
 Model Selection and Generalization,
 Dimensions of a Supervised Machine Learning Algorithm

Learning a Class from Examples
 Let us say we want to learn the class, C, of a “family
car.”
 We have a set of examples of cars, and we have a
group of people that we survey to whom we show
these cars.
 The people look at the cars and label them; the cars
that they believe are family cars are positive
examples, and the other cars are negative examples.
 Class learning is finding a description that is shared
by all the positive examples and none of the negative
examples.

 Doing this, we can make a prediction: Given a car that we
have not seen before, by checking with the description
learned, we will be able to say whether it is a family car or
not.
 Or we can do knowledge extraction: This study may be
sponsored by a car company, and the aim may be to
understand what people expect from a family car.
 After some discussions with experts in the field, let us say
that we reach the conclusion that among all features a car
may have, the features that separate a family car from
other type of cars are the price and engine power.

 These two attributes are the inputs to the class recognizer.
 Note that when we decide on this particular input
representation, we are ignoring various other attributes as
irrelevant.
 Though one may think of other attributes such as seating
capacity and color that might be important for distinguishing
among car types, we will consider only price and engine
power to keep this example simple.

 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect from a
family car?
 Input representation:
x1: price, x2 : engine power
 Output:
Positive (+) and negative (–) examples

 Training set for the class of a “family car.”
 Each data point corresponds to one example car, and the
coordinates of the point indicate the price and engine
power of that car.
 ‘+’ denotes a positive example of the class (a family car),
and
 ‘−’ denotes a negative example (not a family car); it is
another type of car.
 Let us denote price as the first input attribute x1 and
engine power as the second attribute x2 .
 Thus we represent each car using two numeric values

 Our training data can now be plotted in the two-dimensional
(x1, x2) space where each instance t is a data point at
coordinates and its type, namely, positive versus
negative, is given by
 After further discussions with the expert and the analysis of
the data, we may have reason to believe that for a car to be
a family car, its price and engine power should be in a
certain range.
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2)
for suitable values of p1, p2, e1, and e2.
 The above equation fixes H, the hypothesis class from which we
believe C is drawn, namely, the set of rectangles.
 The learning algorithm then finds hypothesis the particular
hypothesis, h ∈ H, specified by a particular quadruple of (ph1, ph2,
eh1, eh2), to approximate C as closely as possible.

Training set X N
t
t
t
,r 1
}
{ 
 x
X




negative
is
if
0
positive
is
if
1
x
x
r







2
1
x
x
x

Class C
   
2
1
2
1 power
engine
AND
price e
e
p
p 




Hypothesis class H




negative
as
classifies
if
0
positive
as
classifies
if
1
)
(
x
x
x
h
h
h
Error of h on H
 
 




N
t
t
t
r
h
h
E
1
1
)
|
( x
X

S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h  H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)

 One possibility is to find the most specific hypothesis, S,
that is the tightest rectangle that includes all the positive
examples and none of the negative examples.
 The most general hypothesis, G, is the largest rectangle
we can draw that includes all the positive examples and
none of the negative examples.
 Any h∈H between S and G is a valid hypothesis with no
error, said to be consistent with the training set, and such h
make up the version space.

Candidate Elimination Algorithm
 The candidate elimination algorithm incrementally builds the
version space given a hypothesis space H and a set E of
examples.
 The examples are added one by one; each example
possibly shrinks the version space by removing the
hypotheses that are inconsistent with the example.
 The candidate elimination algorithm does this by updating
the general and specific boundary for each new example.

Terms Used:
Concept learning:
• Concept learning is basically learning task of the machine
(Learn by Train data)
General Hypothesis:
• Not Specifying features to learn the machine.
• G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
Specific Hypothesis:
• Specifying features to learn machine (Specific feature)
• S= {‘pi’,’pi’,’pi’…}: Number of pi depends on number of
attributes.
Version Space:
• It is intermediate of general hypothesis and Specific
hypothesis.

Algorithm:
Step1: Load Data set
Step2: Initialize General Hypothesis and Specific Hypothesis.
Step3: For each training example
Step4: If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?’
(Basically generalizing it)
Step5: If example is Negative example
Make generalize hypothesis more specific.

Example:
 Consider the dataset given below:
Sky Temperature Humid Wind Water Forest Output
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes

Algorithmic steps:
Initially :
G = [[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?]]
S = [Null, Null, Null, Null, Null, Null]
For instance 1 :
<'sunny','warm','normal','strong','warm ','same’>
and positive output.
G1 = G
S1 = ['sunny','warm','normal','strong','warm ','same']

For instance 2 :
<'sunny','warm','high','strong','warm ','same’>
G2 = G
S2 = ['sunny','warm',?,'strong','warm ','same’]
For instance 3 :
<'rainy','cold','high','strong','warm ','change’>
and negative output.
G3 = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, 'same']]
S3 = S2

For instance 4 : <'sunny','warm','high','strong','cool','change’>
G4 = G3
S4 = ['sunny','warm',?,'strong', ?, ?]
 At last, by synchronizing the G4 and S4 algorithm produce
the output.
Output :
 G = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?]]
 S = ['sunny','warm',?,'strong', ?, ?]

Vapnik-Chervonenkis Dimension
 Let us say we have a dataset containing N points.
 These N points can be labeled in 2N ways as positive and
negative.
 Therefore, 2N different learning problems can be defined by
N data points.
 If for any of these problems, we can find a hypothesis h∈H
that separates the positive examples from the negative,
then we say H shatters N points.
 That is, any learning problem definable by N examples can
be learned with no error by a hypothesis drawn from H.

 The maximum number of points that VC dimension can be
shattered by H is called the Vapnik-Chervonenkis (VC)
dimension of H, is denoted as VC(H), and measures the
capacity of H.

VC Dimension
 N points can be labeled in 2N ways as +/–
 H shatters N if there
exists h  H consistent
for any of these:
VC(H ) = N
An axis-aligned rectangle shatters 4 points only !

 How many training examples N should we have, such that with
probability at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)
 Each strip is at most ε/4
 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Probably Approximately Correct
(PAC) Learning

Noise
 Noise is any
unwanted
anomaly in the
data and due to
noise, the class
may be more
difficult to learn
and zero error
may be
infeasible with a
simple
hypothesis
class. (see
figure 2.8)

 There are several interpretations of noise:
 There may be imprecision in recording the input attributes, which
may shift the data points in the input space.
 There may be errors in labeling the data points, which may relabel
positive instances as negative and vice versa. This is sometimes
called teacher noise.
 There may be additional attributes, which we have not taken into
account, that affect the label of an instance. Such attributes may
be hidden or latent in that they may be unobservable. The effect of
these neglected attributes is thus modeled as a random
component and is included in “noise.”

 As can be seen in figure 2.8, when there is noise, there is
not a simple boundary between the positive and negative
instances and to separate them, one needs a complicated
hypothesis that corresponds to a hypothesis class with
larger capacity.
 A rectangle can be defined by four numbers, but to define
a more complicated shape one needs a more complex
model with a much larger number of parameters.
 With a complex model one can make a perfect fit to the
data and attain zero error; see the wiggly shape in figure
2.8.

31
Use the simpler one because
 Simpler to use
(lower computational
complexity)
 Easier to train (lower
space complexity)
 Easier to explain
(more interpretable)
 Generalizes better (lower
variance - Occam’s razor)
Noise and Model Complexity

Learning Multiple Classes
 In our example of learning a family car, we have positive
examples belonging to the class family car and the
negative examples belonging to all other cars.
 This is a two-class problem.
 In the general case, we have K classes denoted as Ci, i =
1, . . . , K, and an input instance belongs to one and
exactly one of them.

Multiple Classes, Ci i=1,...,K
N
t
t
t
,r 1
}
{ 
 x
X









,
if
0
if
1
i
j
r
j
t
i
t
t
i
C
C
x
x
Train hypotheses
hi(x), i =1,...,K:
 









,
if
0
if
1
i
j
h
j
t
i
t
t
i
C
C
x
x
x

 In machine learning for classification, we would like to
learn the boundary separating the instances of one class
from the instances of all other classes.
 Thus we view a K-class classification problem as K two-
class problems.
 The training examples belonging to Ci are the positive
instances of hypothesis hi and the examples of all other
classes are the negative instances of hi.
 The total empirical error takes a sum over the predictions
for all classes over all instances:

Regression
 In classification, given an input, the output that is
generated is Boolean; it is a yes/no answer.
 When the output is a numeric value, what we would like
to learn is not a class, C(x) ∈ {0, 1}, but is a numeric
function.
 In machine learning, the function is not known but we
have a training set of examples drawn from it

Regression   0
1 w
x
w
x
g 

  0
1
2
2 w
x
w
x
w
x
g 


   
 




N
t
t
t
x
g
r
N
g
E
1
2
1
| X
   
 





N
t
t
t
w
x
w
r
N
w
,
w
E
1
2
0
1
0
1
1
| X
 
  




 
t
t
t
N
t
t
t
x
f
r
r
r
x 1
,
X

Model Selection & Generalization
 Learning is an ill-posed problem; data is not sufficient to
find a unique solution
 The need for inductive bias, assumptions about H
 Generalization: How well a model performs on new data
 Overfitting: H more complex than C or f
 Underfitting: H less complex than C or f

Triple Trade-Off
 There is a trade-off between three factors (Dietterich,
2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As N, E
 As c (H), first E and then E

Cross-Validation
 To estimate generalization error, we need data unseen
during training. We split the data as
 Training set (50%)
 Validation set (25%)
 Test (publication) set (25%)
 Resampling when there is few data

Dimensions of a Supervised
Learner
1. Model :
2. Loss function:
3. Optimization procedure:
 

|
x
g
   
 
 


t
t
t
g
,
r
L
E |
| x
X
 
X
|
min
arg 



E
*

Supervised_Learning.ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Supervised_Learning.ppt

Similar to Supervised_Learning.ppt (20)

Recently uploaded

Recently uploaded (20)

Supervised_Learning.ppt