machine learning.pdf

1-2
Lectures 13, 14, 15, 16
Probabilistic Classifiers
EE 439 Introduction to
Machine Learning
Draft of Chapter 2 of Tom Mitchell, “Machine
Learning”, 2nd Edition, 2018
www.cs.cmu.edu/~tom/mlbook.html.
Probabilistic Methods

1-3
Estimating Probabilities: Intro
❑ Many machine learning methods depend on
probabilistic approaches.
❑ The reason: when we are interested in learning some
target function f : X → Y, we can more generally learn
the probabilistic function P(Y|X).
❑ We can design algorithms that learn functions with
uncertain outcomes (e.g., predicting tomorrow’s stock
price) and that incorporate prior knowledge to guide
learning (e.g., a bias that tomorrow’s stock price is
likely to be similar to today’s price).

1-4
Estimating Probabilities: Intro
❑ We will discuss joint probability distributions over
many variables and shows how they can be used to
calculate a target P(Y|X).
❑ We will study the problem of learning, or estimating
probability distributions from training data.
❖ The two most common approaches:
• maximum likelihood estimation (MLE) and
• maximum a posteriori estimation (MAP).

1-5
Joint Probability Distributions
❑ The key to building probabilistic models is
❖ to define a set of random variables, and
❖ to consider the joint probability distribution over
them.

1-6
❑ Table 1 defines a joint probability distribution over
three random variables:
❖ a person’s Gender, the number of HoursWorked each week,
and their Wealth.

1-7
❑ In general, defining a joint probability distribution over
a set of discrete-valued variables involves three
simple steps:

1-8
❑ The joint probability distribution is central to
probabilistic inference because once we know the
joint distribution, we can answer every possible
probabilistic question that can be asked about these
variables.
❖ We can calculate conditional or joint probabilities
over any subset of variables, given their joint
distribution.

1-9
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = ?
❑ P(Wealth = rich) = ?

1-10
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = 0.6685
❖ sum the four rows for which Gender = male.
❑ P(Wealth = rich) = 0.2393
❖ add together the probabilities for the four rows
covering the cases for which Wealth=rich.

1-11
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = ?

1-12
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = 0.0362
❖ sum the two table rows that satisfy this joint
assignment.

1-13
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = ?

1-14
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = 0.0362/0.3315 =
0.1092.
❖ sum appropriate rows, to obtain the conditional
probability.

1-15
❑ If we know the joint probability distribution over an
arbitrary set of random variables {X1. . . Xn}, we can
calculate the conditional and joint probability
distributions for arbitrary subsets of these variables
(e.g., P(Xn | X1. . . Xn-1)).
❑ In theory, we can solve
❖ Any classification, regression or other function
approximation problem defined over these
variables, and furthermore, produce probabilistic
rather than deterministic predictions for any given
input to the target function.

1-16
Learning the Joint Distribution
❑ How can we learn joint distributions from observed
training data?
❑ In the example of Table 1, it will be easy if we begin
with a large database containing, say, descriptions of
a million people in terms of their values for our three
variables.

1-17
❑ Given a large data set:
❖ one can easily estimate a probability for each row
in the table by calculating the fraction of database
entries (people) that satisfy the joint assignment
specified for that row.
❖ If thousands of database entries fall into each row,
we will obtain highly reliable probability estimates
using this strategy.

1-18
❑ However, it can be difficult to learn the joint
distribution due to the very large amount of training
data required.
❖ Consider how our learning problem would change
if we were to add additional variables to describe a
total of 100 Boolean features for each person in
Table 1 (e.g., we could add ”do they have a
college degree?”, ”are they healthy?”).
❖ Given 100 Boolean features, the number of rows
would now expand to 2^100, which is greater than
10^30.

1-19
❑ Unfortunately, even if our database describes every
single person on earth, we will not have enough data
to obtain reliable probability estimates for most rows.
❖ There are only approximately 10^10 people on
earth, which means that for most of the 2^100
rows in our table, we would have zero training
examples!

1-20
❑ This is a significant problem:
❖ Given that real-world machine learning
applications often use many more than 100
features to describe each example (e.g., text
analysis may use millions of features to describe
text in a given document.).

1-21
❑ To successfully address the issue of learning
probabilities from available training data, we must:
❖ (1) be smart about how we estimate probability
parameters from available data, and
❖ (2) be smart about how we represent joint
probability distributions.

1-22
Estimating Probabilities
❑ Let us begin our discussion of how to estimate
probabilities with a simple example and explore two
intuitive algorithms.
❑ These two intuitive algorithms illustrate the two
primary approaches used in nearly all probabilistic
machine learning algorithms.

1-23
❑ Assume you have a coin, represented by the random
variable X. If you flip this coin, it may turn up heads
(X = 1) or tails (X = 0).
❑ The learning task is to estimate the probability that it
will turn up heads; i.e., to estimate P(X =1).
❑ We will use θ to refer to the true (but unknown)
probability of heads (e.g., P(X =1) = θ), and use to
refer to our learned estimate of this true θ.

1-24
❑ You gather training data by flipping the coin n times,
and observe that it turns up heads α1 times, and tails
α0 times.
n = α1 + α0
❑ What is the most intuitive approach to estimating θ =
P(X =1) from this training data?
❖ Estimate the probability by the fraction of flips that
result in heads:

1-25

1-26

1-27
❑ This leads to our second intuitive algorithm:
❖ an algorithm that enables us to incorporate prior
assumptions along with observed training data to
produce our final estimate.
❑ Algorithm 2 allows us to express our prior
assumptions or knowledge about the coin by adding
in any number of imaginary coin flips resulting in
heads or tails.

1-28

1-29

1-30

1-31
❑ Asymptotically, as the volume of actual observed
data grows toward infinity:
❖ the influence of our imaginary data goes to zero
(the fixed number of imaginary coin flips becomes
insignificant compared to a sufficiently large
number of actual observations).
❖ In other words, Algorithm 2 behaves so that priors
have the strongest influence when observations
are scarce, and their influence gradually reduces
as observations become more plentiful.

1-32
❑ Both Algorithm 1 and Algorithm 2 are intuitively quite
compelling.
❖ These two algorithms exemplify the two most
widely used approaches to machine
learning of probabilistic models from training data.
❖ They follow two different underlying principles.
• Algorithm 1 follows a principle called
Maximum Likelihood Estimation (MLE), in
which we seek an estimate of θ that maximizes
the probability of the observed data.

1-33
• Algorithm 2 follows a different principle called
Maximum a Posteriori (MAP) estimation, in
which we seek the estimate of θ that is itself
most probable, given the observed data, plus
background assumptions about its value.
❑ Both principles have been widely used to derive and
to justify a vast range of machine learning algorithms,
from Bayesian networks, to linear regression, to
neural network learning.
❖ Our coin flip example represents just one of many such
learning problems.

1-34
❑ Here the learning task is to estimate the unknown
value of θ = P(X = 1) for a Boolean valued random
variable X, based on a sample of n values of X drawn
independently (e.g., n independent flips of a coin with
probability θ of heads).
❑ Data is generated by a random number generator.
❖ The true value of θ is 0.3, and the same sequence
of training examples is used in each plot.
❑ The experimental behavior of these two algorithms is
shown in Figure.

1-35
• The blue line shows the estimates produced by MLE while
the red line shows the estimates of MAP.
• Plots on the left correspond to the correct prior assumption
about the value of θ, plots on the right reflect the incorrect
prior assumption that θ is most probably 0.4.

1-36
• Plots in the top row reflect lower confidence in the prior
assumption, by including only 60 imaginary data points,
whereas bottom plots assume 120.

1-37
Maximum Likelihood Estimation
(MLE)

1-38
(MLE)

1-39
(MLE)

1-40
(MLE)

1-41
(MLE)

1-42
(MLE)

1-43
(MLE)

1-44
(MLE)

1-45
(MLE)

1-46
Maximum a Posteriori Probability
Estimation (MAP)

1-47
Estimation (MAP)

1-48
Estimation (MAP)

1-49
Estimation (MAP)

1-50
Estimation (MAP)

1-51
Estimation (MAP)

1-52
Estimation (MAP)

1-53
Chapter 4 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014

1-54
Parametric Methods
❑ A statistic is any value that is calculated from a given
sample.
❑ In statistical inference, we make a decision using the
information provided by a sample.

1-55
Parametric Methods
❑ Our approach is parametric where we assume that
the sample is drawn from some distribution that
obeys a known model, for example, Gaussian.
❖ The advantage is that the model is defined up to a
small number of parameters — for example,
mean, variance — the sufficient statistics of the
Gaussian distribution.
❖ Once those parameters are estimated from the
sample, the whole distribution is known.

1-56
Parametric Methods
❑ Estimate the parameters of the distribution from the
given sample,
❑ Plug in these estimates to the assumed model, and
❑ Get an estimated distribution, which we then use to
make a decision.
❑ The method we use to estimate the parameters of a
distribution is maximum likelihood estimation.

1-57
Parametric Methods
❑ We start with density estimation, which is the general
case of estimating p(x).
❑ We use this for classification where the estimated
densities are the class densities, p(x|Ci), and priors,
P(Ci), to be able to calculate the posteriors, P(Ci|x),
and make our decision.
❑ In this chapter, x is one-dimensional and thus the
densities are univariate. We generalize to the
multivariate case in chapter 5.

1-58

1-59

1-60
Bernoulli Density

1-61
Bernoulli Density

1-62
Gaussian (Normal) Density

1-63
Parametric Classification

1-64

1-65

1-66

1-67

1-68

1-69

1-70

1-71

1-72

1-73

1-74
An Example: Iris flower data set
❑ The Iris flower data set, or Fisher's Iris data set is a
multivariate data set introduced by the British
statistician, and biologist Ronald Fisher in 1936.
Iris setosa
Iris versicolor
Iris virginica

1-75

1-76

1-77

1-78
❑ m1 = 4.825
❑ m2 = 6.45
❑ m3 = 6.425
❑ s1^2 = 0.036875
❑ s2^2 = 0.3525
❑ s3^2 = 0.216875

1-79
Evaluating an Estimator: Bias
and Variance

1-80
and Variance

1-81
and Variance

1-82
and Variance

1-83
and Variance

1-84
and Variance

1-85
and Variance

1-86
Chapter 5 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014

1-87
Multivariate Methods
❑ We generalize our discussion of the parametric
approach to the multivariate case
❖ multiple inputs and the output is a class variable or
continuous output (a function of these multiple
inputs).
❖ These inputs may be discrete or numeric.
❑ We will see how such functions can be learned from
a labeled multivariate sample and also how the
complexity of the learner can be fine-tuned to the
data at hand.

1-88
Multivariate Data
❑ In a multivariate setting, the sample may be viewed
as a data matrix
❑ where the d columns correspond to d variables.
❑ These are also called inputs, features, or attributes.
❑ The N rows correspond to independent and identically
distributed observations, examples, or instances.

1-89
Multivariate Data
❑ For example, in deciding on a loan application, an
observation vector is the information associated with
a customer
❖ age, marital status, yearly income, and so forth,
and we have N such past customers.

1-90
Multivariate Data
❑ Typically these variables are correlated.
❑ If they are not, there is no need for a multivariate
analysis.
❑ Our aim may be simplification, that is, summarizing
this large body of data by means of relatively few
parameters.
❑ Or our aim may be exploratory, and we may be
interested in generating hypotheses about data.

1-91
Parameter Estimation

1-92

1-93

1-94

1-95
Multivariate Normal Distribution

1-96

1-97

1-98
❑ When the variables are
independent, the major axes of the
density are parallel to the input
axes.

1-99
❑ The density becomes an ellipse if
the variances are different.

1-
100
❑ The density rotates depending on
the sign of the covariance
(correlation).

1-
101
❑ The density rotates depending on
the sign of the covariance
(correlation).

1-
102

1-
103

1-
104

1-
105

1-
106
Multivariate Classification
❑ The main reason for this is its analytical simplicity
❑ While real data may not often be exactly
multivariate normal, it is a useful approximation.

1-
107

1-
108

1-
109

1-110

1-111

1-112

1-113

1-114

1-115

1-116

1-117

1-118

1-119

1-
120

1-
121

1-
122
Discrete Features

1-
123
Discrete Features

1-
124
Discrete Features
❑ Read: 6.10 An Example: Learning To Classify Text --
Tom Mitchell’s book
❑ http://www.cs.cmu.edu/afs/cs/project/theo-
11/www/naive-bayes.html

machine learning.pdf

More Related Content

Similar to machine learning.pdf

Recently uploaded

machine learning.pdf