1-1
Probabilistic Methods
1-2
Lectures 13, 14, 15, 16
Probabilistic Classifiers
EE 439 Introduction to
Machine Learning
Draft of Chapter 2 of Tom Mitchell, “Machine
Learning”, 2nd Edition, 2018
www.cs.cmu.edu/~tom/mlbook.html.
Probabilistic Methods
1-3
Estimating Probabilities: Intro
❑ Many machine learning methods depend on
probabilistic approaches.
❑ The reason: when we are interested in learning some
target function f : X → Y, we can more generally learn
the probabilistic function P(Y|X).
❑ We can design algorithms that learn functions with
uncertain outcomes (e.g., predicting tomorrow’s stock
price) and that incorporate prior knowledge to guide
learning (e.g., a bias that tomorrow’s stock price is
likely to be similar to today’s price).
Probabilistic Methods
1-4
Estimating Probabilities: Intro
❑ We will discuss joint probability distributions over
many variables and shows how they can be used to
calculate a target P(Y|X).
❑ We will study the problem of learning, or estimating
probability distributions from training data.
❖ The two most common approaches:
• maximum likelihood estimation (MLE) and
• maximum a posteriori estimation (MAP).
Probabilistic Methods
1-5
Joint Probability Distributions
❑ The key to building probabilistic models is
❖ to define a set of random variables, and
❖ to consider the joint probability distribution over
them.
Probabilistic Methods
1-6
Joint Probability Distributions
Probabilistic Methods
❑ Table 1 defines a joint probability distribution over
three random variables:
❖ a person’s Gender, the number of HoursWorked each week,
and their Wealth.
1-7
Joint Probability Distributions
❑ In general, defining a joint probability distribution over
a set of discrete-valued variables involves three
simple steps:
Probabilistic Methods
1-8
Joint Probability Distributions
❑ The joint probability distribution is central to
probabilistic inference because once we know the
joint distribution, we can answer every possible
probabilistic question that can be asked about these
variables.
❖ We can calculate conditional or joint probabilities
over any subset of variables, given their joint
distribution.
Probabilistic Methods
1-9
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = ?
❑ P(Wealth = rich) = ?
1-10
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any single variable will take on
any specific value.
❑ P(Gender = male) = 0.6685
❖ sum the four rows for which Gender = male.
❑ P(Wealth = rich) = 0.2393
❖ add together the probabilities for the four rows
covering the cases for which Wealth=rich.
1-11
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = ?
1-12
Joint Probability Distributions
Probabilistic Methods
❑ The probability that any subset of the variables will
take on a particular joint assignment.
❑ P(Wealth=rich ˄ Gender=female) = 0.0362
❖ sum the two table rows that satisfy this joint
assignment.
1-13
Joint Probability Distributions
Probabilistic Methods
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = ?
1-14
Joint Probability Distributions
Probabilistic Methods
❑ Any conditional probability P(Y|X) = P(X ˄ Y)/P(X)
defined over subsets of the variables.
❑ P(Wealth=rich | Gender=female) = 0.0362/0.3315 =
0.1092.
❖ sum appropriate rows, to obtain the conditional
probability.
1-15
Joint Probability Distributions
Probabilistic Methods
❑ If we know the joint probability distribution over an
arbitrary set of random variables {X1. . . Xn}, we can
calculate the conditional and joint probability
distributions for arbitrary subsets of these variables
(e.g., P(Xn | X1. . . Xn-1)).
❑ In theory, we can solve
❖ Any classification, regression or other function
approximation problem defined over these
variables, and furthermore, produce probabilistic
rather than deterministic predictions for any given
input to the target function.
1-16
Learning the Joint Distribution
Probabilistic Methods
❑ How can we learn joint distributions from observed
training data?
❑ In the example of Table 1, it will be easy if we begin
with a large database containing, say, descriptions of
a million people in terms of their values for our three
variables.
1-17
Learning the Joint Distribution
Probabilistic Methods
❑ Given a large data set:
❖ one can easily estimate a probability for each row
in the table by calculating the fraction of database
entries (people) that satisfy the joint assignment
specified for that row.
❖ If thousands of database entries fall into each row,
we will obtain highly reliable probability estimates
using this strategy.
1-18
Learning the Joint Distribution
Probabilistic Methods
❑ However, it can be difficult to learn the joint
distribution due to the very large amount of training
data required.
❖ Consider how our learning problem would change
if we were to add additional variables to describe a
total of 100 Boolean features for each person in
Table 1 (e.g., we could add ”do they have a
college degree?”, ”are they healthy?”).
❖ Given 100 Boolean features, the number of rows
would now expand to 2^100, which is greater than
10^30.
1-19
Learning the Joint Distribution
Probabilistic Methods
❑ Unfortunately, even if our database describes every
single person on earth, we will not have enough data
to obtain reliable probability estimates for most rows.
❖ There are only approximately 10^10 people on
earth, which means that for most of the 2^100
rows in our table, we would have zero training
examples!
1-20
Learning the Joint Distribution
Probabilistic Methods
❑ This is a significant problem:
❖ Given that real-world machine learning
applications often use many more than 100
features to describe each example (e.g., text
analysis may use millions of features to describe
text in a given document.).
1-21
Learning the Joint Distribution
Probabilistic Methods
❑ To successfully address the issue of learning
probabilities from available training data, we must:
❖ (1) be smart about how we estimate probability
parameters from available data, and
❖ (2) be smart about how we represent joint
probability distributions.
1-22
Estimating Probabilities
Probabilistic Methods
❑ Let us begin our discussion of how to estimate
probabilities with a simple example and explore two
intuitive algorithms.
❑ These two intuitive algorithms illustrate the two
primary approaches used in nearly all probabilistic
machine learning algorithms.
1-23
Estimating Probabilities
Probabilistic Methods
❑ Assume you have a coin, represented by the random
variable X. If you flip this coin, it may turn up heads
(X = 1) or tails (X = 0).
❑ The learning task is to estimate the probability that it
will turn up heads; i.e., to estimate P(X =1).
❑ We will use θ to refer to the true (but unknown)
probability of heads (e.g., P(X =1) = θ), and use to
refer to our learned estimate of this true θ.
1-24
Estimating Probabilities
Probabilistic Methods
❑ You gather training data by flipping the coin n times,
and observe that it turns up heads α1 times, and tails
α0 times.
n = α1 + α0
❑ What is the most intuitive approach to estimating θ =
P(X =1) from this training data?
❖ Estimate the probability by the fraction of flips that
result in heads:
1-25
Estimating Probabilities
Probabilistic Methods
1-26
Estimating Probabilities
Probabilistic Methods
1-27
Estimating Probabilities
Probabilistic Methods
❑ This leads to our second intuitive algorithm:
❖ an algorithm that enables us to incorporate prior
assumptions along with observed training data to
produce our final estimate.
❑ Algorithm 2 allows us to express our prior
assumptions or knowledge about the coin by adding
in any number of imaginary coin flips resulting in
heads or tails.
1-28
Estimating Probabilities
Probabilistic Methods
1-29
Estimating Probabilities
Probabilistic Methods
1-30
Estimating Probabilities
Probabilistic Methods
1-31
Estimating Probabilities
Probabilistic Methods
❑ Asymptotically, as the volume of actual observed
data grows toward infinity:
❖ the influence of our imaginary data goes to zero
(the fixed number of imaginary coin flips becomes
insignificant compared to a sufficiently large
number of actual observations).
❖ In other words, Algorithm 2 behaves so that priors
have the strongest influence when observations
are scarce, and their influence gradually reduces
as observations become more plentiful.
1-32
Estimating Probabilities
Probabilistic Methods
❑ Both Algorithm 1 and Algorithm 2 are intuitively quite
compelling.
❖ These two algorithms exemplify the two most
widely used approaches to machine
learning of probabilistic models from training data.
❖ They follow two different underlying principles.
• Algorithm 1 follows a principle called
Maximum Likelihood Estimation (MLE), in
which we seek an estimate of θ that maximizes
the probability of the observed data.
1-33
Estimating Probabilities
Probabilistic Methods
• Algorithm 2 follows a different principle called
Maximum a Posteriori (MAP) estimation, in
which we seek the estimate of θ that is itself
most probable, given the observed data, plus
background assumptions about its value.
❑ Both principles have been widely used to derive and
to justify a vast range of machine learning algorithms,
from Bayesian networks, to linear regression, to
neural network learning.
❖ Our coin flip example represents just one of many such
learning problems.
1-34
Estimating Probabilities
Probabilistic Methods
❑ Here the learning task is to estimate the unknown
value of θ = P(X = 1) for a Boolean valued random
variable X, based on a sample of n values of X drawn
independently (e.g., n independent flips of a coin with
probability θ of heads).
❑ Data is generated by a random number generator.
❖ The true value of θ is 0.3, and the same sequence
of training examples is used in each plot.
❑ The experimental behavior of these two algorithms is
shown in Figure.
1-35
Estimating Probabilities
Probabilistic Methods
• The blue line shows the estimates produced by MLE while
the red line shows the estimates of MAP.
• Plots on the left correspond to the correct prior assumption
about the value of θ, plots on the right reflect the incorrect
prior assumption that θ is most probably 0.4.
1-36
Estimating Probabilities
Probabilistic Methods
• Plots in the top row reflect lower confidence in the prior
assumption, by including only 60 imaginary data points,
whereas bottom plots assume 120.
1-37
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-38
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-39
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-40
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-41
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-42
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-43
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-44
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-45
Maximum Likelihood Estimation
(MLE)
Probabilistic Methods
1-46
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-47
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-48
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-49
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-50
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-51
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-52
Maximum a Posteriori Probability
Estimation (MAP)
Probabilistic Methods
1-53
Probabilistic Methods
Chapter 4 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014
1-54
Parametric Methods
❑ A statistic is any value that is calculated from a given
sample.
❑ In statistical inference, we make a decision using the
information provided by a sample.
Probabilistic Methods
1-55
Parametric Methods
❑ Our approach is parametric where we assume that
the sample is drawn from some distribution that
obeys a known model, for example, Gaussian.
❖ The advantage is that the model is defined up to a
small number of parameters — for example,
mean, variance — the sufficient statistics of the
Gaussian distribution.
❖ Once those parameters are estimated from the
sample, the whole distribution is known.
Probabilistic Methods
1-56
Parametric Methods
❑ Estimate the parameters of the distribution from the
given sample,
❑ Plug in these estimates to the assumed model, and
❑ Get an estimated distribution, which we then use to
make a decision.
❑ The method we use to estimate the parameters of a
distribution is maximum likelihood estimation.
Probabilistic Methods
1-57
Parametric Methods
❑ We start with density estimation, which is the general
case of estimating p(x).
❑ We use this for classification where the estimated
densities are the class densities, p(x|Ci), and priors,
P(Ci), to be able to calculate the posteriors, P(Ci|x),
and make our decision.
❑ In this chapter, x is one-dimensional and thus the
densities are univariate. We generalize to the
multivariate case in chapter 5.
Probabilistic Methods
1-58
Maximum Likelihood Estimation
Probabilistic Methods
1-59
Maximum Likelihood Estimation
Probabilistic Methods
1-60
Bernoulli Density
Probabilistic Methods
1-61
Bernoulli Density
Probabilistic Methods
1-62
Gaussian (Normal) Density
Probabilistic Methods
1-63
Parametric Classification
Probabilistic Methods
1-64
Parametric Classification
Probabilistic Methods
1-65
Parametric Classification
Probabilistic Methods
1-66
Parametric Classification
Probabilistic Methods
1-67
Parametric Classification
Probabilistic Methods
1-68
Parametric Classification
Probabilistic Methods
1-69
Parametric Classification
Probabilistic Methods
1-70
Parametric Classification
Probabilistic Methods
1-71
Parametric Classification
Probabilistic Methods
1-72
Parametric Classification
Probabilistic Methods
1-73
Parametric Classification
Probabilistic Methods
1-74
An Example: Iris flower data set
Probabilistic Methods
❑ The Iris flower data set, or Fisher's Iris data set is a
multivariate data set introduced by the British
statistician, and biologist Ronald Fisher in 1936.
Iris setosa
Iris versicolor
Iris virginica
1-75
An Example: Iris flower data set
Probabilistic Methods
1-76
An Example: Iris flower data set
Probabilistic Methods
1-77
An Example: Iris flower data set
Probabilistic Methods
1-78
An Example: Iris flower data set
Probabilistic Methods
❑ m1 = 4.825
❑ m2 = 6.45
❑ m3 = 6.425
❑ s1^2 = 0.036875
❑ s2^2 = 0.3525
❑ s3^2 = 0.216875
1-79
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-80
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-81
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-82
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-83
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-84
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-85
Evaluating an Estimator: Bias
and Variance
Probabilistic Methods
1-86
Probabilistic Methods
Chapter 5 of E. Alpaydin, “Introduction to Machine
Learning”, 3rd Edition The MIT press, 2014
1-87
Multivariate Methods
Probabilistic Methods
❑ We generalize our discussion of the parametric
approach to the multivariate case
❖ multiple inputs and the output is a class variable or
continuous output (a function of these multiple
inputs).
❖ These inputs may be discrete or numeric.
❑ We will see how such functions can be learned from
a labeled multivariate sample and also how the
complexity of the learner can be fine-tuned to the
data at hand.
1-88
Multivariate Data
Probabilistic Methods
❑ In a multivariate setting, the sample may be viewed
as a data matrix
❑ where the d columns correspond to d variables.
❑ These are also called inputs, features, or attributes.
❑ The N rows correspond to independent and identically
distributed observations, examples, or instances.
1-89
Multivariate Data
Probabilistic Methods
❑ For example, in deciding on a loan application, an
observation vector is the information associated with
a customer
❖ age, marital status, yearly income, and so forth,
and we have N such past customers.
1-90
Multivariate Data
Probabilistic Methods
❑ Typically these variables are correlated.
❑ If they are not, there is no need for a multivariate
analysis.
❑ Our aim may be simplification, that is, summarizing
this large body of data by means of relatively few
parameters.
❑ Or our aim may be exploratory, and we may be
interested in generating hypotheses about data.
1-91
Parameter Estimation
Probabilistic Methods
1-92
Parameter Estimation
Probabilistic Methods
1-93
Parameter Estimation
Probabilistic Methods
1-94
Parameter Estimation
Probabilistic Methods
1-95
Multivariate Normal Distribution
Probabilistic Methods
1-96
Multivariate Normal Distribution
Probabilistic Methods
1-97
Multivariate Normal Distribution
Probabilistic Methods
1-98
Multivariate Normal Distribution
Probabilistic Methods
❑ When the variables are
independent, the major axes of the
density are parallel to the input
axes.
1-99
Multivariate Normal Distribution
Probabilistic Methods
❑ The density becomes an ellipse if
the variances are different.
1-
100
Multivariate Normal Distribution
Probabilistic Methods
❑ The density rotates depending on
the sign of the covariance
(correlation).
1-
101
Multivariate Normal Distribution
Probabilistic Methods
❑ The density rotates depending on
the sign of the covariance
(correlation).
1-
102
Multivariate Normal Distribution
Probabilistic Methods
1-
103
Multivariate Normal Distribution
Probabilistic Methods
1-
104
Multivariate Normal Distribution
Probabilistic Methods
1-
105
Multivariate Normal Distribution
Probabilistic Methods
1-
106
Multivariate Classification
Probabilistic Methods
❑ The main reason for this is its analytical simplicity
❑ While real data may not often be exactly
multivariate normal, it is a useful approximation.
1-
107
Multivariate Classification
Probabilistic Methods
1-
108
Multivariate Classification
Probabilistic Methods
1-
109
Multivariate Classification
Probabilistic Methods
1-110
Multivariate Classification
Probabilistic Methods
1-111
Multivariate Classification
Probabilistic Methods
1-112
Multivariate Classification
Probabilistic Methods
1-113
Multivariate Classification
Probabilistic Methods
1-114
Multivariate Classification
Probabilistic Methods
1-115
Multivariate Classification
Probabilistic Methods
1-116
Multivariate Classification
Probabilistic Methods
1-117
Multivariate Classification
Probabilistic Methods
1-118
Multivariate Classification
Probabilistic Methods
1-119
Multivariate Classification
Probabilistic Methods
1-
120
Multivariate Classification
Probabilistic Methods
1-
121
Multivariate Classification
Probabilistic Methods
1-
122
Discrete Features
Probabilistic Methods
1-
123
Discrete Features
Probabilistic Methods
1-
124
Discrete Features
Probabilistic Methods
❑ Read: 6.10 An Example: Learning To Classify Text --
Tom Mitchell’s book
❑ http://www.cs.cmu.edu/afs/cs/project/theo-
11/www/naive-bayes.html

machine learning.pdf