SlideShare a Scribd company logo
1 of 33
Course Logistics and Introduction to
Probabilistic Machine Learning
CS772A: Probabilistic Machine Learning
Piyush Rai
CS772A: PML
Course Logistics
๏‚ง Course Name: Probabilistic Machine Learning โ€“ CS772A
๏‚ง 2 classes each week
๏‚ง Mon/Thur 18:00-19:30
๏‚ง Venue: KD-101
๏‚ง All material (readings etc) will be posted on course webpage (internal
access)
๏‚ง URL: https://web.cse.iitk.ac.in/users/piyush/courses/pml_autumn22/pml.html
๏‚ง Q/A and announcements on Piazza. Please sign up
๏‚ง URL: https://piazza.com/iitk.ac.in/firstsemester2022/cs772a
๏‚ง If need to contact me by email (piyush@cse.iitk.ac.in), prefix subject line with
โ€œCS772โ€
๏‚ง Auditors are welcome
2
CS772A: PML
Workload and Grading Policy
๏‚ง 4 homeworks (theory + programming): 40%
๏‚ง Must be typeset in LaTeX. To be uploaded on Gradescope (login details will be
shared)
๏‚ง 2 quizzes: 10% (tentative dates: Aug 27, Oct 22)
๏‚ง Mid-sem exam: 20% (date as per DOAA schedule)
๏‚ง End-sem exam: 30% (date as per DOAA schedule)
๏‚ง (Optional) Research project (to be done in groups of 2): 20%
๏‚ง If opted, will be exempted from appearing in the mid-sem exam
๏‚ง We may suggest some topics but the project has to be driven by you
๏‚ง We will not be able to provide compute resources
๏‚ง Advisable only if you have strong prior experience of working on ML research
3
Tentative dates for
release of homeworks:
Aug 11, Aug 29, Sept 26,
Oct 20
Each homework due
roughly in 2 weeks
(excluding holidays, exam
weeks)
CS772A: PML
Textbooks and Readings
๏‚ง Textbook: No official textbook required
๏‚ง Some books that you may use as reference (freely available PDF)
๏‚ง Kevin P. Murphy, Probabilistic Machine Learning: An Introduction (PML-1), The MIT
Press, 2022.
๏‚ง Kevin P. Murphy, Probabilistic Machine Learning: Advanced Topics(PML-2), The MIT
Press, 2022.
๏‚ง Christopher M. Bishop, Pattern Recognition and Machine Learning (PRML), Springer,
2007.
4
758
pages
855 pages 1347 pages
Python code Python code Python code
All these books
have
accompanying
Python code. You
are encouraged to
explore (we will
also go through
some)
CS772A: PML
Course Policies
๏‚ง Policy on homeworks
๏‚ง Homework solutions must be in your own words
๏‚ง Must cite sources you have referred to or used in preparing your HW solutions
๏‚ง No requests for deadline extension will entertained. Plan ahead of time
๏‚ง Late submissions allowed up to 72 hours with 10% penalty per 24 hour delay
๏‚ง Every student entitled for ONE late homework submission without penalty (use
it wisely)
๏‚ง Policy on collaboration/cheating
๏‚ง Punishable as per institute's/departmentโ€™s rules
๏‚ง Plagiarism from other sources will also lead to strict punishment
๏‚ง Both copying as well as helping someone copy will be equally punishable
5
CS772A: PML
Course Team
6
Soumya Banerjee
soumyab@cse.iitk.ac.in
Avideep Mukherjee
avideep@cse.iitk.ac.in
Abhishek Jaiswal
abhijais@cse.iitk.ac.in
Abhinav Joshi
ajoshi@cse.iitk.ac.in
Piyush Rai
piyush@cse.iitk.ac.in
TA
TA
TA
TA
Instructor
CS772A: PML
Course Goals
๏‚ง Introduce you to the foundations of probabilistic machine learning
(PML)
๏‚ง Focus will be on learning core, general principles of PML
๏‚ง How to set up a probabilistic model for a given ML problem
๏‚ง How to quantify uncertainty in parameter estimates and predictions
๏‚ง Estimation/inference algorithms to learn various unknowns in a model
๏‚ง How to think about trade-offs (computational efficiency vs accuracy)
๏‚ง Will also look at the above using specific examples of ML models
๏‚ง Throughout the course, focus will also be on contemporary/latest
advances
7
Image source: Probabilistic Machine Learning โ€“ Advanced Topics (Murphy, 2022)
CS772A: PML
Why Probabilistic ML?
8
CS772A: PML
9
Probabilistic ML becauseโ€ฆ Uncertainty Matters!
๏‚ง ML models ingest data, give us predictions, trends in the data, etc
๏‚ง The standard approach: Minimize a loss func. to find optimal parameters. Use
them to predict
๏‚ง In many applications, we also care about the parameter/predictive uncertainty
9
๐œƒ = arg min๐œƒ โ„“(๐’Ÿ, ๐œƒ)
Single answer! No
estimate of uncertainty
about true ๐œƒ
May be unreliable
and can overfit
especially with
limited training data
Desirable: An approach
that reports large
predictive uncertainty
(variance) in regions where
there is little/no training
data
Regression Binary classification
Desirable: An
approach that
reports close to
uniform class
probability (0.5 in
this case) for inputs
faraway from
training data or for
out-of-distribution
inputs
Training
data
Healthcare applications,
autonomous driving,
etc.
Image source: Probabilistic Machine Learning โ€“ Advanced Topics (Murphy, 2022)
CS772A: PML
Types of Uncertainty: Model Uncertainty
๏‚ง Model uncertainty is due to not having enough training data
๏‚ง Also called epistemic uncertainty. Usually reducible
๏‚ง Vanishes with โ€œsufficientโ€ training data
10
Same model class (linear models)
but uncertainty about the weights
Uncertainty not just about the
weights but also the model class
๐‘(๐œƒ|๐’Ÿ)
Distribution of model
parameters
conditioned on the
training data
Also known as the
โ€œposterior
distributionโ€
Hard to compute in
general. Will study
several methods to
do it
A probabilistic way to
express model
uncertainty
Image credit: Balaji L, Dustin T, Jasper N. (NeurIPS 2020 tutorial)
Means
uncertainty in
model weights
An example posterior of
a two-dim parameter
vector ๐œƒ
Each model class itself will
have uncertainty(like left fig)
since there isnโ€™t enough
training data
3 different model
classes considered here
(with linear, polynomial,
circular decision
boundaries)
CS772A: PML
Model Uncertainty via Posterior: An Illustration
๏‚ง Consider a linear classification model for 2-dim
inputs
๏‚ง Classifier weight will be a 2-dim vector ๐œƒ = [๐œƒ1, ๐œƒ2]
๏‚ง Its posterior will be some 2-dim distribution ๐‘(๐œƒ|๐’Ÿ)
๏‚ง Sampling from this distribution will generate 2-dim
vectors
๏‚ง Each vector will correspond to a linear separator
(right fig)
๏‚ง Thus posterior in this case is equivalent to a
11
Not every separator is
equally important
Importance of
separator ๐œƒ(๐‘–)
is
๐‘(๐œƒ(๐‘–)
|๐’Ÿ)
CS772A: PML
Types of Uncertainty: Data Uncertainty
๏‚ง Data uncertainty can be due to various reasons, e.g.,
๏‚ง Intrinsic hardness of labeling, class overlap
๏‚ง Labeling errors/disagreements (for difficult training inputs)
๏‚ง Noisy or missing features
๏‚ง Also called aleatoric uncertainty. Usually irreducible
๏‚ง Wonโ€™t vanish even with infinite training data
๏‚ง Note: Can sometimes vanish by adding more features
(figure on the right) or switching to a more complex model
12
๐‘(๐‘ฆ|๐œƒ, ๐’™)
Usually specified by the
distribution of data being
modeled conditioned on model
parameters and other inputs
A probabilistic way to
express data
uncertainty
Image source: โ€œAleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methodsโ€ (H&W 2021)
Image source: โ€œImproving machine classification using human uncertainty measurementsโ€ (Battleday et al, 2021)
Image credit: Eric Nalisnick
CS772A: PML
A Key Principle of PML: Marginalization
๏‚ง Consider training data ๐’Ÿ = (๐—, ๐ฒ)
๏‚ง Consider a model ๐‘š(e.g., a deep neural net) of the form ๐‘(๐‘ฆ|๐’™, ๐œƒ, ๐‘š)
๏‚ง Let ๐œƒ denote the optimal weights of model ๐‘š by minimizing a loss fn.
on ๐’Ÿ
๏‚ง Standard prediction for a new test input ๐’™โˆ— : ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š
๏‚ง A more robust prediction can be obtained via marginalization
13
๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ
For multi-class
classification, this quantity
will be a vector of predicted
class probabilities
For multi-class
classification, this too will
be a vector of predicted
class probabilities but its
computation incorporates
the uncertainty in ๐œƒ
Denotes data /
aleatoric
uncertainty
Denotes model weight /
epistemic uncertainty
Prediction via marginalizing
the predictions by each
value of ๐œƒ over its posterior
distribution
Predictions by different
๐œƒโ€™s not weighted equally
but based on how likely
each ๐œƒ is given data ๐’Ÿ,,
i.e., ๐‘ ๐œƒ ๐’Ÿ, ๐‘š
e.g., a deep net
with softmax
outputs
Will derive the
expression shortly โ€“ just
a simple application of
product and chain rules
of probability
Note: ๐‘š is just a
model identifier;
can ignore when
writing
Posterior
distribution of the
weights
Uncertainty about ๐œƒ
ignored
CS772A: PML
More on Marginalization
๏‚ง Marginalization (integral) is usually intractable and needs to be
approximated
๏‚ง Marginalization can be done even over several choices of models
14
๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ
๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ) = ๐‘š=1
๐‘€
๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐’Ÿ, ๐‘š ๐‘(๐‘š|๐’Ÿ)
๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ
โ‰ˆ
1
๐‘† ๐‘–=1
๐‘†
๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ(๐‘–), ๐‘š
Marginalization over all
weights of a single
model ๐‘š
Marginalization over all
finite choices ๐‘š =
1,2, โ€ฆ , ๐‘€ of the model
Like a double averaging
(over all model choices,
and over all weights of
each model choice)
Each ๐œƒ(๐‘–)
is drawn
i.i.d. from the
distribution
๐‘ ๐œƒ ๐’Ÿ, ๐‘š
Above integral replaced
by a โ€œMonte-Carlo
Averagingโ€
Will look at these
ideas in more depth
later
For example, deep nets
with different architectures
Havenโ€™t yet told
you how to compute
this quantity but will
see shortly
CS772A: PML
Marginalization is like Ensemble-based Averaging
๏‚ง Recall the approximation of the marginalization
๏‚ง Akin to computing averaged prediction using ensemble of weights
๐œƒ(๐‘–)
๐‘–=1
๐‘†
๏‚ง Ensembling is a well-known classical technique for strong predictive
15
๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐’Ÿ, ๐‘š โ‰ˆ
1
๐‘† ๐‘–=1
๐‘†
๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ(๐‘–), ๐‘š
Tip: If you canโ€™t
design/use a full-fledged,
rigorous probabilistic
model, consider using an
ensemble* instead
We will also study
ensembles later during
this course
One way to get such an
ensemble is to train the
same model with ๐‘†
different initializations
*Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (Lakshminarayanan et al, 2017)
May not be as good as
doing proper
marginalization over the
posterior but usually
better than using a single
estimate of the weights
๐œƒ(1) ๐œƒ(2) ๐œƒ(๐‘†)
CS772A: PML
Probabilistic ML:
The Basic Set-up
16
CS772A: PML
Modeling Data Probabilistically: A Simplistic View
๏‚ง Assume data ๐— = {๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘} generated from a prob. model with
params ๐œƒ
๏‚ง Note: Shaded nodes = observed; unshaded nodes =
unknown/unobserved
๏‚ง Goal: To estimate the unknowns (๐œƒ in this case), given the observed
data ๐—
๏‚ง Many ways to do this (point estimate or the posterior distribution of ๐œƒ)
17
๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ โˆผ ๐‘(๐’™|๐œƒ)
A plate diagram
of this simplistic
model
No outputs/labels in this
problem; just modeling the
inputs (an unsupervised
learning task)
CS772A: PML
Basic Ingredients in Probabilistic ML
๏‚ง Specification of prob. models requires two key ingredients: Likelihood
and prior
๏‚ง Likelihood ๐‘(๐’™|๐œƒ) or the โ€œobservation modelโ€ specifies how data is
generated
๏‚ง Measures data fit (or โ€œlossโ€) w.r.t. the given ๐œƒ (will see the reason formally
later)
๏‚ง Prior distribution ๐‘(๐œƒ) specifies how likely different parameter values
are a priori
18
CS772A: PML
Parameter Estimation in Probabilistic Models
๏‚ง A simple way: Find ๐œƒ for which the observed data is most likely or
most probable
๏‚ง More desirable: Estimate the full posterior distribution over ๐œƒ to get the
uncertainty
๏‚ง To make predictions, can use full posterior rather than a point
19
This โ€œpoint estimateโ€, called
maximum likelihood estimate
(MLE), however, does not provide
us any information about
uncertainty in ๐œƒ
Fully Bayesian inference. In
general, an intractable problem
(mainly because computing the
marginal ๐‘(๐—) is hard), except for
some simple cases (will study
how to solve such problems)
Just like loss function
minimization
approach
Log
likelihood
Can also find a single best estimate
of ๐œƒ by finding the mode of the
posterior, i.e., ๐œƒ =
argmax
๐œƒ
log ๐‘(๐œƒ|๐—), called the
maximum-a-posteriori (MAP)
estimate, but we wonโ€™t get
uncertainty estimate
In this course, the use of
the word โ€œinferenceโ€
would mean Bayesian
inference, i.e.,
computing the (usually
an approximate)
posterior distribution of
the model parameters.
At some other places, especially
in deep learning community, the
word โ€œinferenceโ€ usually refers
to making prediction on the test
inputs
CS772A: PML
Posterior Averaging
๏‚ง Can now use the posterior over params to compute โ€œaveraged
predictionโ€, e.g.,
20
Posterior predictive
distribution (PPD)
obtained by doing an
importance-weighted
averaging over the
posterior
Plug-in predictive
distribution
Tells us how
important this value
of ๐œƒ is
๐‘ ๐’™โˆ— ๐— = โˆซ ๐‘ ๐’™โˆ— ๐œƒ ๐‘ ๐œƒ ๐— ๐‘‘๐œƒ
๐‘ ๐’™โˆ— ๐— = โˆซ ๐‘(๐’™โˆ—, ๐œƒ|๐—) ๐‘‘๐œƒ
= โˆซ ๐‘ ๐’™โˆ— ๐œƒ, ๐— ๐‘(๐œƒ|๐—) ๐‘‘๐œƒ
= โˆซ ๐‘ ๐’™โˆ— ๐œƒ ๐‘(๐œƒ|๐—) ๐‘‘๐œƒ
Assuming
observations are
i.i.d. given ๐œƒ
An
approximation
of the PPD
โ‰ˆ
1
๐‘† ๐‘–=1
๐‘†
๐‘(๐’™โˆ—|๐œƒ ๐‘–
)
โ€œPlug-inโ€
because we
plugged-in a
single value of ๐œƒ
to compute this
Note: The PPD will take
different forms depending
on the problem/model,
e.g., for a discriminative
supervised learning
model, PPD will be of the
form ๐‘(๐‘ฆ|๐’™, ๐’Ÿ) where ๐’Ÿ is
the labeled training data
Proof of the PPD expression
Past (training)
data
Future (test)
data
CS772A: PML
Bayesian Inference
๏‚ง Bayesian inference can be seen in a sequential fashion
๏‚ง Our belief about ๐œƒ keeps getting updated as we see more and more
data
๏‚ง Posterior keeps getting updates as more and more data is observed
๏‚ง Note: Updates may not be straightforward and approximations may be
21
CS772A: PML
Other Examples of
Probabilistic Models of Data
22
CS772A: PML
Modeling Data Probabilistically: With Latent Vars
๏‚ง Can endow generative models of data with latent variables. For
example:
๏‚ง Such models are used in many problems, especially unsupervised learning:
Gaussian mixture model, probabilistic PCA, topic models, deep generative
models, etc.
๏‚ง We will look at several of these in this course and way to learn such models
23
Each data point ๐’™๐‘›
is associated with
a latent variable
๐’›๐‘›
The latent variable ๐’›๐‘› can be used
to encode some property of ๐’™๐‘›
(e.g., its cluster membership, or its
low-dim representation, or missing
parts of ๐’™๐‘›)
CS772A: PML
Modeling Data Probabilistically: Other Examples
๏‚ง Can construct various other types of models of data for different
problems
๏‚ง Any node (even if observed) we are uncertain about is modeled by a prob.
distribution
๏‚ง These nodes become the random variables of the model
๏‚ง The full model is specified via a joint prob. distribution over all random
variables
24
A simple supervised
learning model
A latent variable model
for unsupervised learning
A latent variable model
for supervised learning
A latent variable model
for sequential data
CS772A: PML
Use probabilistic ML
also because..
25
CS772A: PML
Helps in Sequential Decision-Making Problems
๏‚ง Sequential decision-making: Information about uncertainty can
โ€œguideโ€ us, e.g.,
๏‚ง Applications in active learning, reinforcement learning, Bayesian
26
Given our current estimate of
the regression function, which
training input(s) should we add
next to improve its estimate
the most?
Uncertainty can help here: Acquire
training inputs from regions where the
function is most uncertain about its
current predictions
Blue curve is the mean of
the function, shaded
region denotes the
predictive uncertainty
CS772A: PML
Can Learn Data Distribution and Generate Data
๏‚ง Often wish to learn the underlying probability distribution ๐‘(๐’™) of the
data
๏‚ง Useful for many tasks, e.g.,
๏‚ง Outlier/novelty detection: Outliers will have low probability under ๐‘(๐’™)
๏‚ง Can sample from this distribution to generate new โ€œartificialโ€ but realistic-
looking data
27
Several models, such as
generative adversarial
networks (GAN), variational
auto-encoders (VAE),
denoising diffusion models,
etc can do this
Pic credit: https://medium.com/analytics-vidhya/an-introduction-to-generative-deep-learning-792e93d1c6d4
CS772A: PML
Can Better Handle OOD Data
๏‚ง Many modern deep neural networks (DNN) tend to be overconfident
๏‚ง Especially true if test data is โ€œout-of-distribution (OOD)โ€
๏‚ง PML models are robust and give better uncertainty estimates to flag
OOD data
28
Low
confidence
High
confidence
Class 1
Class 2
Some OOD
test data
Desirable confidence map Confidence map of a DNN
Model has high confidence
for predictions on even
inputs that are far away
from training data
Example of an
overconfident
model
Image source: โ€œSimple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awarenessโ€ (Liu et al, 2020)
CS772A: PML
Can Construct Complex Models in Modular Ways
๏‚ง Can combine multiple simple probabilistic models to learn complex
patterns
๏‚ง Can design models that can jointly learn from multiple datasets and share
information across multiple datasets using shared parameters with a prior
distribution
29
A combination of a mixture model
for clustering and a probabilistic
linear regression model: Result is a
probabilistic nonlinear regression
model
Can design a
latent variable
model to do
this
Essentially a
โ€œmixture of
expertsโ€ model
An example of transfer
learning or multitask
learning using a
probabilistic approach
Example: Estimating the
means of ๐‘š datasets,
assuming the means are
somewhat related. Can do
this jointly rather than
estimating independently
Easy to do it using a probabilistic
approach with shared parameters
(will see details later)
CS772A: PML
Hyperparameter Estimation
๏‚ง ML models invariably have hyperparams, e.g., regularization/kernel
h.p. in a linear/kernel regression, hyperparameters of a deep neural
network, etc.
๏‚ง Can specify the hyperparams as additional unknown of the probabilistic
model
๏‚ง Can now estimate them, e.g., using a point estimate or a posterior
distribution
๏‚ง To find point estimate of hyperparameters, we can write the probability of data as a
30
A way to find the point estimate of the
hyperparameters by maximizing the
marginal likelihood of data (more on
this later)
๐›ผ = argmax
๐›ผ
log ๐‘(๐—|๐›ผ)
= argmax
๐›ผ
log โˆซ ๐‘ ๐— ๐œƒ ๐‘ ๐œƒ ๐›ผ ๐œƒ
The approach of using marginal
likelihood for doing such thing
has some issues; other
quantities can be used such as
โ€œconditionalโ€ marginal
likelihood* (more on this later)
*Bayesian Model Selection, the Marginal Likelihood, and Generalization (Lotfi et al, 2022)
CS772A: PML
Model Comparison
๏‚ง Suppose we have a number of models ๐‘š = 1,2, โ€ฆ , ๐‘€ to choose from
๏‚ง The standard way to choose the best model is cross-validation
๏‚ง Can also compute the posterior probability of each candidate model, using
Bayes rule
๏‚ง If all models are equally likely a priori (๐‘(๐‘š) is uniform) then the best model
can be selected as the one with largest marginal likelihood
31
๐‘ ๐‘š ๐— =
๐‘ ๐‘š ๐‘(๐—|๐‘š)
๐‘(๐—)
Marginal likelihood of model
๐‘š
๐‘(๐—|๐‘š)
This doesnโ€™t require
a separate validation
set unlike cross-
validation
Therefore also useful for
doing model
selection/comparison for
unsupervised learning
problems
May not be easy to do
exactly but can
compute it
approximately
CS772A: PML
Non-probabilistic ML Methods?
๏‚ง Some non-probabilistic ML methods can give probabilistic answers via
heuristics
๏‚ง Doesnโ€™t mean these methods are not useful/used but they donโ€™t follow
the PML paradigm, so we wonโ€™t study them in this course
๏‚ง Some examples which you may have seen
๏‚ง Converting distances from hyperplane (in hyperplane classifiers) to compute class
probabilities
๏‚ง Using class-frequencies in nearest neighbors to compute class probabilities
๏‚ง Using class-frequencies at leaves of a Decision Tree to compute class probabilities
๏‚ง Soft k-means clustering to compute probabilistic cluster memberships
32
Or methods like Platt
Scaling used to get
class probabilities for
SVMs
CS772A: PML
Tentative Outline
๏‚ง Basics of probabilistic modeling and inference
๏‚ง Common probability distributions
๏‚ง Basic point estimation (MLE and MAP)
๏‚ง Bayesian inference (simple and not-so-simple cases)
๏‚ง Probabilistic models for regression, classification, clustering, dimensionality
reduction
๏‚ง Gaussian Processes (probabilistic modeling meets kernels)
๏‚ง Latent Variable Models (for i.i.d., sequential, and relational data)
๏‚ง Approximate Bayesian inference (EM, variational inference, sampling, etc)
๏‚ง Bayesian Deep Learning
๏‚ง Misc topics, e.g., deep generative models, black-box inference, sequential
decision-making, etc
33

More Related Content

Similar to CS772-Lec1.pptx

Lecture_2_Deep_Learning_Overview (1).pptx
Lecture_2_Deep_Learning_Overview (1).pptxLecture_2_Deep_Learning_Overview (1).pptx
Lecture_2_Deep_Learning_Overview (1).pptx
gamajima2023
ย 
Introduction
IntroductionIntroduction
Introduction
butest
ย 
Introduction
IntroductionIntroduction
Introduction
butest
ย 
Introduction
IntroductionIntroduction
Introduction
butest
ย 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
ย 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
ABDUmomo
ย 

Similar to CS772-Lec1.pptx (20)

Lecture_2_Deep_Learning_Overview (1).pptx
Lecture_2_Deep_Learning_Overview (1).pptxLecture_2_Deep_Learning_Overview (1).pptx
Lecture_2_Deep_Learning_Overview (1).pptx
ย 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
ย 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
ย 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
ย 
Introduction
IntroductionIntroduction
Introduction
ย 
Introduction
IntroductionIntroduction
Introduction
ย 
Introduction
IntroductionIntroduction
Introduction
ย 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
ย 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
ย 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
ย 
Linear regression
Linear regressionLinear regression
Linear regression
ย 
Machine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dkuMachine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dku
ย 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
ย 
Overfitting.pptx
Overfitting.pptxOverfitting.pptx
Overfitting.pptx
ย 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptx
ย 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
ย 
Machine learning in science and industry โ€” day 1
Machine learning in science and industry โ€” day 1Machine learning in science and industry โ€” day 1
Machine learning in science and industry โ€” day 1
ย 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
ย 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
ย 
Session 6.pdf
Session 6.pdfSession 6.pdf
Session 6.pdf
ย 

Recently uploaded

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
ย 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
ย 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
ย 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
ย 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
ย 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
ย 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
ย 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
ย 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
ibrahimabdi22
ย 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
ย 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
ย 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
kumargunjan9515
ย 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
ย 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
ย 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
ย 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
ย 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
ย 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
ย 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ย 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
ย 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
ย 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
ย 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
ย 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
ย 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ย 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
ย 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
ย 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
ย 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
ย 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
ย 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
ย 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
ย 

CS772-Lec1.pptx

  • 1. Course Logistics and Introduction to Probabilistic Machine Learning CS772A: Probabilistic Machine Learning Piyush Rai
  • 2. CS772A: PML Course Logistics ๏‚ง Course Name: Probabilistic Machine Learning โ€“ CS772A ๏‚ง 2 classes each week ๏‚ง Mon/Thur 18:00-19:30 ๏‚ง Venue: KD-101 ๏‚ง All material (readings etc) will be posted on course webpage (internal access) ๏‚ง URL: https://web.cse.iitk.ac.in/users/piyush/courses/pml_autumn22/pml.html ๏‚ง Q/A and announcements on Piazza. Please sign up ๏‚ง URL: https://piazza.com/iitk.ac.in/firstsemester2022/cs772a ๏‚ง If need to contact me by email (piyush@cse.iitk.ac.in), prefix subject line with โ€œCS772โ€ ๏‚ง Auditors are welcome 2
  • 3. CS772A: PML Workload and Grading Policy ๏‚ง 4 homeworks (theory + programming): 40% ๏‚ง Must be typeset in LaTeX. To be uploaded on Gradescope (login details will be shared) ๏‚ง 2 quizzes: 10% (tentative dates: Aug 27, Oct 22) ๏‚ง Mid-sem exam: 20% (date as per DOAA schedule) ๏‚ง End-sem exam: 30% (date as per DOAA schedule) ๏‚ง (Optional) Research project (to be done in groups of 2): 20% ๏‚ง If opted, will be exempted from appearing in the mid-sem exam ๏‚ง We may suggest some topics but the project has to be driven by you ๏‚ง We will not be able to provide compute resources ๏‚ง Advisable only if you have strong prior experience of working on ML research 3 Tentative dates for release of homeworks: Aug 11, Aug 29, Sept 26, Oct 20 Each homework due roughly in 2 weeks (excluding holidays, exam weeks)
  • 4. CS772A: PML Textbooks and Readings ๏‚ง Textbook: No official textbook required ๏‚ง Some books that you may use as reference (freely available PDF) ๏‚ง Kevin P. Murphy, Probabilistic Machine Learning: An Introduction (PML-1), The MIT Press, 2022. ๏‚ง Kevin P. Murphy, Probabilistic Machine Learning: Advanced Topics(PML-2), The MIT Press, 2022. ๏‚ง Christopher M. Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2007. 4 758 pages 855 pages 1347 pages Python code Python code Python code All these books have accompanying Python code. You are encouraged to explore (we will also go through some)
  • 5. CS772A: PML Course Policies ๏‚ง Policy on homeworks ๏‚ง Homework solutions must be in your own words ๏‚ง Must cite sources you have referred to or used in preparing your HW solutions ๏‚ง No requests for deadline extension will entertained. Plan ahead of time ๏‚ง Late submissions allowed up to 72 hours with 10% penalty per 24 hour delay ๏‚ง Every student entitled for ONE late homework submission without penalty (use it wisely) ๏‚ง Policy on collaboration/cheating ๏‚ง Punishable as per institute's/departmentโ€™s rules ๏‚ง Plagiarism from other sources will also lead to strict punishment ๏‚ง Both copying as well as helping someone copy will be equally punishable 5
  • 6. CS772A: PML Course Team 6 Soumya Banerjee soumyab@cse.iitk.ac.in Avideep Mukherjee avideep@cse.iitk.ac.in Abhishek Jaiswal abhijais@cse.iitk.ac.in Abhinav Joshi ajoshi@cse.iitk.ac.in Piyush Rai piyush@cse.iitk.ac.in TA TA TA TA Instructor
  • 7. CS772A: PML Course Goals ๏‚ง Introduce you to the foundations of probabilistic machine learning (PML) ๏‚ง Focus will be on learning core, general principles of PML ๏‚ง How to set up a probabilistic model for a given ML problem ๏‚ง How to quantify uncertainty in parameter estimates and predictions ๏‚ง Estimation/inference algorithms to learn various unknowns in a model ๏‚ง How to think about trade-offs (computational efficiency vs accuracy) ๏‚ง Will also look at the above using specific examples of ML models ๏‚ง Throughout the course, focus will also be on contemporary/latest advances 7 Image source: Probabilistic Machine Learning โ€“ Advanced Topics (Murphy, 2022)
  • 9. CS772A: PML 9 Probabilistic ML becauseโ€ฆ Uncertainty Matters! ๏‚ง ML models ingest data, give us predictions, trends in the data, etc ๏‚ง The standard approach: Minimize a loss func. to find optimal parameters. Use them to predict ๏‚ง In many applications, we also care about the parameter/predictive uncertainty 9 ๐œƒ = arg min๐œƒ โ„“(๐’Ÿ, ๐œƒ) Single answer! No estimate of uncertainty about true ๐œƒ May be unreliable and can overfit especially with limited training data Desirable: An approach that reports large predictive uncertainty (variance) in regions where there is little/no training data Regression Binary classification Desirable: An approach that reports close to uniform class probability (0.5 in this case) for inputs faraway from training data or for out-of-distribution inputs Training data Healthcare applications, autonomous driving, etc. Image source: Probabilistic Machine Learning โ€“ Advanced Topics (Murphy, 2022)
  • 10. CS772A: PML Types of Uncertainty: Model Uncertainty ๏‚ง Model uncertainty is due to not having enough training data ๏‚ง Also called epistemic uncertainty. Usually reducible ๏‚ง Vanishes with โ€œsufficientโ€ training data 10 Same model class (linear models) but uncertainty about the weights Uncertainty not just about the weights but also the model class ๐‘(๐œƒ|๐’Ÿ) Distribution of model parameters conditioned on the training data Also known as the โ€œposterior distributionโ€ Hard to compute in general. Will study several methods to do it A probabilistic way to express model uncertainty Image credit: Balaji L, Dustin T, Jasper N. (NeurIPS 2020 tutorial) Means uncertainty in model weights An example posterior of a two-dim parameter vector ๐œƒ Each model class itself will have uncertainty(like left fig) since there isnโ€™t enough training data 3 different model classes considered here (with linear, polynomial, circular decision boundaries)
  • 11. CS772A: PML Model Uncertainty via Posterior: An Illustration ๏‚ง Consider a linear classification model for 2-dim inputs ๏‚ง Classifier weight will be a 2-dim vector ๐œƒ = [๐œƒ1, ๐œƒ2] ๏‚ง Its posterior will be some 2-dim distribution ๐‘(๐œƒ|๐’Ÿ) ๏‚ง Sampling from this distribution will generate 2-dim vectors ๏‚ง Each vector will correspond to a linear separator (right fig) ๏‚ง Thus posterior in this case is equivalent to a 11 Not every separator is equally important Importance of separator ๐œƒ(๐‘–) is ๐‘(๐œƒ(๐‘–) |๐’Ÿ)
  • 12. CS772A: PML Types of Uncertainty: Data Uncertainty ๏‚ง Data uncertainty can be due to various reasons, e.g., ๏‚ง Intrinsic hardness of labeling, class overlap ๏‚ง Labeling errors/disagreements (for difficult training inputs) ๏‚ง Noisy or missing features ๏‚ง Also called aleatoric uncertainty. Usually irreducible ๏‚ง Wonโ€™t vanish even with infinite training data ๏‚ง Note: Can sometimes vanish by adding more features (figure on the right) or switching to a more complex model 12 ๐‘(๐‘ฆ|๐œƒ, ๐’™) Usually specified by the distribution of data being modeled conditioned on model parameters and other inputs A probabilistic way to express data uncertainty Image source: โ€œAleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methodsโ€ (H&W 2021) Image source: โ€œImproving machine classification using human uncertainty measurementsโ€ (Battleday et al, 2021) Image credit: Eric Nalisnick
  • 13. CS772A: PML A Key Principle of PML: Marginalization ๏‚ง Consider training data ๐’Ÿ = (๐—, ๐ฒ) ๏‚ง Consider a model ๐‘š(e.g., a deep neural net) of the form ๐‘(๐‘ฆ|๐’™, ๐œƒ, ๐‘š) ๏‚ง Let ๐œƒ denote the optimal weights of model ๐‘š by minimizing a loss fn. on ๐’Ÿ ๏‚ง Standard prediction for a new test input ๐’™โˆ— : ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๏‚ง A more robust prediction can be obtained via marginalization 13 ๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ For multi-class classification, this quantity will be a vector of predicted class probabilities For multi-class classification, this too will be a vector of predicted class probabilities but its computation incorporates the uncertainty in ๐œƒ Denotes data / aleatoric uncertainty Denotes model weight / epistemic uncertainty Prediction via marginalizing the predictions by each value of ๐œƒ over its posterior distribution Predictions by different ๐œƒโ€™s not weighted equally but based on how likely each ๐œƒ is given data ๐’Ÿ,, i.e., ๐‘ ๐œƒ ๐’Ÿ, ๐‘š e.g., a deep net with softmax outputs Will derive the expression shortly โ€“ just a simple application of product and chain rules of probability Note: ๐‘š is just a model identifier; can ignore when writing Posterior distribution of the weights Uncertainty about ๐œƒ ignored
  • 14. CS772A: PML More on Marginalization ๏‚ง Marginalization (integral) is usually intractable and needs to be approximated ๏‚ง Marginalization can be done even over several choices of models 14 ๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ ๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ) = ๐‘š=1 ๐‘€ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐’Ÿ, ๐‘š ๐‘(๐‘š|๐’Ÿ) ๐‘(๐‘ฆโˆ—|๐’™โˆ—, ๐’Ÿ, ๐‘š) = โˆซ ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ, ๐‘š ๐‘ ๐œƒ ๐’Ÿ, ๐‘š ๐‘‘๐œƒ โ‰ˆ 1 ๐‘† ๐‘–=1 ๐‘† ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ(๐‘–), ๐‘š Marginalization over all weights of a single model ๐‘š Marginalization over all finite choices ๐‘š = 1,2, โ€ฆ , ๐‘€ of the model Like a double averaging (over all model choices, and over all weights of each model choice) Each ๐œƒ(๐‘–) is drawn i.i.d. from the distribution ๐‘ ๐œƒ ๐’Ÿ, ๐‘š Above integral replaced by a โ€œMonte-Carlo Averagingโ€ Will look at these ideas in more depth later For example, deep nets with different architectures Havenโ€™t yet told you how to compute this quantity but will see shortly
  • 15. CS772A: PML Marginalization is like Ensemble-based Averaging ๏‚ง Recall the approximation of the marginalization ๏‚ง Akin to computing averaged prediction using ensemble of weights ๐œƒ(๐‘–) ๐‘–=1 ๐‘† ๏‚ง Ensembling is a well-known classical technique for strong predictive 15 ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐’Ÿ, ๐‘š โ‰ˆ 1 ๐‘† ๐‘–=1 ๐‘† ๐‘ ๐‘ฆโˆ— ๐’™โˆ—, ๐œƒ(๐‘–), ๐‘š Tip: If you canโ€™t design/use a full-fledged, rigorous probabilistic model, consider using an ensemble* instead We will also study ensembles later during this course One way to get such an ensemble is to train the same model with ๐‘† different initializations *Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (Lakshminarayanan et al, 2017) May not be as good as doing proper marginalization over the posterior but usually better than using a single estimate of the weights ๐œƒ(1) ๐œƒ(2) ๐œƒ(๐‘†)
  • 17. CS772A: PML Modeling Data Probabilistically: A Simplistic View ๏‚ง Assume data ๐— = {๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘} generated from a prob. model with params ๐œƒ ๏‚ง Note: Shaded nodes = observed; unshaded nodes = unknown/unobserved ๏‚ง Goal: To estimate the unknowns (๐œƒ in this case), given the observed data ๐— ๏‚ง Many ways to do this (point estimate or the posterior distribution of ๐œƒ) 17 ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ โˆผ ๐‘(๐’™|๐œƒ) A plate diagram of this simplistic model No outputs/labels in this problem; just modeling the inputs (an unsupervised learning task)
  • 18. CS772A: PML Basic Ingredients in Probabilistic ML ๏‚ง Specification of prob. models requires two key ingredients: Likelihood and prior ๏‚ง Likelihood ๐‘(๐’™|๐œƒ) or the โ€œobservation modelโ€ specifies how data is generated ๏‚ง Measures data fit (or โ€œlossโ€) w.r.t. the given ๐œƒ (will see the reason formally later) ๏‚ง Prior distribution ๐‘(๐œƒ) specifies how likely different parameter values are a priori 18
  • 19. CS772A: PML Parameter Estimation in Probabilistic Models ๏‚ง A simple way: Find ๐œƒ for which the observed data is most likely or most probable ๏‚ง More desirable: Estimate the full posterior distribution over ๐œƒ to get the uncertainty ๏‚ง To make predictions, can use full posterior rather than a point 19 This โ€œpoint estimateโ€, called maximum likelihood estimate (MLE), however, does not provide us any information about uncertainty in ๐œƒ Fully Bayesian inference. In general, an intractable problem (mainly because computing the marginal ๐‘(๐—) is hard), except for some simple cases (will study how to solve such problems) Just like loss function minimization approach Log likelihood Can also find a single best estimate of ๐œƒ by finding the mode of the posterior, i.e., ๐œƒ = argmax ๐œƒ log ๐‘(๐œƒ|๐—), called the maximum-a-posteriori (MAP) estimate, but we wonโ€™t get uncertainty estimate In this course, the use of the word โ€œinferenceโ€ would mean Bayesian inference, i.e., computing the (usually an approximate) posterior distribution of the model parameters. At some other places, especially in deep learning community, the word โ€œinferenceโ€ usually refers to making prediction on the test inputs
  • 20. CS772A: PML Posterior Averaging ๏‚ง Can now use the posterior over params to compute โ€œaveraged predictionโ€, e.g., 20 Posterior predictive distribution (PPD) obtained by doing an importance-weighted averaging over the posterior Plug-in predictive distribution Tells us how important this value of ๐œƒ is ๐‘ ๐’™โˆ— ๐— = โˆซ ๐‘ ๐’™โˆ— ๐œƒ ๐‘ ๐œƒ ๐— ๐‘‘๐œƒ ๐‘ ๐’™โˆ— ๐— = โˆซ ๐‘(๐’™โˆ—, ๐œƒ|๐—) ๐‘‘๐œƒ = โˆซ ๐‘ ๐’™โˆ— ๐œƒ, ๐— ๐‘(๐œƒ|๐—) ๐‘‘๐œƒ = โˆซ ๐‘ ๐’™โˆ— ๐œƒ ๐‘(๐œƒ|๐—) ๐‘‘๐œƒ Assuming observations are i.i.d. given ๐œƒ An approximation of the PPD โ‰ˆ 1 ๐‘† ๐‘–=1 ๐‘† ๐‘(๐’™โˆ—|๐œƒ ๐‘– ) โ€œPlug-inโ€ because we plugged-in a single value of ๐œƒ to compute this Note: The PPD will take different forms depending on the problem/model, e.g., for a discriminative supervised learning model, PPD will be of the form ๐‘(๐‘ฆ|๐’™, ๐’Ÿ) where ๐’Ÿ is the labeled training data Proof of the PPD expression Past (training) data Future (test) data
  • 21. CS772A: PML Bayesian Inference ๏‚ง Bayesian inference can be seen in a sequential fashion ๏‚ง Our belief about ๐œƒ keeps getting updated as we see more and more data ๏‚ง Posterior keeps getting updates as more and more data is observed ๏‚ง Note: Updates may not be straightforward and approximations may be 21
  • 22. CS772A: PML Other Examples of Probabilistic Models of Data 22
  • 23. CS772A: PML Modeling Data Probabilistically: With Latent Vars ๏‚ง Can endow generative models of data with latent variables. For example: ๏‚ง Such models are used in many problems, especially unsupervised learning: Gaussian mixture model, probabilistic PCA, topic models, deep generative models, etc. ๏‚ง We will look at several of these in this course and way to learn such models 23 Each data point ๐’™๐‘› is associated with a latent variable ๐’›๐‘› The latent variable ๐’›๐‘› can be used to encode some property of ๐’™๐‘› (e.g., its cluster membership, or its low-dim representation, or missing parts of ๐’™๐‘›)
  • 24. CS772A: PML Modeling Data Probabilistically: Other Examples ๏‚ง Can construct various other types of models of data for different problems ๏‚ง Any node (even if observed) we are uncertain about is modeled by a prob. distribution ๏‚ง These nodes become the random variables of the model ๏‚ง The full model is specified via a joint prob. distribution over all random variables 24 A simple supervised learning model A latent variable model for unsupervised learning A latent variable model for supervised learning A latent variable model for sequential data
  • 25. CS772A: PML Use probabilistic ML also because.. 25
  • 26. CS772A: PML Helps in Sequential Decision-Making Problems ๏‚ง Sequential decision-making: Information about uncertainty can โ€œguideโ€ us, e.g., ๏‚ง Applications in active learning, reinforcement learning, Bayesian 26 Given our current estimate of the regression function, which training input(s) should we add next to improve its estimate the most? Uncertainty can help here: Acquire training inputs from regions where the function is most uncertain about its current predictions Blue curve is the mean of the function, shaded region denotes the predictive uncertainty
  • 27. CS772A: PML Can Learn Data Distribution and Generate Data ๏‚ง Often wish to learn the underlying probability distribution ๐‘(๐’™) of the data ๏‚ง Useful for many tasks, e.g., ๏‚ง Outlier/novelty detection: Outliers will have low probability under ๐‘(๐’™) ๏‚ง Can sample from this distribution to generate new โ€œartificialโ€ but realistic- looking data 27 Several models, such as generative adversarial networks (GAN), variational auto-encoders (VAE), denoising diffusion models, etc can do this Pic credit: https://medium.com/analytics-vidhya/an-introduction-to-generative-deep-learning-792e93d1c6d4
  • 28. CS772A: PML Can Better Handle OOD Data ๏‚ง Many modern deep neural networks (DNN) tend to be overconfident ๏‚ง Especially true if test data is โ€œout-of-distribution (OOD)โ€ ๏‚ง PML models are robust and give better uncertainty estimates to flag OOD data 28 Low confidence High confidence Class 1 Class 2 Some OOD test data Desirable confidence map Confidence map of a DNN Model has high confidence for predictions on even inputs that are far away from training data Example of an overconfident model Image source: โ€œSimple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awarenessโ€ (Liu et al, 2020)
  • 29. CS772A: PML Can Construct Complex Models in Modular Ways ๏‚ง Can combine multiple simple probabilistic models to learn complex patterns ๏‚ง Can design models that can jointly learn from multiple datasets and share information across multiple datasets using shared parameters with a prior distribution 29 A combination of a mixture model for clustering and a probabilistic linear regression model: Result is a probabilistic nonlinear regression model Can design a latent variable model to do this Essentially a โ€œmixture of expertsโ€ model An example of transfer learning or multitask learning using a probabilistic approach Example: Estimating the means of ๐‘š datasets, assuming the means are somewhat related. Can do this jointly rather than estimating independently Easy to do it using a probabilistic approach with shared parameters (will see details later)
  • 30. CS772A: PML Hyperparameter Estimation ๏‚ง ML models invariably have hyperparams, e.g., regularization/kernel h.p. in a linear/kernel regression, hyperparameters of a deep neural network, etc. ๏‚ง Can specify the hyperparams as additional unknown of the probabilistic model ๏‚ง Can now estimate them, e.g., using a point estimate or a posterior distribution ๏‚ง To find point estimate of hyperparameters, we can write the probability of data as a 30 A way to find the point estimate of the hyperparameters by maximizing the marginal likelihood of data (more on this later) ๐›ผ = argmax ๐›ผ log ๐‘(๐—|๐›ผ) = argmax ๐›ผ log โˆซ ๐‘ ๐— ๐œƒ ๐‘ ๐œƒ ๐›ผ ๐œƒ The approach of using marginal likelihood for doing such thing has some issues; other quantities can be used such as โ€œconditionalโ€ marginal likelihood* (more on this later) *Bayesian Model Selection, the Marginal Likelihood, and Generalization (Lotfi et al, 2022)
  • 31. CS772A: PML Model Comparison ๏‚ง Suppose we have a number of models ๐‘š = 1,2, โ€ฆ , ๐‘€ to choose from ๏‚ง The standard way to choose the best model is cross-validation ๏‚ง Can also compute the posterior probability of each candidate model, using Bayes rule ๏‚ง If all models are equally likely a priori (๐‘(๐‘š) is uniform) then the best model can be selected as the one with largest marginal likelihood 31 ๐‘ ๐‘š ๐— = ๐‘ ๐‘š ๐‘(๐—|๐‘š) ๐‘(๐—) Marginal likelihood of model ๐‘š ๐‘(๐—|๐‘š) This doesnโ€™t require a separate validation set unlike cross- validation Therefore also useful for doing model selection/comparison for unsupervised learning problems May not be easy to do exactly but can compute it approximately
  • 32. CS772A: PML Non-probabilistic ML Methods? ๏‚ง Some non-probabilistic ML methods can give probabilistic answers via heuristics ๏‚ง Doesnโ€™t mean these methods are not useful/used but they donโ€™t follow the PML paradigm, so we wonโ€™t study them in this course ๏‚ง Some examples which you may have seen ๏‚ง Converting distances from hyperplane (in hyperplane classifiers) to compute class probabilities ๏‚ง Using class-frequencies in nearest neighbors to compute class probabilities ๏‚ง Using class-frequencies at leaves of a Decision Tree to compute class probabilities ๏‚ง Soft k-means clustering to compute probabilistic cluster memberships 32 Or methods like Platt Scaling used to get class probabilities for SVMs
  • 33. CS772A: PML Tentative Outline ๏‚ง Basics of probabilistic modeling and inference ๏‚ง Common probability distributions ๏‚ง Basic point estimation (MLE and MAP) ๏‚ง Bayesian inference (simple and not-so-simple cases) ๏‚ง Probabilistic models for regression, classification, clustering, dimensionality reduction ๏‚ง Gaussian Processes (probabilistic modeling meets kernels) ๏‚ง Latent Variable Models (for i.i.d., sequential, and relational data) ๏‚ง Approximate Bayesian inference (EM, variational inference, sampling, etc) ๏‚ง Bayesian Deep Learning ๏‚ง Misc topics, e.g., deep generative models, black-box inference, sequential decision-making, etc 33