SlideShare a Scribd company logo
1 of 135
Download to read offline
https://github.com/soulmachine/machine-learning-cheat-sheet
soulmachine@gmail.com
Machine Learning Cheat Sheet
Classical equations, diagrams and tricks in machine learning
February 12, 2015
ii
©2013 soulmachine
Except where otherwise noted, This document is licensed under a Creative Commons Attribution-ShareAlike 3.0
Unported (CC BY-SA3.0) license
(http://creativecommons.org/licenses/by/3.0/).
Preface
This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly
recall knowledge and ideas in machine learning.
This cheat sheet has three significant advantages:
1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can
be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this
cheat sheet, I try my best to standardize symbols used, see section §.
2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually
causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where
they are needed, to make formulas easy to understand.
3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often
makes readers get lost in the middle way of derivation.
At Tsinghua University, May 2013 soulmachine
iii
Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Types of machine learning . . . . . . . . . . . . 1
1.2 Three elements of a machine learning
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Representation . . . . . . . . . . . . . . 1
1.2.2 Evaluation . . . . . . . . . . . . . . . . . 1
1.2.3 Optimization . . . . . . . . . . . . . . . 2
1.3 Some basic concepts . . . . . . . . . . . . . . . . . 2
1.3.1 Parametric vs non-parametric
models . . . . . . . . . . . . . . . . . . . . 2
1.3.2 A simple non-parametric
classifier: K-nearest neighbours 2
1.3.3 Overfitting . . . . . . . . . . . . . . . . . 2
1.3.4 Cross validation . . . . . . . . . . . . . 2
1.3.5 Model selection . . . . . . . . . . . . . 2
2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Frequentists vs. Bayesians . . . . . . . . . . . . 3
2.2 A brief review of probability theory . . . . 3
2.2.1 Basic concepts . . . . . . . . . . . . . . 3
2.2.2 Mutivariate random variables . . 3
2.2.3 Bayes rule. . . . . . . . . . . . . . . . . . 4
2.2.4 Independence and conditional
independence . . . . . . . . . . . . . . . 4
2.2.5 Quantiles . . . . . . . . . . . . . . . . . . 4
2.2.6 Mean and variance . . . . . . . . . . 4
2.3 Some common discrete distributions . . . 5
2.3.1 The Bernoulli and binomial
distributions . . . . . . . . . . . . . . . . 5
2.3.2 The multinoulli and
multinomial distributions . . . . . 5
2.3.3 The Poisson distribution . . . . . . 5
2.3.4 The empirical distribution . . . . 5
2.4 Some common continuous distributions. 6
2.4.1 Gaussian (normal) distribution. 6
2.4.2 Student’s t-distribution . . . . . . . 6
2.4.3 The Laplace distribution . . . . . . 7
2.4.4 The gamma distribution . . . . . . 8
2.4.5 The beta distribution . . . . . . . . . 8
2.4.6 Pareto distribution . . . . . . . . . . . 8
2.5 Joint probability distributions . . . . . . . . . 9
2.5.1 Covariance and correlation . . . . 9
2.5.2 Multivariate Gaussian
distribution . . . . . . . . . . . . . . . . . 10
2.5.3 Multivariate Student’s
t-distribution . . . . . . . . . . . . . . . 10
2.5.4 Dirichlet distribution . . . . . . . . . 10
2.6 Transformations of random variables . . . 11
2.6.1 Linear transformations . . . . . . . 11
2.6.2 General transformations . . . . . . 11
2.6.3 Central limit theorem . . . . . . . . 13
2.7 Monte Carlo approximation . . . . . . . . . . . 13
2.8 Information theory . . . . . . . . . . . . . . . . . . 14
2.8.1 Entropy . . . . . . . . . . . . . . . . . . . . 14
2.8.2 KL divergence . . . . . . . . . . . . . . 14
2.8.3 Mutual information . . . . . . . . . . 14
3 Generative models for discrete data . . . . . . . . 17
3.1 Generative classifier . . . . . . . . . . . . . . . . . 17
3.2 Bayesian concept learning . . . . . . . . . . . . 17
3.2.1 Likelihood . . . . . . . . . . . . . . . . . 17
3.2.2 Prior . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Posterior . . . . . . . . . . . . . . . . . . . 17
3.2.4 Posterior predictive distribution 18
3.3 The beta-binomial model . . . . . . . . . . . . . 18
3.3.1 Likelihood . . . . . . . . . . . . . . . . . 18
3.3.2 Prior . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Posterior . . . . . . . . . . . . . . . . . . . 18
3.3.4 Posterior predictive distribution 19
3.4 The Dirichlet-multinomial model . . . . . . 19
3.4.1 Likelihood . . . . . . . . . . . . . . . . . 20
3.4.2 Prior . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Posterior . . . . . . . . . . . . . . . . . . . 20
3.4.4 Posterior predictive distribution 20
3.5 Naive Bayes classifiers . . . . . . . . . . . . . . . 20
3.5.1 Optimization . . . . . . . . . . . . . . . 21
3.5.2 Using the model for prediction 21
3.5.3 The log-sum-exp trick . . . . . . . . 21
3.5.4 Feature selection using
mutual information . . . . . . . . . . 22
3.5.5 Classifying documents using
bag of words . . . . . . . . . . . . . . . 22
4 Gaussian Models . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 MLE for a MVN . . . . . . . . . . . . 25
4.1.2 Maximum entropy derivation
of the Gaussian * . . . . . . . . . . . . 26
4.2 Gaussian discriminant analysis . . . . . . . . 26
4.2.1 Quadratic discriminant
analysis (QDA) . . . . . . . . . . . . . 26
v
vi Preface
4.2.2 Linear discriminant analysis
(LDA) . . . . . . . . . . . . . . . . . . . . . 27
4.2.3 Two-class LDA . . . . . . . . . . . . . 28
4.2.4 MLE for discriminant analysis. 28
4.2.5 Strategies for preventing
overfitting . . . . . . . . . . . . . . . . . . 29
4.2.6 Regularized LDA * . . . . . . . . . . 29
4.2.7 Diagonal LDA . . . . . . . . . . . . . . 29
4.2.8 Nearest shrunken centroids
classifier * . . . . . . . . . . . . . . . . . 29
4.3 Inference in jointly Gaussian
distributions . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 Statement of the result . . . . . . . 29
4.3.2 Examples . . . . . . . . . . . . . . . . . . 30
4.4 Linear Gaussian systems . . . . . . . . . . . . . 30
4.4.1 Statement of the result . . . . . . . 30
4.5 Digression: The Wishart distribution * . . 30
4.6 Inferring the parameters of an MVN . . . 30
4.6.1 Posterior distribution of µ . . . . 30
4.6.2 Posterior distribution of Σ * . . . 30
4.6.3 Posterior distribution of µ
and Σ * . . . . . . . . . . . . . . . . . . . . 30
4.6.4 Sensor fusion with unknown
precisions * . . . . . . . . . . . . . . . . 30
5 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Summarizing posterior distributions . . . . 31
5.2.1 MAP estimation. . . . . . . . . . . . . 31
5.2.2 Credible intervals . . . . . . . . . . . 32
5.2.3 Inference for a difference in
proportions . . . . . . . . . . . . . . . . . 33
5.3 Bayesian model selection . . . . . . . . . . . . . 33
5.3.1 Bayesian Occam’s razor . . . . . . 33
5.3.2 Computing the marginal
likelihood (evidence). . . . . . . . . 34
5.3.3 Bayes factors . . . . . . . . . . . . . . . 36
5.4 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4.1 Uninformative priors . . . . . . . . . 36
5.4.2 Robust priors . . . . . . . . . . . . . . . 36
5.4.3 Mixtures of conjugate priors . . 36
5.5 Hierarchical Bayes . . . . . . . . . . . . . . . . . . 36
5.6 Empirical Bayes . . . . . . . . . . . . . . . . . . . . 36
5.7 Bayesian decision theory . . . . . . . . . . . . . 36
5.7.1 Bayes estimators for common
loss functions . . . . . . . . . . . . . . . 37
5.7.2 The false positive vs false
negative tradeoff . . . . . . . . . . . . 38
6 Frequentist statistics. . . . . . . . . . . . . . . . . . . . . . 39
6.1 Sampling distribution of an estimator . . . 39
6.1.1 Bootstrap . . . . . . . . . . . . . . . . . . 39
6.1.2 Large sample theory for the
MLE * . . . . . . . . . . . . . . . . . . . . 39
6.2 Frequentist decision theory . . . . . . . . . . . 39
6.3 Desirable properties of estimators . . . . . . 39
6.4 Empirical risk minimization . . . . . . . . . . 39
6.4.1 Regularized risk minimization . 39
6.4.2 Structural risk minimization . . . 39
6.4.3 Estimating the risk using
cross validation . . . . . . . . . . . . . 39
6.4.4 Upper bounding the risk
using statistical learning
theory *. . . . . . . . . . . . . . . . . . . . 39
6.4.5 Surrogate loss functions . . . . . . 39
6.5 Pathologies of frequentist statistics * . . . 39
7 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Representation. . . . . . . . . . . . . . . . . . . . . . 41
7.3 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.1 OLS . . . . . . . . . . . . . . . . . . . . . . 41
7.3.2 SGD . . . . . . . . . . . . . . . . . . . . . . 42
7.4 Ridge regression(MAP) . . . . . . . . . . . . . . 42
7.4.1 Basic idea . . . . . . . . . . . . . . . . . . 43
7.4.2 Numerically stable
computation * . . . . . . . . . . . . . . 43
7.4.3 Connection with PCA * . . . . . . 43
7.4.4 Regularization effects of big
data . . . . . . . . . . . . . . . . . . . . . . . 43
7.5 Bayesian linear regression . . . . . . . . . . . . 43
8 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 45
8.1 Representation. . . . . . . . . . . . . . . . . . . . . . 45
8.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . 45
8.2.1 MLE . . . . . . . . . . . . . . . . . . . . . . 45
8.2.2 MAP . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Multinomial logistic regression . . . . . . . . 45
8.3.1 Representation . . . . . . . . . . . . . . 45
8.3.2 MLE . . . . . . . . . . . . . . . . . . . . . . 46
8.3.3 MAP . . . . . . . . . . . . . . . . . . . . . . 46
8.4 Bayesian logistic regression . . . . . . . . . . 46
8.4.1 Laplace approximation . . . . . . . 47
8.4.2 Derivation of the BIC . . . . . . . . 47
8.4.3 Gaussian approximation for
logistic regression . . . . . . . . . . . 47
8.4.4 Approximating the posterior
predictive . . . . . . . . . . . . . . . . . . 47
8.4.5 Residual analysis (outlier
detection) * . . . . . . . . . . . . . . . . 47
8.5 Online learning and stochastic
optimization. . . . . . . . . . . . . . . . . . . . . . . . 47
8.5.1 The perceptron algorithm . . . . . 47
8.6 Generative vs discriminative classifiers . 48
8.6.1 Pros and cons of each approach 48
8.6.2 Dealing with missing data . . . . 48
8.6.3 Fishers linear discriminant
analysis (FLDA) * . . . . . . . . . . . 50
Preface vii
9 Generalized linear models and the
exponential family . . . . . . . . . . . . . . . . . . . . . . . 51
9.1 The exponential family. . . . . . . . . . . . . . . 51
9.1.1 Definition . . . . . . . . . . . . . . . . . . 51
9.1.2 Examples . . . . . . . . . . . . . . . . . . 51
9.1.3 Log partition function . . . . . . . . 52
9.1.4 MLE for the exponential family 53
9.1.5 Bayes for the exponential
family . . . . . . . . . . . . . . . . . . . . . 53
9.1.6 Maximum entropy derivation
of the exponential family * . . . . 53
9.2 Generalized linear models (GLMs). . . . . 53
9.2.1 Basics . . . . . . . . . . . . . . . . . . . . . 53
9.3 Probit regression . . . . . . . . . . . . . . . . . . . . 53
9.4 Multi-task learning . . . . . . . . . . . . . . . . . . 53
10 Directed graphical models (Bayes nets) . . . . . 55
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 55
10.1.1 Chain rule . . . . . . . . . . . . . . . . . . 55
10.1.2 Conditional independence . . . . 55
10.1.3 Graphical models. . . . . . . . . . . . 55
10.1.4 Directed graphical model . . . . . 55
10.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.2.1 Naive Bayes classifiers . . . . . . . 56
10.2.2 Markov and hidden Markov
models . . . . . . . . . . . . . . . . . . . . 56
10.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.4.1 Learning from complete data . . 56
10.4.2 Learning with missing and/or
latent variables . . . . . . . . . . . . . . 57
10.5 Conditional independence properties
of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.5.1 d-separation and the Bayes
Ball algorithm (global
Markov properties) . . . . . . . . . . 57
10.5.2 Other Markov properties of
DGMs . . . . . . . . . . . . . . . . . . . . . 57
10.5.3 Markov blanket and full
conditionals . . . . . . . . . . . . . . . . 57
10.5.4 Multinoulli Learning . . . . . . . . . 57
10.6 Influence (decision) diagrams * . . . . . . . 57
11 Mixture models and the EM algorithm . . . . . 59
11.1 Latent variable models . . . . . . . . . . . . . . . 59
11.2 Mixture models . . . . . . . . . . . . . . . . . . . . . 59
11.2.1 Mixtures of Gaussians . . . . . . . 59
11.2.2 Mixtures of multinoullis . . . . . . 60
11.2.3 Using mixture models for
clustering . . . . . . . . . . . . . . . . . . 60
11.2.4 Mixtures of experts . . . . . . . . . . 60
11.3 Parameter estimation for mixture models 60
11.3.1 Unidentifiability . . . . . . . . . . . . 60
11.3.2 Computing a MAP estimate
is non-convex . . . . . . . . . . . . . . . 60
11.4 The EM algorithm . . . . . . . . . . . . . . . . . . 60
11.4.1 Introduction . . . . . . . . . . . . . . . . 60
11.4.2 Basic idea . . . . . . . . . . . . . . . . . . 62
11.4.3 EM for GMMs . . . . . . . . . . . . . . 62
11.4.4 EM for K-means . . . . . . . . . . . . 64
11.4.5 EM for mixture of experts . . . . 64
11.4.6 EM for DGMs with hidden
variables . . . . . . . . . . . . . . . . . . . 64
11.4.7 EM for the Student
distribution * . . . . . . . . . . . . . . . 64
11.4.8 EM for probit regression * . . . . 64
11.4.9 Derivation of the Q function . . 64
11.4.10 Convergence of the EM
Algorithm * . . . . . . . . . . . . . . . . 65
11.4.11 Generalization of EM
Algorithm * . . . . . . . . . . . . . . . . 65
11.4.12 Online EM . . . . . . . . . . . . . . . . . 66
11.4.13 Other EM variants * . . . . . . . . . 66
11.5 Model selection for latent variable
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
11.5.1 Model selection for
probabilistic models . . . . . . . . . 67
11.5.2 Model selection for
non-probabilistic methods . . . . 67
11.6 Fitting models with missing data . . . . . . 67
11.6.1 EM for the MLE of an MVN
with missing data. . . . . . . . . . . . 67
12 Latent linear models. . . . . . . . . . . . . . . . . . . . . . 69
12.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . 69
12.1.1 FA is a low rank
parameterization of an MVN . . 69
12.1.2 Inference of the latent factors . . 69
12.1.3 Unidentifiability . . . . . . . . . . . . 70
12.1.4 Mixtures of factor analysers . . . 70
12.1.5 EM for factor analysis models . 71
12.1.6 Fitting FA models with
missing data . . . . . . . . . . . . . . . . 71
12.2 Principal components analysis (PCA) . . 71
12.2.1 Classical PCA . . . . . . . . . . . . . . 71
12.2.2 Singular value decomposition
(SVD) . . . . . . . . . . . . . . . . . . . . . 72
12.2.3 Probabilistic PCA . . . . . . . . . . . 73
12.2.4 EM algorithm for PCA . . . . . . . 74
12.3 Choosing the number of latent
dimensions. . . . . . . . . . . . . . . . . . . . . . . . . 74
12.3.1 Model selection for FA/PPCA . 74
12.3.2 Model selection for PCA . . . . . 74
12.4 PCA for categorical data . . . . . . . . . . . . . 74
12.5 PCA for paired and multi-view data . . . . 75
12.5.1 Supervised PCA (latent
factor regression) . . . . . . . . . . . . 75
viii Preface
12.5.2 Discriminative supervised PCA 75
12.5.3 Canonical correlation analysis . 75
12.6 Independent Component Analysis (ICA) 75
12.6.1 Maximum likelihood estimation 75
12.6.2 The FastICA algorithm . . . . . . . 76
12.6.3 Using EM . . . . . . . . . . . . . . . . . . 76
12.6.4 Other estimation principles * . . 76
13 Sparse linear models . . . . . . . . . . . . . . . . . . . . . 77
14 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 79
14.2 Kernel functions . . . . . . . . . . . . . . . . . . . . 79
14.2.1 RBF kernels . . . . . . . . . . . . . . . . 79
14.2.2 TF-IDF kernels . . . . . . . . . . . . . 79
14.2.3 Mercer (positive definite)
kernels . . . . . . . . . . . . . . . . . . . . 79
14.2.4 Linear kernels . . . . . . . . . . . . . . 80
14.2.5 Matern kernels . . . . . . . . . . . . . . 80
14.2.6 String kernels . . . . . . . . . . . . . . . 80
14.2.7 Pyramid match kernels . . . . . . . 81
14.2.8 Kernels derived from
probabilistic generative models 81
14.3 Using kernels inside GLMs . . . . . . . . . . . 81
14.3.1 Kernel machines . . . . . . . . . . . . 81
14.3.2 L1VMs, RVMs, and other
sparse vector machines . . . . . . . 81
14.4 The kernel trick . . . . . . . . . . . . . . . . . . . . . 81
14.4.1 Kernelized KNN . . . . . . . . . . . . 82
14.4.2 Kernelized K-medoids
clustering . . . . . . . . . . . . . . . . . . 82
14.4.3 Kernelized ridge regression . . . 82
14.4.4 Kernel PCA . . . . . . . . . . . . . . . . 83
14.5 Support vector machines (SVMs) . . . . . . 83
14.5.1 SVMs for classification. . . . . . . 83
14.5.2 SVMs for regression . . . . . . . . . 84
14.5.3 Choosing C . . . . . . . . . . . . . . . . 85
14.5.4 A probabilistic interpretation
of SVMs . . . . . . . . . . . . . . . . . . . 85
14.5.5 Summary of key points . . . . . . . 85
14.6 Comparison of discriminative kernel
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
14.7 Kernels for building generative models . 86
15 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . 87
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 87
15.2 GPs for regression . . . . . . . . . . . . . . . . . . 87
15.3 GPs meet GLMs . . . . . . . . . . . . . . . . . . . . 87
15.4 Connection with other methods. . . . . . . . 87
15.5 GP latent variable model . . . . . . . . . . . . . 87
15.6 Approximation methods for large
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
16 Adaptive basis function models . . . . . . . . . . . . 89
16.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.1.1 Representation . . . . . . . . . . . . . . 89
16.1.2 Evaluation . . . . . . . . . . . . . . . . . 89
16.1.3 Optimization . . . . . . . . . . . . . . . 89
16.1.4 The upper bound of the
training error of AdaBoost . . . . 89
17 Hidden markov Model . . . . . . . . . . . . . . . . . . . . 91
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 91
17.2 Markov models . . . . . . . . . . . . . . . . . . . . . 91
18 State space models . . . . . . . . . . . . . . . . . . . . . . . 93
19 Undirected graphical models (Markov
random fields) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
20 Exact inference for graphical models . . . . . . . 97
21 Variational inference . . . . . . . . . . . . . . . . . . . . . 99
22 More variational inference . . . . . . . . . . . . . . . . 101
23 Monte Carlo inference . . . . . . . . . . . . . . . . . . . . 103
24 Markov chain Monte Carlo
(MCMC)inference . . . . . . . . . . . . . . . . . . . . . . . 105
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 105
24.2 Metropolis Hastings algorithm . . . . . . . . 105
24.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 105
24.4 Speed and accuracy of MCMC . . . . . . . . 105
24.5 Auxiliary variable MCMC * . . . . . . . . . . 105
25 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
26 Graphical model structure learning . . . . . . . . 109
27 Latent variable models for discrete data . . . . 111
27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 111
27.2 Distributed state LVMs for discrete data 111
28 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A Optimization methods . . . . . . . . . . . . . . . . . . . . 115
A.1 Convexity. . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.2 Gradient descent . . . . . . . . . . . . . . . . . . . . 115
A.2.1 Stochastic gradient descent . . . 115
A.2.2 Batch gradient descent . . . . . . . 115
A.2.3 Line search . . . . . . . . . . . . . . . . . 115
A.2.4 Momentum term . . . . . . . . . . . . 116
A.3 Lagrange duality . . . . . . . . . . . . . . . . . . . . 116
A.3.1 Primal form . . . . . . . . . . . . . . . . 116
A.3.2 Dual form . . . . . . . . . . . . . . . . . . 116
A.4 Newton’s method . . . . . . . . . . . . . . . . . . . 116
A.5 Quasi-Newton method . . . . . . . . . . . . . . . 116
A.5.1 DFP . . . . . . . . . . . . . . . . . . . . . . . 116
Preface ix
A.5.2 BFGS . . . . . . . . . . . . . . . . . . . . . 116
A.5.3 Broyden . . . . . . . . . . . . . . . . . . . 117
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
List of Contributors
Wei Zhang
PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail:
zh3feng@gmail.com, has written chapters of Naive Bayes and SVM.
Fei Pan
Master at Beijing University of Technology, Beijing, P.R.CHINA, e-mail: example@gmail.com, has written
chapters of KMeans, AdaBoost.
Yong Li
PhD candidate at the Institute of Automation of the Chinese Academy of Sciences (CASIA), Beijing, P.R.CHINA,
e-mail: liyong3forever@gmail.com, has written chapters of Logistic Regression.
Jiankou Li
PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail:
lijiankoucoco@163.com, has written chapters of BayesNet.
xi
Notation
Introduction
It is very difficult to come up with a single, consistent notation to cover the wide variety of data, models and algorithms
that we discuss. Furthermore, conventions difer between machine learning and statistics, and between different books
and papers. Nevertheless, we have tried to be as consistent as possible. Below we summarize most of the notation
used in this book, although individual sections may introduce new notation. Note also that the same symbol may have
different meanings depending on the context, although we try to avoid this where possible.
General math notation
Symbol Meaning
⌊x⌋ Floor of x, i.e., round down to nearest integer
⌈x⌉ Ceiling of x, i.e., round up to nearest integer
x⊗y Convolution of x and y
x⊙y Hadamard (elementwise) product of x and y
a∧b logical AND
a∨b logical OR
¬a logical NOT
I(x) Indicator function, I(x) = 1 if x is true, else I(x) = 0
∞ Infinity
→ Tends towards, e.g., n → ∞
∝ Proportional to, so y = ax can be written as y ∝ x
|x| Absolute value
|S| Size (cardinality) of a set
n! Factorial function
∇ Vector of first derivatives
∇2 Hessian matrix of second derivatives
≜ Defined as
O(·) Big-O: roughly means order of magnitude
R The real numbers
1 : n Range (Matlab convention): 1 : n = 1,2,...,n
≈ Approximately equal to
argmax
x
f(x) Argmax: the value x that maximizes f
B(a,b) Beta function, B(a,b) =
Γ (a)Γ (b)
Γ (a+b)
B(α) Multivariate beta function,
∏
k
Γ (αk)
Γ (∑
k
αk)
(n
k
)
n choose k , equal to n!/(k!(nk)!)
δ(x) Dirac delta function,δ(x) = ∞ if x = 0, else δ(x) = 0
exp(x) Exponential function ex
Γ (x) Gamma function, Γ (x) =
∫ ∞
0 ux−1e−udu
Ψ(x) Digamma function,Psi(x) =
d
dx
logΓ (x)
xiii
xiv Notation
X A set from which values are drawn (e.g.,X = RD)
Linear algebra notation
We use boldface lower-case to denote vectors, such as x, and boldface upper-case to denote matrices, such as X. We
denote entries in a matrix by non-bold upper case letters, such as Xij.
Vectors are assumed to be column vectors, unless noted otherwise. We use (x1,··· ,xD) to denote a column vector
created by stacking D scalars. If we write X = (x1,··· ,xn), where the left hand side is a matrix, we mean to stack
the xi along the columns, creating a matrix.
Symbol Meaning
X ≻ 0 X is a positive definite matrix
tr(X) Trace of a matrix
det(X) Determinant of matrix X
|X| Determinant of matrix X
X−1 Inverse of a matrix
X† Pseudo-inverse of a matrix
XT Transpose of a matrix
xT Transpose of a vector
diag(x) Diagonal matrix made from vector x
diag(X) Diagonal vector extracted from matrix X
I or Id Identity matrix of size d ×d (ones on diagonal, zeros of)
1 or 1d Vector of ones (of length d)
0 or 0d Vector of zeros (of length d)
||x|| = ||x||2 Euclidean or ℓ2 norm
√
d
∑
j=1
x2
j
||x||1 ℓ1 norm
d
∑
j=1
xj
X:,j j’th column of matrix
Xi,: transpose of i’th row of matrix (a column vector)
Xi,j Element (i, j) of matrix X
x⊗y Tensor product of x and y
Probability notation
We denote random and fixed scalars by lower case, random and fixed vectors by bold lower case, and random and
fixed matrices by bold upper case. Occasionally we use non-bold upper case to denote scalar random variables. Also,
we use p() for both discrete and continuous random variables
Symbol Meaning
X,Y Random variable
P() Probability of a random event
F() Cumulative distribution function(CDF), also called distribution function
p(x) Probability mass function(PMF)
f(x) probability density function(PDF)
F(x,y) Joint CDF
p(x,y) Joint PMF
f(x,y) Joint PDF
Notation xv
p(X|Y) Conditional PMF, also called conditional probability
fX|Y (x|y) Conditional PDF
X ⊥ Y X is independent of Y
X ̸⊥ Y X is not independent of Y
X ⊥ Y|Z X is conditionally independent of Y given Z
X ̸⊥ Y|Z X is not conditionally independent of Y given Z
X ∼ p X is distributed according to distribution p
α Parameters of a Beta or Dirichlet distribution
cov[X] Covariance of X
E[X] Expected value of X
Eq[X] Expected value of X wrt distribution q
H(X) or H(p) Entropy of distribution p(X)
I(X;Y) Mutual information between X and Y
KL(p||q) KL divergence from distribution p to q
ℓ(θ) Log-likelihood function
L(θ,a) Loss function for taking action a when true state of nature is θ
λ Precision (inverse variance) λ = 1/σ2
Λ Precision matrix Λ = Σ−1
mode[X] Most probable value of X
µ Mean of a scalar distribution
µ Mean of a multivariate distribution
Φ cdf of standard normal
ϕ pdf of standard normal
π multinomial parameter vector, Stationary distribution of Markov chain
ρ Correlation coefficient
sigm(x) Sigmoid (logistic) function,
1
1+e−x
σ2 Variance
Σ Covariance matrix
var[x] Variance of x
ν Degrees of freedom parameter
Z Normalization constant of a probability distribution
Machine learning/statistics notation
In general, we use upper case letters to denote constants, such as C,K,M,N,T, etc. We use lower case letters as dummy
indexes of the appropriate range, such as c = 1 : C to index classes, i = 1 : M to index data cases, j = 1 : N to index
input features, k = 1 : K to index states or clusters, t = 1 : T to index time, etc.
We use x to represent an observed data vector. In a supervised problem, we use y or y to represent the desired output
label. We use z to represent a hidden variable. Sometimes we also use q to represent a hidden discrete variable.
Symbol Meaning
C Number of classes
D Dimensionality of data vector (number of features)
N Number of data cases
Nc Number of examples of class c,Nc = ∑N
i=1 I(yi = c)
R Number of outputs (response variables)
D Training data D = {(xi,yi)|i = 1 : N}
Dtest Test data
X Input space
Y Output space
xvi Notation
K Number of states or dimensions of a variable (often latent)
k(x,y) Kernel function
K Kernel matrix
H Hypothesis space
L Loss function
J(θ) Cost function
f(x) Decision function
P(y|x) TODO
λ Strength of ℓ2 or ℓ1regularizer
ϕ(x) Basis function expansion of feature vector x
Φ Basis function expansion of design matrix X
q() Approximate or proposal distribution
Q(θ,θold) Auxiliary function in EM
T Length of a sequence
T(D) Test statistic for data
T Transition matrix of Markov chain
θ Parameter vector
θ(s) s’th sample of parameter vector
ˆθ Estimate (usually MLE or MAP) of θ
ˆθMLE Maximum likelihood estimate of θ
ˆθMAP MAP estimate of θ
¯θ Estimate (usually posterior mean) of θ
w Vector of regression weights (called β in statistics)
b intercept (called ε in statistics)
W Matrix of regression weights
xij Component (i.e., feature) j of data case i ,for i = 1 : N, j = 1 : D
xi Training case, i = 1 : N
X Design matrix of size N ×D
¯x Empirical mean ¯x =
1
N
∑N
i=1 xi
˜x Future test case
x∗ Feature test case
y Vector of all training labels y = (y1,...,yN)
zij Latent component j for case i
Chapter 1
Introduction
1.1 Types of machine learning



Supervised learning
{
Classification
Regression
Unsupervised learning



Discovering clusters
Discovering latent factors
Discovering graph structure
Matrix completion
1.2 Three elements of a machine learning
model
Model = Representation + Evaluation + Optimization1
1.2.1 Representation
In supervised learning, a model must be represented as
a conditional probability distribution P(y|x)(usually we
call it classifier) or a decision function f(x). The set of
classifiers(or decision functions) is called the hypothesis
space of the model. Choosing a representation for a model
is tantamount to choosing the hypothesis space that it can
possibly learn.
1.2.2 Evaluation
In the hypothesis space, an evaluation function (also
called objective function or risk function) is needed to
distinguish good classifiers(or decision functions) from
bad ones.
1.2.2.1 Loss function and risk function
Definition 1.1. In order to measure how well a function
fits the training data, a loss function L : Y ×Y → R ≥ 0 is
1 Domingos, P. A few useful things to know about machine learning.
Commun. ACM. 55(10):7887 (2012).
defined. For training example (xi,yi), the loss of predict-
ing the value y is L(yi,y).
The following is some common loss functions:
1. 0-1 loss function
L(Y, f(X)) = I(Y, f(X)) =
{
1, Y = f(X)
0, Y ̸= f(X)
2. Quadratic loss function L(Y, f(X)) = (Y − f(X))2
3. Absolute loss function L(Y, f(X)) = |Y − f(X)|
4. Logarithmic loss function
L(Y,P(Y|X)) = −logP(Y|X)
Definition 1.2. The risk of function f is defined as the ex-
pected loss of f:
Rexp(f) = E [L(Y, f(X))] =
∫
L(y, f(x))P(x,y)dxdy
(1.1)
which is also called expected loss or risk function.
Definition 1.3. The risk function Rexp(f) can be esti-
mated from the training data as
Remp(f) =
1
N
N
∑
i=1
L(yi, f(xi)) (1.2)
which is also called empirical loss or empirical risk.
You can define your own loss function, but if you’re
a novice, you’re probably better off using one from the
literature. There are conditions that loss functions should
meet2:
1. They should approximate the actual loss you’re trying
to minimize. As was said in the other answer, the stan-
dard loss functions for classification is zero-one-loss
(misclassification rate) and the ones used for training
classifiers are approximations of that loss.
2. The loss function should work with your intended op-
timization algorithm. That’s why zero-one-loss is not
used directly: it doesn’t work with gradient-based opti-
mization methods since it doesn’t have a well-defined
gradient (or even a subgradient, like the hinge loss for
SVMs has).
The main algorithm that optimizes the zero-one-loss
directly is the old perceptron algorithm(chapter §??).
2 http://t.cn/zTrDxLO
1
2
1.2.2.2 ERM and SRM
Definition 1.4. ERM(Empirical risk minimization)
min
f∈F
Remp(f) = min
f∈F
1
N
N
∑
i=1
L(yi, f(xi)) (1.3)
Definition 1.5. Structural risk
Rsmp(f) =
1
N
N
∑
i=1
L(yi, f(xi))+λJ(f) (1.4)
Definition 1.6. SRM(Structural risk minimization)
min
f∈F
Rsrm(f) = min
f∈F
1
N
N
∑
i=1
L(yi, f(xi))+λJ(f) (1.5)
1.2.3 Optimization
Finally, we need a training algorithm(also called learn-
ing algorithm) to search among the classifiers in the the
hypothesis space for the highest-scoring one. The choice
of optimization technique is key to the efficiency of the
model.
1.3 Some basic concepts
1.3.1 Parametric vs non-parametric models
1.3.2 A simple non-parametric classifier:
K-nearest neighbours
1.3.2.1 Representation
y = f(x) = argmin
c
∑
xi∈Nk(x)
I(yi = c) (1.6)
where Nk(x) is the set of k points that are closest to point
x.
Usually use k-d tree to accelerate the process of find-
ing k nearest points.
1.3.2.2 Evaluation
No training is needed.
1.3.2.3 Optimization
No training is needed.
1.3.3 Overfitting
1.3.4 Cross validation
Definition 1.7. Cross validation, sometimes called rota-
tion estimation, is a model validation technique for assess-
ing how the results of a statistical analysis will generalize
to an independent data set3.
Common types of cross-validation:
1. K-fold cross-validation. In k-fold cross-validation, the
original sample is randomly partitioned into k equal
size subsamples. Of the k subsamples, a single sub-
sample is retained as the validation data for testing the
model, and the remaining k 1 subsamples are used as
training data.
2. 2-fold cross-validation. Also, called simple cross-
validation or holdout method. This is the simplest
variation of k-fold cross-validation, k=2.
3. Leave-one-out cross-validation(LOOCV). k=M, the
number of original samples.
1.3.5 Model selection
When we have a variety of models of different complex-
ity (e.g., linear or logistic regression models with differ-
ent degree polynomials, or KNN classifiers with different
values ofK), how should we pick the right one? A natural
approach is to compute the misclassification rate on the
training set for each method.
3 http://en.wikipedia.org/wiki/
Cross-validation_(statistics)
Chapter 2
Probability
2.1 Frequentists vs. Bayesians
what is probability?
One is called the frequentist interpretation. In this
view, probabilities represent long run frequencies of
events. For example, the above statement means that, if
we flip the coin many times, we expect it to land heads
about half the time.
The other interpretation is called the Bayesian inter-
pretation of probability. In this view, probability is used
to quantify our uncertainty about something; hence it is
fundamentally related to information rather than repeated
trials (Jaynes 2003). In the Bayesian view, the above state-
ment means we believe the coin is equally likely to land
heads or tails on the next toss
One big advantage of the Bayesian interpretation is
that it can be used to model our uncertainty about events
that do not have long term frequencies. For example, we
might want to compute the probability that the polar ice
cap will melt by 2020 CE. This event will happen zero
or one times, but cannot happen repeatedly. Nevertheless,
we ought to be able to quantify our uncertainty about this
event. To give another machine learning oriented exam-
ple, we might have observed a blip on our radar screen,
and want to compute the probability distribution over the
location of the corresponding target (be it a bird, plane,
or missile). In all these cases, the idea of repeated trials
does not make sense, but the Bayesian interpretation is
valid and indeed quite natural. We shall therefore adopt
the Bayesian interpretation in this book. Fortunately, the
basic rules of probability theory are the same, no matter
which interpretation is adopted.
2.2 A brief review of probability theory
2.2.1 Basic concepts
We denote a random event by defining a random variable
X.
Descrete random variable: X can take on any value
from a finite or countably infinite set.
Continuous random variable: the value of X is real-
valued.
2.2.1.1 CDF
F(x) ≜ P(X ≤ x) =
{
∑u≤x p(u) , discrete
∫ x
−∞ f(u)du , continuous
(2.1)
2.2.1.2 PMF and PDF
For descrete random variable, We denote the probability
of the event that X = x by P(X = x), or just p(x) for
short. Here p(x) is called a probability mass function
or PMF.A probability mass function is a function that
gives the probability that a discrete random variable is ex-
actly equal to some value4. This satisfies the properties
0 ≤ p(x) ≤ 1 and ∑x∈X p(x) = 1.
For continuous variable, in the equation
F(x) =
∫ x
−∞ f(u)du, the function f(x) is called a
probability density function or PDF. A probability
density function is a function that describes the rela-
tive likelihood for this random variable to take on a
given value5.This satisfies the properties f(x) ≥ 0 and∫ ∞
−∞ f(x)dx = 1.
2.2.2 Mutivariate random variables
2.2.2.1 Joint CDF
We denote joint CDF by F(x,y) ≜ P(X ≤ x ∩Y ≤ y) =
P(X ≤ x,Y ≤ y).
F(x,y) ≜ P(X ≤ x,Y ≤ y) =
{
∑u≤x,v≤y p(u,v)
∫ x
−∞
∫ y
−∞ f(u,v)dudv
(2.2)
product rule:
p(X,Y) = P(X|Y)P(Y) (2.3)
Chain rule:
4 http://en.wikipedia.org/wiki/Probability_
mass_function
5 http://en.wikipedia.org/wiki/Probability_
density_function
3
4
p(X1:N) = p(X1)p(X3|X2,X1)...p(XN|X1:N−1) (2.4)
2.2.2.2 Marginal distribution
Marginal CDF:
FX (x) ≜ F(x,+∞) =



∑
xi≤x
P(X = xi) = ∑
xi≤x
+∞
∑
j=1
P(X = xi,Y = yj)
∫ x
−∞ fX (u)du =
∫ x
−∞
∫ +∞
−∞ f(u,v)dudv
(2.5)
FY (y) ≜ F(+∞,y) =



∑
yj≤y
p(Y = yj) =
+∞
∑
i=1
∑yj≤y P(X = xi,Y = yj)
∫ y
−∞ fY (v)dv =
∫ +∞
−∞
∫ y
−∞ f(u,v)dudv
(2.6)
Marginal PMF and PDF:
{
P(X = xi) = ∑+∞
j=1 P(X = xi,Y = yj) , descrete
fX (x) =
∫ +∞
−∞ f(x,y)dy , continuous
(2.7)
{
p(Y = yj) = ∑+∞
i=1 P(X = xi,Y = yj) , descrete
fY (y) =
∫ +∞
−∞ f(x,y)dx , continuous
(2.8)
2.2.2.3 Conditional distribution
Conditional PMF:
p(X = xi|Y = yj) =
p(X = xi,Y = yj)
p(Y = yj)
if p(Y) > 0 (2.9)
The pmf p(X|Y) is called conditional probability.
Conditional PDF:
fX|Y (x|y) =
f(x,y)
fY (y)
(2.10)
2.2.3 Bayes rule
p(Y = y|X = x) =
p(X = x,Y = y)
p(X = x)
=
p(X = x|Y = y)p(Y = y)
∑y′ p(X = x|Y = y′)p(Y = y′)
(2.11)
2.2.4 Independence and conditional
independence
We say X and Y are unconditionally independent or
marginally independent, denoted X ⊥ Y, if we can
represent the joint as the product of the two marginals,
i.e.,
X ⊥ Y = P(X,Y) = P(X)P(Y) (2.12)
We say X and Y are conditionally independent(CI)
given Z if the conditional joint can be written as a product
of conditional marginals:
X ⊥ Y|Z = P(X,Y|Z) = P(X|Z)P(Y|Z) (2.13)
2.2.5 Quantiles
Since the cdf F is a monotonically increasing function,
it has an inverse; let us denote this by F−1. If F is the
cdf of X , then F−1(α) is the value of xα such that
P(X ≤ xα) = α; this is called the α quantile of F. The
value F−1(0.5) is the median of the distribution, with half
of the probability mass on the left, and half on the right.
The values F−1(0.25) and F1(0.75)are the lower and up-
per quartiles.
2.2.6 Mean and variance
The most familiar property of a distribution is its mean,or
expected value, denoted by µ. For discrete rvs, it is de-
fined as E[X] ≜ ∑x∈X xp(x), and for continuous rvs, it is
defined as E[X] ≜
∫
X xp(x)dx. If this integral is not finite,
the mean is not defined (we will see some examples of
this later).
The variance is a measure of the spread of a distribu-
tion, denoted by σ2. This is defined as follows:
var[X] = E[(X − µ)2
] (2.14)
=
∫
(x− µ)2
p(x)dx
=
∫
x2
p(x)dx+ µ2
∫
p(x)dx−2µ
∫
xp(x)dx
= E[X2
]− µ2
(2.15)
from which we derive the useful result
E[X2
] = σ2
+ µ2
(2.16)
The standard deviation is defined as
5
std[X] ≜
√
var[X] (2.17)
This is useful since it has the same units as X itself.
2.3 Some common discrete distributions
In this section, we review some commonly used paramet-
ric distributions defined on discrete state spaces, both fi-
nite and countably infinite.
2.3.1 The Bernoulli and binomial
distributions
Definition 2.1. Now suppose we toss a coin only once.
Let X ∈ {0,1} be a binary random variable, with probabil-
ity of success or heads of θ. We say that X has a Bernoulli
distribution. This is written as X ∼ Ber(θ), where the
pmf is defined as
Ber(x|θ) ≜ θI(x=1)
(1−θ)I(x=0)
(2.18)
Definition 2.2. Suppose we toss a coin n times. Let X ∈
{0,1,··· ,n} be the number of heads. If the probability of
heads is θ, then we say X has a binomial distribution,
written as X ∼ Bin(n,θ). The pmf is given by
Bin(k|n,θ) ≜
(
n
k
)
θk
(1−θ)n−k
(2.19)
2.3.2 The multinoulli and multinomial
distributions
Definition 2.3. The Bernoulli distribution can be
used to model the outcome of one coin tosses. To
model the outcome of tossing a K-sided dice, let
x = (I(x = 1),··· ,I(x = K)) ∈ {0,1}K be a random
vector(this is called dummy encoding or one-hot en-
coding), then we say X has a multinoulli distribution(or
categorical distribution), written as X ∼ Cat(θ). The
pmf is given by:
p(x) ≜
K
∏
k=1
θ
I(xk=1)
k (2.20)
Definition 2.4. Suppose we toss a K-sided dice n times.
Let x = (x1,x2,··· ,xK) ∈ {0,1,··· ,n}K be a random vec-
tor, where xj is the number of times side j of the dice
occurs, then we say X has a multinomial distribution,
written as X ∼ Mu(n,θ). The pmf is given by
p(x) ≜
(
n
x1 ···xk
) K
∏
k=1
θ
xk
k (2.21)
where
(
n
x1 ···xk
)
≜
n!
x1!x2!···xK!
Bernoulli distribution is just a special case of a Bino-
mial distribution with n = 1, and so is multinoulli distri-
bution as to multinomial distribution. See Table 2.1 for a
summary.
Table 2.1: Summary of the multinomial and related
distributions.
Name K n X
Bernoulli 1 1 x ∈ {0,1}
Binomial 1 - x ∈ {0,1,··· ,n}
Multinoulli - 1 x ∈ {0,1}K,∑K
k=1 xk = 1
Multinomial - - x ∈ {0,1,··· ,n}K,∑K
k=1 xk = n
2.3.3 The Poisson distribution
Definition 2.5. We say that X ∈ {0,1,2,···} has a Pois-
son distribution with parameter λ > 0, written as X ∼
Poi(λ), if its pmf is
p(x|λ) = e−λ λx
x!
(2.22)
The first term is just the normalization constant, re-
quired to ensure the distribution sums to 1.
The Poisson distribution is often used as a model for
counts of rare events like radioactive decay and traffic ac-
cidents.
2.3.4 The empirical distribution
The empirical distribution function6, or empirical cdf,
is the cumulative distribution function associated with the
empirical measure of the sample. Let D = {x1,x2,··· ,xN}
be a sample set, it is defined as
Fn(x) ≜
1
N
N
∑
i=1
I(xi ≤ x) (2.23)
6 http://en.wikipedia.org/wiki/Empirical_
distribution_function
6
Table 2.2: Summary of Bernoulli, binomial multinoulli and multinomial distributions.
Name Written as X p(x)(or p(x)) E[X] var[X]
Bernoulli X ∼ Ber(θ) x ∈ {0,1} θI(x=1)(1−θ)I(x=0) θ θ(1−θ)
Binomial X ∼ Bin(n,θ) x ∈ {0,1,··· ,n}
(
n
k
)
θk(1−θ)n−k nθ nθ(1−θ)
Multinoulli X ∼ Cat(θ) x ∈ {0,1}K,∑K
k=1 xk = 1
K
∏
k=1
θ
I(xj=1)
j - -
Multinomial X ∼ Mu(n,θ) x ∈ {0,1,··· ,n}K,∑K
k=1 xk = n
(
n
x1 ···xk
)
K
∏
k=1
θ
xj
j - -
Poisson X ∼ Poi(λ) x ∈ {0,1,2,···} e−λ λx
x!
λ λ
2.4 Some common continuous distributions
In this section we present some commonly used univariate
(one-dimensional) continuous probability distributions.
2.4.1 Gaussian (normal) distribution
Table 2.3: Summary of Gaussian distribution.
Written as f(x) E[X] mode var[X]
X ∼ N(µ,σ2)
1
√
2πσ
e
− 1
2σ2 (x−µ)2
µ µ σ2
If X ∼ N(0,1),we say X follows a standard normal
distribution.
The Gaussian distribution is the most widely used dis-
tribution in statistics. There are several reasons for this.
1. First, it has two parameters which are easy to interpret,
and which capture some of the most basic properties of
a distribution, namely its mean and variance.
2. Second,the central limit theorem (Section TODO) tells
us that sums of independent random variables have an
approximately Gaussian distribution, making it a good
choice for modeling residual errors or noise.
3. Third, the Gaussian distribution makes the least num-
ber of assumptions (has maximum entropy), subject to
the constraint of having a specified mean and variance,
as we show in Section TODO; this makes it a good de-
fault choice in many cases.
4. Finally, it has a simple mathematical form, which re-
sults in easy to implement, but often highly effective,
methods, as we will see.
See (Jaynes 2003, ch 7) for a more extensive discussion
of why Gaussians are so widely used.
2.4.2 Student’s t-distribution
Table 2.4: Summary of Student’s t-distribution.
Written as f(x) E[X] mode var[X]
X ∼ T (µ,σ2,ν)
Γ (ν+1
2 )
√
νπΓ (ν
2 )
[
1+
1
ν
(
x− µ
ν
)2
]
µ µ
νσ2
ν −2
where Γ (x) is the gamma function:
Γ (x) ≜
∫ ∞
0
tx−1
e−t
dt (2.24)
µ is the mean, σ2 > 0 is the scale parameter, and ν > 0
is called the degrees of freedom. See Figure 2.1 for some
plots.
The variance is only defined if ν > 2. The mean is only
defined if ν > 1.
As an illustration of the robustness of the Student dis-
tribution, consider Figure 2.2. We see that the Gaussian
is affected a lot, whereas the Student distribution hardly
changes. This is because the Student has heavier tails, at
least for small ν(see Figure 2.1).
If ν = 1, this distribution is known as the Cauchy
or Lorentz distribution. This is notable for having such
heavy tails that the integral that defines the mean does not
converge.
To ensure finite variance, we require ν > 2. It is com-
mon to use ν = 4, which gives good performance in a
range of problems (Lange et al. 1989). For ν ≫ 5, the
Student distribution rapidly approaches a Gaussian distri-
bution and loses its robustness properties.
7
(a)
(b)
Fig. 2.1: (a) The pdfs for a N(0,1), T (0,1,1) and
Lap(0,1/
√
2). The mean is 0 and the variance is 1 for
both the Gaussian and Laplace. The mean and variance
of the Student is undefined when ν = 1.(b) Log of these
pdfs. Note that the Student distribution is not
log-concave for any parameter value, unlike the Laplace
distribution, which is always log-concave (and
log-convex...) Nevertheless, both are unimodal.
Table 2.5: Summary of Laplace distribution.
Written as f(x) E[X] mode var[X]
X ∼ Lap(µ,b)
1
2b
exp
(
−
|x− µ|
b
)
µ µ 2b2
(a)
(b)
Fig. 2.2: Illustration of the effect of outliers on fitting
Gaussian, Student and Laplace distributions. (a) No
outliers (the Gaussian and Student curves are on top of
each other). (b) With outliers. We see that the Gaussian is
more affected by outliers than the Student and Laplace
distributions.
2.4.3 The Laplace distribution
Here µ is a location parameter and b > 0 is a scale param-
eter. See Figure 2.1 for a plot.
Its robustness to outliers is illustrated in Figure 2.2. It
also put mores probability density at 0 than the Gaussian.
This property is a useful way to encourage sparsity in a
model, as we will see in Section TODO.
8
Table 2.6: Summary of gamma distribution
Written as X f(x) E[X] mode var[X]
X ∼ Ga(a,b) x ∈ R+ ba
Γ (a)
xa−1e−xb a
b
a−1
b
a
b2
2.4.4 The gamma distribution
Here a > 0 is called the shape parameter and b > 0 is
called the rate parameter. See Figure 2.3 for some plots.
(a)
(b)
Fig. 2.3: Some Ga(a,b = 1) distributions. If a ≤ 1, the
mode is at 0, otherwise it is > 0.As we increase the rate
b, we reduce the horizontal scale, thus squeezing
everything leftwards and upwards. (b) An empirical pdf
of some rainfall data, with a fitted Gamma distribution
superimposed.
2.4.5 The beta distribution
Here B(a,b)is the beta function,
B(a,b) ≜
Γ (a)Γ (b)
Γ (a+b)
(2.25)
See Figure 2.4 for plots of some beta distributions. We
require a,b > 0 to ensure the distribution is integrable
(i.e., to ensure B(a,b) exists). If a = b = 1, we get the
uniform distirbution. If a and b are both less than 1, we
get a bimodal distribution with spikes at 0 and 1; if a and
b are both greater than 1, the distribution is unimodal.
Fig. 2.4: Some beta distributions.
2.4.6 Pareto distribution
The Pareto distribution is used to model the distribu-
tion of quantities that exhibit long tails, also called heavy
tails.
As k → ∞, the distribution approaches δ(x − m). See
Figure 2.5(a) for some plots. If we plot the distribution
on a log-log scale, it forms a straight line, of the form
log p(x) = alogx+c for some constants a and c. See Fig-
ure 2.5(b) for an illustration (this is known as a power
law).
9
Table 2.7: Summary of Beta distribution
Name Written as X f(x) E[X] mode var[X]
Beta distribution X ∼ Beta(a,b) x ∈ [0,1]
1
B(a,b)
xa−1(1−x)b−1 a
a+b
a−1
a+b−2
ab
(a+b)2(a+b+1)
Table 2.8: Summary of Pareto distribution
Name Written as X f(x) E[X] mode var[X]
Pareto distribution X ∼ Pareto(k,m) x ≥ m kmkx−(k+1)I(x ≥ m)
km
k −1
if k > 1 m
m2k
(k −1)2(k −2)
if k > 2
(a)
(b)
Fig. 2.5: (a) The Pareto distribution Pareto(x|m,k) for
m = 1. (b) The pdf on a log-log scale.
2.5 Joint probability distributions
Given a multivariate random variable or random vec-
tor 7 X ∈ RD, the joint probability distribution8 is a
probability distribution that gives the probability that each
of X1,X2,··· ,XD falls in any particular range or discrete
set of values specified for that variable. In the case of only
two random variables, this is called a bivariate distribu-
tion, but the concept generalizes to any number of random
variables, giving a multivariate distribution.
The joint probability distribution can be expressed ei-
ther in terms of a joint cumulative distribution function
or in terms of a joint probability density function (in the
case of continuous variables) or joint probability mass
function (in the case of discrete variables).
2.5.1 Covariance and correlation
Definition 2.6. The covariance between two rvs X and
Y measures the degree to which X and Y are (linearly)
related. Covariance is defined as
cov[X,Y] ≜ E[(X −E[X])(Y −E[Y])]
= E[XY]−E[X]E[Y]
(2.26)
Definition 2.7. If X is a D-dimensional random vector, its
covariance matrix is defined to be the following symmet-
ric, positive definite matrix:
7 http://en.wikipedia.org/wiki/Multivariate_
random_variable
8 http://en.wikipedia.org/wiki/Joint_
probability_distribution
10
cov[X] ≜ E
[
(X −E[X])(X −E[X])T
]
(2.27)
=





var[X1] Cov[X1,X2] ··· Cov[X1,XD]
Cov[X2,X1] var[X2] ··· Cov[X2,XD]
...
...
...
...
Cov[XD,X1] Cov[XD,X2] ··· var[XD]





(2.28)
Definition 2.8. The (Pearson) correlation coefficient be-
tween X and Y is defined as
corr[X,Y] ≜
Cov[X,Y]
√
var[X],var[Y]
(2.29)
A correlation matrix has the form
R ≜





corr[X1,X1] corr[X1,X2] ··· corr[X1,XD]
corr[X2,X1] corr[X2,X2] ··· corr[X2,XD]
...
...
...
...
corr[XD,X1] corr[XD,X2] ··· corr[XD,XD]





(2.30)
The correlation coefficient can viewed as a degree of
linearity between X and Y, see Figure 2.6.
Uncorrelated does not imply independent. For ex-
ample, let X ∼ U(−1,1) and Y = X2. Clearly Y is depen-
dent on X(in fact, Y is uniquely determined by X), yet one
can show that corr[X,Y] = 0. Some striking examples of
this fact are shown in Figure 2.6. This shows several data
sets where there is clear dependence between X andY, and
yet the correlation coefficient is 0. A more general mea-
sure of dependence between random variables is mutual
information, see Section TODO.
2.5.2 Multivariate Gaussian distribution
The multivariate Gaussian or multivariate nor-
mal(MVN) is the most widely used joint probability
density function for continuous variables. We discuss
MVNs in detail in Chapter 4; here we just give some
definitions and plots.
The pdf of the MVN in D dimensions is defined by the
following:
N(x|µ,Σ) ≜
1
(2π)D/2|Σ|1/2
exp
[
−
1
2
(x−µ)T
Σ−1
(x−µ)
]
(2.31)
where µ = E[X] ∈ RD is the mean vector, and Σ = Cov[X]
is the D × D covariance matrix. The normalization con-
stant (2π)D/2|Σ|1/2 just ensures that the pdf integrates to
1.
Figure 2.7 plots some MVN densities in 2d for three
different kinds of covariance matrices. A full covariance
matrix has A D(D+1)/2 parameters (we divide by 2 since
Σ is symmetric). A diagonal covariance matrix has D pa-
rameters, and has 0s in the off-diagonal terms. A spherical
or isotropic covariance,Σ = σ2ID, has one free parameter.
2.5.3 Multivariate Student’s t-distribution
A more robust alternative to the MVN is the multivariate
Student’s t-distribution, whose pdf is given by
T (x|µ,Σ,ν)
≜
Γ (ν+D
2 )
Γ (ν
2 )
|Σ|− 1
2
(νπ)
D
2
[
1+
1
ν
(x−µ)T
Σ−1
(x−µ)
]− ν+D
2
(2.32)
=
Γ (ν+D
2 )
Γ (ν
2 )
|Σ|− 1
2
(νπ)
D
2
[
1+(x−µ)T
V −1
(x−µ)
]− ν+D
2
(2.33)
where Σ is called the scale matrix (since it is not exactly
the covariance matrix) and V = νΣ. This has fatter tails
than a Gaussian. The smaller ν is, the fatter the tails. As
ν → ∞, the distribution tends towards a Gaussian. The dis-
tribution has the following properties
mean = µ , mode = µ , Cov =
ν
ν −2
Σ (2.34)
2.5.4 Dirichlet distribution
A multivariate generalization of the beta distribution is the
Dirichlet distribution, which has support over the prob-
ability simplex, defined by
SK =
{
x : 0 ≤ xk ≤ 1,
K
∑
k=1
xk = 1
}
(2.35)
The pdf is defined as follows:
Dir(x|α) ≜
1
B(α)
K
∏
k=1
x
αk−1
k I(x ∈ SK) (2.36)
where B(α1,α2,··· ,αK) is the natural generalization of
the beta function to K variables:
B(α) ≜
∏K
k=1 Γ (αk)
Γ (α0)
where α0 ≜
K
∑
k=1
αk (2.37)
Figure 2.8 shows some plots of the Dirichlet when
K = 3, and Figure 2.9 for some sampled probability vec-
tors. We see that α0 controls the strength of the dis-
11
Fig. 2.6: Several sets of (x,y) points, with the Pearson correlation coefficient of x and y for each set. Note that the
correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship
(middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in
that case the correlation coefficient is undefined because the variance of Y is
zero.Source:http://en.wikipedia.org/wiki/Correlation
tribution (how peaked it is), and thekcontrol where the
peak occurs. For example, Dir(1,1,1) is a uniform dis-
tribution, Dir(2,2,2) is a broad distribution centered at
(1/3,1/3,1/3), and Dir(20,20,20) is a narrow distribu-
tion centered at (1/3,1/3,1/3).If αk < 1 for all k, we get
spikes at the corner of the simplex.
For future reference, the distribution has these proper-
ties
E(xk) =
αk
α0
, mode[xk] =
αk −1
α0 −K
, var[xk] =
αk(α0 −αk)
α2
0 (α0 +1)
(2.38)
2.6 Transformations of random variables
If x ∼ P() is some random variable, and y = f(x), what
is the distribution of Y? This is the question we address in
this section.
2.6.1 Linear transformations
Suppose g() is a linear function:
g(x) = Ax+b (2.39)
First, for the mean, we have
E[y] = E[Ax+b] = AE[x]+b (2.40)
this is called the linearity of expectation.
For the covariance, we have
Cov[y] = Cov[Ax+b] = AΣAT
(2.41)
2.6.2 General transformations
If X is a discrete rv, we can derive the pmf for y by simply
summing up the probability mass for all the xs such that
f(x) = y:
pY (y) = ∑
x:g(x)=y
pX (x) (2.42)
If X is continuous, we cannot use Equation 2.42 since
pX (x) is a density, not a pmf, and we cannot sum up den-
sities. Instead, we work with cdfs, and write
FY (y) = P(Y ≤ y) = P(g(X) ≤ y) =
∫
g(X)≤y
fX (x)dx
(2.43)
We can derive the pdf of Y by differentiating the cdf:
fY (y) = fX (x)|
dx
dy
| (2.44)
This is called change of variables formula. We leave
the proof of this as an exercise.
For example, suppose X ∼ U(1,1), and Y = X2. Then
pY (y) =
1
2
y− 1
2 .
12
(a)
(b)
(c)
(d)
Fig. 2.7: We show the level sets for 2d Gaussians. (a) A
full covariance matrix has elliptical contours.(b) A
diagonal covariance matrix is an axis aligned ellipse. (c)
A spherical covariance matrix has a circular shape. (d)
Surface plot for the spherical Gaussian in (c).
(a)
(b)
(c)
(d)
Fig. 2.8: (a) The Dirichlet distribution when K = 3
defines a distribution over the simplex, which can be
represented by the triangular surface. Points on this
surface satisfy 0 ≤ θk ≤ 1 and ∑K
k=1 θk = 1. (b) Plot of the
Dirichlet density when α = (2,2,2). (c) α = (20,2,2).
13
(a) α = (0.1,··· ,0.1). This results in very sparse
distributions, with many 0s.
(b) α = (1,··· ,1). This results in more uniform (and
dense) distributions.
Fig. 2.9: Samples from a 5-dimensional symmetric
Dirichlet distribution for different parameter values.
2.6.2.1 Multivariate change of variables *
Let f be a function f : Rn → Rn, and let y = f(x). Then
its Jacobian matrix J is given by
Jx→y ≜
∂y
∂x
≜




∂y1
∂x1
··· ∂y1
∂xn
...
...
...
∂yn
∂x1
··· ∂yn
∂xn



 (2.45)
|det(J)| measures how much a unit cube changes in vol-
ume when we apply f.
If f is an invertible mapping, we can define the pdf of
the transformed variables using the Jacobian of the inverse
mapping y → x:
py(y) = px(x)|det(
∂x
∂y
)| = px(x)|det(Jy→x)| (2.46)
2.6.3 Central limit theorem
Given N random variables X1,X2,··· ,XN, each variable is
independent and identically distributed9(iid for short),
and each has the same mean µ and variance σ2, then
n
∑
i=1
Xi −Nµ
√
Nσ
∼ N(0,1) (2.47)
this can also be written as
¯X − µ
σ/
√
N
∼ N(0,1) , where ¯X ≜
1
N
n
∑
i=1
Xi (2.48)
2.7 Monte Carlo approximation
In general, computing the distribution of a function of an
rv using the change of variables formula can be difficult.
One simple but powerful alternative is as follows. First
we generate S samples from the distribution, call them
x1,··· ,xS. (There are many ways to generate such sam-
ples; one popular method, for high dimensional distribu-
tions, is called Markov chain Monte Carlo or MCMC;
this will be explained in Chapter TODO.) Given the sam-
ples, we can approximate the distribution of f(X) by us-
ing the empirical distribution of {f(xs)}S
s=1. This is called
a Monte Carlo approximation10, named after a city in
Europe known for its plush gambling casinos.
We can use Monte Carlo to approximate the expected
value of any function of a random variable. We simply
draw samples, and then compute the arithmetic mean of
the function applied to the samples. This can be written as
follows:
E[g(X)] =
∫
g(x)p(x)dx ≈
1
S
S
∑
s=1
f(xs) (2.49)
where xs ∼ p(X).
This is called Monte Carlo integration11, and has the
advantage over numerical integration (which is based on
evaluating the function at a fixed grid of points) that the
function is only evaluated in places where there is non-
negligible probability.
9 http://en.wikipedia.org/wiki/Independent_
identically_distributed
10 http://en.wikipedia.org/wiki/Monte_Carlo_
method
11 http://en.wikipedia.org/wiki/Monte_Carlo_
integration
14
2.8 Information theory
2.8.1 Entropy
The entropy of a random variable X with distribution p,
denoted by H(X) or sometimes H(p), is a measure of its
uncertainty. In particular, for a discrete variable with K
states, it is defined by
H(X) ≜ −
K
∑
k=1
p(X = k)log2 p(X = k) (2.50)
Usually we use log base 2, in which case the units are
called bits(short for binary digits). If we use log base e ,
the units are called nats.
The discrete distribution with maximum entropy is
the uniform distribution (see Section XXX for a proof).
Hence for a K-ary random variable, the entropy is maxi-
mized if p(x = k) = 1/K; in this case, H(X) = log2 K.
Conversely, the distribution with minimum entropy
(which is zero) is any delta-function that puts all its mass
on one state. Such a distribution has no uncertainty.
2.8.2 KL divergence
One way to measure the dissimilarity of two probability
distributions, p and q , is known as the Kullback-Leibler
divergence(KL divergence)or relative entropy. This is
defined as follows:
KL(P||Q) ≜ ∑
x
p(x)log2
p(x)
q(x)
(2.51)
where the sum gets replaced by an integral for pdfs12. The
KL divergence is only defined if P and Q both sum to 1
and if q(x) = 0 implies p(x) = 0 for all x(absolute con-
tinuity). If the quantity 0ln0 appears in the formula, it is
interpreted as zero because lim
x→0
xlnx. We can rewrite this
as
KL(p||q) ≜ ∑
x
p(x)log2 p(x)−
K
∑
k=1
p(x)log2 q(x)
= H(p)−H(p,q)
(2.52)
where H(p,q) is called the cross entropy,
12 The KL divergence is not a distance, since it is asymmet-
ric. One symmetric version of the KL divergence is the Jensen-
Shannon divergence, defined as JS(p1, p2) = 0.5KL(p1||q) +
0.5KL(p2||q),where q = 0.5p1 +0.5p2
H(p,q) = ∑
x
p(x)log2 q(x) (2.53)
One can show (Cover and Thomas 2006) that the cross
entropy is the average number of bits needed to encode
data coming from a source with distribution p when we
use model q to define our codebook. Hence the regular
entropy H(p) = H(p, p), defined in section §2.8.1,is the
expected number of bits if we use the true model, so the
KL divergence is the diference between these. In other
words, the KL divergence is the average number of extra
bits needed to encode the data, due to the fact that we
used distribution q to encode the data instead of the true
distribution p.
The extra number of bits interpretation should make it
clear that KL(p||q) ≥ 0, and that the KL is only equal to
zero if q = p. We now give a proof of this important result.
Theorem 2.1. (Information inequality) KL(p||q) ≥
0 with equality iff p = q.
One important consequence of this result is that the
discrete distribution with the maximum entropy is the uni-
form distribution.
2.8.3 Mutual information
Definition 2.9. Mutual information or MI, is defined as
follows:
I(X;Y) ≜ KL(P(X,Y)||P(X)P(X))
= ∑
x
∑
y
p(x,y)log
p(x,y)
p(x)p(y)
(2.54)
We have I(X;Y) ≥ 0 with equality if P(X,Y) =
P(X)P(Y). That is, the MI is zero if the variables are
independent.
To gain insight into the meaning of MI, it helps to re-
express it in terms of joint and conditional entropies. One
can show that the above expression is equivalent to the
following:
I(X;Y) = H(X)−H(X|Y) (2.55)
= H(Y)−H(Y|X) (2.56)
= H(X)+H(Y)−H(X,Y) (2.57)
= H(X,Y)−H(X|Y)−H(Y|X) (2.58)
where H(X) and H(Y) are the marginal entropies,
H(X|Y) and H(Y|X) are the conditional entropies, and
H(X,Y) is the joint entropy of X and Y, see Fig. 2.1013.
13 http://en.wikipedia.org/wiki/Mutual_
information
15
Fig. 2.10: Individual H(X),H(Y), joint H(X,Y), and
conditional entropies for a pair of correlated subsystems
X,Y with mutual information I(X;Y).
Intuitively, we can interpret the MI between X and Y as
the reduction in uncertainty about X after observing Y, or,
by symmetry, the reduction in uncertainty about Y after
observing X.
A quantity which is closely related to MI is the point-
wise mutual information or PMI. For two events (not
random variables) x and y, this is defined as
PMI(x,y) ≜ log
p(x,y)
p(x)p(y)
= log
p(x|y)
p(x)
= log
p(y|x)
p(y)
(2.59)
This measures the discrepancy between these events
occuring together compared to what would be expected
by chance. Clearly the MI of X and Y is just the expected
value of the PMI. Interestingly, we can rewrite the PMI as
follows:
PMI(x,y) = log
p(x|y)
p(x)
= log
p(y|x)
p(y)
(2.60)
This is the amount we learn from updating the prior
p(x) into the posterior p(x|y) , or equivalently, updating
the prior p(y) into the posterior p(y|x) .
Chapter 3
Generative models for discrete data
3.1 Generative classifier
p(y = c|x,θ) =
p(y = c|θ)p(x|y = c,θ)
∑c′ p(y = c′|θ)p(x|y = c′,θ)
(3.1)
This is called a generative classifier, since it specifies
how to generate the data using the class conditional den-
sity p(x|y = c) and the class prior p(y = c). An alternative
approach is to directly fit the class posterior, p(y = c|x)
;this is known as a discriminative classifier.
3.2 Bayesian concept learning
Psychological research has shown that people can learn
concepts from positive examples alone (Xu and Tenen-
baum 2007).
We can think of learning the meaning of a word as
equivalent to concept learning, which in turn is equiv-
alent to binary classification. To see this, define f(x) = 1
if xis an example of the concept C, and f(x) = 0 other-
wise. Then the goal is to learn the indicator function f,
which just defines which elements are in the set C.
3.2.1 Likelihood
p(D|h) ≜
(
1
size(h)
)N
=
(
1
|h|
)N
(3.2)
This crucial equation embodies what Tenenbaum calls
the size principle, which means the model favours the
simplest (smallest) hypothesis consistent with the data.
This is more commonly known as Occams razor14.
3.2.2 Prior
The prior is decided by human, not machines, so it is sub-
jective. The subjectivity of the prior is controversial. For
example, that a child and a math professor will reach dif-
14 http://en.wikipedia.org/wiki/Occam%27s_
razor
ferent answers. In fact, they presumably not only have dif-
ferent priors, but also different hypothesis spaces. How-
ever, we can finesse that by defining the hypothesis space
of the child and the math professor to be the same, and
then setting the childs prior weight to be zero on certain
advanced concepts. Thus there is no sharp distinction be-
tween the prior and the hypothesis space.
However, the prior is the mechanism by which back-
ground knowledge can be brought to bear on a prob-
lem. Without this, rapid learning (i.e., from small samples
sizes) is impossible.
3.2.3 Posterior
The posterior is simply the likelihood times the prior, nor-
malized.
p(h|D) ≜
p(D|h)p(h)
∑h′∈H p(D|h′)p(h′)
=
I(D ∈ h)p(h)
∑h′∈H I(D ∈ h′)p(h′)
(3.3)
where I(D ∈ h)p(h) is 1 iff(iff and only if) all the data are
in the extension of the hypothesis h.
In general, when we have enough data, the posterior
p(h|D) becomes peaked on a single concept, namely the
MAP estimate, i.e.,
p(h|D) → ˆhMAP
(3.4)
where ˆhMAP is the posterior mode,
ˆhMAP
≜ argmax
h
p(h|D) = argmax
h
p(D|h)p(h)
= argmax
h
[log p(D|h)+log p(h)]
(3.5)
Since the likelihood term depends exponentially on N,
and the prior stays constant, as we get more and more data,
the MAP estimate converges towards the maximum like-
lihood estimate or MLE:
ˆhMLE
≜ argmax
h
p(D|h) = argmax
h
log p(D|h) (3.6)
In other words, if we have enough data, we see that the
data overwhelms the prior.
17
18
3.2.4 Posterior predictive distribution
The concept of posterior predictive distribution15 is
normally used in a Bayesian context, where it makes use
of the entire posterior distribution of the parameters given
the observed data to yield a probability distribution over
an interval rather than simply a point estimate.
p( ˜x|D) ≜ Eh|D[p( ˜x|h)] =
{
∑h p( ˜x|h)p(h|D)
∫
p( ˜x|h)p(h|D)dh
(3.7)
This is just a weighted average of the predictions of
each individual hypothesis and is called Bayes model av-
eraging(Hoeting et al. 1999).
3.3 The beta-binomial model
3.3.1 Likelihood
Given X ∼ Bin(θ), the likelihood of D is given by
p(D|θ) = Bin(N1|N,θ) (3.8)
3.3.2 Prior
Beta(θ|a,b) ∝ θa−1
(1−θ)b−1
(3.9)
The parameters of the prior are called hyper-
parameters.
3.3.3 Posterior
p(θ|D) ∝ Bin(N1|N1 +N0,θ)Beta(θ|a,b)
= Beta(θ|N1 +a,N0b)
(3.10)
Note that updating the posterior sequentially is equiv-
alent to updating in a single batch. To see this, suppose
we have two data sets Da and Db with sufficient statistics
Na
1 ,Na
0 and Nb
1 ,Nb
0 . Let N1 = Na
1 +Nb
1 and N0 = Na
0 +Nb
0 be
the sufficient statistics of the combined datasets. In batch
mode we have
15 http://en.wikipedia.org/wiki/Posterior_
predictive_distribution
p(θ|Da,Db) = p(θ,Db|Da)p(Da)
∝ p(θ,Db|Da)
= p(Db,θ|Da)
= p(Db|θ)p(θ|Da)
Combine Equation 3.10 and 2.19
= Bin(Nb
1 |θ,Nb
1 +Nb
0 )Beta(θ|Na
1 +a,Na
0 +b)
= Beta(θ|Na
1 +Nb
1 +a,Na
0 +Nb
0 +b)
This makes Bayesian inference particularly well-suited
to online learning, as we will see later.
3.3.3.1 Posterior mean and mode
From Table 2.7, the posterior mean is given by
¯θ =
a+N1
a+b+N
(3.11)
The mode is given by
ˆθMAP =
a+N1 −1
a+b+N −2
(3.12)
If we use a uniform prior, then the MAP estimate re-
duces to the MLE,
ˆθMLE =
N1
N
(3.13)
We will now show that the posterior mean is convex
combination of the prior mean and the MLE, which cap-
tures the notion that the posterior is a compromise be-
tween what we previously believed and what the data is
telling us.
3.3.3.2 Posterior variance
The mean and mode are point estimates, but it is useful to
know how much we can trust them. The variance of the
posterior is one way to measure this. The variance of the
Beta posterior is given by
var(θ|D) =
(a+N1)(b+N0)
(a+N1 +b+N0)2(a+N1 +b+N0 +1)
(3.14)
We can simplify this formidable expression in the case
that N ≫ a,b, to get
var(θ|D) ≈
N1N0
NNN
=
ˆθMLE(1− ˆθMLE)
N
(3.15)
19
3.3.4 Posterior predictive distribution
So far, we have been focusing on inference of the un-
known parameter(s). Let us now turn our attention to pre-
diction of future observable data.
Consider predicting the probability of heads in a single
future trial under a Beta(a,b)posterior. We have
p(˜x|D) =
∫ 1
0
p(˜x|θ)p(θ|D)dθ
=
∫ 1
0
θBeta(θ|a,b)dθ
= E[θ|D] =
a
a+b
(3.16)
3.3.4.1 Overfitting and the black swan paradox
Let us now derive a simple Bayesian solution to the prob-
lem. We will use a uniform prior, so a = b = 1. In this
case, plugging in the posterior mean gives Laplaces rule
of succession
p(˜x|D) =
N1 +1
N0 +N1 +1
(3.17)
This justifies the common practice of adding 1 to the
empirical counts, normalizing and then plugging them in,
a technique known as add-one smoothing. (Note that
plugging in the MAP parameters would not have this
smoothing effect, since the mode becomes the MLE if
a = b = 1, see Section 3.3.3.1.)
3.3.4.2 Predicting the outcome of multiple future
trials
Suppose now we were interested in predicting the number
of heads, ˜x, in M future trials. This is given by
p(˜x|D) =
∫ 1
0
Bin(˜x|M,θ)Beta(θ|a,b)dθ (3.18)
=
(
M
˜x
)
1
B(a,b)
∫ 1
0
θ ˜x
(1−θ)M−˜x
θa−1
(1−θ)b−1
dθ
(3.19)
We recognize the integral as the normalization constant
for a Beta(a+ ˜x,M ˜x+b) distribution. Hence
∫ 1
0
θ ˜x
(1−θ)M−˜x
θa−1
(1−θ)b−1
dθ = B(˜x+a,M− ˜x+b)
(3.20)
Thus we find that the posterior predictive is given by
the following, known as the (compound) beta-binomial
distribution:
Bb(x|a,b,M) ≜
(
M
x
)
B(x+a,M −x+b)
B(a,b)
(3.21)
This distribution has the following mean and variance
mean = M
a
a+b
, var =
Mab
(a+b)2
a+b+M
a+b+1
(3.22)
This process is illustrated in Figure 3.1. We start with
a Beta(2,2) prior, and plot the posterior predictive density
after seeing N1 = 3 heads and N0 = 17 tails. Figure 3.1(b)
plots a plug-in approximation using a MAP estimate. We
see that the Bayesian prediction has longer tails, spread-
ing its probability mass more widely, and is therefore less
prone to overfitting and blackswan type paradoxes.
(a)
(b)
Fig. 3.1: (a) Posterior predictive distributions after seeing
N1 = 3,N0 = 17. (b) MAP estimation.
3.4 The Dirichlet-multinomial model
In the previous section, we discussed how to infer the
probability that a coin comes up heads. In this section,
20
we generalize these results to infer the probability that a
dice with K sides comes up as face k.
3.4.1 Likelihood
Suppose we observe N dice rolls, D = {x1,x2,··· ,xN},
where xi ∈ {1,2,··· ,K}. The likelihood has the form
p(D|θ) =
(
N
N1 ···Nk
) K
∏
k=1
θ
Nk
k where Nk =
N
∑
i=1
I(yi = k)
(3.23)
almost the same as Equation 2.21.
3.4.2 Prior
Dir(θ|α) =
1
B(α)
K
∏
k=1
θ
αk−1
k I(θ ∈ SK) (3.24)
3.4.3 Posterior
p(θ|D) ∝ p(D|θ)p(θ) (3.25)
∝
K
∏
k=1
θ
Nk
k θ
αk−1
k =
K
∏
k=1
θ
Nk+αk−1
k (3.26)
= Dir(θ|α1 +N1,··· ,αK +NK) (3.27)
From Equation 2.38, the MAP estimate is given by
ˆθk =
Nk +αk −1
N +α0 −K
(3.28)
If we use a uniform prior, αk = 1, we recover the MLE:
ˆθk =
Nk
N
(3.29)
3.4.4 Posterior predictive distribution
The posterior predictive distribution for a single multi-
noulli trial is given by the following expression:
p(X = j|D) =
∫
p(X = j|θ)p(θ|D)dθ (3.30)
=
∫
p(X = j|θj)
[∫
p(θ−j,θj|D)dθ−j
]
dθj
(3.31)
=
∫
θj p(θj|D)dθj = E[θj|D] =
αj +Nj
α0 +N
(3.32)
where θ−j are all the components of θ except θj.
The above expression avoids the zero-count problem.
In fact, this form of Bayesian smoothing is even more im-
portant in the multinomial case than the binary case, since
the likelihood of data sparsity increases once we start par-
titioning the data into many categories.
3.5 Naive Bayes classifiers
Assume the features are conditionally independent given
the class label, then the class conditional density has the
following form
p(x|y = c,θ) =
D
∏
j=1
p(xj|y = c,θjc) (3.33)
The resulting model is called a naive Bayes classi-
fier(NBC).
The form of the class-conditional density depends on
the type of each feature. We give some possibilities below:
• In the case of real-valued features, we can
use the Gaussian distribution: p(x|y,θ) =
∏D
j=1 N(xj|µjc,σ2
jc), where µjc is the mean of
feature j in objects of class c, and σ2
jc is its variance.
• In the case of binary features, xj ∈ {0,1}, we
can use the Bernoulli distribution: p(x|y,θ) =
∏D
j=1 Ber(xj|µjc), where µjc is the probability that
feature j occurs in class c. This is sometimes called
the multivariate Bernoulli naive Bayes model. We
will see an application of this below.
• In the case of categorical features, xj ∈
{aj1,aj2,··· ,ajSj }, we can use the multinoulli
distribution: p(x|y,θ) = ∏D
j=1 Cat(xj|µjc), where µjc
is a histogram over the K possible values for xj in
class c.
Obviously we can handle other kinds of features, or
use different distributional assumptions. Also, it is easy to
mix and match features of different types.
21
3.5.1 Optimization
We now discuss how to train a naive Bayes classifier. This
usually means computing the MLE or the MAP estimate
for the parameters. However, we will also discuss how to
compute the full posterior, p(θ|D).
3.5.1.1 MLE for NBC
The probability for a single data case is given by
p(xi,yi|θ) = p(yi|π)∏
j
p(xij|θj)
= ∏
c
π
I(yi=c)
c ∏
j
∏
c
p(xij|θjc)I(yi=c)
(3.34)
Hence the log-likelihood is given by
p(D|θ) =
C
∑
c=1
Nc logπc +
D
∑
j=1
C
∑
c=1
∑
i:yi=c
log p(xij|θjc)
(3.35)
where Nc ≜ ∑
i
I(yi = c) is the number of feature vectors in
class c.
We see that this expression decomposes into a series
of terms, one concerning π, and DC terms containing the
θjcs. Hence we can optimize all these parameters sepa-
rately.
From Equation 3.29, the MLE for the class prior is
given by
ˆπc =
Nc
N
(3.36)
The MLE for θjcs depends on the type of distribution
we choose to use for each feature.
In the case of binary features, xj ∈ {0,1}, xj|y = c ∼
Ber(θjc), hence
ˆθjc =
Njc
Nc
(3.37)
where Njc ≜ ∑
i:yi=c
I(yi = c) is the number that feature j
occurs in class c.
In the case of categorical features, xj ∈
{aj1,aj2,··· ,ajSj }, xj|y = c ∼ Cat(θjc), hence
ˆθjc = (
Nj1c
Nc
,
Nj2c
Nc
,··· ,
NjSj
Nc
)T
(3.38)
where Njkc ≜
N
∑
i=1
I(xij = ajk,yi = c) is the number that fea-
ture xj = ajk occurs in class c.
3.5.1.2 Bayesian naive Bayes
Use a Dir(α) prior for π.
In the case of binary features, use a Beta(β0,β1) prior
for each θjc; in the case of categorical features, use a
Dir(α) prior for each θjc. Often we just take α = 1 and
β = 1, corresponding to add-one or Laplace smoothing.
3.5.2 Using the model for prediction
The goal is to compute
y = f(x) = argmax
c
P(y = c|x,θ)
= P(y = c|θ)
D
∏
j=1
P(xj|y = c,θ)
(3.39)
We can the estimate parameters using MLE or MAP,
then the posterior predictive density is obtained by simply
plugging in the parameters ¯θ(MLE) or ˆθ(MAP).
Or we can use BMA, just integrate out the unknown
parameters.
3.5.3 The log-sum-exp trick
when using generative classifiers of any kind, comput-
ing the posterior over class labels using Equation 3.1 can
fail due to numerical underflow. The problem is that
p(x|y = c) is often a very small number, especially if x
is a high-dimensional vector. This is because we require
that ∑x p(x|y) = 1, so the probability of observing any
particular high-dimensional vector is small. The obvious
solution is to take logs when applying Bayes rule, as fol-
lows:
log p(y = c|x,θ) = bc −log
(
∑
c′
ebc′
)
(3.40)
where bc ≜ log p(x|y = c,θ)+log p(y = c|θ).
We can factor out the largest term, and just represent
the remaining numbers relative to that. For example,
log(e−120
+e−121
) = log(e−120
(1+e−1
))
= log(1+e−1
)−120
(3.41)
In general, we have
∑
c
ebc
= log
[
(∑ebc−B
)eB
]
= log
(
∑ebc−B
)
+B (3.42)
22
where B ≜ max{bc}.
This is called the log-sum-exp trick, and is widely
used.
3.5.4 Feature selection using mutual
information
Since an NBC is fitting a joint distribution over potentially
many features, it can suffer from overfitting. In addition,
the run-time cost is O(D), which may be too high for some
applications.
One common approach to tackling both of these prob-
lems is to perform feature selection, to remove irrelevant
features that do not help much with the classification prob-
lem. The simplest approach to feature selection is to eval-
uate the relevance of each feature separately, and then take
the top K,whereKis chosen based on some tradeoff be-
tween accuracy and complexity. This approach is known
as variable ranking, filtering, or screening.
One way to measure relevance is to use mutual infor-
mation (Section 2.8.3) between feature Xj and the class
label Y
I(Xj,Y) = ∑
xj
∑
y
p(xj,y)log
p(xj,y)
p(xj)p(y)
(3.43)
If the features are binary, it is easy to show that the MI
can be computed as follows
Ij = ∑
c
[
θjcπc log
θjc
θj
+(1−θjc)πc log
1−θjc
1−θj
]
(3.44)
where πc = p(y = c), θjc = p(xj = 1|y = c), and θj =
p(xj = 1) = ∑c πcθjc.
3.5.5 Classifying documents using bag of
words
Document classification is the problem of classifying
text documents into different categories.
3.5.5.1 Bernoulli product model
One simple approach is to represent each document as a
binary vector, which records whether each word is present
or not, so xij = 1 iff word j occurs in document i, other-
wise xij = 0. We can then use the following class condi-
tional density:
p(xi|yi = c,θ) =
D
∏
j=1
Ber(xij|θjc)
=
D
∏
j=1
θ
xij
jc (1−θjc)1−xij
(3.45)
This is called the Bernoulli product model, or the bi-
nary independence model.
3.5.5.2 Multinomial document classifier
However, ignoring the number of times each word oc-
curs in a document loses some information (McCallum
and Nigam 1998). A more accurate representation counts
the number of occurrences of each word. Specifically,
let xi be a vector of counts for document i, so xij ∈
{0,1,··· ,Ni}, where Ni is the number of terms in docu-
ment i(so
D
∑
j=1
xij = Ni). For the class conditional densities,
we can use a multinomial distribution:
p(xi|yi = c,θ) = Mu(xi|Ni,θc) =
Ni!
∏D
j=1 xij!
D
∏
j=1
θ
xi j
jc
(3.46)
where we have implicitly assumed that the document
length Ni is independent of the class. Here jc is the proba-
bility of generating word j in documents of class c; these
parameters satisfy the constraint that ∑D
j=1 θjc = 1 for
each class c.
Although the multinomial classifier is easy to train and
easy to use at test time, it does not work particularly well
for document classification. One reason for this is that it
does not take into account the burstiness of word usage.
This refers to the phenomenon that most words never ap-
pear in any given document, but if they do appear once,
they are likely to appear more than once, i.e., words occur
in bursts.
The multinomial model cannot capture the burstiness
phenomenon. To see why, note that Equation 3.46 has the
form θ
xij
jc , and since θjc ≪ 1 for rare words, it becomes
increasingly unlikely to generate many of them. For more
frequent words, the decay rate is not as fast. To see why
intuitively, note that the most frequent words are func-
tion words which are not specific to the class, such as
and, the, and but; the chance of the word and occuring
is pretty much the same no matter how many time it has
previously occurred (modulo document length), so the in-
dependence assumption is more reasonable for common
words. However, since rare words are the ones that mat-
ter most for classification purposes, these are the ones we
want to model the most carefully.
23
3.5.5.3 DCM model
Various ad hoc heuristics have been proposed to improve
the performance of the multinomial document classifier
(Rennie et al. 2003). We now present an alternative class
conditional density that performs as well as these ad hoc
methods, yet is probabilistically sound (Madsen et al.
2005).
Suppose we simply replace the multinomial class con-
ditional density with the Dirichlet Compound Multino-
mial or DCM density, defined as follows:
p(xi|yi = c,α) =
∫
Mu(xi|Ni,θc)Dir(θc|αc)
=
Ni!
∏D
j=1 xij!
D
∏
j=1
B(xi +αc)
B(αc)
(3.47)
(This equation is derived in Equation TODO.) Surpris-
ingly this simple change is all that is needed to capture the
burstiness phenomenon. The intuitive reason for this is as
follows: After seeing one occurence of a word, say wordj,
the posterior counts on j gets updated, making another oc-
curence of wordjmore likely. By contrast, ifj is fixed, then
the occurences of each word are independent. The multi-
nomial model corresponds to drawing a ball from an urn
with Kcolors of ball, recording its color, and then replac-
ing it. By contrast, the DCM model corresponds to draw-
ing a ball, recording its color, and then replacing it with
one additional copy; this is called the Polya urn.
Using the DCM as the class conditional density gives
much better results than using the multinomial, and has
performance comparable to state of the art methods, as de-
scribed in (Madsen et al. 2005). The only disadvantage is
that fitting the DCM model is more complex; see (Minka
2000e; Elkan 2006) for the details.
Chapter 4
Gaussian Models
In this chapter, we discuss the multivariate Gaus-
sian or multivariate normal(MVN), which is the most
widely used joint probability density function for contin-
uous variables. It will form the basis for many of the mod-
els we will encounter in later chapters.
4.1 Basics
Recall from Section 2.5.2 that the pdf for an MVN in D
dimensions is defined by the following:
N(x|µ,Σ) ≜
1
(2π)
D
2 |Σ|
1
2
exp
[
−
1
2
(x−µ)T
Σ−1
(x−µ)
]
(4.1)
The expression inside the exponent is the Mahalanobis
distance between a data vector x and the mean vector µ,
We can gain a better understanding of this quantity by per-
forming an eigendecomposition of Σ. That is, we write
Σ = UΛUT , where U is an orthonormal matrix of eigen-
vectors satsifying UT U = I, and Λ is a diagonal matrix
of eigenvalues. Using the eigendecomposition, we have
that
Σ−1
= U−T
Λ−1
U−1
= UΛ−1
UT
=
D
∑
i=1
1
λi
uiuT
i (4.2)
where ui is the i’th column of U, containing the i’th
eigenvector. Hence we can rewrite the Mahalanobis dis-
tance as follows:
(x−µ)T
Σ−1
(x−µ) = (x−µ)T
(
D
∑
i=1
1
λi
uiuT
i
)
(x−µ)
(4.3)
=
D
∑
i=1
1
λi
(x−µ)T
uiuT
i (x−µ)
(4.4)
=
D
∑
i=1
y2
i
λi
(4.5)
where yi ≜ uT
i (x−µ). Recall that the equation for an el-
lipse in 2d is
y2
1
λ1
+
y2
2
λ2
= 1 (4.6)
Hence we see that the contours of equal probability
density of a Gaussian lie along ellipses. This is illustrated
in Figure 4.1. The eigenvectors determine the orientation
of the ellipse, and the eigenvalues determine how elogo-
nated it is.
Fig. 4.1: Visualization of a 2 dimensional Gaussian
density. The major and minor axes of the ellipse are
defined by the first two eigenvectors of the covariance
matrix, namely u1 and u2. Based on Figure 2.7 of
(Bishop 2006a)
In general, we see that the Mahalanobis distance corre-
sponds to Euclidean distance in a transformed coordinate
system, where we shift by µ and rotate by U.
4.1.1 MLE for a MVN
Theorem 4.1. (MLE for a MVN) If we have N iid sam-
ples xi ∼ N(µ,Σ), then the MLE for the parameters is
given by
25
26
¯µ =
1
N
N
∑
i=1
xi ≜ ¯x (4.7)
¯Σ =
1
N
N
∑
i=1
(xi − ¯x)(xi − ¯x)T
(4.8)
=
1
N
(
N
∑
i=1
xixT
i
)
− ¯x ¯xT
(4.9)
4.1.2 Maximum entropy derivation of the
Gaussian *
In this section, we show that the multivariate Gaussian is
the distribution with maximum entropy subject to having a
specified mean and covariance (see also Section TODO).
This is one reason the Gaussian is so widely used: the first
two moments are usually all that we can reliably estimate
from data, so we want a distribution that captures these
properties, but otherwise makes as few addtional assump-
tions as possible.
To simplify notation, we will assume the mean is zero.
The pdf has the form
f(x) =
1
Z
exp
(
−
1
2
xT
Σ−1
x
)
(4.10)
4.2 Gaussian discriminant analysis
One important application of MVNs is to define the the
class conditional densities in a generative classifier, i.e.,
p(x|y = c,θ) = N(x|µc,Σc) (4.11)
The resulting technique is called (Gaussian) discrim-
inant analysis or GDA (even though it is a generative,
not discriminative, classifier see Section TODO for more
on this distinction). If Σc is diagonal, this is equivalent to
naive Bayes.
We can classify a feature vector using the following
decision rule, derived from Equation 3.1:
y = argmax
c
[log p(y = c|π)+log p(x|θ)] (4.12)
When we compute the probability of x under each
class conditional density, we are measuring the distance
from x to the center of each class, µc, using Mahalanobis
distance. This can be thought of as a nearest centroids
classifier.
As an example, Figure 4.2 shows two Gaussian class-
conditional densities in 2d, representing the height and
weight of men and women. We can see that the features
(a)
(b)
Fig. 4.2: (a) Height/weight data. (b) Visualization of 2d
Gaussians fit to each class. 95% of the probability mass
is inside the ellipse.
are correlated, as is to be expected (tall people tend to
weigh more). The ellipses for each class contain 95%
of the probability mass. If we have a uniform prior over
classes, we can classify a new test vector as follows:
y = argmax
c
(x−µc)T
Σ−1
c (x−µc) (4.13)
4.2.1 Quadratic discriminant analysis
(QDA)
By plugging in the definition of the Gaussian density to
Equation 3.1, we can get
p(y|x,θ) =
πc|2πΣc|− 1
2 exp
[
−1
2 (x−µ)T Σ−1(x−µ)
]
∑c′ πc′ |2πΣc′ |− 1
2 exp
[
−1
2 (x−µ)T Σ−1(x−µ)
]
(4.14)
27
Thresholding this results in a quadratic function of
x. The result is known as quadratic discriminant analy-
sis(QDA). Figure 4.3 gives some examples of what the
decision boundaries look like in 2D.
(a)
(b)
Fig. 4.3: Quadratic decision boundaries in 2D for the 2
and 3 class case.
4.2.2 Linear discriminant analysis (LDA)
We now consider a special case in which the covariance
matrices are tied or shared across classes,Σc = Σ. In this
case, we can simplify Equation 4.14 as follows:
p(y|x,θ) ∝ πc exp
(
µcΣ−1
x−
1
2
xT
Σ−1
x−
1
2
µT
c Σ−1
µc
)
= exp
(
µcΣ−1
x−
1
2
µT
c Σ−1
µc +logπc
)
exp
(
−
1
2
xT
Σ−1
x
)
∝ exp
(
µcΣ−1
x−
1
2
µT
c Σ−1
µc +logπc
)
(4.15)
Since the quadratic term xT Σ−1x is independent of c,
it will cancel out in the numerator and denominator. If we
define
γc ≜ −
1
2
µT
c Σ−1
µc +logπc (4.16)
βc ≜ Σ−1
µc (4.17)
then we can write
p(y|x,θ) =
eβT
c x+γc
∑c′ eβT
c′ x+γc′
≜ σ(η,c) (4.18)
where η ≜ (eβT
1 x
+γ1,··· ,eβT
Cx
+γC), σ() is the softmax
activation function16, defined as follows:
σ(q,i) ≜
exp(qi)
∑n
j=1 exp(qj)
(4.19)
When parameterized by some constant, α > 0, the fol-
lowing formulation becomes a smooth, differentiable ap-
proximation of the maximum function:
Sα(x) =
∑D
j=1 xjeαxj
∑D
j=1 eαxj
(4.20)
Sα has the following properties:
1. Sα → max as α → ∞
2. S0 is the average of its inputs
3. Sα → min as α → −∞
Note that the softmax activation function comes from
the area of statistical physics, where it is common to use
the Boltzmann distribution, which has the same form as
the softmax activation function.
An interesting property of Equation 4.18 is that, if we
take logs, we end up with a linear function of x. (The
reason it is linear is because the xT Σ−1x cancels from the
numerator and denominator.) Thus the decision boundary
between any two classes, says c and c′, will be a straight
line. Hence this technique is called linear discriminant
analysis or LDA.
16 http://en.wikipedia.org/wiki/Softmax_
activation_function
28
An alternative to fitting an LDA model and then de-
riving the class posterior is to directly fit p(y|x,W ) =
Cat(y|W x) for some C × D weight matrix W . This is
called multi-class logistic regression, or multinomial lo-
gistic regression. We will discuss this model in detail
in Section TODO. The difference between the two ap-
proaches is explained in Section TODO.
4.2.3 Two-class LDA
To gain further insight into the meaning of these equa-
tions, let us consider the binary case. In this case, the pos-
terior is given by
p(y = 1|x,θ) =
eβT
1 x+γ1
eβT
0 x+γ0 +eβT
1 x+γ1
) (4.21)
=
1
1+e(β0 −β1)T x+(γ0 −γ1)
(4.22)
= sigm((β1 −β0)T
x+(γ0 −γ1)) (4.23)
where sigm(x) refers to the sigmoid function17.
Now
γ1 −γ0 = −
1
2
µT
1 Σ−1
µ1 +
1
2
µT
0 Σ−1
µ0 +log(π1/π0)
(4.24)
= −
1
2
(µ1 −µ0)T
Σ−1
(µ1 +µ0)+log(π1/π0)
(4.25)
So if we define
w = β1 −β0 = Σ−1
(µ1 −µ0) (4.26)
x0 =
1
2
(µ1 +µ0)−(µ1 −µ0)
log(π1/π0)
(µ1 −µ0)T Σ−1(µ1 −µ0)
(4.27)
then we have wT x0 = −(γ1 −γ0), and hence
p(y = 1|x,θ) = sigm(wT
(x−x0)) (4.28)
(This is closely related to logistic regression, which we
will discuss in Section TODO.) So the final decision rule
is as follows: shift x by x0, project onto the line w, and
see if the result is positive or negative.
If Σ = σ2I, then w is in the direction of µ1 −µ0. So
we classify the point based on whether its projection is
closer to µ0 or µ1 . This is illustrated in Figure 4.4. Fur-
themore, if π1 = π0, then x0 = 1
2 (µ1 +µ0), which is half
way between the means. If we make π1 > π0, then x0 gets
17 http://en.wikipedia.org/wiki/Sigmoid_
function
Fig. 4.4: Geometry of LDA in the 2 class case where
Σ1 = Σ2 = I.
closer to µ0, so more of the line belongs to class 1 a priori.
Conversely if π1 < π0, the boundary shifts right. Thus we
see that the class prior, c, just changes the decision thresh-
old, and not the overall geometry, as we claimed above.
(A similar argument applies in the multi-class case.)
The magnitude of w determines the steepness of the
logistic function, and depends on how well-separated the
means are, relative to the variance. In psychology and sig-
nal detection theory, it is common to define the discrim-
inability of a signal from the background noise using a
quantity called d-prime:
d′
≜
µ1 − µ0
σ
(4.29)
where µ1 is the mean of the signal and µ0 is the mean of
the noise, and σ is the standard deviation of the noise. If
d′ is large, the signal will be easier to discriminate from
the noise.
4.2.4 MLE for discriminant analysis
The log-likelihood function is as follows:
p(D|θ) =
C
∑
c=1
∑
i:yi=c
logπc +
C
∑
c=1
∑
i:yi=c
logN(xi|µc,Σc)
(4.30)
The MLE for each parameter is as follows:
¯µc =
Nc
N
(4.31)
¯µc =
1
Nc
∑
i:yi=c
xi (4.32)
¯Σc =
1
Nc
∑
i:yi=c
(xi − ¯µc)(xi − ¯µc)T
(4.33)
29
4.2.5 Strategies for preventing overfitting
The speed and simplicity of the MLE method is one of its
greatest appeals. However, the MLE can badly overfit in
high dimensions. In particular, the MLE for a full covari-
ance matrix is singular if Nc < D. And even when Nc > D,
the MLE can be ill-conditioned, meaning it is close to sin-
gular. There are several possible solutions to this problem:
• Use a diagonal covariance matrix for each class, which
assumes the features are conditionally independent;
this is equivalent to using a naive Bayes classifier (Sec-
tion 3.5).
• Use a full covariance matrix, but force it to be the same
for all classes,Σc = Σ. This is an example of param-
eter tying or parameter sharing, and is equivalent to
LDA (Section 4.2.2).
• Use a diagonal covariance matrix and forced it to be
shared. This is called diagonal covariance LDA, and is
discussed in Section TODO.
• Use a full covariance matrix, but impose a prior and
then integrate it out. If we use a conjugate prior, this
can be done in closed form, using the results from Sec-
tion TODO; this is analogous to the Bayesian naive
Bayes method in Section 3.5.1.2. See (Minka 2000f)
for details.
• Fit a full or diagonal covariance matrix by MAP esti-
mation. We discuss two different kindsof prior below.
• Project the data into a low dimensional subspace and
fit the Gaussians there. See Section TODO for a way
to find the best (most discriminative) linear projection.
We discuss some of these options below.
4.2.6 Regularized LDA *
4.2.7 Diagonal LDA
4.2.8 Nearest shrunken centroids classifier *
One drawback of diagonal LDA is that it depends on all of
the features. In high dimensional problems, we might pre-
fer a method that only depends on a subset of the features,
for reasons of accuracy and interpretability. One approach
is to use a screening method, perhaps based on mutual in-
formation, as in Section 3.5.4. We now discuss another
approach to this problem known as the nearest shrunken
centroids classifier (Hastie et al. 2009, p652).
4.3 Inference in jointly Gaussian
distributions
Given a joint distribution, p(x1,x2), it is useful to be able
to compute marginals p(x1) and conditionals p(x1|x2).
We discuss how to do this below, and then give some ap-
plications. These operations take O(D3) time in the worst
case. See Section TODO for faster methods.
4.3.1 Statement of the result
Theorem 4.2. (Marginals and conditionals of an MVN).
Suppose X = (x1,x2)is jointly Gaussian with parameters
µ =
(
µ1
µ2
)
,Σ =
(
Σ11 Σ12
Σ21 Σ22
)
,Λ = Σ−1
=
(
Λ11 Λ12
Λ21 Λ22
)
,
(4.34)
Then the marginals are given by
p(x1) = N(x1|µ1,Σ11)
p(x2) = N(x2|µ2,Σ22)
(4.35)
and the posterior conditional is given by
p(x1|x2) = N(x1|µ1|2,Σ1|2)
µ1|2 = µ1 +Σ12Σ−1
22 (x2 −µ2)
= µ1 −Λ−1
11 Λ12(x2 −µ2)
= Σ1|2(Λ11µ1 −Λ12(x2 −µ2))
Σ1|2 = Σ11 −Σ12Σ−1
22 Σ21 = Λ−1
11
(4.36)
Equation 4.36 is of such crucial importance in this
book that we have put a box around it, so you can eas-
ily find it. For the proof, see Section TODO.
We see that both the marginal and conditional distribu-
tions are themselves Gaussian. For the marginals, we just
extract the rows and columns corresponding to x1 or x2.
For the conditional, we have to do a bit more work. How-
ever, it is not that complicated: the conditional mean is just
a linear function of x2, and the conditional covariance is
just a constant matrix that is independent of x2. We give
three different (but equivalent) expressions for the poste-
rior mean, and two different (but equivalent) expressions
for the posterior covariance; each one is useful in different
circumstances.
30
4.3.2 Examples
Below we give some examples of these equations in ac-
tion, which will make them seem more intuitive.
4.3.2.1 Marginals and conditionals of a 2d Gaussian
4.4 Linear Gaussian systems
Suppose we have two variables, x and y.Let x ∈ RDx be a
hidden variable, and y ∈ RDy be a noisy observation of x.
Let us assume we have the following prior and likelihood:
p(x) = N(x|µx,Σx)
p(y|x) = N(y|W x+µy,Σy)
(4.37)
where W is a matrix of size Dy × Dx. This is an exam-
ple of a linear Gaussian system. We can represent this
schematically as x → y, meaning x generates y. In this
section, we show how to invert the arrow, that is, how to
infer x from y. We state the result below, then give sev-
eral examples, and finally we derive the result. We will see
many more applications of these results in later chapters.
4.4.1 Statement of the result
Theorem 4.3. (Bayes rule for linear Gaussian systems).
Given a linear Gaussian system, as in Equation 4.37, the
posterior p(x|y) is given by the following:
p(x|y) = N(x|µx|y,Σx|y)
Σx|y = Σ−1
x +W T
Σ−1
y W
µx|y = Σx|y
[
W T
Σ−1
y (y −µy)+Σ−1
x µx
]
(4.38)
In addition, the normalization constant p(y) is given by
p(y) = N(y|W µx +µy,Σy +W ΣxW T
) (4.39)
For the proof, see Section 4.4.3 TODO.
4.5 Digression: The Wishart distribution *
4.6 Inferring the parameters of an MVN
4.6.1 Posterior distribution of µ
4.6.2 Posterior distribution of Σ *
4.6.3 Posterior distribution of µ and Σ *
4.6.4 Sensor fusion with unknown precisions
*
Chapter 5
Bayesian statistics
5.1 Introduction
Using the posterior distribution to summarize everything
we know about a set of unknown variables is at the core
of Bayesian statistics. In this chapter, we discuss this ap-
proach to statistics in more detail.
5.2 Summarizing posterior distributions
The posterior p(θ|D) summarizes everything we know
about the unknown quantities θ. In this section, we dis-
cuss some simple quantities that can be derived from a
probability distribution, such as a posterior. These sum-
mary statistics are often easier to understand and visualize
than the full joint.
5.2.1 MAP estimation
We can easily compute a point estimate of an unknown
quantity by computing the posterior mean, median or
mode. In Section 5.7, we discuss how to use decision the-
ory to choose between these methods. Typically the pos-
terior mean or median is the most appropriate choice for a
realvalued quantity, and the vector of posterior marginals
is the best choice for a discrete quantity. However, the
posterior mode, aka the MAP estimate, is the most pop-
ular choice because it reduces to an optimization prob-
lem, for which efficient algorithms often exist. Futher-
more, MAP estimation can be interpreted in non-Bayesian
terms, by thinking of the log prior as a regularizer (see
Section TODO for more details).
Although this approach is computationally appealing,
it is important to point out that there are various draw-
backs to MAP estimation, which we briefly discuss be-
low. This will provide motivation for the more thoroughly
Bayesian approach which we will study later in this chap-
ter(and elsewhere in this book).
5.2.1.1 No measure of uncertainty
The most obvious drawback of MAP estimation, and in-
deed of any other point estimate such as the posterior
mean or median, is that it does not provide any measure of
uncertainty. In many applications, it is important to know
how much one can trust a given estimate. We can derive
such confidence measures from the posterior, as we dis-
cuss in Section 5.2.2.
5.2.1.2 Plugging in the MAP estimate can result in
overfitting
If we dont model the uncertainty in our parameters, then
our predictive distribution will be overconfident. Over-
confidence in predictions is particularly problematic in
situations where we may be risk averse; see Section 5.7
for details.
5.2.1.3 The mode is an untypical point
Choosing the mode as a summary of a posterior distribu-
tion is often a very poor choice, since the mode is usually
quite untypical of the distribution, unlike the mean or me-
dian. The basic problem is that the mode is a point of mea-
sure zero, whereas the mean and median take the volume
of the space into account. See Figure 5.1.
How should we summarize a posterior if the mode is
not a good choice? The answer is to use decision theory,
which we discuss in Section 5.7. The basic idea is to spec-
ify a loss function, where L(θ, ˆθ) is the loss you incur if
the truth is θ and your estimate is ˆθ. If we use 0-1 loss
L(θ, ˆθ) = I(θ ̸= ˆθ)(see section 1.2.2.1), then the opti-
mal estimate is the posterior mode. 0-1 loss means you
only get points if you make no errors, otherwise you get
nothing: there is no partial credit under this loss function!
For continuous-valued quantities, we often prefer to use
squared error loss, L(θ, ˆθ) = (θ − ˆθ)2 ; the corresponding
optimal estimator is then the posterior mean, as we show
in Section 5.7. Or we can use a more robust loss func-
tion, L(θ, ˆθ) = |θ − ˆθ|, which gives rise to the posterior
median.
31
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet
Machine learning-cheat-sheet

More Related Content

What's hot

Ieml semantic topology
Ieml semantic topologyIeml semantic topology
Ieml semantic topologyAntonio Medina
 
Stochastic Programming
Stochastic ProgrammingStochastic Programming
Stochastic ProgrammingSSA KPI
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEnrique Muñoz Corral
 
Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429eniacnetpoint
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelMarco Piccolino
 
M152 notes
M152 notesM152 notes
M152 noteswfei
 
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimizationDavid_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimizationDavid Mateos
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 
Coding interview preparation
Coding interview preparationCoding interview preparation
Coding interview preparationSrinevethaAR
 
Donhauser - 2012 - Jump Variation From High-Frequency Asset Returns
Donhauser - 2012 - Jump Variation From High-Frequency Asset ReturnsDonhauser - 2012 - Jump Variation From High-Frequency Asset Returns
Donhauser - 2012 - Jump Variation From High-Frequency Asset ReturnsBrian Donhauser
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggRohit Bapat
 
Reading Materials for Operational Research
Reading Materials for Operational Research Reading Materials for Operational Research
Reading Materials for Operational Research Derbew Tesfa
 
Szalas cugs-lectures
Szalas cugs-lecturesSzalas cugs-lectures
Szalas cugs-lecturesHanibei
 

What's hot (18)

Ieml semantic topology
Ieml semantic topologyIeml semantic topology
Ieml semantic topology
 
Stochastic Programming
Stochastic ProgrammingStochastic Programming
Stochastic Programming
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image Registration
 
Dm
DmDm
Dm
 
Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational model
 
M152 notes
M152 notesM152 notes
M152 notes
 
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimizationDavid_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Discrete strucures
Discrete strucures Discrete strucures
Discrete strucures
 
Coding interview preparation
Coding interview preparationCoding interview preparation
Coding interview preparation
 
Donhauser - 2012 - Jump Variation From High-Frequency Asset Returns
Donhauser - 2012 - Jump Variation From High-Frequency Asset ReturnsDonhauser - 2012 - Jump Variation From High-Frequency Asset Returns
Donhauser - 2012 - Jump Variation From High-Frequency Asset Returns
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
 
DISS2013
DISS2013DISS2013
DISS2013
 
Reading Materials for Operational Research
Reading Materials for Operational Research Reading Materials for Operational Research
Reading Materials for Operational Research
 
btpreport
btpreportbtpreport
btpreport
 
Thats How We C
Thats How We CThats How We C
Thats How We C
 
Szalas cugs-lectures
Szalas cugs-lecturesSzalas cugs-lectures
Szalas cugs-lectures
 

Viewers also liked

The emerging effects of slide share
The emerging effects of slide shareThe emerging effects of slide share
The emerging effects of slide shareAli Anani, PhD
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShiraz316
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)PyData
 
Top 5 algorithms used in Data Science
Top 5 algorithms used in Data ScienceTop 5 algorithms used in Data Science
Top 5 algorithms used in Data ScienceEdureka!
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
 
A brief history of machine learning
A brief history of  machine learningA brief history of  machine learning
A brief history of machine learningRobert Colner
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniquesGiorgos Bamparopoulos
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 

Viewers also liked (11)

The emerging effects of slide share
The emerging effects of slide shareThe emerging effects of slide share
The emerging effects of slide share
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
 
Winnow vs perceptron
Winnow vs perceptronWinnow vs perceptron
Winnow vs perceptron
 
Top 5 algorithms used in Data Science
Top 5 algorithms used in Data ScienceTop 5 algorithms used in Data Science
Top 5 algorithms used in Data Science
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
A brief history of machine learning
A brief history of  machine learningA brief history of  machine learning
A brief history of machine learning
 
Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 

Similar to Machine learning-cheat-sheet

Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspectivee2wi67sy4816pahn
 
Methods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfMethods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfComrade15
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_ThesisVojtech Seman
 
Manual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionManual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionRahman Hakim
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...valentincivil
 
Memoire antoine pissoort_aout2017
Memoire antoine pissoort_aout2017Memoire antoine pissoort_aout2017
Memoire antoine pissoort_aout2017Antoine Pissoort
 
Introductory Statistics Explained.pdf
Introductory Statistics Explained.pdfIntroductory Statistics Explained.pdf
Introductory Statistics Explained.pdfssuser4492e2
 
Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance SpectraCarl Sapp
 
Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systemsAOERA
 
Exercises_in_Machine_Learning_1657514028.pdf
Exercises_in_Machine_Learning_1657514028.pdfExercises_in_Machine_Learning_1657514028.pdf
Exercises_in_Machine_Learning_1657514028.pdfRaidTan
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOKmusadoto
 
Lecture notes on planetary sciences and orbit determination
Lecture notes on planetary sciences and orbit determinationLecture notes on planetary sciences and orbit determination
Lecture notes on planetary sciences and orbit determinationErnst Schrama
 
biometry MTH 201
biometry MTH 201 biometry MTH 201
biometry MTH 201 musadoto
 

Similar to Machine learning-cheat-sheet (20)

Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspective
 
Methods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdfMethods for Applied Macroeconomic Research.pdf
Methods for Applied Macroeconomic Research.pdf
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis
 
book.pdf
book.pdfbook.pdf
book.pdf
 
Manual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th EditionManual Solution Probability and Statistic Hayter 4th Edition
Manual Solution Probability and Statistic Hayter 4th Edition
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
 
probabilidades.pdf
probabilidades.pdfprobabilidades.pdf
probabilidades.pdf
 
Memoire antoine pissoort_aout2017
Memoire antoine pissoort_aout2017Memoire antoine pissoort_aout2017
Memoire antoine pissoort_aout2017
 
MBIP-book.pdf
MBIP-book.pdfMBIP-book.pdf
MBIP-book.pdf
 
Scikit learn 0.16.0 user guide
Scikit learn 0.16.0 user guideScikit learn 0.16.0 user guide
Scikit learn 0.16.0 user guide
 
thesis
thesisthesis
thesis
 
Introductory Statistics Explained.pdf
Introductory Statistics Explained.pdfIntroductory Statistics Explained.pdf
Introductory Statistics Explained.pdf
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance Spectra
 
Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systems
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
Exercises_in_Machine_Learning_1657514028.pdf
Exercises_in_Machine_Learning_1657514028.pdfExercises_in_Machine_Learning_1657514028.pdf
Exercises_in_Machine_Learning_1657514028.pdf
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOK
 
Lecture notes on planetary sciences and orbit determination
Lecture notes on planetary sciences and orbit determinationLecture notes on planetary sciences and orbit determination
Lecture notes on planetary sciences and orbit determination
 
biometry MTH 201
biometry MTH 201 biometry MTH 201
biometry MTH 201
 

More from Willy Marroquin (WillyDevNET)

Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsWilly Marroquin (WillyDevNET)
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Willy Marroquin (WillyDevNET)
 
An Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorAn Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorWilly Marroquin (WillyDevNET)
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSWilly Marroquin (WillyDevNET)
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...Willy Marroquin (WillyDevNET)
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...Willy Marroquin (WillyDevNET)
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionDeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionWilly Marroquin (WillyDevNET)
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...Willy Marroquin (WillyDevNET)
 
When Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWhen Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWilly Marroquin (WillyDevNET)
 
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...Willy Marroquin (WillyDevNET)
 
Seven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor marketSeven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor marketWilly Marroquin (WillyDevNET)
 

More from Willy Marroquin (WillyDevNET) (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024
 
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
 
An Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorAn Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum Processor
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
 
Deep learning-approach
Deep learning-approachDeep learning-approach
Deep learning-approach
 
WEF new vision for education
WEF new vision for educationWEF new vision for education
WEF new vision for education
 
El futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionalesEl futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionales
 
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIALASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionDeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
 
When Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWhen Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI Experts
 
Microsoft AI Platform Whitepaper
Microsoft AI Platform WhitepaperMicrosoft AI Platform Whitepaper
Microsoft AI Platform Whitepaper
 
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Ad...
 
Seven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor marketSeven facts noncognitive skills education labor market
Seven facts noncognitive skills education labor market
 

Recently uploaded

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Machine learning-cheat-sheet

  • 1. https://github.com/soulmachine/machine-learning-cheat-sheet soulmachine@gmail.com Machine Learning Cheat Sheet Classical equations, diagrams and tricks in machine learning February 12, 2015
  • 2. ii ©2013 soulmachine Except where otherwise noted, This document is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA3.0) license (http://creativecommons.org/licenses/by/3.0/).
  • 3. Preface This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas in machine learning. This cheat sheet has three significant advantages: 1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this cheat sheet, I try my best to standardize symbols used, see section §. 2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where they are needed, to make formulas easy to understand. 3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often makes readers get lost in the middle way of derivation. At Tsinghua University, May 2013 soulmachine iii
  • 4.
  • 5. Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Types of machine learning . . . . . . . . . . . . 1 1.2 Three elements of a machine learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Representation . . . . . . . . . . . . . . 1 1.2.2 Evaluation . . . . . . . . . . . . . . . . . 1 1.2.3 Optimization . . . . . . . . . . . . . . . 2 1.3 Some basic concepts . . . . . . . . . . . . . . . . . 2 1.3.1 Parametric vs non-parametric models . . . . . . . . . . . . . . . . . . . . 2 1.3.2 A simple non-parametric classifier: K-nearest neighbours 2 1.3.3 Overfitting . . . . . . . . . . . . . . . . . 2 1.3.4 Cross validation . . . . . . . . . . . . . 2 1.3.5 Model selection . . . . . . . . . . . . . 2 2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Frequentists vs. Bayesians . . . . . . . . . . . . 3 2.2 A brief review of probability theory . . . . 3 2.2.1 Basic concepts . . . . . . . . . . . . . . 3 2.2.2 Mutivariate random variables . . 3 2.2.3 Bayes rule. . . . . . . . . . . . . . . . . . 4 2.2.4 Independence and conditional independence . . . . . . . . . . . . . . . 4 2.2.5 Quantiles . . . . . . . . . . . . . . . . . . 4 2.2.6 Mean and variance . . . . . . . . . . 4 2.3 Some common discrete distributions . . . 5 2.3.1 The Bernoulli and binomial distributions . . . . . . . . . . . . . . . . 5 2.3.2 The multinoulli and multinomial distributions . . . . . 5 2.3.3 The Poisson distribution . . . . . . 5 2.3.4 The empirical distribution . . . . 5 2.4 Some common continuous distributions. 6 2.4.1 Gaussian (normal) distribution. 6 2.4.2 Student’s t-distribution . . . . . . . 6 2.4.3 The Laplace distribution . . . . . . 7 2.4.4 The gamma distribution . . . . . . 8 2.4.5 The beta distribution . . . . . . . . . 8 2.4.6 Pareto distribution . . . . . . . . . . . 8 2.5 Joint probability distributions . . . . . . . . . 9 2.5.1 Covariance and correlation . . . . 9 2.5.2 Multivariate Gaussian distribution . . . . . . . . . . . . . . . . . 10 2.5.3 Multivariate Student’s t-distribution . . . . . . . . . . . . . . . 10 2.5.4 Dirichlet distribution . . . . . . . . . 10 2.6 Transformations of random variables . . . 11 2.6.1 Linear transformations . . . . . . . 11 2.6.2 General transformations . . . . . . 11 2.6.3 Central limit theorem . . . . . . . . 13 2.7 Monte Carlo approximation . . . . . . . . . . . 13 2.8 Information theory . . . . . . . . . . . . . . . . . . 14 2.8.1 Entropy . . . . . . . . . . . . . . . . . . . . 14 2.8.2 KL divergence . . . . . . . . . . . . . . 14 2.8.3 Mutual information . . . . . . . . . . 14 3 Generative models for discrete data . . . . . . . . 17 3.1 Generative classifier . . . . . . . . . . . . . . . . . 17 3.2 Bayesian concept learning . . . . . . . . . . . . 17 3.2.1 Likelihood . . . . . . . . . . . . . . . . . 17 3.2.2 Prior . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Posterior . . . . . . . . . . . . . . . . . . . 17 3.2.4 Posterior predictive distribution 18 3.3 The beta-binomial model . . . . . . . . . . . . . 18 3.3.1 Likelihood . . . . . . . . . . . . . . . . . 18 3.3.2 Prior . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Posterior . . . . . . . . . . . . . . . . . . . 18 3.3.4 Posterior predictive distribution 19 3.4 The Dirichlet-multinomial model . . . . . . 19 3.4.1 Likelihood . . . . . . . . . . . . . . . . . 20 3.4.2 Prior . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3 Posterior . . . . . . . . . . . . . . . . . . . 20 3.4.4 Posterior predictive distribution 20 3.5 Naive Bayes classifiers . . . . . . . . . . . . . . . 20 3.5.1 Optimization . . . . . . . . . . . . . . . 21 3.5.2 Using the model for prediction 21 3.5.3 The log-sum-exp trick . . . . . . . . 21 3.5.4 Feature selection using mutual information . . . . . . . . . . 22 3.5.5 Classifying documents using bag of words . . . . . . . . . . . . . . . 22 4 Gaussian Models . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 MLE for a MVN . . . . . . . . . . . . 25 4.1.2 Maximum entropy derivation of the Gaussian * . . . . . . . . . . . . 26 4.2 Gaussian discriminant analysis . . . . . . . . 26 4.2.1 Quadratic discriminant analysis (QDA) . . . . . . . . . . . . . 26 v
  • 6. vi Preface 4.2.2 Linear discriminant analysis (LDA) . . . . . . . . . . . . . . . . . . . . . 27 4.2.3 Two-class LDA . . . . . . . . . . . . . 28 4.2.4 MLE for discriminant analysis. 28 4.2.5 Strategies for preventing overfitting . . . . . . . . . . . . . . . . . . 29 4.2.6 Regularized LDA * . . . . . . . . . . 29 4.2.7 Diagonal LDA . . . . . . . . . . . . . . 29 4.2.8 Nearest shrunken centroids classifier * . . . . . . . . . . . . . . . . . 29 4.3 Inference in jointly Gaussian distributions . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Statement of the result . . . . . . . 29 4.3.2 Examples . . . . . . . . . . . . . . . . . . 30 4.4 Linear Gaussian systems . . . . . . . . . . . . . 30 4.4.1 Statement of the result . . . . . . . 30 4.5 Digression: The Wishart distribution * . . 30 4.6 Inferring the parameters of an MVN . . . 30 4.6.1 Posterior distribution of µ . . . . 30 4.6.2 Posterior distribution of Σ * . . . 30 4.6.3 Posterior distribution of µ and Σ * . . . . . . . . . . . . . . . . . . . . 30 4.6.4 Sensor fusion with unknown precisions * . . . . . . . . . . . . . . . . 30 5 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Summarizing posterior distributions . . . . 31 5.2.1 MAP estimation. . . . . . . . . . . . . 31 5.2.2 Credible intervals . . . . . . . . . . . 32 5.2.3 Inference for a difference in proportions . . . . . . . . . . . . . . . . . 33 5.3 Bayesian model selection . . . . . . . . . . . . . 33 5.3.1 Bayesian Occam’s razor . . . . . . 33 5.3.2 Computing the marginal likelihood (evidence). . . . . . . . . 34 5.3.3 Bayes factors . . . . . . . . . . . . . . . 36 5.4 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4.1 Uninformative priors . . . . . . . . . 36 5.4.2 Robust priors . . . . . . . . . . . . . . . 36 5.4.3 Mixtures of conjugate priors . . 36 5.5 Hierarchical Bayes . . . . . . . . . . . . . . . . . . 36 5.6 Empirical Bayes . . . . . . . . . . . . . . . . . . . . 36 5.7 Bayesian decision theory . . . . . . . . . . . . . 36 5.7.1 Bayes estimators for common loss functions . . . . . . . . . . . . . . . 37 5.7.2 The false positive vs false negative tradeoff . . . . . . . . . . . . 38 6 Frequentist statistics. . . . . . . . . . . . . . . . . . . . . . 39 6.1 Sampling distribution of an estimator . . . 39 6.1.1 Bootstrap . . . . . . . . . . . . . . . . . . 39 6.1.2 Large sample theory for the MLE * . . . . . . . . . . . . . . . . . . . . 39 6.2 Frequentist decision theory . . . . . . . . . . . 39 6.3 Desirable properties of estimators . . . . . . 39 6.4 Empirical risk minimization . . . . . . . . . . 39 6.4.1 Regularized risk minimization . 39 6.4.2 Structural risk minimization . . . 39 6.4.3 Estimating the risk using cross validation . . . . . . . . . . . . . 39 6.4.4 Upper bounding the risk using statistical learning theory *. . . . . . . . . . . . . . . . . . . . 39 6.4.5 Surrogate loss functions . . . . . . 39 6.5 Pathologies of frequentist statistics * . . . 39 7 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 41 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 41 7.2 Representation. . . . . . . . . . . . . . . . . . . . . . 41 7.3 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.3.1 OLS . . . . . . . . . . . . . . . . . . . . . . 41 7.3.2 SGD . . . . . . . . . . . . . . . . . . . . . . 42 7.4 Ridge regression(MAP) . . . . . . . . . . . . . . 42 7.4.1 Basic idea . . . . . . . . . . . . . . . . . . 43 7.4.2 Numerically stable computation * . . . . . . . . . . . . . . 43 7.4.3 Connection with PCA * . . . . . . 43 7.4.4 Regularization effects of big data . . . . . . . . . . . . . . . . . . . . . . . 43 7.5 Bayesian linear regression . . . . . . . . . . . . 43 8 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 45 8.1 Representation. . . . . . . . . . . . . . . . . . . . . . 45 8.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . 45 8.2.1 MLE . . . . . . . . . . . . . . . . . . . . . . 45 8.2.2 MAP . . . . . . . . . . . . . . . . . . . . . . 45 8.3 Multinomial logistic regression . . . . . . . . 45 8.3.1 Representation . . . . . . . . . . . . . . 45 8.3.2 MLE . . . . . . . . . . . . . . . . . . . . . . 46 8.3.3 MAP . . . . . . . . . . . . . . . . . . . . . . 46 8.4 Bayesian logistic regression . . . . . . . . . . 46 8.4.1 Laplace approximation . . . . . . . 47 8.4.2 Derivation of the BIC . . . . . . . . 47 8.4.3 Gaussian approximation for logistic regression . . . . . . . . . . . 47 8.4.4 Approximating the posterior predictive . . . . . . . . . . . . . . . . . . 47 8.4.5 Residual analysis (outlier detection) * . . . . . . . . . . . . . . . . 47 8.5 Online learning and stochastic optimization. . . . . . . . . . . . . . . . . . . . . . . . 47 8.5.1 The perceptron algorithm . . . . . 47 8.6 Generative vs discriminative classifiers . 48 8.6.1 Pros and cons of each approach 48 8.6.2 Dealing with missing data . . . . 48 8.6.3 Fishers linear discriminant analysis (FLDA) * . . . . . . . . . . . 50
  • 7. Preface vii 9 Generalized linear models and the exponential family . . . . . . . . . . . . . . . . . . . . . . . 51 9.1 The exponential family. . . . . . . . . . . . . . . 51 9.1.1 Definition . . . . . . . . . . . . . . . . . . 51 9.1.2 Examples . . . . . . . . . . . . . . . . . . 51 9.1.3 Log partition function . . . . . . . . 52 9.1.4 MLE for the exponential family 53 9.1.5 Bayes for the exponential family . . . . . . . . . . . . . . . . . . . . . 53 9.1.6 Maximum entropy derivation of the exponential family * . . . . 53 9.2 Generalized linear models (GLMs). . . . . 53 9.2.1 Basics . . . . . . . . . . . . . . . . . . . . . 53 9.3 Probit regression . . . . . . . . . . . . . . . . . . . . 53 9.4 Multi-task learning . . . . . . . . . . . . . . . . . . 53 10 Directed graphical models (Bayes nets) . . . . . 55 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 55 10.1.1 Chain rule . . . . . . . . . . . . . . . . . . 55 10.1.2 Conditional independence . . . . 55 10.1.3 Graphical models. . . . . . . . . . . . 55 10.1.4 Directed graphical model . . . . . 55 10.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 56 10.2.1 Naive Bayes classifiers . . . . . . . 56 10.2.2 Markov and hidden Markov models . . . . . . . . . . . . . . . . . . . . 56 10.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 56 10.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 10.4.1 Learning from complete data . . 56 10.4.2 Learning with missing and/or latent variables . . . . . . . . . . . . . . 57 10.5 Conditional independence properties of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . 57 10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties) . . . . . . . . . . 57 10.5.2 Other Markov properties of DGMs . . . . . . . . . . . . . . . . . . . . . 57 10.5.3 Markov blanket and full conditionals . . . . . . . . . . . . . . . . 57 10.5.4 Multinoulli Learning . . . . . . . . . 57 10.6 Influence (decision) diagrams * . . . . . . . 57 11 Mixture models and the EM algorithm . . . . . 59 11.1 Latent variable models . . . . . . . . . . . . . . . 59 11.2 Mixture models . . . . . . . . . . . . . . . . . . . . . 59 11.2.1 Mixtures of Gaussians . . . . . . . 59 11.2.2 Mixtures of multinoullis . . . . . . 60 11.2.3 Using mixture models for clustering . . . . . . . . . . . . . . . . . . 60 11.2.4 Mixtures of experts . . . . . . . . . . 60 11.3 Parameter estimation for mixture models 60 11.3.1 Unidentifiability . . . . . . . . . . . . 60 11.3.2 Computing a MAP estimate is non-convex . . . . . . . . . . . . . . . 60 11.4 The EM algorithm . . . . . . . . . . . . . . . . . . 60 11.4.1 Introduction . . . . . . . . . . . . . . . . 60 11.4.2 Basic idea . . . . . . . . . . . . . . . . . . 62 11.4.3 EM for GMMs . . . . . . . . . . . . . . 62 11.4.4 EM for K-means . . . . . . . . . . . . 64 11.4.5 EM for mixture of experts . . . . 64 11.4.6 EM for DGMs with hidden variables . . . . . . . . . . . . . . . . . . . 64 11.4.7 EM for the Student distribution * . . . . . . . . . . . . . . . 64 11.4.8 EM for probit regression * . . . . 64 11.4.9 Derivation of the Q function . . 64 11.4.10 Convergence of the EM Algorithm * . . . . . . . . . . . . . . . . 65 11.4.11 Generalization of EM Algorithm * . . . . . . . . . . . . . . . . 65 11.4.12 Online EM . . . . . . . . . . . . . . . . . 66 11.4.13 Other EM variants * . . . . . . . . . 66 11.5 Model selection for latent variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 11.5.1 Model selection for probabilistic models . . . . . . . . . 67 11.5.2 Model selection for non-probabilistic methods . . . . 67 11.6 Fitting models with missing data . . . . . . 67 11.6.1 EM for the MLE of an MVN with missing data. . . . . . . . . . . . 67 12 Latent linear models. . . . . . . . . . . . . . . . . . . . . . 69 12.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . 69 12.1.1 FA is a low rank parameterization of an MVN . . 69 12.1.2 Inference of the latent factors . . 69 12.1.3 Unidentifiability . . . . . . . . . . . . 70 12.1.4 Mixtures of factor analysers . . . 70 12.1.5 EM for factor analysis models . 71 12.1.6 Fitting FA models with missing data . . . . . . . . . . . . . . . . 71 12.2 Principal components analysis (PCA) . . 71 12.2.1 Classical PCA . . . . . . . . . . . . . . 71 12.2.2 Singular value decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . 72 12.2.3 Probabilistic PCA . . . . . . . . . . . 73 12.2.4 EM algorithm for PCA . . . . . . . 74 12.3 Choosing the number of latent dimensions. . . . . . . . . . . . . . . . . . . . . . . . . 74 12.3.1 Model selection for FA/PPCA . 74 12.3.2 Model selection for PCA . . . . . 74 12.4 PCA for categorical data . . . . . . . . . . . . . 74 12.5 PCA for paired and multi-view data . . . . 75 12.5.1 Supervised PCA (latent factor regression) . . . . . . . . . . . . 75
  • 8. viii Preface 12.5.2 Discriminative supervised PCA 75 12.5.3 Canonical correlation analysis . 75 12.6 Independent Component Analysis (ICA) 75 12.6.1 Maximum likelihood estimation 75 12.6.2 The FastICA algorithm . . . . . . . 76 12.6.3 Using EM . . . . . . . . . . . . . . . . . . 76 12.6.4 Other estimation principles * . . 76 13 Sparse linear models . . . . . . . . . . . . . . . . . . . . . 77 14 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 79 14.2 Kernel functions . . . . . . . . . . . . . . . . . . . . 79 14.2.1 RBF kernels . . . . . . . . . . . . . . . . 79 14.2.2 TF-IDF kernels . . . . . . . . . . . . . 79 14.2.3 Mercer (positive definite) kernels . . . . . . . . . . . . . . . . . . . . 79 14.2.4 Linear kernels . . . . . . . . . . . . . . 80 14.2.5 Matern kernels . . . . . . . . . . . . . . 80 14.2.6 String kernels . . . . . . . . . . . . . . . 80 14.2.7 Pyramid match kernels . . . . . . . 81 14.2.8 Kernels derived from probabilistic generative models 81 14.3 Using kernels inside GLMs . . . . . . . . . . . 81 14.3.1 Kernel machines . . . . . . . . . . . . 81 14.3.2 L1VMs, RVMs, and other sparse vector machines . . . . . . . 81 14.4 The kernel trick . . . . . . . . . . . . . . . . . . . . . 81 14.4.1 Kernelized KNN . . . . . . . . . . . . 82 14.4.2 Kernelized K-medoids clustering . . . . . . . . . . . . . . . . . . 82 14.4.3 Kernelized ridge regression . . . 82 14.4.4 Kernel PCA . . . . . . . . . . . . . . . . 83 14.5 Support vector machines (SVMs) . . . . . . 83 14.5.1 SVMs for classification. . . . . . . 83 14.5.2 SVMs for regression . . . . . . . . . 84 14.5.3 Choosing C . . . . . . . . . . . . . . . . 85 14.5.4 A probabilistic interpretation of SVMs . . . . . . . . . . . . . . . . . . . 85 14.5.5 Summary of key points . . . . . . . 85 14.6 Comparison of discriminative kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 14.7 Kernels for building generative models . 86 15 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . 87 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 87 15.2 GPs for regression . . . . . . . . . . . . . . . . . . 87 15.3 GPs meet GLMs . . . . . . . . . . . . . . . . . . . . 87 15.4 Connection with other methods. . . . . . . . 87 15.5 GP latent variable model . . . . . . . . . . . . . 87 15.6 Approximation methods for large datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 16 Adaptive basis function models . . . . . . . . . . . . 89 16.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 89 16.1.1 Representation . . . . . . . . . . . . . . 89 16.1.2 Evaluation . . . . . . . . . . . . . . . . . 89 16.1.3 Optimization . . . . . . . . . . . . . . . 89 16.1.4 The upper bound of the training error of AdaBoost . . . . 89 17 Hidden markov Model . . . . . . . . . . . . . . . . . . . . 91 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 91 17.2 Markov models . . . . . . . . . . . . . . . . . . . . . 91 18 State space models . . . . . . . . . . . . . . . . . . . . . . . 93 19 Undirected graphical models (Markov random fields) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 20 Exact inference for graphical models . . . . . . . 97 21 Variational inference . . . . . . . . . . . . . . . . . . . . . 99 22 More variational inference . . . . . . . . . . . . . . . . 101 23 Monte Carlo inference . . . . . . . . . . . . . . . . . . . . 103 24 Markov chain Monte Carlo (MCMC)inference . . . . . . . . . . . . . . . . . . . . . . . 105 24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 105 24.2 Metropolis Hastings algorithm . . . . . . . . 105 24.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 105 24.4 Speed and accuracy of MCMC . . . . . . . . 105 24.5 Auxiliary variable MCMC * . . . . . . . . . . 105 25 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 26 Graphical model structure learning . . . . . . . . 109 27 Latent variable models for discrete data . . . . 111 27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 111 27.2 Distributed state LVMs for discrete data 111 28 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A Optimization methods . . . . . . . . . . . . . . . . . . . . 115 A.1 Convexity. . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.2 Gradient descent . . . . . . . . . . . . . . . . . . . . 115 A.2.1 Stochastic gradient descent . . . 115 A.2.2 Batch gradient descent . . . . . . . 115 A.2.3 Line search . . . . . . . . . . . . . . . . . 115 A.2.4 Momentum term . . . . . . . . . . . . 116 A.3 Lagrange duality . . . . . . . . . . . . . . . . . . . . 116 A.3.1 Primal form . . . . . . . . . . . . . . . . 116 A.3.2 Dual form . . . . . . . . . . . . . . . . . . 116 A.4 Newton’s method . . . . . . . . . . . . . . . . . . . 116 A.5 Quasi-Newton method . . . . . . . . . . . . . . . 116 A.5.1 DFP . . . . . . . . . . . . . . . . . . . . . . . 116
  • 9. Preface ix A.5.2 BFGS . . . . . . . . . . . . . . . . . . . . . 116 A.5.3 Broyden . . . . . . . . . . . . . . . . . . . 117 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
  • 10.
  • 11. List of Contributors Wei Zhang PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail: zh3feng@gmail.com, has written chapters of Naive Bayes and SVM. Fei Pan Master at Beijing University of Technology, Beijing, P.R.CHINA, e-mail: example@gmail.com, has written chapters of KMeans, AdaBoost. Yong Li PhD candidate at the Institute of Automation of the Chinese Academy of Sciences (CASIA), Beijing, P.R.CHINA, e-mail: liyong3forever@gmail.com, has written chapters of Logistic Regression. Jiankou Li PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail: lijiankoucoco@163.com, has written chapters of BayesNet. xi
  • 12.
  • 13. Notation Introduction It is very difficult to come up with a single, consistent notation to cover the wide variety of data, models and algorithms that we discuss. Furthermore, conventions difer between machine learning and statistics, and between different books and papers. Nevertheless, we have tried to be as consistent as possible. Below we summarize most of the notation used in this book, although individual sections may introduce new notation. Note also that the same symbol may have different meanings depending on the context, although we try to avoid this where possible. General math notation Symbol Meaning ⌊x⌋ Floor of x, i.e., round down to nearest integer ⌈x⌉ Ceiling of x, i.e., round up to nearest integer x⊗y Convolution of x and y x⊙y Hadamard (elementwise) product of x and y a∧b logical AND a∨b logical OR ¬a logical NOT I(x) Indicator function, I(x) = 1 if x is true, else I(x) = 0 ∞ Infinity → Tends towards, e.g., n → ∞ ∝ Proportional to, so y = ax can be written as y ∝ x |x| Absolute value |S| Size (cardinality) of a set n! Factorial function ∇ Vector of first derivatives ∇2 Hessian matrix of second derivatives ≜ Defined as O(·) Big-O: roughly means order of magnitude R The real numbers 1 : n Range (Matlab convention): 1 : n = 1,2,...,n ≈ Approximately equal to argmax x f(x) Argmax: the value x that maximizes f B(a,b) Beta function, B(a,b) = Γ (a)Γ (b) Γ (a+b) B(α) Multivariate beta function, ∏ k Γ (αk) Γ (∑ k αk) (n k ) n choose k , equal to n!/(k!(nk)!) δ(x) Dirac delta function,δ(x) = ∞ if x = 0, else δ(x) = 0 exp(x) Exponential function ex Γ (x) Gamma function, Γ (x) = ∫ ∞ 0 ux−1e−udu Ψ(x) Digamma function,Psi(x) = d dx logΓ (x) xiii
  • 14. xiv Notation X A set from which values are drawn (e.g.,X = RD) Linear algebra notation We use boldface lower-case to denote vectors, such as x, and boldface upper-case to denote matrices, such as X. We denote entries in a matrix by non-bold upper case letters, such as Xij. Vectors are assumed to be column vectors, unless noted otherwise. We use (x1,··· ,xD) to denote a column vector created by stacking D scalars. If we write X = (x1,··· ,xn), where the left hand side is a matrix, we mean to stack the xi along the columns, creating a matrix. Symbol Meaning X ≻ 0 X is a positive definite matrix tr(X) Trace of a matrix det(X) Determinant of matrix X |X| Determinant of matrix X X−1 Inverse of a matrix X† Pseudo-inverse of a matrix XT Transpose of a matrix xT Transpose of a vector diag(x) Diagonal matrix made from vector x diag(X) Diagonal vector extracted from matrix X I or Id Identity matrix of size d ×d (ones on diagonal, zeros of) 1 or 1d Vector of ones (of length d) 0 or 0d Vector of zeros (of length d) ||x|| = ||x||2 Euclidean or ℓ2 norm √ d ∑ j=1 x2 j ||x||1 ℓ1 norm d ∑ j=1 xj X:,j j’th column of matrix Xi,: transpose of i’th row of matrix (a column vector) Xi,j Element (i, j) of matrix X x⊗y Tensor product of x and y Probability notation We denote random and fixed scalars by lower case, random and fixed vectors by bold lower case, and random and fixed matrices by bold upper case. Occasionally we use non-bold upper case to denote scalar random variables. Also, we use p() for both discrete and continuous random variables Symbol Meaning X,Y Random variable P() Probability of a random event F() Cumulative distribution function(CDF), also called distribution function p(x) Probability mass function(PMF) f(x) probability density function(PDF) F(x,y) Joint CDF p(x,y) Joint PMF f(x,y) Joint PDF
  • 15. Notation xv p(X|Y) Conditional PMF, also called conditional probability fX|Y (x|y) Conditional PDF X ⊥ Y X is independent of Y X ̸⊥ Y X is not independent of Y X ⊥ Y|Z X is conditionally independent of Y given Z X ̸⊥ Y|Z X is not conditionally independent of Y given Z X ∼ p X is distributed according to distribution p α Parameters of a Beta or Dirichlet distribution cov[X] Covariance of X E[X] Expected value of X Eq[X] Expected value of X wrt distribution q H(X) or H(p) Entropy of distribution p(X) I(X;Y) Mutual information between X and Y KL(p||q) KL divergence from distribution p to q ℓ(θ) Log-likelihood function L(θ,a) Loss function for taking action a when true state of nature is θ λ Precision (inverse variance) λ = 1/σ2 Λ Precision matrix Λ = Σ−1 mode[X] Most probable value of X µ Mean of a scalar distribution µ Mean of a multivariate distribution Φ cdf of standard normal ϕ pdf of standard normal π multinomial parameter vector, Stationary distribution of Markov chain ρ Correlation coefficient sigm(x) Sigmoid (logistic) function, 1 1+e−x σ2 Variance Σ Covariance matrix var[x] Variance of x ν Degrees of freedom parameter Z Normalization constant of a probability distribution Machine learning/statistics notation In general, we use upper case letters to denote constants, such as C,K,M,N,T, etc. We use lower case letters as dummy indexes of the appropriate range, such as c = 1 : C to index classes, i = 1 : M to index data cases, j = 1 : N to index input features, k = 1 : K to index states or clusters, t = 1 : T to index time, etc. We use x to represent an observed data vector. In a supervised problem, we use y or y to represent the desired output label. We use z to represent a hidden variable. Sometimes we also use q to represent a hidden discrete variable. Symbol Meaning C Number of classes D Dimensionality of data vector (number of features) N Number of data cases Nc Number of examples of class c,Nc = ∑N i=1 I(yi = c) R Number of outputs (response variables) D Training data D = {(xi,yi)|i = 1 : N} Dtest Test data X Input space Y Output space
  • 16. xvi Notation K Number of states or dimensions of a variable (often latent) k(x,y) Kernel function K Kernel matrix H Hypothesis space L Loss function J(θ) Cost function f(x) Decision function P(y|x) TODO λ Strength of ℓ2 or ℓ1regularizer ϕ(x) Basis function expansion of feature vector x Φ Basis function expansion of design matrix X q() Approximate or proposal distribution Q(θ,θold) Auxiliary function in EM T Length of a sequence T(D) Test statistic for data T Transition matrix of Markov chain θ Parameter vector θ(s) s’th sample of parameter vector ˆθ Estimate (usually MLE or MAP) of θ ˆθMLE Maximum likelihood estimate of θ ˆθMAP MAP estimate of θ ¯θ Estimate (usually posterior mean) of θ w Vector of regression weights (called β in statistics) b intercept (called ε in statistics) W Matrix of regression weights xij Component (i.e., feature) j of data case i ,for i = 1 : N, j = 1 : D xi Training case, i = 1 : N X Design matrix of size N ×D ¯x Empirical mean ¯x = 1 N ∑N i=1 xi ˜x Future test case x∗ Feature test case y Vector of all training labels y = (y1,...,yN) zij Latent component j for case i
  • 17. Chapter 1 Introduction 1.1 Types of machine learning    Supervised learning { Classification Regression Unsupervised learning    Discovering clusters Discovering latent factors Discovering graph structure Matrix completion 1.2 Three elements of a machine learning model Model = Representation + Evaluation + Optimization1 1.2.1 Representation In supervised learning, a model must be represented as a conditional probability distribution P(y|x)(usually we call it classifier) or a decision function f(x). The set of classifiers(or decision functions) is called the hypothesis space of the model. Choosing a representation for a model is tantamount to choosing the hypothesis space that it can possibly learn. 1.2.2 Evaluation In the hypothesis space, an evaluation function (also called objective function or risk function) is needed to distinguish good classifiers(or decision functions) from bad ones. 1.2.2.1 Loss function and risk function Definition 1.1. In order to measure how well a function fits the training data, a loss function L : Y ×Y → R ≥ 0 is 1 Domingos, P. A few useful things to know about machine learning. Commun. ACM. 55(10):7887 (2012). defined. For training example (xi,yi), the loss of predict- ing the value y is L(yi,y). The following is some common loss functions: 1. 0-1 loss function L(Y, f(X)) = I(Y, f(X)) = { 1, Y = f(X) 0, Y ̸= f(X) 2. Quadratic loss function L(Y, f(X)) = (Y − f(X))2 3. Absolute loss function L(Y, f(X)) = |Y − f(X)| 4. Logarithmic loss function L(Y,P(Y|X)) = −logP(Y|X) Definition 1.2. The risk of function f is defined as the ex- pected loss of f: Rexp(f) = E [L(Y, f(X))] = ∫ L(y, f(x))P(x,y)dxdy (1.1) which is also called expected loss or risk function. Definition 1.3. The risk function Rexp(f) can be esti- mated from the training data as Remp(f) = 1 N N ∑ i=1 L(yi, f(xi)) (1.2) which is also called empirical loss or empirical risk. You can define your own loss function, but if you’re a novice, you’re probably better off using one from the literature. There are conditions that loss functions should meet2: 1. They should approximate the actual loss you’re trying to minimize. As was said in the other answer, the stan- dard loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are approximations of that loss. 2. The loss function should work with your intended op- timization algorithm. That’s why zero-one-loss is not used directly: it doesn’t work with gradient-based opti- mization methods since it doesn’t have a well-defined gradient (or even a subgradient, like the hinge loss for SVMs has). The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm(chapter §??). 2 http://t.cn/zTrDxLO 1
  • 18. 2 1.2.2.2 ERM and SRM Definition 1.4. ERM(Empirical risk minimization) min f∈F Remp(f) = min f∈F 1 N N ∑ i=1 L(yi, f(xi)) (1.3) Definition 1.5. Structural risk Rsmp(f) = 1 N N ∑ i=1 L(yi, f(xi))+λJ(f) (1.4) Definition 1.6. SRM(Structural risk minimization) min f∈F Rsrm(f) = min f∈F 1 N N ∑ i=1 L(yi, f(xi))+λJ(f) (1.5) 1.2.3 Optimization Finally, we need a training algorithm(also called learn- ing algorithm) to search among the classifiers in the the hypothesis space for the highest-scoring one. The choice of optimization technique is key to the efficiency of the model. 1.3 Some basic concepts 1.3.1 Parametric vs non-parametric models 1.3.2 A simple non-parametric classifier: K-nearest neighbours 1.3.2.1 Representation y = f(x) = argmin c ∑ xi∈Nk(x) I(yi = c) (1.6) where Nk(x) is the set of k points that are closest to point x. Usually use k-d tree to accelerate the process of find- ing k nearest points. 1.3.2.2 Evaluation No training is needed. 1.3.2.3 Optimization No training is needed. 1.3.3 Overfitting 1.3.4 Cross validation Definition 1.7. Cross validation, sometimes called rota- tion estimation, is a model validation technique for assess- ing how the results of a statistical analysis will generalize to an independent data set3. Common types of cross-validation: 1. K-fold cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single sub- sample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data. 2. 2-fold cross-validation. Also, called simple cross- validation or holdout method. This is the simplest variation of k-fold cross-validation, k=2. 3. Leave-one-out cross-validation(LOOCV). k=M, the number of original samples. 1.3.5 Model selection When we have a variety of models of different complex- ity (e.g., linear or logistic regression models with differ- ent degree polynomials, or KNN classifiers with different values ofK), how should we pick the right one? A natural approach is to compute the misclassification rate on the training set for each method. 3 http://en.wikipedia.org/wiki/ Cross-validation_(statistics)
  • 19. Chapter 2 Probability 2.1 Frequentists vs. Bayesians what is probability? One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of events. For example, the above statement means that, if we flip the coin many times, we expect it to land heads about half the time. The other interpretation is called the Bayesian inter- pretation of probability. In this view, probability is used to quantify our uncertainty about something; hence it is fundamentally related to information rather than repeated trials (Jaynes 2003). In the Bayesian view, the above state- ment means we believe the coin is equally likely to land heads or tails on the next toss One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies. For example, we might want to compute the probability that the polar ice cap will melt by 2020 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our uncertainty about this event. To give another machine learning oriented exam- ple, we might have observed a blip on our radar screen, and want to compute the probability distribution over the location of the corresponding target (be it a bird, plane, or missile). In all these cases, the idea of repeated trials does not make sense, but the Bayesian interpretation is valid and indeed quite natural. We shall therefore adopt the Bayesian interpretation in this book. Fortunately, the basic rules of probability theory are the same, no matter which interpretation is adopted. 2.2 A brief review of probability theory 2.2.1 Basic concepts We denote a random event by defining a random variable X. Descrete random variable: X can take on any value from a finite or countably infinite set. Continuous random variable: the value of X is real- valued. 2.2.1.1 CDF F(x) ≜ P(X ≤ x) = { ∑u≤x p(u) , discrete ∫ x −∞ f(u)du , continuous (2.1) 2.2.1.2 PMF and PDF For descrete random variable, We denote the probability of the event that X = x by P(X = x), or just p(x) for short. Here p(x) is called a probability mass function or PMF.A probability mass function is a function that gives the probability that a discrete random variable is ex- actly equal to some value4. This satisfies the properties 0 ≤ p(x) ≤ 1 and ∑x∈X p(x) = 1. For continuous variable, in the equation F(x) = ∫ x −∞ f(u)du, the function f(x) is called a probability density function or PDF. A probability density function is a function that describes the rela- tive likelihood for this random variable to take on a given value5.This satisfies the properties f(x) ≥ 0 and∫ ∞ −∞ f(x)dx = 1. 2.2.2 Mutivariate random variables 2.2.2.1 Joint CDF We denote joint CDF by F(x,y) ≜ P(X ≤ x ∩Y ≤ y) = P(X ≤ x,Y ≤ y). F(x,y) ≜ P(X ≤ x,Y ≤ y) = { ∑u≤x,v≤y p(u,v) ∫ x −∞ ∫ y −∞ f(u,v)dudv (2.2) product rule: p(X,Y) = P(X|Y)P(Y) (2.3) Chain rule: 4 http://en.wikipedia.org/wiki/Probability_ mass_function 5 http://en.wikipedia.org/wiki/Probability_ density_function 3
  • 20. 4 p(X1:N) = p(X1)p(X3|X2,X1)...p(XN|X1:N−1) (2.4) 2.2.2.2 Marginal distribution Marginal CDF: FX (x) ≜ F(x,+∞) =    ∑ xi≤x P(X = xi) = ∑ xi≤x +∞ ∑ j=1 P(X = xi,Y = yj) ∫ x −∞ fX (u)du = ∫ x −∞ ∫ +∞ −∞ f(u,v)dudv (2.5) FY (y) ≜ F(+∞,y) =    ∑ yj≤y p(Y = yj) = +∞ ∑ i=1 ∑yj≤y P(X = xi,Y = yj) ∫ y −∞ fY (v)dv = ∫ +∞ −∞ ∫ y −∞ f(u,v)dudv (2.6) Marginal PMF and PDF: { P(X = xi) = ∑+∞ j=1 P(X = xi,Y = yj) , descrete fX (x) = ∫ +∞ −∞ f(x,y)dy , continuous (2.7) { p(Y = yj) = ∑+∞ i=1 P(X = xi,Y = yj) , descrete fY (y) = ∫ +∞ −∞ f(x,y)dx , continuous (2.8) 2.2.2.3 Conditional distribution Conditional PMF: p(X = xi|Y = yj) = p(X = xi,Y = yj) p(Y = yj) if p(Y) > 0 (2.9) The pmf p(X|Y) is called conditional probability. Conditional PDF: fX|Y (x|y) = f(x,y) fY (y) (2.10) 2.2.3 Bayes rule p(Y = y|X = x) = p(X = x,Y = y) p(X = x) = p(X = x|Y = y)p(Y = y) ∑y′ p(X = x|Y = y′)p(Y = y′) (2.11) 2.2.4 Independence and conditional independence We say X and Y are unconditionally independent or marginally independent, denoted X ⊥ Y, if we can represent the joint as the product of the two marginals, i.e., X ⊥ Y = P(X,Y) = P(X)P(Y) (2.12) We say X and Y are conditionally independent(CI) given Z if the conditional joint can be written as a product of conditional marginals: X ⊥ Y|Z = P(X,Y|Z) = P(X|Z)P(Y|Z) (2.13) 2.2.5 Quantiles Since the cdf F is a monotonically increasing function, it has an inverse; let us denote this by F−1. If F is the cdf of X , then F−1(α) is the value of xα such that P(X ≤ xα) = α; this is called the α quantile of F. The value F−1(0.5) is the median of the distribution, with half of the probability mass on the left, and half on the right. The values F−1(0.25) and F1(0.75)are the lower and up- per quartiles. 2.2.6 Mean and variance The most familiar property of a distribution is its mean,or expected value, denoted by µ. For discrete rvs, it is de- fined as E[X] ≜ ∑x∈X xp(x), and for continuous rvs, it is defined as E[X] ≜ ∫ X xp(x)dx. If this integral is not finite, the mean is not defined (we will see some examples of this later). The variance is a measure of the spread of a distribu- tion, denoted by σ2. This is defined as follows: var[X] = E[(X − µ)2 ] (2.14) = ∫ (x− µ)2 p(x)dx = ∫ x2 p(x)dx+ µ2 ∫ p(x)dx−2µ ∫ xp(x)dx = E[X2 ]− µ2 (2.15) from which we derive the useful result E[X2 ] = σ2 + µ2 (2.16) The standard deviation is defined as
  • 21. 5 std[X] ≜ √ var[X] (2.17) This is useful since it has the same units as X itself. 2.3 Some common discrete distributions In this section, we review some commonly used paramet- ric distributions defined on discrete state spaces, both fi- nite and countably infinite. 2.3.1 The Bernoulli and binomial distributions Definition 2.1. Now suppose we toss a coin only once. Let X ∈ {0,1} be a binary random variable, with probabil- ity of success or heads of θ. We say that X has a Bernoulli distribution. This is written as X ∼ Ber(θ), where the pmf is defined as Ber(x|θ) ≜ θI(x=1) (1−θ)I(x=0) (2.18) Definition 2.2. Suppose we toss a coin n times. Let X ∈ {0,1,··· ,n} be the number of heads. If the probability of heads is θ, then we say X has a binomial distribution, written as X ∼ Bin(n,θ). The pmf is given by Bin(k|n,θ) ≜ ( n k ) θk (1−θ)n−k (2.19) 2.3.2 The multinoulli and multinomial distributions Definition 2.3. The Bernoulli distribution can be used to model the outcome of one coin tosses. To model the outcome of tossing a K-sided dice, let x = (I(x = 1),··· ,I(x = K)) ∈ {0,1}K be a random vector(this is called dummy encoding or one-hot en- coding), then we say X has a multinoulli distribution(or categorical distribution), written as X ∼ Cat(θ). The pmf is given by: p(x) ≜ K ∏ k=1 θ I(xk=1) k (2.20) Definition 2.4. Suppose we toss a K-sided dice n times. Let x = (x1,x2,··· ,xK) ∈ {0,1,··· ,n}K be a random vec- tor, where xj is the number of times side j of the dice occurs, then we say X has a multinomial distribution, written as X ∼ Mu(n,θ). The pmf is given by p(x) ≜ ( n x1 ···xk ) K ∏ k=1 θ xk k (2.21) where ( n x1 ···xk ) ≜ n! x1!x2!···xK! Bernoulli distribution is just a special case of a Bino- mial distribution with n = 1, and so is multinoulli distri- bution as to multinomial distribution. See Table 2.1 for a summary. Table 2.1: Summary of the multinomial and related distributions. Name K n X Bernoulli 1 1 x ∈ {0,1} Binomial 1 - x ∈ {0,1,··· ,n} Multinoulli - 1 x ∈ {0,1}K,∑K k=1 xk = 1 Multinomial - - x ∈ {0,1,··· ,n}K,∑K k=1 xk = n 2.3.3 The Poisson distribution Definition 2.5. We say that X ∈ {0,1,2,···} has a Pois- son distribution with parameter λ > 0, written as X ∼ Poi(λ), if its pmf is p(x|λ) = e−λ λx x! (2.22) The first term is just the normalization constant, re- quired to ensure the distribution sums to 1. The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic ac- cidents. 2.3.4 The empirical distribution The empirical distribution function6, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample. Let D = {x1,x2,··· ,xN} be a sample set, it is defined as Fn(x) ≜ 1 N N ∑ i=1 I(xi ≤ x) (2.23) 6 http://en.wikipedia.org/wiki/Empirical_ distribution_function
  • 22. 6 Table 2.2: Summary of Bernoulli, binomial multinoulli and multinomial distributions. Name Written as X p(x)(or p(x)) E[X] var[X] Bernoulli X ∼ Ber(θ) x ∈ {0,1} θI(x=1)(1−θ)I(x=0) θ θ(1−θ) Binomial X ∼ Bin(n,θ) x ∈ {0,1,··· ,n} ( n k ) θk(1−θ)n−k nθ nθ(1−θ) Multinoulli X ∼ Cat(θ) x ∈ {0,1}K,∑K k=1 xk = 1 K ∏ k=1 θ I(xj=1) j - - Multinomial X ∼ Mu(n,θ) x ∈ {0,1,··· ,n}K,∑K k=1 xk = n ( n x1 ···xk ) K ∏ k=1 θ xj j - - Poisson X ∼ Poi(λ) x ∈ {0,1,2,···} e−λ λx x! λ λ 2.4 Some common continuous distributions In this section we present some commonly used univariate (one-dimensional) continuous probability distributions. 2.4.1 Gaussian (normal) distribution Table 2.3: Summary of Gaussian distribution. Written as f(x) E[X] mode var[X] X ∼ N(µ,σ2) 1 √ 2πσ e − 1 2σ2 (x−µ)2 µ µ σ2 If X ∼ N(0,1),we say X follows a standard normal distribution. The Gaussian distribution is the most widely used dis- tribution in statistics. There are several reasons for this. 1. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance. 2. Second,the central limit theorem (Section TODO) tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or noise. 3. Third, the Gaussian distribution makes the least num- ber of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance, as we show in Section TODO; this makes it a good de- fault choice in many cases. 4. Finally, it has a simple mathematical form, which re- sults in easy to implement, but often highly effective, methods, as we will see. See (Jaynes 2003, ch 7) for a more extensive discussion of why Gaussians are so widely used. 2.4.2 Student’s t-distribution Table 2.4: Summary of Student’s t-distribution. Written as f(x) E[X] mode var[X] X ∼ T (µ,σ2,ν) Γ (ν+1 2 ) √ νπΓ (ν 2 ) [ 1+ 1 ν ( x− µ ν )2 ] µ µ νσ2 ν −2 where Γ (x) is the gamma function: Γ (x) ≜ ∫ ∞ 0 tx−1 e−t dt (2.24) µ is the mean, σ2 > 0 is the scale parameter, and ν > 0 is called the degrees of freedom. See Figure 2.1 for some plots. The variance is only defined if ν > 2. The mean is only defined if ν > 1. As an illustration of the robustness of the Student dis- tribution, consider Figure 2.2. We see that the Gaussian is affected a lot, whereas the Student distribution hardly changes. This is because the Student has heavier tails, at least for small ν(see Figure 2.1). If ν = 1, this distribution is known as the Cauchy or Lorentz distribution. This is notable for having such heavy tails that the integral that defines the mean does not converge. To ensure finite variance, we require ν > 2. It is com- mon to use ν = 4, which gives good performance in a range of problems (Lange et al. 1989). For ν ≫ 5, the Student distribution rapidly approaches a Gaussian distri- bution and loses its robustness properties.
  • 23. 7 (a) (b) Fig. 2.1: (a) The pdfs for a N(0,1), T (0,1,1) and Lap(0,1/ √ 2). The mean is 0 and the variance is 1 for both the Gaussian and Laplace. The mean and variance of the Student is undefined when ν = 1.(b) Log of these pdfs. Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution, which is always log-concave (and log-convex...) Nevertheless, both are unimodal. Table 2.5: Summary of Laplace distribution. Written as f(x) E[X] mode var[X] X ∼ Lap(µ,b) 1 2b exp ( − |x− µ| b ) µ µ 2b2 (a) (b) Fig. 2.2: Illustration of the effect of outliers on fitting Gaussian, Student and Laplace distributions. (a) No outliers (the Gaussian and Student curves are on top of each other). (b) With outliers. We see that the Gaussian is more affected by outliers than the Student and Laplace distributions. 2.4.3 The Laplace distribution Here µ is a location parameter and b > 0 is a scale param- eter. See Figure 2.1 for a plot. Its robustness to outliers is illustrated in Figure 2.2. It also put mores probability density at 0 than the Gaussian. This property is a useful way to encourage sparsity in a model, as we will see in Section TODO.
  • 24. 8 Table 2.6: Summary of gamma distribution Written as X f(x) E[X] mode var[X] X ∼ Ga(a,b) x ∈ R+ ba Γ (a) xa−1e−xb a b a−1 b a b2 2.4.4 The gamma distribution Here a > 0 is called the shape parameter and b > 0 is called the rate parameter. See Figure 2.3 for some plots. (a) (b) Fig. 2.3: Some Ga(a,b = 1) distributions. If a ≤ 1, the mode is at 0, otherwise it is > 0.As we increase the rate b, we reduce the horizontal scale, thus squeezing everything leftwards and upwards. (b) An empirical pdf of some rainfall data, with a fitted Gamma distribution superimposed. 2.4.5 The beta distribution Here B(a,b)is the beta function, B(a,b) ≜ Γ (a)Γ (b) Γ (a+b) (2.25) See Figure 2.4 for plots of some beta distributions. We require a,b > 0 to ensure the distribution is integrable (i.e., to ensure B(a,b) exists). If a = b = 1, we get the uniform distirbution. If a and b are both less than 1, we get a bimodal distribution with spikes at 0 and 1; if a and b are both greater than 1, the distribution is unimodal. Fig. 2.4: Some beta distributions. 2.4.6 Pareto distribution The Pareto distribution is used to model the distribu- tion of quantities that exhibit long tails, also called heavy tails. As k → ∞, the distribution approaches δ(x − m). See Figure 2.5(a) for some plots. If we plot the distribution on a log-log scale, it forms a straight line, of the form log p(x) = alogx+c for some constants a and c. See Fig- ure 2.5(b) for an illustration (this is known as a power law).
  • 25. 9 Table 2.7: Summary of Beta distribution Name Written as X f(x) E[X] mode var[X] Beta distribution X ∼ Beta(a,b) x ∈ [0,1] 1 B(a,b) xa−1(1−x)b−1 a a+b a−1 a+b−2 ab (a+b)2(a+b+1) Table 2.8: Summary of Pareto distribution Name Written as X f(x) E[X] mode var[X] Pareto distribution X ∼ Pareto(k,m) x ≥ m kmkx−(k+1)I(x ≥ m) km k −1 if k > 1 m m2k (k −1)2(k −2) if k > 2 (a) (b) Fig. 2.5: (a) The Pareto distribution Pareto(x|m,k) for m = 1. (b) The pdf on a log-log scale. 2.5 Joint probability distributions Given a multivariate random variable or random vec- tor 7 X ∈ RD, the joint probability distribution8 is a probability distribution that gives the probability that each of X1,X2,··· ,XD falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribu- tion, but the concept generalizes to any number of random variables, giving a multivariate distribution. The joint probability distribution can be expressed ei- ther in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables). 2.5.1 Covariance and correlation Definition 2.6. The covariance between two rvs X and Y measures the degree to which X and Y are (linearly) related. Covariance is defined as cov[X,Y] ≜ E[(X −E[X])(Y −E[Y])] = E[XY]−E[X]E[Y] (2.26) Definition 2.7. If X is a D-dimensional random vector, its covariance matrix is defined to be the following symmet- ric, positive definite matrix: 7 http://en.wikipedia.org/wiki/Multivariate_ random_variable 8 http://en.wikipedia.org/wiki/Joint_ probability_distribution
  • 26. 10 cov[X] ≜ E [ (X −E[X])(X −E[X])T ] (2.27) =      var[X1] Cov[X1,X2] ··· Cov[X1,XD] Cov[X2,X1] var[X2] ··· Cov[X2,XD] ... ... ... ... Cov[XD,X1] Cov[XD,X2] ··· var[XD]      (2.28) Definition 2.8. The (Pearson) correlation coefficient be- tween X and Y is defined as corr[X,Y] ≜ Cov[X,Y] √ var[X],var[Y] (2.29) A correlation matrix has the form R ≜      corr[X1,X1] corr[X1,X2] ··· corr[X1,XD] corr[X2,X1] corr[X2,X2] ··· corr[X2,XD] ... ... ... ... corr[XD,X1] corr[XD,X2] ··· corr[XD,XD]      (2.30) The correlation coefficient can viewed as a degree of linearity between X and Y, see Figure 2.6. Uncorrelated does not imply independent. For ex- ample, let X ∼ U(−1,1) and Y = X2. Clearly Y is depen- dent on X(in fact, Y is uniquely determined by X), yet one can show that corr[X,Y] = 0. Some striking examples of this fact are shown in Figure 2.6. This shows several data sets where there is clear dependence between X andY, and yet the correlation coefficient is 0. A more general mea- sure of dependence between random variables is mutual information, see Section TODO. 2.5.2 Multivariate Gaussian distribution The multivariate Gaussian or multivariate nor- mal(MVN) is the most widely used joint probability density function for continuous variables. We discuss MVNs in detail in Chapter 4; here we just give some definitions and plots. The pdf of the MVN in D dimensions is defined by the following: N(x|µ,Σ) ≜ 1 (2π)D/2|Σ|1/2 exp [ − 1 2 (x−µ)T Σ−1 (x−µ) ] (2.31) where µ = E[X] ∈ RD is the mean vector, and Σ = Cov[X] is the D × D covariance matrix. The normalization con- stant (2π)D/2|Σ|1/2 just ensures that the pdf integrates to 1. Figure 2.7 plots some MVN densities in 2d for three different kinds of covariance matrices. A full covariance matrix has A D(D+1)/2 parameters (we divide by 2 since Σ is symmetric). A diagonal covariance matrix has D pa- rameters, and has 0s in the off-diagonal terms. A spherical or isotropic covariance,Σ = σ2ID, has one free parameter. 2.5.3 Multivariate Student’s t-distribution A more robust alternative to the MVN is the multivariate Student’s t-distribution, whose pdf is given by T (x|µ,Σ,ν) ≜ Γ (ν+D 2 ) Γ (ν 2 ) |Σ|− 1 2 (νπ) D 2 [ 1+ 1 ν (x−µ)T Σ−1 (x−µ) ]− ν+D 2 (2.32) = Γ (ν+D 2 ) Γ (ν 2 ) |Σ|− 1 2 (νπ) D 2 [ 1+(x−µ)T V −1 (x−µ) ]− ν+D 2 (2.33) where Σ is called the scale matrix (since it is not exactly the covariance matrix) and V = νΣ. This has fatter tails than a Gaussian. The smaller ν is, the fatter the tails. As ν → ∞, the distribution tends towards a Gaussian. The dis- tribution has the following properties mean = µ , mode = µ , Cov = ν ν −2 Σ (2.34) 2.5.4 Dirichlet distribution A multivariate generalization of the beta distribution is the Dirichlet distribution, which has support over the prob- ability simplex, defined by SK = { x : 0 ≤ xk ≤ 1, K ∑ k=1 xk = 1 } (2.35) The pdf is defined as follows: Dir(x|α) ≜ 1 B(α) K ∏ k=1 x αk−1 k I(x ∈ SK) (2.36) where B(α1,α2,··· ,αK) is the natural generalization of the beta function to K variables: B(α) ≜ ∏K k=1 Γ (αk) Γ (α0) where α0 ≜ K ∑ k=1 αk (2.37) Figure 2.8 shows some plots of the Dirichlet when K = 3, and Figure 2.9 for some sampled probability vec- tors. We see that α0 controls the strength of the dis-
  • 27. 11 Fig. 2.6: Several sets of (x,y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.Source:http://en.wikipedia.org/wiki/Correlation tribution (how peaked it is), and thekcontrol where the peak occurs. For example, Dir(1,1,1) is a uniform dis- tribution, Dir(2,2,2) is a broad distribution centered at (1/3,1/3,1/3), and Dir(20,20,20) is a narrow distribu- tion centered at (1/3,1/3,1/3).If αk < 1 for all k, we get spikes at the corner of the simplex. For future reference, the distribution has these proper- ties E(xk) = αk α0 , mode[xk] = αk −1 α0 −K , var[xk] = αk(α0 −αk) α2 0 (α0 +1) (2.38) 2.6 Transformations of random variables If x ∼ P() is some random variable, and y = f(x), what is the distribution of Y? This is the question we address in this section. 2.6.1 Linear transformations Suppose g() is a linear function: g(x) = Ax+b (2.39) First, for the mean, we have E[y] = E[Ax+b] = AE[x]+b (2.40) this is called the linearity of expectation. For the covariance, we have Cov[y] = Cov[Ax+b] = AΣAT (2.41) 2.6.2 General transformations If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the xs such that f(x) = y: pY (y) = ∑ x:g(x)=y pX (x) (2.42) If X is continuous, we cannot use Equation 2.42 since pX (x) is a density, not a pmf, and we cannot sum up den- sities. Instead, we work with cdfs, and write FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = ∫ g(X)≤y fX (x)dx (2.43) We can derive the pdf of Y by differentiating the cdf: fY (y) = fX (x)| dx dy | (2.44) This is called change of variables formula. We leave the proof of this as an exercise. For example, suppose X ∼ U(1,1), and Y = X2. Then pY (y) = 1 2 y− 1 2 .
  • 28. 12 (a) (b) (c) (d) Fig. 2.7: We show the level sets for 2d Gaussians. (a) A full covariance matrix has elliptical contours.(b) A diagonal covariance matrix is an axis aligned ellipse. (c) A spherical covariance matrix has a circular shape. (d) Surface plot for the spherical Gaussian in (c). (a) (b) (c) (d) Fig. 2.8: (a) The Dirichlet distribution when K = 3 defines a distribution over the simplex, which can be represented by the triangular surface. Points on this surface satisfy 0 ≤ θk ≤ 1 and ∑K k=1 θk = 1. (b) Plot of the Dirichlet density when α = (2,2,2). (c) α = (20,2,2).
  • 29. 13 (a) α = (0.1,··· ,0.1). This results in very sparse distributions, with many 0s. (b) α = (1,··· ,1). This results in more uniform (and dense) distributions. Fig. 2.9: Samples from a 5-dimensional symmetric Dirichlet distribution for different parameter values. 2.6.2.1 Multivariate change of variables * Let f be a function f : Rn → Rn, and let y = f(x). Then its Jacobian matrix J is given by Jx→y ≜ ∂y ∂x ≜     ∂y1 ∂x1 ··· ∂y1 ∂xn ... ... ... ∂yn ∂x1 ··· ∂yn ∂xn     (2.45) |det(J)| measures how much a unit cube changes in vol- ume when we apply f. If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse mapping y → x: py(y) = px(x)|det( ∂x ∂y )| = px(x)|det(Jy→x)| (2.46) 2.6.3 Central limit theorem Given N random variables X1,X2,··· ,XN, each variable is independent and identically distributed9(iid for short), and each has the same mean µ and variance σ2, then n ∑ i=1 Xi −Nµ √ Nσ ∼ N(0,1) (2.47) this can also be written as ¯X − µ σ/ √ N ∼ N(0,1) , where ¯X ≜ 1 N n ∑ i=1 Xi (2.48) 2.7 Monte Carlo approximation In general, computing the distribution of a function of an rv using the change of variables formula can be difficult. One simple but powerful alternative is as follows. First we generate S samples from the distribution, call them x1,··· ,xS. (There are many ways to generate such sam- ples; one popular method, for high dimensional distribu- tions, is called Markov chain Monte Carlo or MCMC; this will be explained in Chapter TODO.) Given the sam- ples, we can approximate the distribution of f(X) by us- ing the empirical distribution of {f(xs)}S s=1. This is called a Monte Carlo approximation10, named after a city in Europe known for its plush gambling casinos. We can use Monte Carlo to approximate the expected value of any function of a random variable. We simply draw samples, and then compute the arithmetic mean of the function applied to the samples. This can be written as follows: E[g(X)] = ∫ g(x)p(x)dx ≈ 1 S S ∑ s=1 f(xs) (2.49) where xs ∼ p(X). This is called Monte Carlo integration11, and has the advantage over numerical integration (which is based on evaluating the function at a fixed grid of points) that the function is only evaluated in places where there is non- negligible probability. 9 http://en.wikipedia.org/wiki/Independent_ identically_distributed 10 http://en.wikipedia.org/wiki/Monte_Carlo_ method 11 http://en.wikipedia.org/wiki/Monte_Carlo_ integration
  • 30. 14 2.8 Information theory 2.8.1 Entropy The entropy of a random variable X with distribution p, denoted by H(X) or sometimes H(p), is a measure of its uncertainty. In particular, for a discrete variable with K states, it is defined by H(X) ≜ − K ∑ k=1 p(X = k)log2 p(X = k) (2.50) Usually we use log base 2, in which case the units are called bits(short for binary digits). If we use log base e , the units are called nats. The discrete distribution with maximum entropy is the uniform distribution (see Section XXX for a proof). Hence for a K-ary random variable, the entropy is maxi- mized if p(x = k) = 1/K; in this case, H(X) = log2 K. Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty. 2.8.2 KL divergence One way to measure the dissimilarity of two probability distributions, p and q , is known as the Kullback-Leibler divergence(KL divergence)or relative entropy. This is defined as follows: KL(P||Q) ≜ ∑ x p(x)log2 p(x) q(x) (2.51) where the sum gets replaced by an integral for pdfs12. The KL divergence is only defined if P and Q both sum to 1 and if q(x) = 0 implies p(x) = 0 for all x(absolute con- tinuity). If the quantity 0ln0 appears in the formula, it is interpreted as zero because lim x→0 xlnx. We can rewrite this as KL(p||q) ≜ ∑ x p(x)log2 p(x)− K ∑ k=1 p(x)log2 q(x) = H(p)−H(p,q) (2.52) where H(p,q) is called the cross entropy, 12 The KL divergence is not a distance, since it is asymmet- ric. One symmetric version of the KL divergence is the Jensen- Shannon divergence, defined as JS(p1, p2) = 0.5KL(p1||q) + 0.5KL(p2||q),where q = 0.5p1 +0.5p2 H(p,q) = ∑ x p(x)log2 q(x) (2.53) One can show (Cover and Thomas 2006) that the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q to define our codebook. Hence the regular entropy H(p) = H(p, p), defined in section §2.8.1,is the expected number of bits if we use the true model, so the KL divergence is the diference between these. In other words, the KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p. The extra number of bits interpretation should make it clear that KL(p||q) ≥ 0, and that the KL is only equal to zero if q = p. We now give a proof of this important result. Theorem 2.1. (Information inequality) KL(p||q) ≥ 0 with equality iff p = q. One important consequence of this result is that the discrete distribution with the maximum entropy is the uni- form distribution. 2.8.3 Mutual information Definition 2.9. Mutual information or MI, is defined as follows: I(X;Y) ≜ KL(P(X,Y)||P(X)P(X)) = ∑ x ∑ y p(x,y)log p(x,y) p(x)p(y) (2.54) We have I(X;Y) ≥ 0 with equality if P(X,Y) = P(X)P(Y). That is, the MI is zero if the variables are independent. To gain insight into the meaning of MI, it helps to re- express it in terms of joint and conditional entropies. One can show that the above expression is equivalent to the following: I(X;Y) = H(X)−H(X|Y) (2.55) = H(Y)−H(Y|X) (2.56) = H(X)+H(Y)−H(X,Y) (2.57) = H(X,Y)−H(X|Y)−H(Y|X) (2.58) where H(X) and H(Y) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entropies, and H(X,Y) is the joint entropy of X and Y, see Fig. 2.1013. 13 http://en.wikipedia.org/wiki/Mutual_ information
  • 31. 15 Fig. 2.10: Individual H(X),H(Y), joint H(X,Y), and conditional entropies for a pair of correlated subsystems X,Y with mutual information I(X;Y). Intuitively, we can interpret the MI between X and Y as the reduction in uncertainty about X after observing Y, or, by symmetry, the reduction in uncertainty about Y after observing X. A quantity which is closely related to MI is the point- wise mutual information or PMI. For two events (not random variables) x and y, this is defined as PMI(x,y) ≜ log p(x,y) p(x)p(y) = log p(x|y) p(x) = log p(y|x) p(y) (2.59) This measures the discrepancy between these events occuring together compared to what would be expected by chance. Clearly the MI of X and Y is just the expected value of the PMI. Interestingly, we can rewrite the PMI as follows: PMI(x,y) = log p(x|y) p(x) = log p(y|x) p(y) (2.60) This is the amount we learn from updating the prior p(x) into the posterior p(x|y) , or equivalently, updating the prior p(y) into the posterior p(y|x) .
  • 32.
  • 33. Chapter 3 Generative models for discrete data 3.1 Generative classifier p(y = c|x,θ) = p(y = c|θ)p(x|y = c,θ) ∑c′ p(y = c′|θ)p(x|y = c′,θ) (3.1) This is called a generative classifier, since it specifies how to generate the data using the class conditional den- sity p(x|y = c) and the class prior p(y = c). An alternative approach is to directly fit the class posterior, p(y = c|x) ;this is known as a discriminative classifier. 3.2 Bayesian concept learning Psychological research has shown that people can learn concepts from positive examples alone (Xu and Tenen- baum 2007). We can think of learning the meaning of a word as equivalent to concept learning, which in turn is equiv- alent to binary classification. To see this, define f(x) = 1 if xis an example of the concept C, and f(x) = 0 other- wise. Then the goal is to learn the indicator function f, which just defines which elements are in the set C. 3.2.1 Likelihood p(D|h) ≜ ( 1 size(h) )N = ( 1 |h| )N (3.2) This crucial equation embodies what Tenenbaum calls the size principle, which means the model favours the simplest (smallest) hypothesis consistent with the data. This is more commonly known as Occams razor14. 3.2.2 Prior The prior is decided by human, not machines, so it is sub- jective. The subjectivity of the prior is controversial. For example, that a child and a math professor will reach dif- 14 http://en.wikipedia.org/wiki/Occam%27s_ razor ferent answers. In fact, they presumably not only have dif- ferent priors, but also different hypothesis spaces. How- ever, we can finesse that by defining the hypothesis space of the child and the math professor to be the same, and then setting the childs prior weight to be zero on certain advanced concepts. Thus there is no sharp distinction be- tween the prior and the hypothesis space. However, the prior is the mechanism by which back- ground knowledge can be brought to bear on a prob- lem. Without this, rapid learning (i.e., from small samples sizes) is impossible. 3.2.3 Posterior The posterior is simply the likelihood times the prior, nor- malized. p(h|D) ≜ p(D|h)p(h) ∑h′∈H p(D|h′)p(h′) = I(D ∈ h)p(h) ∑h′∈H I(D ∈ h′)p(h′) (3.3) where I(D ∈ h)p(h) is 1 iff(iff and only if) all the data are in the extension of the hypothesis h. In general, when we have enough data, the posterior p(h|D) becomes peaked on a single concept, namely the MAP estimate, i.e., p(h|D) → ˆhMAP (3.4) where ˆhMAP is the posterior mode, ˆhMAP ≜ argmax h p(h|D) = argmax h p(D|h)p(h) = argmax h [log p(D|h)+log p(h)] (3.5) Since the likelihood term depends exponentially on N, and the prior stays constant, as we get more and more data, the MAP estimate converges towards the maximum like- lihood estimate or MLE: ˆhMLE ≜ argmax h p(D|h) = argmax h log p(D|h) (3.6) In other words, if we have enough data, we see that the data overwhelms the prior. 17
  • 34. 18 3.2.4 Posterior predictive distribution The concept of posterior predictive distribution15 is normally used in a Bayesian context, where it makes use of the entire posterior distribution of the parameters given the observed data to yield a probability distribution over an interval rather than simply a point estimate. p( ˜x|D) ≜ Eh|D[p( ˜x|h)] = { ∑h p( ˜x|h)p(h|D) ∫ p( ˜x|h)p(h|D)dh (3.7) This is just a weighted average of the predictions of each individual hypothesis and is called Bayes model av- eraging(Hoeting et al. 1999). 3.3 The beta-binomial model 3.3.1 Likelihood Given X ∼ Bin(θ), the likelihood of D is given by p(D|θ) = Bin(N1|N,θ) (3.8) 3.3.2 Prior Beta(θ|a,b) ∝ θa−1 (1−θ)b−1 (3.9) The parameters of the prior are called hyper- parameters. 3.3.3 Posterior p(θ|D) ∝ Bin(N1|N1 +N0,θ)Beta(θ|a,b) = Beta(θ|N1 +a,N0b) (3.10) Note that updating the posterior sequentially is equiv- alent to updating in a single batch. To see this, suppose we have two data sets Da and Db with sufficient statistics Na 1 ,Na 0 and Nb 1 ,Nb 0 . Let N1 = Na 1 +Nb 1 and N0 = Na 0 +Nb 0 be the sufficient statistics of the combined datasets. In batch mode we have 15 http://en.wikipedia.org/wiki/Posterior_ predictive_distribution p(θ|Da,Db) = p(θ,Db|Da)p(Da) ∝ p(θ,Db|Da) = p(Db,θ|Da) = p(Db|θ)p(θ|Da) Combine Equation 3.10 and 2.19 = Bin(Nb 1 |θ,Nb 1 +Nb 0 )Beta(θ|Na 1 +a,Na 0 +b) = Beta(θ|Na 1 +Nb 1 +a,Na 0 +Nb 0 +b) This makes Bayesian inference particularly well-suited to online learning, as we will see later. 3.3.3.1 Posterior mean and mode From Table 2.7, the posterior mean is given by ¯θ = a+N1 a+b+N (3.11) The mode is given by ˆθMAP = a+N1 −1 a+b+N −2 (3.12) If we use a uniform prior, then the MAP estimate re- duces to the MLE, ˆθMLE = N1 N (3.13) We will now show that the posterior mean is convex combination of the prior mean and the MLE, which cap- tures the notion that the posterior is a compromise be- tween what we previously believed and what the data is telling us. 3.3.3.2 Posterior variance The mean and mode are point estimates, but it is useful to know how much we can trust them. The variance of the posterior is one way to measure this. The variance of the Beta posterior is given by var(θ|D) = (a+N1)(b+N0) (a+N1 +b+N0)2(a+N1 +b+N0 +1) (3.14) We can simplify this formidable expression in the case that N ≫ a,b, to get var(θ|D) ≈ N1N0 NNN = ˆθMLE(1− ˆθMLE) N (3.15)
  • 35. 19 3.3.4 Posterior predictive distribution So far, we have been focusing on inference of the un- known parameter(s). Let us now turn our attention to pre- diction of future observable data. Consider predicting the probability of heads in a single future trial under a Beta(a,b)posterior. We have p(˜x|D) = ∫ 1 0 p(˜x|θ)p(θ|D)dθ = ∫ 1 0 θBeta(θ|a,b)dθ = E[θ|D] = a a+b (3.16) 3.3.4.1 Overfitting and the black swan paradox Let us now derive a simple Bayesian solution to the prob- lem. We will use a uniform prior, so a = b = 1. In this case, plugging in the posterior mean gives Laplaces rule of succession p(˜x|D) = N1 +1 N0 +N1 +1 (3.17) This justifies the common practice of adding 1 to the empirical counts, normalizing and then plugging them in, a technique known as add-one smoothing. (Note that plugging in the MAP parameters would not have this smoothing effect, since the mode becomes the MLE if a = b = 1, see Section 3.3.3.1.) 3.3.4.2 Predicting the outcome of multiple future trials Suppose now we were interested in predicting the number of heads, ˜x, in M future trials. This is given by p(˜x|D) = ∫ 1 0 Bin(˜x|M,θ)Beta(θ|a,b)dθ (3.18) = ( M ˜x ) 1 B(a,b) ∫ 1 0 θ ˜x (1−θ)M−˜x θa−1 (1−θ)b−1 dθ (3.19) We recognize the integral as the normalization constant for a Beta(a+ ˜x,M ˜x+b) distribution. Hence ∫ 1 0 θ ˜x (1−θ)M−˜x θa−1 (1−θ)b−1 dθ = B(˜x+a,M− ˜x+b) (3.20) Thus we find that the posterior predictive is given by the following, known as the (compound) beta-binomial distribution: Bb(x|a,b,M) ≜ ( M x ) B(x+a,M −x+b) B(a,b) (3.21) This distribution has the following mean and variance mean = M a a+b , var = Mab (a+b)2 a+b+M a+b+1 (3.22) This process is illustrated in Figure 3.1. We start with a Beta(2,2) prior, and plot the posterior predictive density after seeing N1 = 3 heads and N0 = 17 tails. Figure 3.1(b) plots a plug-in approximation using a MAP estimate. We see that the Bayesian prediction has longer tails, spread- ing its probability mass more widely, and is therefore less prone to overfitting and blackswan type paradoxes. (a) (b) Fig. 3.1: (a) Posterior predictive distributions after seeing N1 = 3,N0 = 17. (b) MAP estimation. 3.4 The Dirichlet-multinomial model In the previous section, we discussed how to infer the probability that a coin comes up heads. In this section,
  • 36. 20 we generalize these results to infer the probability that a dice with K sides comes up as face k. 3.4.1 Likelihood Suppose we observe N dice rolls, D = {x1,x2,··· ,xN}, where xi ∈ {1,2,··· ,K}. The likelihood has the form p(D|θ) = ( N N1 ···Nk ) K ∏ k=1 θ Nk k where Nk = N ∑ i=1 I(yi = k) (3.23) almost the same as Equation 2.21. 3.4.2 Prior Dir(θ|α) = 1 B(α) K ∏ k=1 θ αk−1 k I(θ ∈ SK) (3.24) 3.4.3 Posterior p(θ|D) ∝ p(D|θ)p(θ) (3.25) ∝ K ∏ k=1 θ Nk k θ αk−1 k = K ∏ k=1 θ Nk+αk−1 k (3.26) = Dir(θ|α1 +N1,··· ,αK +NK) (3.27) From Equation 2.38, the MAP estimate is given by ˆθk = Nk +αk −1 N +α0 −K (3.28) If we use a uniform prior, αk = 1, we recover the MLE: ˆθk = Nk N (3.29) 3.4.4 Posterior predictive distribution The posterior predictive distribution for a single multi- noulli trial is given by the following expression: p(X = j|D) = ∫ p(X = j|θ)p(θ|D)dθ (3.30) = ∫ p(X = j|θj) [∫ p(θ−j,θj|D)dθ−j ] dθj (3.31) = ∫ θj p(θj|D)dθj = E[θj|D] = αj +Nj α0 +N (3.32) where θ−j are all the components of θ except θj. The above expression avoids the zero-count problem. In fact, this form of Bayesian smoothing is even more im- portant in the multinomial case than the binary case, since the likelihood of data sparsity increases once we start par- titioning the data into many categories. 3.5 Naive Bayes classifiers Assume the features are conditionally independent given the class label, then the class conditional density has the following form p(x|y = c,θ) = D ∏ j=1 p(xj|y = c,θjc) (3.33) The resulting model is called a naive Bayes classi- fier(NBC). The form of the class-conditional density depends on the type of each feature. We give some possibilities below: • In the case of real-valued features, we can use the Gaussian distribution: p(x|y,θ) = ∏D j=1 N(xj|µjc,σ2 jc), where µjc is the mean of feature j in objects of class c, and σ2 jc is its variance. • In the case of binary features, xj ∈ {0,1}, we can use the Bernoulli distribution: p(x|y,θ) = ∏D j=1 Ber(xj|µjc), where µjc is the probability that feature j occurs in class c. This is sometimes called the multivariate Bernoulli naive Bayes model. We will see an application of this below. • In the case of categorical features, xj ∈ {aj1,aj2,··· ,ajSj }, we can use the multinoulli distribution: p(x|y,θ) = ∏D j=1 Cat(xj|µjc), where µjc is a histogram over the K possible values for xj in class c. Obviously we can handle other kinds of features, or use different distributional assumptions. Also, it is easy to mix and match features of different types.
  • 37. 21 3.5.1 Optimization We now discuss how to train a naive Bayes classifier. This usually means computing the MLE or the MAP estimate for the parameters. However, we will also discuss how to compute the full posterior, p(θ|D). 3.5.1.1 MLE for NBC The probability for a single data case is given by p(xi,yi|θ) = p(yi|π)∏ j p(xij|θj) = ∏ c π I(yi=c) c ∏ j ∏ c p(xij|θjc)I(yi=c) (3.34) Hence the log-likelihood is given by p(D|θ) = C ∑ c=1 Nc logπc + D ∑ j=1 C ∑ c=1 ∑ i:yi=c log p(xij|θjc) (3.35) where Nc ≜ ∑ i I(yi = c) is the number of feature vectors in class c. We see that this expression decomposes into a series of terms, one concerning π, and DC terms containing the θjcs. Hence we can optimize all these parameters sepa- rately. From Equation 3.29, the MLE for the class prior is given by ˆπc = Nc N (3.36) The MLE for θjcs depends on the type of distribution we choose to use for each feature. In the case of binary features, xj ∈ {0,1}, xj|y = c ∼ Ber(θjc), hence ˆθjc = Njc Nc (3.37) where Njc ≜ ∑ i:yi=c I(yi = c) is the number that feature j occurs in class c. In the case of categorical features, xj ∈ {aj1,aj2,··· ,ajSj }, xj|y = c ∼ Cat(θjc), hence ˆθjc = ( Nj1c Nc , Nj2c Nc ,··· , NjSj Nc )T (3.38) where Njkc ≜ N ∑ i=1 I(xij = ajk,yi = c) is the number that fea- ture xj = ajk occurs in class c. 3.5.1.2 Bayesian naive Bayes Use a Dir(α) prior for π. In the case of binary features, use a Beta(β0,β1) prior for each θjc; in the case of categorical features, use a Dir(α) prior for each θjc. Often we just take α = 1 and β = 1, corresponding to add-one or Laplace smoothing. 3.5.2 Using the model for prediction The goal is to compute y = f(x) = argmax c P(y = c|x,θ) = P(y = c|θ) D ∏ j=1 P(xj|y = c,θ) (3.39) We can the estimate parameters using MLE or MAP, then the posterior predictive density is obtained by simply plugging in the parameters ¯θ(MLE) or ˆθ(MAP). Or we can use BMA, just integrate out the unknown parameters. 3.5.3 The log-sum-exp trick when using generative classifiers of any kind, comput- ing the posterior over class labels using Equation 3.1 can fail due to numerical underflow. The problem is that p(x|y = c) is often a very small number, especially if x is a high-dimensional vector. This is because we require that ∑x p(x|y) = 1, so the probability of observing any particular high-dimensional vector is small. The obvious solution is to take logs when applying Bayes rule, as fol- lows: log p(y = c|x,θ) = bc −log ( ∑ c′ ebc′ ) (3.40) where bc ≜ log p(x|y = c,θ)+log p(y = c|θ). We can factor out the largest term, and just represent the remaining numbers relative to that. For example, log(e−120 +e−121 ) = log(e−120 (1+e−1 )) = log(1+e−1 )−120 (3.41) In general, we have ∑ c ebc = log [ (∑ebc−B )eB ] = log ( ∑ebc−B ) +B (3.42)
  • 38. 22 where B ≜ max{bc}. This is called the log-sum-exp trick, and is widely used. 3.5.4 Feature selection using mutual information Since an NBC is fitting a joint distribution over potentially many features, it can suffer from overfitting. In addition, the run-time cost is O(D), which may be too high for some applications. One common approach to tackling both of these prob- lems is to perform feature selection, to remove irrelevant features that do not help much with the classification prob- lem. The simplest approach to feature selection is to eval- uate the relevance of each feature separately, and then take the top K,whereKis chosen based on some tradeoff be- tween accuracy and complexity. This approach is known as variable ranking, filtering, or screening. One way to measure relevance is to use mutual infor- mation (Section 2.8.3) between feature Xj and the class label Y I(Xj,Y) = ∑ xj ∑ y p(xj,y)log p(xj,y) p(xj)p(y) (3.43) If the features are binary, it is easy to show that the MI can be computed as follows Ij = ∑ c [ θjcπc log θjc θj +(1−θjc)πc log 1−θjc 1−θj ] (3.44) where πc = p(y = c), θjc = p(xj = 1|y = c), and θj = p(xj = 1) = ∑c πcθjc. 3.5.5 Classifying documents using bag of words Document classification is the problem of classifying text documents into different categories. 3.5.5.1 Bernoulli product model One simple approach is to represent each document as a binary vector, which records whether each word is present or not, so xij = 1 iff word j occurs in document i, other- wise xij = 0. We can then use the following class condi- tional density: p(xi|yi = c,θ) = D ∏ j=1 Ber(xij|θjc) = D ∏ j=1 θ xij jc (1−θjc)1−xij (3.45) This is called the Bernoulli product model, or the bi- nary independence model. 3.5.5.2 Multinomial document classifier However, ignoring the number of times each word oc- curs in a document loses some information (McCallum and Nigam 1998). A more accurate representation counts the number of occurrences of each word. Specifically, let xi be a vector of counts for document i, so xij ∈ {0,1,··· ,Ni}, where Ni is the number of terms in docu- ment i(so D ∑ j=1 xij = Ni). For the class conditional densities, we can use a multinomial distribution: p(xi|yi = c,θ) = Mu(xi|Ni,θc) = Ni! ∏D j=1 xij! D ∏ j=1 θ xi j jc (3.46) where we have implicitly assumed that the document length Ni is independent of the class. Here jc is the proba- bility of generating word j in documents of class c; these parameters satisfy the constraint that ∑D j=1 θjc = 1 for each class c. Although the multinomial classifier is easy to train and easy to use at test time, it does not work particularly well for document classification. One reason for this is that it does not take into account the burstiness of word usage. This refers to the phenomenon that most words never ap- pear in any given document, but if they do appear once, they are likely to appear more than once, i.e., words occur in bursts. The multinomial model cannot capture the burstiness phenomenon. To see why, note that Equation 3.46 has the form θ xij jc , and since θjc ≪ 1 for rare words, it becomes increasingly unlikely to generate many of them. For more frequent words, the decay rate is not as fast. To see why intuitively, note that the most frequent words are func- tion words which are not specific to the class, such as and, the, and but; the chance of the word and occuring is pretty much the same no matter how many time it has previously occurred (modulo document length), so the in- dependence assumption is more reasonable for common words. However, since rare words are the ones that mat- ter most for classification purposes, these are the ones we want to model the most carefully.
  • 39. 23 3.5.5.3 DCM model Various ad hoc heuristics have been proposed to improve the performance of the multinomial document classifier (Rennie et al. 2003). We now present an alternative class conditional density that performs as well as these ad hoc methods, yet is probabilistically sound (Madsen et al. 2005). Suppose we simply replace the multinomial class con- ditional density with the Dirichlet Compound Multino- mial or DCM density, defined as follows: p(xi|yi = c,α) = ∫ Mu(xi|Ni,θc)Dir(θc|αc) = Ni! ∏D j=1 xij! D ∏ j=1 B(xi +αc) B(αc) (3.47) (This equation is derived in Equation TODO.) Surpris- ingly this simple change is all that is needed to capture the burstiness phenomenon. The intuitive reason for this is as follows: After seeing one occurence of a word, say wordj, the posterior counts on j gets updated, making another oc- curence of wordjmore likely. By contrast, ifj is fixed, then the occurences of each word are independent. The multi- nomial model corresponds to drawing a ball from an urn with Kcolors of ball, recording its color, and then replac- ing it. By contrast, the DCM model corresponds to draw- ing a ball, recording its color, and then replacing it with one additional copy; this is called the Polya urn. Using the DCM as the class conditional density gives much better results than using the multinomial, and has performance comparable to state of the art methods, as de- scribed in (Madsen et al. 2005). The only disadvantage is that fitting the DCM model is more complex; see (Minka 2000e; Elkan 2006) for the details.
  • 40.
  • 41. Chapter 4 Gaussian Models In this chapter, we discuss the multivariate Gaus- sian or multivariate normal(MVN), which is the most widely used joint probability density function for contin- uous variables. It will form the basis for many of the mod- els we will encounter in later chapters. 4.1 Basics Recall from Section 2.5.2 that the pdf for an MVN in D dimensions is defined by the following: N(x|µ,Σ) ≜ 1 (2π) D 2 |Σ| 1 2 exp [ − 1 2 (x−µ)T Σ−1 (x−µ) ] (4.1) The expression inside the exponent is the Mahalanobis distance between a data vector x and the mean vector µ, We can gain a better understanding of this quantity by per- forming an eigendecomposition of Σ. That is, we write Σ = UΛUT , where U is an orthonormal matrix of eigen- vectors satsifying UT U = I, and Λ is a diagonal matrix of eigenvalues. Using the eigendecomposition, we have that Σ−1 = U−T Λ−1 U−1 = UΛ−1 UT = D ∑ i=1 1 λi uiuT i (4.2) where ui is the i’th column of U, containing the i’th eigenvector. Hence we can rewrite the Mahalanobis dis- tance as follows: (x−µ)T Σ−1 (x−µ) = (x−µ)T ( D ∑ i=1 1 λi uiuT i ) (x−µ) (4.3) = D ∑ i=1 1 λi (x−µ)T uiuT i (x−µ) (4.4) = D ∑ i=1 y2 i λi (4.5) where yi ≜ uT i (x−µ). Recall that the equation for an el- lipse in 2d is y2 1 λ1 + y2 2 λ2 = 1 (4.6) Hence we see that the contours of equal probability density of a Gaussian lie along ellipses. This is illustrated in Figure 4.1. The eigenvectors determine the orientation of the ellipse, and the eigenvalues determine how elogo- nated it is. Fig. 4.1: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the first two eigenvectors of the covariance matrix, namely u1 and u2. Based on Figure 2.7 of (Bishop 2006a) In general, we see that the Mahalanobis distance corre- sponds to Euclidean distance in a transformed coordinate system, where we shift by µ and rotate by U. 4.1.1 MLE for a MVN Theorem 4.1. (MLE for a MVN) If we have N iid sam- ples xi ∼ N(µ,Σ), then the MLE for the parameters is given by 25
  • 42. 26 ¯µ = 1 N N ∑ i=1 xi ≜ ¯x (4.7) ¯Σ = 1 N N ∑ i=1 (xi − ¯x)(xi − ¯x)T (4.8) = 1 N ( N ∑ i=1 xixT i ) − ¯x ¯xT (4.9) 4.1.2 Maximum entropy derivation of the Gaussian * In this section, we show that the multivariate Gaussian is the distribution with maximum entropy subject to having a specified mean and covariance (see also Section TODO). This is one reason the Gaussian is so widely used: the first two moments are usually all that we can reliably estimate from data, so we want a distribution that captures these properties, but otherwise makes as few addtional assump- tions as possible. To simplify notation, we will assume the mean is zero. The pdf has the form f(x) = 1 Z exp ( − 1 2 xT Σ−1 x ) (4.10) 4.2 Gaussian discriminant analysis One important application of MVNs is to define the the class conditional densities in a generative classifier, i.e., p(x|y = c,θ) = N(x|µc,Σc) (4.11) The resulting technique is called (Gaussian) discrim- inant analysis or GDA (even though it is a generative, not discriminative, classifier see Section TODO for more on this distinction). If Σc is diagonal, this is equivalent to naive Bayes. We can classify a feature vector using the following decision rule, derived from Equation 3.1: y = argmax c [log p(y = c|π)+log p(x|θ)] (4.12) When we compute the probability of x under each class conditional density, we are measuring the distance from x to the center of each class, µc, using Mahalanobis distance. This can be thought of as a nearest centroids classifier. As an example, Figure 4.2 shows two Gaussian class- conditional densities in 2d, representing the height and weight of men and women. We can see that the features (a) (b) Fig. 4.2: (a) Height/weight data. (b) Visualization of 2d Gaussians fit to each class. 95% of the probability mass is inside the ellipse. are correlated, as is to be expected (tall people tend to weigh more). The ellipses for each class contain 95% of the probability mass. If we have a uniform prior over classes, we can classify a new test vector as follows: y = argmax c (x−µc)T Σ−1 c (x−µc) (4.13) 4.2.1 Quadratic discriminant analysis (QDA) By plugging in the definition of the Gaussian density to Equation 3.1, we can get p(y|x,θ) = πc|2πΣc|− 1 2 exp [ −1 2 (x−µ)T Σ−1(x−µ) ] ∑c′ πc′ |2πΣc′ |− 1 2 exp [ −1 2 (x−µ)T Σ−1(x−µ) ] (4.14)
  • 43. 27 Thresholding this results in a quadratic function of x. The result is known as quadratic discriminant analy- sis(QDA). Figure 4.3 gives some examples of what the decision boundaries look like in 2D. (a) (b) Fig. 4.3: Quadratic decision boundaries in 2D for the 2 and 3 class case. 4.2.2 Linear discriminant analysis (LDA) We now consider a special case in which the covariance matrices are tied or shared across classes,Σc = Σ. In this case, we can simplify Equation 4.14 as follows: p(y|x,θ) ∝ πc exp ( µcΣ−1 x− 1 2 xT Σ−1 x− 1 2 µT c Σ−1 µc ) = exp ( µcΣ−1 x− 1 2 µT c Σ−1 µc +logπc ) exp ( − 1 2 xT Σ−1 x ) ∝ exp ( µcΣ−1 x− 1 2 µT c Σ−1 µc +logπc ) (4.15) Since the quadratic term xT Σ−1x is independent of c, it will cancel out in the numerator and denominator. If we define γc ≜ − 1 2 µT c Σ−1 µc +logπc (4.16) βc ≜ Σ−1 µc (4.17) then we can write p(y|x,θ) = eβT c x+γc ∑c′ eβT c′ x+γc′ ≜ σ(η,c) (4.18) where η ≜ (eβT 1 x +γ1,··· ,eβT Cx +γC), σ() is the softmax activation function16, defined as follows: σ(q,i) ≜ exp(qi) ∑n j=1 exp(qj) (4.19) When parameterized by some constant, α > 0, the fol- lowing formulation becomes a smooth, differentiable ap- proximation of the maximum function: Sα(x) = ∑D j=1 xjeαxj ∑D j=1 eαxj (4.20) Sα has the following properties: 1. Sα → max as α → ∞ 2. S0 is the average of its inputs 3. Sα → min as α → −∞ Note that the softmax activation function comes from the area of statistical physics, where it is common to use the Boltzmann distribution, which has the same form as the softmax activation function. An interesting property of Equation 4.18 is that, if we take logs, we end up with a linear function of x. (The reason it is linear is because the xT Σ−1x cancels from the numerator and denominator.) Thus the decision boundary between any two classes, says c and c′, will be a straight line. Hence this technique is called linear discriminant analysis or LDA. 16 http://en.wikipedia.org/wiki/Softmax_ activation_function
  • 44. 28 An alternative to fitting an LDA model and then de- riving the class posterior is to directly fit p(y|x,W ) = Cat(y|W x) for some C × D weight matrix W . This is called multi-class logistic regression, or multinomial lo- gistic regression. We will discuss this model in detail in Section TODO. The difference between the two ap- proaches is explained in Section TODO. 4.2.3 Two-class LDA To gain further insight into the meaning of these equa- tions, let us consider the binary case. In this case, the pos- terior is given by p(y = 1|x,θ) = eβT 1 x+γ1 eβT 0 x+γ0 +eβT 1 x+γ1 ) (4.21) = 1 1+e(β0 −β1)T x+(γ0 −γ1) (4.22) = sigm((β1 −β0)T x+(γ0 −γ1)) (4.23) where sigm(x) refers to the sigmoid function17. Now γ1 −γ0 = − 1 2 µT 1 Σ−1 µ1 + 1 2 µT 0 Σ−1 µ0 +log(π1/π0) (4.24) = − 1 2 (µ1 −µ0)T Σ−1 (µ1 +µ0)+log(π1/π0) (4.25) So if we define w = β1 −β0 = Σ−1 (µ1 −µ0) (4.26) x0 = 1 2 (µ1 +µ0)−(µ1 −µ0) log(π1/π0) (µ1 −µ0)T Σ−1(µ1 −µ0) (4.27) then we have wT x0 = −(γ1 −γ0), and hence p(y = 1|x,θ) = sigm(wT (x−x0)) (4.28) (This is closely related to logistic regression, which we will discuss in Section TODO.) So the final decision rule is as follows: shift x by x0, project onto the line w, and see if the result is positive or negative. If Σ = σ2I, then w is in the direction of µ1 −µ0. So we classify the point based on whether its projection is closer to µ0 or µ1 . This is illustrated in Figure 4.4. Fur- themore, if π1 = π0, then x0 = 1 2 (µ1 +µ0), which is half way between the means. If we make π1 > π0, then x0 gets 17 http://en.wikipedia.org/wiki/Sigmoid_ function Fig. 4.4: Geometry of LDA in the 2 class case where Σ1 = Σ2 = I. closer to µ0, so more of the line belongs to class 1 a priori. Conversely if π1 < π0, the boundary shifts right. Thus we see that the class prior, c, just changes the decision thresh- old, and not the overall geometry, as we claimed above. (A similar argument applies in the multi-class case.) The magnitude of w determines the steepness of the logistic function, and depends on how well-separated the means are, relative to the variance. In psychology and sig- nal detection theory, it is common to define the discrim- inability of a signal from the background noise using a quantity called d-prime: d′ ≜ µ1 − µ0 σ (4.29) where µ1 is the mean of the signal and µ0 is the mean of the noise, and σ is the standard deviation of the noise. If d′ is large, the signal will be easier to discriminate from the noise. 4.2.4 MLE for discriminant analysis The log-likelihood function is as follows: p(D|θ) = C ∑ c=1 ∑ i:yi=c logπc + C ∑ c=1 ∑ i:yi=c logN(xi|µc,Σc) (4.30) The MLE for each parameter is as follows: ¯µc = Nc N (4.31) ¯µc = 1 Nc ∑ i:yi=c xi (4.32) ¯Σc = 1 Nc ∑ i:yi=c (xi − ¯µc)(xi − ¯µc)T (4.33)
  • 45. 29 4.2.5 Strategies for preventing overfitting The speed and simplicity of the MLE method is one of its greatest appeals. However, the MLE can badly overfit in high dimensions. In particular, the MLE for a full covari- ance matrix is singular if Nc < D. And even when Nc > D, the MLE can be ill-conditioned, meaning it is close to sin- gular. There are several possible solutions to this problem: • Use a diagonal covariance matrix for each class, which assumes the features are conditionally independent; this is equivalent to using a naive Bayes classifier (Sec- tion 3.5). • Use a full covariance matrix, but force it to be the same for all classes,Σc = Σ. This is an example of param- eter tying or parameter sharing, and is equivalent to LDA (Section 4.2.2). • Use a diagonal covariance matrix and forced it to be shared. This is called diagonal covariance LDA, and is discussed in Section TODO. • Use a full covariance matrix, but impose a prior and then integrate it out. If we use a conjugate prior, this can be done in closed form, using the results from Sec- tion TODO; this is analogous to the Bayesian naive Bayes method in Section 3.5.1.2. See (Minka 2000f) for details. • Fit a full or diagonal covariance matrix by MAP esti- mation. We discuss two different kindsof prior below. • Project the data into a low dimensional subspace and fit the Gaussians there. See Section TODO for a way to find the best (most discriminative) linear projection. We discuss some of these options below. 4.2.6 Regularized LDA * 4.2.7 Diagonal LDA 4.2.8 Nearest shrunken centroids classifier * One drawback of diagonal LDA is that it depends on all of the features. In high dimensional problems, we might pre- fer a method that only depends on a subset of the features, for reasons of accuracy and interpretability. One approach is to use a screening method, perhaps based on mutual in- formation, as in Section 3.5.4. We now discuss another approach to this problem known as the nearest shrunken centroids classifier (Hastie et al. 2009, p652). 4.3 Inference in jointly Gaussian distributions Given a joint distribution, p(x1,x2), it is useful to be able to compute marginals p(x1) and conditionals p(x1|x2). We discuss how to do this below, and then give some ap- plications. These operations take O(D3) time in the worst case. See Section TODO for faster methods. 4.3.1 Statement of the result Theorem 4.2. (Marginals and conditionals of an MVN). Suppose X = (x1,x2)is jointly Gaussian with parameters µ = ( µ1 µ2 ) ,Σ = ( Σ11 Σ12 Σ21 Σ22 ) ,Λ = Σ−1 = ( Λ11 Λ12 Λ21 Λ22 ) , (4.34) Then the marginals are given by p(x1) = N(x1|µ1,Σ11) p(x2) = N(x2|µ2,Σ22) (4.35) and the posterior conditional is given by p(x1|x2) = N(x1|µ1|2,Σ1|2) µ1|2 = µ1 +Σ12Σ−1 22 (x2 −µ2) = µ1 −Λ−1 11 Λ12(x2 −µ2) = Σ1|2(Λ11µ1 −Λ12(x2 −µ2)) Σ1|2 = Σ11 −Σ12Σ−1 22 Σ21 = Λ−1 11 (4.36) Equation 4.36 is of such crucial importance in this book that we have put a box around it, so you can eas- ily find it. For the proof, see Section TODO. We see that both the marginal and conditional distribu- tions are themselves Gaussian. For the marginals, we just extract the rows and columns corresponding to x1 or x2. For the conditional, we have to do a bit more work. How- ever, it is not that complicated: the conditional mean is just a linear function of x2, and the conditional covariance is just a constant matrix that is independent of x2. We give three different (but equivalent) expressions for the poste- rior mean, and two different (but equivalent) expressions for the posterior covariance; each one is useful in different circumstances.
  • 46. 30 4.3.2 Examples Below we give some examples of these equations in ac- tion, which will make them seem more intuitive. 4.3.2.1 Marginals and conditionals of a 2d Gaussian 4.4 Linear Gaussian systems Suppose we have two variables, x and y.Let x ∈ RDx be a hidden variable, and y ∈ RDy be a noisy observation of x. Let us assume we have the following prior and likelihood: p(x) = N(x|µx,Σx) p(y|x) = N(y|W x+µy,Σy) (4.37) where W is a matrix of size Dy × Dx. This is an exam- ple of a linear Gaussian system. We can represent this schematically as x → y, meaning x generates y. In this section, we show how to invert the arrow, that is, how to infer x from y. We state the result below, then give sev- eral examples, and finally we derive the result. We will see many more applications of these results in later chapters. 4.4.1 Statement of the result Theorem 4.3. (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, as in Equation 4.37, the posterior p(x|y) is given by the following: p(x|y) = N(x|µx|y,Σx|y) Σx|y = Σ−1 x +W T Σ−1 y W µx|y = Σx|y [ W T Σ−1 y (y −µy)+Σ−1 x µx ] (4.38) In addition, the normalization constant p(y) is given by p(y) = N(y|W µx +µy,Σy +W ΣxW T ) (4.39) For the proof, see Section 4.4.3 TODO. 4.5 Digression: The Wishart distribution * 4.6 Inferring the parameters of an MVN 4.6.1 Posterior distribution of µ 4.6.2 Posterior distribution of Σ * 4.6.3 Posterior distribution of µ and Σ * 4.6.4 Sensor fusion with unknown precisions *
  • 47. Chapter 5 Bayesian statistics 5.1 Introduction Using the posterior distribution to summarize everything we know about a set of unknown variables is at the core of Bayesian statistics. In this chapter, we discuss this ap- proach to statistics in more detail. 5.2 Summarizing posterior distributions The posterior p(θ|D) summarizes everything we know about the unknown quantities θ. In this section, we dis- cuss some simple quantities that can be derived from a probability distribution, such as a posterior. These sum- mary statistics are often easier to understand and visualize than the full joint. 5.2.1 MAP estimation We can easily compute a point estimate of an unknown quantity by computing the posterior mean, median or mode. In Section 5.7, we discuss how to use decision the- ory to choose between these methods. Typically the pos- terior mean or median is the most appropriate choice for a realvalued quantity, and the vector of posterior marginals is the best choice for a discrete quantity. However, the posterior mode, aka the MAP estimate, is the most pop- ular choice because it reduces to an optimization prob- lem, for which efficient algorithms often exist. Futher- more, MAP estimation can be interpreted in non-Bayesian terms, by thinking of the log prior as a regularizer (see Section TODO for more details). Although this approach is computationally appealing, it is important to point out that there are various draw- backs to MAP estimation, which we briefly discuss be- low. This will provide motivation for the more thoroughly Bayesian approach which we will study later in this chap- ter(and elsewhere in this book). 5.2.1.1 No measure of uncertainty The most obvious drawback of MAP estimation, and in- deed of any other point estimate such as the posterior mean or median, is that it does not provide any measure of uncertainty. In many applications, it is important to know how much one can trust a given estimate. We can derive such confidence measures from the posterior, as we dis- cuss in Section 5.2.2. 5.2.1.2 Plugging in the MAP estimate can result in overfitting If we dont model the uncertainty in our parameters, then our predictive distribution will be overconfident. Over- confidence in predictions is particularly problematic in situations where we may be risk averse; see Section 5.7 for details. 5.2.1.3 The mode is an untypical point Choosing the mode as a summary of a posterior distribu- tion is often a very poor choice, since the mode is usually quite untypical of the distribution, unlike the mean or me- dian. The basic problem is that the mode is a point of mea- sure zero, whereas the mean and median take the volume of the space into account. See Figure 5.1. How should we summarize a posterior if the mode is not a good choice? The answer is to use decision theory, which we discuss in Section 5.7. The basic idea is to spec- ify a loss function, where L(θ, ˆθ) is the loss you incur if the truth is θ and your estimate is ˆθ. If we use 0-1 loss L(θ, ˆθ) = I(θ ̸= ˆθ)(see section 1.2.2.1), then the opti- mal estimate is the posterior mode. 0-1 loss means you only get points if you make no errors, otherwise you get nothing: there is no partial credit under this loss function! For continuous-valued quantities, we often prefer to use squared error loss, L(θ, ˆθ) = (θ − ˆθ)2 ; the corresponding optimal estimator is then the posterior mean, as we show in Section 5.7. Or we can use a more robust loss func- tion, L(θ, ˆθ) = |θ − ˆθ|, which gives rise to the posterior median. 31