SlideShare a Scribd company logo
1 of 33
Introduction to Statistical
                                                                                   Machine Learning

                                                                                       c 2011
                                                                                 Christfried Webers
                                                                                       NICTA
                                                                               The Australian National
                                                                                     University

Introduction to Statistical Machine Learning

                         Christfried Webers

                      Statistical Machine Learning Group
                                     NICTA
                                      and
                College of Engineering and Computer Science
                      The Australian National University


                            Canberra
                       February – June 2011



 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")



                                                                                                  1of 229
Introduction to Statistical
                          Machine Learning

                              c 2011
                        Christfried Webers
                              NICTA
                      The Australian National
                            University




     Part VI
                      Bayesian Regression

                      Sequential Update of the
                      Posterior

Linear Regression 2   Predictive Distribution

                      Proof of the Predictive
                      Distribution

                      Predictive Distribution
                      with Simplified Prior

                      Limitations of Linear
                      Basis Function Models




                                       198of 229
Introduction to Statistical

Bayesian Regression                                             Machine Learning

                                                                    c 2011
                                                              Christfried Webers
                                                                    NICTA
                                                            The Australian National
     Training Phase                                               University




          training       model with adjustable
           data x           parameter w

                                                            Bayesian Regression

                                                            Sequential Update of the
          training                                          Posterior
          targets
              t                                             Predictive Distribution

                                                            Proof of the Predictive
                                                            Distribution

                                                            Predictive Distribution
                      fix the most appropriate w*            with Simplified Prior
     Test Phase                                             Limitations of Linear
                                                            Basis Function Models



                                                    test
            test            model with fixed
                                                   target
           data x            parameter w*
                                                      t




                                                                             199of 229
Introduction to Statistical

Bayesian Regression                                                                Machine Learning

                                                                                       c 2011
                                                                                 Christfried Webers
                                                                                       NICTA
                                                                               The Australian National
                                                                                     University
    Bayes Theorem

                   likelihood × prior                          p(t | w) p(w)
     posterior =                                  p(w | t) =
                      normalisation                                  p(t)
                                                                               Bayesian Regression
    likelihood for i.i.d. data                                                 Sequential Update of the
                                                                               Posterior
                       N                                                       Predictive Distribution
          p(t | w) =         N (tn | y(xn , w), β −1 )                         Proof of the Predictive
                                                                               Distribution
                       n=1
                                                                               Predictive Distribution
                       N                                                       with Simplified Prior
                                       T           −1
                   =         N (tn | w φ(xn ), β        )                      Limitations of Linear
                                                                               Basis Function Models
                       n=1
                                   1
                   = const × exp{−β (t − Φw)T (t − Φw)}
                                   2
    where we left out the conditioning on x (always assumed),
    and β, which is assumed to be constant.

                                                                                                200of 229
Introduction to Statistical

How to choose a prior?                                                  Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    Can we find a prior for the given likelihood which
         makes sense for the problem at hand                        Bayesian Regression
         allows us to find a posterior in a ’nice’ form              Sequential Update of the
                                                                    Posterior
An answer to the second question:                                   Predictive Distribution

                                                                    Proof of the Predictive
Definition ( Conjugate Prior)                                        Distribution

                                                                    Predictive Distribution
A class of prior probability distributions p(w) is conjugate to a   with Simplified Prior

class of likelihood functions p(x | w) if the resulting posterior   Limitations of Linear
                                                                    Basis Function Models
distributions p(w | x) are in the same family as p(w).




                                                                                     201of 229
Introduction to Statistical

Examples of Conjugate Prior Distributions                  Machine Learning

                                                               c 2011
                                                         Christfried Webers
                                                               NICTA
                                                       The Australian National
                                                             University


           Table: Discrete likelihood distributions
             Likelihood       Conjugate Prior
              Bernoulli            Beta
                                                       Bayesian Regression
              Binomial             Beta
                                                       Sequential Update of the
              Poisson            Gamma                 Posterior

             Multinomial         Dirichlet             Predictive Distribution

                                                       Proof of the Predictive
                                                       Distribution

                                                       Predictive Distribution
                                                       with Simplified Prior
          Table: Continuous likelihood distributions   Limitations of Linear
                                                       Basis Function Models
            Likelihood           Conjugate Prior
              Uniform                 Pareto
           Exponential                Gamma
              Normal                  Normal
        Multivariate normal     Multivariate normal

                                                                        202of 229
Introduction to Statistical

Conjugate Prior to a Gaussian Distribution                             Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University
    Example : The Gaussian family is conjugate to itself with
    respect to a Gaussian likelihood function: if the likelihood
    function is Gaussian, choosing a Gaussian prior will
    ensure that the posterior distribution is also Gaussian.
                                                                   Bayesian Regression
    Given a marginal distribution for x and a conditional
                                                                   Sequential Update of the
    Gaussian distribution for y given x in the form                Posterior

                                                                   Predictive Distribution

                       p(x) = N (x | µ, Λ−1 )                      Proof of the Predictive
                                                                   Distribution
                                                −1
                    p(y | x) = N (y | Ax + b, L      )             Predictive Distribution
                                                                   with Simplified Prior

                                                                   Limitations of Linear
    we get                                                         Basis Function Models



                p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT )
             p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ)

    where Σ = (Λ + AT L A)−1 .

                                                                                    203of 229
Introduction to Statistical

Bayesian Regression                                               Machine Learning

                                                                      c 2011
                                                                Christfried Webers
                                                                      NICTA
                                                              The Australian National
    Choose a Gaussian prior with mean m0 and covariance S0          University



                         p(w) = N (w | m0 , S0 )

    After having seen N training data pairs (xn , tn ), the
    posterior for the given likelihood is now                 Bayesian Regression

                                                              Sequential Update of the
                                                              Posterior
                       p(w | t) = N (w | mN , SN )            Predictive Distribution

                                                              Proof of the Predictive
    where                                                     Distribution

                                                              Predictive Distribution
                                                              with Simplified Prior
                      mN = SN (S−1 m0 + βΦT t)
                                0                             Limitations of Linear
                                                              Basis Function Models
                      S−1 = S−1 + βΦT Φ
                       N     0

    The posterior is Gaussian, therefore mode = mean.
    The maximum posterior weight vector wMAP = mN .
    Assume infinitely broad prior S0 = α−1 I with α → 0, the
    mean reduces to the maximum likelihood wML .
                                                                               204of 229
Introduction to Statistical

Bayesian Regression                                                     Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    If we have not yet seen any data point ( N = 0), the            Bayesian Regression
    posterior is equal to the prior.                                Sequential Update of the
                                                                    Posterior
    Sequential arrival of data points : Each posterior              Predictive Distribution
    distribution calculated after the arrival of a data point and   Proof of the Predictive
    target value, acts as the prior distribution for the            Distribution

                                                                    Predictive Distribution
    subsequent data point.                                          with Simplified Prior

    Nicely fits a sequential learning framework.                     Limitations of Linear
                                                                    Basis Function Models




                                                                                     205of 229
Introduction to Statistical

Bayesian Regression                                                        Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
                                                                       The Australian National
                                                                             University
    Special simplified prior in the remainder, m0 = 0 and
    S0 = α−1 I,
                     p(x | α) = N (x | 0, α−1 I)                 (1)
    The parameters of the posterior distribution
    p(w | t) = N (w | mn , SN ) are now                                Bayesian Regression

                                                                       Sequential Update of the
                                                                       Posterior
                                      T
                         mN = βSN Φ t                                  Predictive Distribution

                         S−1 = αI + βΦT Φ
                          N
                                                                       Proof of the Predictive
                                                                       Distribution

                                                                       Predictive Distribution
    For α → 0 we get                                                   with Simplified Prior

                                                                       Limitations of Linear
                                                                       Basis Function Models
                     mN → wML = (ΦT Φ)−1 ΦT t

    Log of posterior is sum of log likelihood and log of prior

                      β                    α
       ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const
                      2                    2

                                                                                        206of 229
Introduction to Statistical

Bayesian Regression                                                     Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    Log of posterior is sum of log likelihood and log of prior

                                1
               ln p(w | t) = − β (t − Φw)T (t − Φw)                 Bayesian Regression
                                2                                   Sequential Update of the
                                                                    Posterior
                                      sum-of-squares-error
                               α                                    Predictive Distribution

                           −            wT w            + const     Proof of the Predictive
                               2                                    Distribution
                                   quadr. regulariser               Predictive Distribution
                                                                    with Simplified Prior

    Maximising the posterior distribution with respect to w         Limitations of Linear
                                                                    Basis Function Models
    corresponds to minimising the sum-of-squares error
    function with the addition of a quadratic regularisation term
    λ = α/β.




                                                                                     207of 229
Introduction to Statistical

Sequential Update of the Posterior                                             Machine Learning

                                                                                   c 2011
                                                                             Christfried Webers
                                                                                   NICTA
                                                                           The Australian National
                                                                                 University




    Example of a linear basis function model
    Single input x, single output t                                        Bayesian Regression

    Linear model y(x, w) = w0 + w1 x.                                      Sequential Update of the
                                                                           Posterior
    Data creation                                                          Predictive Distribution
      1   Choose an xn from the uniform distribution U(x | − 1, 1).        Proof of the Predictive
                                                                           Distribution
      2   Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5.
                                                                           Predictive Distribution
      3   Add Gaussian noise with standard deviation σ = 0.2,              with Simplified Prior

                                                                           Limitations of Linear
                             tn = N (xn | f (xn , a), 0.04)                Basis Function Models



    Set the precision of the uniform prior to α = 2.0.




                                                                                            208of 229
Introduction to Statistical

Sequential Update of the Posterior       Machine Learning

                                             c 2011
                                       Christfried Webers
                                             NICTA
                                     The Australian National
                                           University




                                     Bayesian Regression

                                     Sequential Update of the
                                     Posterior

                                     Predictive Distribution

                                     Proof of the Predictive
                                     Distribution

                                     Predictive Distribution
                                     with Simplified Prior

                                     Limitations of Linear
                                     Basis Function Models




                                                      209of 229
Introduction to Statistical

Sequential Update of the Posterior       Machine Learning

                                             c 2011
                                       Christfried Webers
                                             NICTA
                                     The Australian National
                                           University




                                     Bayesian Regression

                                     Sequential Update of the
                                     Posterior

                                     Predictive Distribution

                                     Proof of the Predictive
                                     Distribution

                                     Predictive Distribution
                                     with Simplified Prior

                                     Limitations of Linear
                                     Basis Function Models




                                                      210of 229
Introduction to Statistical

Predictive Distribution                                                Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University




    In the training phase, data x and targets t are provided
                                                                   Bayesian Regression
    In the test phase, a new data value x is given and the
                                                                   Sequential Update of the
    corresponding target value t is asked for                      Posterior

                                                                   Predictive Distribution
    Bayesian approach: Find the probability of the test target t
                                                                   Proof of the Predictive
    given the test data x, the training data x and the training    Distribution

    targets t                                                      Predictive Distribution
                                                                   with Simplified Prior
                              p(t | x, x, t)                       Limitations of Linear
                                                                   Basis Function Models
    This is the Predictive Distribution.




                                                                                    211of 229
Introduction to Statistical

How to calculate the Predictive Distribution?                              Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
    Introduce the model parameter w via the sum rule                   The Australian National
                                                                             University


            p(t | x, x, t) =       p(t, w | x, x, t)dw

                         =         p(t | w, x, x, t)p(w | x, x, t)dw
                                                                       Bayesian Regression

    The test target t depends only on the test data x and the          Sequential Update of the
                                                                       Posterior
    model parameter w, but not on the training data and the            Predictive Distribution

    training targets                                                   Proof of the Predictive
                                                                       Distribution

                       p(t | w, x, x, t) = p(t | w, x)                 Predictive Distribution
                                                                       with Simplified Prior

    The model parameter w are learned with the training data           Limitations of Linear
                                                                       Basis Function Models
    x and the training targets t only
                        p(w | x, x, t) = p(w | x, t)
    Predictive Distribution

                p(t | x, x, t) =      p(t | w, x)p(w | x, t)dw
                                                                                        212of 229
Introduction to Statistical

Proof of the Predictive Distribution                                   Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
    How to prove the Predictive Distribution in the general              University

    form?

            p(t | x, x, t) =   p(t | w, x, x, t)p(w | x, x, t)dw

                                                                   Bayesian Regression
    Convert each conditional probability on the right-hand-side    Sequential Update of the
    into a joint probability.                                      Posterior

                                                                   Predictive Distribution

                                                                   Proof of the Predictive
                        p(t | w, x, x, t)p(w | x, x, t)dw          Distribution

                                                                   Predictive Distribution
                         p(t, w, x, x, t) p(w, x, x, t)            with Simplified Prior

                   =                                    dw         Limitations of Linear
                          p(w, x, x, t) p(x, x, t)                 Basis Function Models

                         p(t, w, x, x, t)
                   =                      dw
                            p(x, x, t)
                     p(t, x, x, t)
                   =
                      p(x, x, t)
                   = p(t | x, x, t)
                                                                                    213of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                               Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
    Find the predictive distribution                                   The Australian National
                                                                             University



             p(t | t, α, β) =     p(t | w, β) p(w | t, α, β)dw

    (remember : The conditioning on the input variables x is
    often suppressed to simplify the notation.)                        Bayesian Regression

                                                                       Sequential Update of the
    Now we know                                                        Posterior

                                                                       Predictive Distribution
                                  N
                                                T             −1       Proof of the Predictive
               p(t | x, w, β) =         N (tn | w φ(xn ), β        )   Distribution

                                  n=1                                  Predictive Distribution
                                                                       with Simplified Prior

    and the posterior was                                              Limitations of Linear
                                                                       Basis Function Models


                    p(w | t, α, β) = N (w | mN , SN )

    where

                           mN = βSN ΦT t
                          S−1 = αI + βΦT Φ
                           N
                                                                                        214of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                        Machine Learning

                                                                        c 2011
                                                                  Christfried Webers
                                                                        NICTA
                                                                The Australian National
                                                                      University




    If we do the convolution of the two Gaussians, we get for
    the predictive distribution                                 Bayesian Regression

                                                                Sequential Update of the
                                                                Posterior
                                                      2
               p(t | x, t, α, β) =   N (t | mT φ(x), σN (x))
                                             N                  Predictive Distribution

                                                                Proof of the Predictive
                          2
    where the variance   σN (x)   is given by                   Distribution

                                                                Predictive Distribution
                                                                with Simplified Prior
                     2        1
                    σN (x)   = + φ(x)T SN φ(x).                 Limitations of Linear
                              β                                 Basis Function Models




                                                                                 215of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University
  Example with artificial sinusoidal data from sin(2πx) (green)
      and added noise. Number of data points N = 1.


            1                                                    Bayesian Regression

        t                                                        Sequential Update of the
                                                                 Posterior

                                                                 Predictive Distribution

            0                                                    Proof of the Predictive
                                                                 Distribution

                                                                 Predictive Distribution
                                                                 with Simplified Prior

                                                                 Limitations of Linear
        −1                                                       Basis Function Models




                0                              x     1

  Mean of the predictive distribution (red) and regions of one
        standard deviation from mean (red shaded).
                                                                                  216of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University
  Example with artificial sinusoidal data from sin(2πx) (green)
      and added noise. Number of data points N = 2.


            1                                                    Bayesian Regression

        t                                                        Sequential Update of the
                                                                 Posterior

                                                                 Predictive Distribution

            0                                                    Proof of the Predictive
                                                                 Distribution

                                                                 Predictive Distribution
                                                                 with Simplified Prior

                                                                 Limitations of Linear
        −1                                                       Basis Function Models




                0                              x     1

  Mean of the predictive distribution (red) and regions of one
        standard deviation from mean (red shaded).
                                                                                  217of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University
  Example with artificial sinusoidal data from sin(2πx) (green)
      and added noise. Number of data points N = 4.


            1                                                    Bayesian Regression

        t                                                        Sequential Update of the
                                                                 Posterior

                                                                 Predictive Distribution

            0                                                    Proof of the Predictive
                                                                 Distribution

                                                                 Predictive Distribution
                                                                 with Simplified Prior

                                                                 Limitations of Linear
        −1                                                       Basis Function Models




                0                              x     1

  Mean of the predictive distribution (red) and regions of one
        standard deviation from mean (red shaded).
                                                                                  218of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University
  Example with artificial sinusoidal data from sin(2πx) (green)
      and added noise. Number of data points N = 25.


            1                                                    Bayesian Regression

        t                                                        Sequential Update of the
                                                                 Posterior

                                                                 Predictive Distribution

            0                                                    Proof of the Predictive
                                                                 Distribution

                                                                 Predictive Distribution
                                                                 with Simplified Prior

                                                                 Limitations of Linear
        −1                                                       Basis Function Models




                0                              x     1

  Mean of the predictive distribution (red) and regions of one
        standard deviation from mean (red shaded).
                                                                                  219of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                                     Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University




        Plots of the function y(x, w) using samples from the posterior
              distribution over w. Number of data points N = 1.
                                                                             Bayesian Regression

                                                                             Sequential Update of the
                                                                             Posterior
    1                                        1
                                                                             Predictive Distribution
t                                        t
                                                                             Proof of the Predictive
                                                                             Distribution
    0                                        0                               Predictive Distribution
                                                                             with Simplified Prior

                                                                             Limitations of Linear
                                                                             Basis Function Models
−1                                       −1


         0                       x   1           0                  x    1




                                                                                              220of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                                     Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University




        Plots of the function y(x, w) using samples from the posterior
              distribution over w. Number of data points N = 2.
                                                                             Bayesian Regression

                                                                             Sequential Update of the
                                                                             Posterior
    1                                        1
                                                                             Predictive Distribution
t                                        t
                                                                             Proof of the Predictive
                                                                             Distribution
    0                                        0                               Predictive Distribution
                                                                             with Simplified Prior

                                                                             Limitations of Linear
                                                                             Basis Function Models
−1                                       −1


         0                       x   1           0                  x    1




                                                                                              221of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                                     Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University




        Plots of the function y(x, w) using samples from the posterior
              distribution over w. Number of data points N = 4.
                                                                             Bayesian Regression

                                                                             Sequential Update of the
                                                                             Posterior
    1                                        1
                                                                             Predictive Distribution
t                                        t
                                                                             Proof of the Predictive
                                                                             Distribution
    0                                        0                               Predictive Distribution
                                                                             with Simplified Prior

                                                                             Limitations of Linear
                                                                             Basis Function Models
−1                                       −1


         0                       x   1           0                  x    1




                                                                                              222of 229
Introduction to Statistical

Predictive Distribution with Simplified Prior                                     Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University




        Plots of the function y(x, w) using samples from the posterior
             distribution over w. Number of data points N = 25.
                                                                             Bayesian Regression

                                                                             Sequential Update of the
                                                                             Posterior
    1                                        1
                                                                             Predictive Distribution
t                                        t
                                                                             Proof of the Predictive
                                                                             Distribution
    0                                        0                               Predictive Distribution
                                                                             with Simplified Prior

                                                                             Limitations of Linear
                                                                             Basis Function Models
−1                                       −1


         0                       x   1           0                  x    1




                                                                                              223of 229
Introduction to Statistical

Limitations of Linear Basis Function Models                                Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
                                                                       The Australian National
                                                                             University



    Basis function φj (x) are fixed before the training data set is
    observed.
    Curse of dimensionality : Number of basis function grows
                                                                       Bayesian Regression
    rapidly, often exponentially, with the dimensionality D.
                                                                       Sequential Update of the
    But typical data sets have two nice properties which can           Posterior

    be exploited if the basis functions are not fixed :                 Predictive Distribution

                                                                       Proof of the Predictive
        Data lie close to a nonlinear manifold with intrinsic          Distribution
        dimension much smaller than D. Need algorithms which           Predictive Distribution
                                                                       with Simplified Prior
        place basis functions only where data are (e.g. radial basis
                                                                       Limitations of Linear
        function networks, support vector machines, relevance          Basis Function Models
        vector machines, neural networks).
        Target variables may only depend on a few significant
        directions within the data manifold. Need algorithms which
        can exploit this property (Neural networks).



                                                                                        224of 229
Introduction to Statistical

Curse of Dimensionality                                                Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University


    Linear Algebra allows us to operate in n-dimensional
    vector spaces using the intution from our 3-dimensional
    world as a vector space. No surprises as long as n is finite.
    If we add more structure to a vector space (e.g. inner         Bayesian Regression

    product, metric), our intution gained from the                 Sequential Update of the
                                                                   Posterior
    3-dimensional world around us may be wrong.                    Predictive Distribution

    Example: Sphere of radius r = 1. What is the fraction of       Proof of the Predictive
                                                                   Distribution
    the volume of the sphere in a D-dimensional space which
                                                                   Predictive Distribution
    lies between radius r = 1 and r = 1 − ?                        with Simplified Prior


    Volume scales like rD , therefore the formula for the volume   Limitations of Linear
                                                                   Basis Function Models

    of a sphere is VD (r) = KD rD .

                 VD (1) − VD (1 − )
                                    = 1 − (1 − )D
                        VD (1)


                                                                                    225of 229
Introduction to Statistical

Curse of Dimensionality                                                              Machine Learning

                                                                                         c 2011
                                                                                   Christfried Webers
                                                                                         NICTA
                                                                                 The Australian National
    Fraction of the volume of the sphere in a D-dimensional                            University

    space which lies between radius r = 1 and r = 1 −

                 VD (1) − VD (1 − )
                                    = 1 − (1 − )D
                        VD (1)
                                                                                 Bayesian Regression

                                                                                 Sequential Update of the
                                     1                                           Posterior
                                              D = 20
                                                                                 Predictive Distribution
                                    0.8             D =5                         Proof of the Predictive
                                                                                 Distribution
                                                             D =2
                  volume fraction




                                                                                 Predictive Distribution
                                    0.6                                          with Simplified Prior
                                                                D =1
                                                                                 Limitations of Linear
                                                                                 Basis Function Models
                                    0.4


                                    0.2


                                     0
                                          0   0.2      0.4      0.6    0.8   1



                                                                                                  226of 229
Introduction to Statistical

Curse of Dimensionality                                              Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University


    Probability density with respect to radius r of a Gaussian
    distribution for various values of the dimensionality D.
                     2
                                                                 Bayesian Regression

                             D=1                                 Sequential Update of the
                                                                 Posterior

                                                                 Predictive Distribution
                               D=2
              p(r)




                                                                 Proof of the Predictive
                     1                                           Distribution
                                         D = 20
                                                                 Predictive Distribution
                                                                 with Simplified Prior

                                                                 Limitations of Linear
                                                                 Basis Function Models


                     0
                         0           2                4
                                     r




                                                                                  227of 229
Introduction to Statistical

Curse of Dimensionality                                                    Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
    Probability density with respect to radius r of a Gaussian         The Australian National
                                                                             University
    distribution for various values of the dimensionality D.
    Example: D = 2; assume µ = 0, Σ = I
                        1      1                  1      1 2     2
      N (x | 0, I) =      exp − xT x         =      exp − (x1 + x2 )
                       2π      2                 2π      2             Bayesian Regression

                                                                       Sequential Update of the
                                                                       Posterior
    Coordinate transformation
                                                                       Predictive Distribution

                                                                       Proof of the Predictive
                       x1 = r cos(φ)         x2 = r sin(φ)             Distribution

                                                                       Predictive Distribution
    Probability in the new coordinates                                 with Simplified Prior

                                                                       Limitations of Linear
                                                                       Basis Function Models
                  p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |

    where | J | = r is the determinant of the Jacobian for the
    given coordinate transformation.

                                           1        1
                       p(r, φ | 0, I) =      r exp − r2
                                          2π        2
                                                                                        228of 229
Introduction to Statistical

Curse of Dimensionality                                                    Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
    Probability density with respect to radius r of a Gaussian         The Australian National
                                                                             University
    distribution for D = 2 (and µ = 0, Σ = I)

                                                1        1
                       p(r, φ | 0, I) =           r exp − r2
                                               2π        2
                                                                       Bayesian Regression
    Integrate over all angles φ                                        Sequential Update of the
                                                                       Posterior
                                  2π
                                        1        1               1     Predictive Distribution

       p(r | 0, I) =                      r exp − r2 dφ = r exp − r2   Proof of the Predictive
                              0        2π        2               2     Distribution

                                                                       Predictive Distribution
                                                                       with Simplified Prior
                              2
                                                                       Limitations of Linear
                                       D=1                             Basis Function Models

                                         D=2
                       p(r)




                              1
                                                   D = 20




                              0
                                  0            2            4
                                               r


                                                                                        229of 229

More Related Content

Similar to linear regression part 2 (6)

Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
08 linear classification_2
08 linear classification_208 linear classification_2
08 linear classification_2
 
Linear Algebra
Linear AlgebraLinear Algebra
Linear Algebra
 
Introduction
IntroductionIntroduction
Introduction
 
Probability
ProbabilityProbability
Probability
 
linear classification
linear classificationlinear classification
linear classification
 

More from nep_test_account

Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
nep_test_account
 

More from nep_test_account (13)

Database slide
Database slideDatabase slide
Database slide
 
database slide 1
database slide 1database slide 1
database slide 1
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
PDF TEST
PDF TESTPDF TEST
PDF TEST
 
Baboom!
Baboom!Baboom!
Baboom!
 
Baboom!
Baboom!Baboom!
Baboom!
 
uat
uatuat
uat
 
test pdf
test pdftest pdf
test pdf
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Induction of Decision Trees
Induction of Decision TreesInduction of Decision Trees
Induction of Decision Trees
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 

linear regression part 2

  • 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Introduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 229
  • 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VI Bayesian Regression Sequential Update of the Posterior Linear Regression 2 Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 198of 229
  • 3. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Training Phase University training model with adjustable data x parameter w Bayesian Regression Sequential Update of the training Posterior targets t Predictive Distribution Proof of the Predictive Distribution Predictive Distribution fix the most appropriate w* with Simplified Prior Test Phase Limitations of Linear Basis Function Models test test model with fixed target data x parameter w* t 199of 229
  • 4. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayes Theorem likelihood × prior p(t | w) p(w) posterior = p(w | t) = normalisation p(t) Bayesian Regression likelihood for i.i.d. data Sequential Update of the Posterior N Predictive Distribution p(t | w) = N (tn | y(xn , w), β −1 ) Proof of the Predictive Distribution n=1 Predictive Distribution N with Simplified Prior T −1 = N (tn | w φ(xn ), β ) Limitations of Linear Basis Function Models n=1 1 = const × exp{−β (t − Φw)T (t − Φw)} 2 where we left out the conditioning on x (always assumed), and β, which is assumed to be constant. 200of 229
  • 5. Introduction to Statistical How to choose a prior? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Can we find a prior for the given likelihood which makes sense for the problem at hand Bayesian Regression allows us to find a posterior in a ’nice’ form Sequential Update of the Posterior An answer to the second question: Predictive Distribution Proof of the Predictive Definition ( Conjugate Prior) Distribution Predictive Distribution A class of prior probability distributions p(w) is conjugate to a with Simplified Prior class of likelihood functions p(x | w) if the resulting posterior Limitations of Linear Basis Function Models distributions p(w | x) are in the same family as p(w). 201of 229
  • 6. Introduction to Statistical Examples of Conjugate Prior Distributions Machine Learning c 2011 Christfried Webers NICTA The Australian National University Table: Discrete likelihood distributions Likelihood Conjugate Prior Bernoulli Beta Bayesian Regression Binomial Beta Sequential Update of the Poisson Gamma Posterior Multinomial Dirichlet Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Table: Continuous likelihood distributions Limitations of Linear Basis Function Models Likelihood Conjugate Prior Uniform Pareto Exponential Gamma Normal Normal Multivariate normal Multivariate normal 202of 229
  • 7. Introduction to Statistical Conjugate Prior to a Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example : The Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior will ensure that the posterior distribution is also Gaussian. Bayesian Regression Given a marginal distribution for x and a conditional Sequential Update of the Gaussian distribution for y given x in the form Posterior Predictive Distribution p(x) = N (x | µ, Λ−1 ) Proof of the Predictive Distribution −1 p(y | x) = N (y | Ax + b, L ) Predictive Distribution with Simplified Prior Limitations of Linear we get Basis Function Models p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT ) p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ) where Σ = (Λ + AT L A)−1 . 203of 229
  • 8. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Choose a Gaussian prior with mean m0 and covariance S0 University p(w) = N (w | m0 , S0 ) After having seen N training data pairs (xn , tn ), the posterior for the given likelihood is now Bayesian Regression Sequential Update of the Posterior p(w | t) = N (w | mN , SN ) Predictive Distribution Proof of the Predictive where Distribution Predictive Distribution with Simplified Prior mN = SN (S−1 m0 + βΦT t) 0 Limitations of Linear Basis Function Models S−1 = S−1 + βΦT Φ N 0 The posterior is Gaussian, therefore mode = mean. The maximum posterior weight vector wMAP = mN . Assume infinitely broad prior S0 = α−1 I with α → 0, the mean reduces to the maximum likelihood wML . 204of 229
  • 9. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we have not yet seen any data point ( N = 0), the Bayesian Regression posterior is equal to the prior. Sequential Update of the Posterior Sequential arrival of data points : Each posterior Predictive Distribution distribution calculated after the arrival of a data point and Proof of the Predictive target value, acts as the prior distribution for the Distribution Predictive Distribution subsequent data point. with Simplified Prior Nicely fits a sequential learning framework. Limitations of Linear Basis Function Models 205of 229
  • 10. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Special simplified prior in the remainder, m0 = 0 and S0 = α−1 I, p(x | α) = N (x | 0, α−1 I) (1) The parameters of the posterior distribution p(w | t) = N (w | mn , SN ) are now Bayesian Regression Sequential Update of the Posterior T mN = βSN Φ t Predictive Distribution S−1 = αI + βΦT Φ N Proof of the Predictive Distribution Predictive Distribution For α → 0 we get with Simplified Prior Limitations of Linear Basis Function Models mN → wML = (ΦT Φ)−1 ΦT t Log of posterior is sum of log likelihood and log of prior β α ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const 2 2 206of 229
  • 11. Introduction to Statistical Bayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Log of posterior is sum of log likelihood and log of prior 1 ln p(w | t) = − β (t − Φw)T (t − Φw) Bayesian Regression 2 Sequential Update of the Posterior sum-of-squares-error α Predictive Distribution − wT w + const Proof of the Predictive 2 Distribution quadr. regulariser Predictive Distribution with Simplified Prior Maximising the posterior distribution with respect to w Limitations of Linear Basis Function Models corresponds to minimising the sum-of-squares error function with the addition of a quadratic regularisation term λ = α/β. 207of 229
  • 12. Introduction to Statistical Sequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example of a linear basis function model Single input x, single output t Bayesian Regression Linear model y(x, w) = w0 + w1 x. Sequential Update of the Posterior Data creation Predictive Distribution 1 Choose an xn from the uniform distribution U(x | − 1, 1). Proof of the Predictive Distribution 2 Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5. Predictive Distribution 3 Add Gaussian noise with standard deviation σ = 0.2, with Simplified Prior Limitations of Linear tn = N (xn | f (xn , a), 0.04) Basis Function Models Set the precision of the uniform prior to α = 2.0. 208of 229
  • 13. Introduction to Statistical Sequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 209of 229
  • 14. Introduction to Statistical Sequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 210of 229
  • 15. Introduction to Statistical Predictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University In the training phase, data x and targets t are provided Bayesian Regression In the test phase, a new data value x is given and the Sequential Update of the corresponding target value t is asked for Posterior Predictive Distribution Bayesian approach: Find the probability of the test target t Proof of the Predictive given the test data x, the training data x and the training Distribution targets t Predictive Distribution with Simplified Prior p(t | x, x, t) Limitations of Linear Basis Function Models This is the Predictive Distribution. 211of 229
  • 16. Introduction to Statistical How to calculate the Predictive Distribution? Machine Learning c 2011 Christfried Webers NICTA Introduce the model parameter w via the sum rule The Australian National University p(t | x, x, t) = p(t, w | x, x, t)dw = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression The test target t depends only on the test data x and the Sequential Update of the Posterior model parameter w, but not on the training data and the Predictive Distribution training targets Proof of the Predictive Distribution p(t | w, x, x, t) = p(t | w, x) Predictive Distribution with Simplified Prior The model parameter w are learned with the training data Limitations of Linear Basis Function Models x and the training targets t only p(w | x, x, t) = p(w | x, t) Predictive Distribution p(t | x, x, t) = p(t | w, x)p(w | x, t)dw 212of 229
  • 17. Introduction to Statistical Proof of the Predictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National How to prove the Predictive Distribution in the general University form? p(t | x, x, t) = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression Convert each conditional probability on the right-hand-side Sequential Update of the into a joint probability. Posterior Predictive Distribution Proof of the Predictive p(t | w, x, x, t)p(w | x, x, t)dw Distribution Predictive Distribution p(t, w, x, x, t) p(w, x, x, t) with Simplified Prior = dw Limitations of Linear p(w, x, x, t) p(x, x, t) Basis Function Models p(t, w, x, x, t) = dw p(x, x, t) p(t, x, x, t) = p(x, x, t) = p(t | x, x, t) 213of 229
  • 18. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA Find the predictive distribution The Australian National University p(t | t, α, β) = p(t | w, β) p(w | t, α, β)dw (remember : The conditioning on the input variables x is often suppressed to simplify the notation.) Bayesian Regression Sequential Update of the Now we know Posterior Predictive Distribution N T −1 Proof of the Predictive p(t | x, w, β) = N (tn | w φ(xn ), β ) Distribution n=1 Predictive Distribution with Simplified Prior and the posterior was Limitations of Linear Basis Function Models p(w | t, α, β) = N (w | mN , SN ) where mN = βSN ΦT t S−1 = αI + βΦT Φ N 214of 229
  • 19. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we do the convolution of the two Gaussians, we get for the predictive distribution Bayesian Regression Sequential Update of the Posterior 2 p(t | x, t, α, β) = N (t | mT φ(x), σN (x)) N Predictive Distribution Proof of the Predictive 2 where the variance σN (x) is given by Distribution Predictive Distribution with Simplified Prior 2 1 σN (x) = + φ(x)T SN φ(x). Limitations of Linear β Basis Function Models 215of 229
  • 20. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 1. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 216of 229
  • 21. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 2. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 217of 229
  • 22. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 4. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 218of 229
  • 23. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 25. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 219of 229
  • 24. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 1. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distribution t t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models −1 −1 0 x 1 0 x 1 220of 229
  • 25. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 2. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distribution t t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models −1 −1 0 x 1 0 x 1 221of 229
  • 26. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 4. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distribution t t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models −1 −1 0 x 1 0 x 1 222of 229
  • 27. Introduction to Statistical Predictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 25. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distribution t t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models −1 −1 0 x 1 0 x 1 223of 229
  • 28. Introduction to Statistical Limitations of Linear Basis Function Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basis function φj (x) are fixed before the training data set is observed. Curse of dimensionality : Number of basis function grows Bayesian Regression rapidly, often exponentially, with the dimensionality D. Sequential Update of the But typical data sets have two nice properties which can Posterior be exploited if the basis functions are not fixed : Predictive Distribution Proof of the Predictive Data lie close to a nonlinear manifold with intrinsic Distribution dimension much smaller than D. Need algorithms which Predictive Distribution with Simplified Prior place basis functions only where data are (e.g. radial basis Limitations of Linear function networks, support vector machines, relevance Basis Function Models vector machines, neural networks). Target variables may only depend on a few significant directions within the data manifold. Need algorithms which can exploit this property (Neural networks). 224of 229
  • 29. Introduction to Statistical Curse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is finite. If we add more structure to a vector space (e.g. inner Bayesian Regression product, metric), our intution gained from the Sequential Update of the Posterior 3-dimensional world around us may be wrong. Predictive Distribution Example: Sphere of radius r = 1. What is the fraction of Proof of the Predictive Distribution the volume of the sphere in a D-dimensional space which Predictive Distribution lies between radius r = 1 and r = 1 − ? with Simplified Prior Volume scales like rD , therefore the formula for the volume Limitations of Linear Basis Function Models of a sphere is VD (r) = KD rD . VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) 225of 229
  • 30. Introduction to Statistical Curse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National Fraction of the volume of the sphere in a D-dimensional University space which lies between radius r = 1 and r = 1 − VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) Bayesian Regression Sequential Update of the 1 Posterior D = 20 Predictive Distribution 0.8 D =5 Proof of the Predictive Distribution D =2 volume fraction Predictive Distribution 0.6 with Simplified Prior D =1 Limitations of Linear Basis Function Models 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 226of 229
  • 31. Introduction to Statistical Curse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. 2 Bayesian Regression D=1 Sequential Update of the Posterior Predictive Distribution D=2 p(r) Proof of the Predictive 1 Distribution D = 20 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 0 0 2 4 r 227of 229
  • 32. Introduction to Statistical Curse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for various values of the dimensionality D. Example: D = 2; assume µ = 0, Σ = I 1 1 1 1 2 2 N (x | 0, I) = exp − xT x = exp − (x1 + x2 ) 2π 2 2π 2 Bayesian Regression Sequential Update of the Posterior Coordinate transformation Predictive Distribution Proof of the Predictive x1 = r cos(φ) x2 = r sin(φ) Distribution Predictive Distribution Probability in the new coordinates with Simplified Prior Limitations of Linear Basis Function Models p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J | where | J | = r is the determinant of the Jacobian for the given coordinate transformation. 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 228of 229
  • 33. Introduction to Statistical Curse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for D = 2 (and µ = 0, Σ = I) 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 Bayesian Regression Integrate over all angles φ Sequential Update of the Posterior 2π 1 1 1 Predictive Distribution p(r | 0, I) = r exp − r2 dφ = r exp − r2 Proof of the Predictive 0 2π 2 2 Distribution Predictive Distribution with Simplified Prior 2 Limitations of Linear D=1 Basis Function Models D=2 p(r) 1 D = 20 0 0 2 4 r 229of 229