Models & Estimation
            Merlise Clyde


     STA721 Linear Models

             Duke University


        September 6, 2012




                                             duke.eps


STA721 Linear Models   Models & Estimation
Outline




  Readings: Christensen Chapter 1-2, Appendix A




                                                                duke.eps


                   STA721 Linear Models   Models & Estimation
Models

  Take an random vector Y ∈ Rn which is observable and decompose

                                       Y =µ+

  into µ ∈ Rn (unknown, fixed) and                 ∈ Rn unobservable error
  vector (random)

  Usual assumptions?
      E[ ] = 0     ⇒ E[Y] = µ (mean vector)
      Cov[ ] = σ 2 In         ⇒ Cov[Y] = σ 2 In (errors are uncorrelated)
         ∼ N(0, σ 2 I)       ⇒ Y ∼ N(µ, σ 2 I)
  The distribution assumption allows us to right down a likelihood
  function

                                                                            duke.eps


                         STA721 Linear Models   Models & Estimation
Likelihood Functions


  The likelihood function for µ, σ 2 is proportional to the sampling
  distribution of the data


                                       1 (Y − µ)T (Y − µ)
       L(µ, σ 2 ) ∝ (2πσ 2 )−n/2 exp −
                                       2        σ2
                                     1 Y−µ    2
                  ∝ (σ 2 )−n/2 exp −
                                     2    σ2

  Find values of µ and σ 2 that maximize the likelihood L(µ, σ 2 ) for
                 ˆ     ˆ
  µ∈R   n and σ 2 ∈ R+



  Clearly, µ = Y but σ 2 = 0 is outside the parameter space
           ˆ         ˆ

                                                                         duke.eps


                    STA721 Linear Models   Models & Estimation
Restrictions on µ

  Linear model µ = Xβ for fixed X (n × p) and some β ∈ Rp
  restrict the possible values for µ
      Oneway Anova Model j = 1, . . . , J groups with nj replicates in
      each group j
      “cell means” Yij = µj + ij
                                                                   
                         1n1 0n1           ...    0n1            µ1
                       0n 1n              ...    0n2          µ2   
                       2        2
                 µ= .
                                                                    
                         .   .
                             .             ..     .
                                                  .            .
                                                                 .    
                       .    .                .   .            .    
                            0nJ     ...    0nJ    1nJ            µJ
      1nj is a vector of length nj of ones.
      0nj is a vector of length nj h of zeros.

                                                                          duke.eps


                    STA721 Linear Models   Models & Estimation
Equivalent Model

  Linear model µ = Xβ for fixed X (n × p) and some β ∈ Rp
      Oneway Anova Model j = 1, . . . , J groups with nj replicates in
      each group j
      “Treatment effects”      Yij = µ + τj +               ij
                                                                             
                     1n1        1n1        0n1      ...         0n1        µ
                   1n          0n2        1n2      ...         0n2      τ1   
                   2
              µ= .
                                                                              
                                .          .        ..          .          .
                   .           .          .                    .          .
                                                                              
                     .          .          .           .        .        .    
                        1nJ     0nJ        ...      0nJ         1nJ        τJ

      Equivalent means µj = µ + τj
      Should our inference for µ depend on how we represent or
      parameterize µ?
                                                                                    duke.eps


                    STA721 Linear Models         Models & Estimation
Column Space

  Many equivalent ways to represent the same mean vector –
  inference should be independent of the coordinate system used
      Let X1 , X2 , . . . , Xp ∈ Rn
      The set of all linear combinations of X1 , . . . , Xp is the space
      spanned by X1 , . . . , Xp ≡ S(X1 , . . . , Xp )
      Let X = [X1 X2 . . . Xp ] be a n × p matrix with columns Xj
      then the column space of X, C (X) = S(X1 , . . . , Xp ) space
      spanned by the (column) vectors of X
      µ ∈ C (X) : C (X) = {µ | µ ∈ Rn such that Xβ = µ for some
      β ∈ Rp } (also called the Range of X, R(X)
      β are the “coordinates” of µ in this space
      C (X) is a subspace of Rn

                                                                           duke.eps


                      STA721 Linear Models   Models & Estimation
Vector spaces
  A collection of vectors V is a real vector space if the following
  conditions hold: for any pair x and y of vectors in V there
  corresponds a vector x + y and scalars α, β ∈ R such that:
    1   x + y = y + x (vector addition is commutative)
    2   (x + y) + z = x + (y + z) (vector addition is associative)
    3   there exists a unique vector 0 ∈ V (the origin) such that x + 0 = x
        for every vector x
    4   for every x ∈ V , there exists a unique vector −x such that
        x + (−x) = 0
    5   α(βx) = (αβ)x (multiplication by scalars is associative)
    6   1x = x for every x
    7   α(x + y) = αx + αy (multiplication by scalars is distributive with
        respect vector addition)
    8   (α + β)x = αx + βx (multiplication by scalars is distributive with
        respect to vector addition)                                           duke.eps


                       STA721 Linear Models   Models & Estimation
Subspaces

  Definition
  Let V be a vector space and let V0 be a set with V0 ⊆ V . V0 is a
  subspace of V if and only if V0 is a vector space.

  Theorem
  Let V be a vector space, and let V0 be a non-empty subset of V .
  If V0 is closed on vector addition and scalar multiplication, then V0
  is a subspace of V

  Theorem
  Let V be a vector space, and let x1 , . . . , xr be vectors in V . The
  set of all linear combinations of x1 , . . . , xr is a subspace of V .

  The Column space of X, C (X), is a subspace of Rn
                                                                           duke.eps


                     STA721 Linear Models   Models & Estimation
Basis

  Definition
  A finite set of vectors {xi } is linearly dependent if there exist
  scalars {ai } (not all zero) such that i ai xi = 0. Conversely
    i ai xi = 0 implies that ai = 0 ∀ i the set is linearly independent.


  Definition
  If V0 is a subspace of V and if {x1 , . . . , xr } is a linearly
  independent spanning set for V0 , then {x1 , . . . , xr } is a basis for V0

  Theorem
  If V0 is a subspace of V , then all bases for V0 have the same
  number of vectors
  Can both the collection of vectors in the cell means and the
  treatment effects parameterizations be a basis?
                                                                                duke.eps


                      STA721 Linear Models   Models & Estimation
Rank




  Definition
  The rank of a subspace V0 is the number of elements in a basis for
  V0 and is written as r (V0 ). Similarly if A is a matrix, the rank of
  C (A) is called the rank of A and is written r (A).

  What is the rank of the subspace in the Oneway ANOVA model?




                                                                          duke.eps


                    STA721 Linear Models   Models & Estimation
Orthogonality

  Definition
  An inner product space is a vector space V equipped with an inner
  product: ·, · is a mapping V × V → R. Two vectors are
  orthogonal if x, y = 0, written x ⊥ y (usual xT y = 0)

  Definition
  Two subspaces are orthogonal if x ∈ M and y ∈ N imply that
  x⊥y

  Definition
  Orthonormal basis: {x1 , . . . , xr } is an orthonormal basis (ONB) for
  M if xi , xj = 0 for i = j and xi , xi = 1 for all i. Length
  x= x =        x, x . Distance of two vectors is x − y

                                                                            duke.eps


                     STA721 Linear Models   Models & Estimation
Oneway Anova

  Find an orthonormal basis for the oneway ANOVA model. Is it
  unique?




  Can also use Gram-Schmidt sequential orthogonalization

                                                                duke.eps


                   STA721 Linear Models   Models & Estimation
Decomposition

  Definition
  If M and N are subspaces of V that satisfy
      M ∩ N = {0}
      M + N = V where M + N = {z | z = x + y, x ∈ M, y ∈ N}
      (M ⊕ N = V ) Direct Sum
  then M and N are complementary spaces. N is the complement of
  M

  Definition
  If M and N are complementary spaces M + N = Rn and M and N
  are orthogonal subspaces, then M is the orthogonal complement of
  N, N ⊥ .

  If z ∈ Rn = N + M, then we can uniquely decompose it into a part
  x ∈ M and y ∈ N and r (M) + r (N) = n
                                                                     duke.eps


                   STA721 Linear Models   Models & Estimation
Projections



  Definition
  P is an orthogonal projection operator onto C (X) if and only if
      u ∈ C (X) implies Pu = u (projection)
      v ⊥ C (X) implies Pv = 0 (perpendicular)

      C (X) and C (X)⊥ are complementary spaces
      y = u + v for y ∈ Rn , u ∈ C (X) and v ⊥ C (X) then
      Py = Pu + Pv = u
      u is the orthogonal projection of y onto C (X)



                                                                     duke.eps


                    STA721 Linear Models   Models & Estimation
More on Projections



  Prop
  P is an orthogonal projection operator on C (X) then C (X) = C (P)

  Theorem
  P is an orthogonal projection operator on C (P) if and only if
         P = P2 (idempotent)
         P = PT (symmetry)

  Claim: If X is n × p and r (X) = p then, PX = X(XT X)−1 XT
  operator onto C (X)



                                                                       duke.eps


                     STA721 Linear Models   Models & Estimation
Projections
  Claim: If X is n × p and r (X) = p then, PX = X(XT X)−1 XT is an
  orthogonal projection operator onto C (X)
      P = P2 (idempotent)

              P2 = PX PX = X(XT X)−1 XT X(XT X)−1 XT
               X
                                = X(XT X)−1 XT
                                = PX

      P = PT (symmetry)

                    PT
                     X      = (X(XT X)−1 XT )T
                            = (XT )T ((XT X)−1 )T (X)T
                            = X(XT X)−1 XT
                            = PX

      C (X ) = C (PX )                                               duke.eps


                    STA721 Linear Models   Models & Estimation
Projections


  Claim: I − PX is an orthogonal projection onto C (X)⊥
      idempotent

                   (I − PX )2 = (I − PX )(I − PX )
                                   = I − PX − PX + PX PX
                                   = I − PX − PX + PX
                                   = I − PX

      Symmetry I − PX = (I − PX )T
      u ∈ C (X)⊥ ⇒ u ⊥ C (X) and (I − PX )u = u (projection)
      if v ∈ C (X), (I − PX )v = v − v = 0


                                                                duke.eps


                   STA721 Linear Models   Models & Estimation
Null Space

     Column space C (X)
     N(A): Null space of A is {u | Au = 0}
     Null space of XT is (u | XT u = 0}
     N(XT ) = C (X)⊥ the orthogonal complement of C (X)
     N(PX ) = C (X)⊥

             u ∈ N(P) ⇒ PX u = 0
                            ⇔ X(XT X)−1 XT u = 0
                            ⇔ uT X(XT X)−1 XT = 0T
                            ⇔ uT X(XT X)−1 XT β = 0
                            ⇔ uT v = 0 v ∈ C (X)


                                                               duke.eps


                  STA721 Linear Models   Models & Estimation
Models Again


     Y ∼ N(µ, σ 2 In ) with µ ∈ C (X)
     Claim: Maximum Likelihood Estimator (MLE) of µ is PX Y
     Log Likelihood:

                                 n            1 Y−µ               2
               log L(µ, σ 2 ) = − log(σ 2 ) −
                                 2            2  σ2

     Decompose Y = PX Y + (I − PX )Y
     Use PX µ = µ
     and Simplify
                                                  2
                                           Y−µ


                                                                      duke.eps


                    STA721 Linear Models    Models & Estimation

Models

  • 1.
    Models & Estimation Merlise Clyde STA721 Linear Models Duke University September 6, 2012 duke.eps STA721 Linear Models Models & Estimation
  • 2.
    Outline Readings:Christensen Chapter 1-2, Appendix A duke.eps STA721 Linear Models Models & Estimation
  • 3.
    Models Takean random vector Y ∈ Rn which is observable and decompose Y =µ+ into µ ∈ Rn (unknown, fixed) and ∈ Rn unobservable error vector (random) Usual assumptions? E[ ] = 0 ⇒ E[Y] = µ (mean vector) Cov[ ] = σ 2 In ⇒ Cov[Y] = σ 2 In (errors are uncorrelated) ∼ N(0, σ 2 I) ⇒ Y ∼ N(µ, σ 2 I) The distribution assumption allows us to right down a likelihood function duke.eps STA721 Linear Models Models & Estimation
  • 4.
    Likelihood Functions The likelihood function for µ, σ 2 is proportional to the sampling distribution of the data 1 (Y − µ)T (Y − µ) L(µ, σ 2 ) ∝ (2πσ 2 )−n/2 exp − 2 σ2 1 Y−µ 2 ∝ (σ 2 )−n/2 exp − 2 σ2 Find values of µ and σ 2 that maximize the likelihood L(µ, σ 2 ) for ˆ ˆ µ∈R n and σ 2 ∈ R+ Clearly, µ = Y but σ 2 = 0 is outside the parameter space ˆ ˆ duke.eps STA721 Linear Models Models & Estimation
  • 5.
    Restrictions on µ Linear model µ = Xβ for fixed X (n × p) and some β ∈ Rp restrict the possible values for µ Oneway Anova Model j = 1, . . . , J groups with nj replicates in each group j “cell means” Yij = µj + ij    1n1 0n1 ... 0n1 µ1  0n 1n ... 0n2  µ2   2 2 µ= .   . . . .. . .  . .   . . . .  .  0nJ ... 0nJ 1nJ µJ 1nj is a vector of length nj of ones. 0nj is a vector of length nj h of zeros. duke.eps STA721 Linear Models Models & Estimation
  • 6.
    Equivalent Model Linear model µ = Xβ for fixed X (n × p) and some β ∈ Rp Oneway Anova Model j = 1, . . . , J groups with nj replicates in each group j “Treatment effects” Yij = µ + τj + ij    1n1 1n1 0n1 ... 0n1 µ  1n 0n2 1n2 ... 0n2  τ1   2 µ= .   . . .. . .  . . . . .   . . . . .  .  1nJ 0nJ ... 0nJ 1nJ τJ Equivalent means µj = µ + τj Should our inference for µ depend on how we represent or parameterize µ? duke.eps STA721 Linear Models Models & Estimation
  • 7.
    Column Space Many equivalent ways to represent the same mean vector – inference should be independent of the coordinate system used Let X1 , X2 , . . . , Xp ∈ Rn The set of all linear combinations of X1 , . . . , Xp is the space spanned by X1 , . . . , Xp ≡ S(X1 , . . . , Xp ) Let X = [X1 X2 . . . Xp ] be a n × p matrix with columns Xj then the column space of X, C (X) = S(X1 , . . . , Xp ) space spanned by the (column) vectors of X µ ∈ C (X) : C (X) = {µ | µ ∈ Rn such that Xβ = µ for some β ∈ Rp } (also called the Range of X, R(X) β are the “coordinates” of µ in this space C (X) is a subspace of Rn duke.eps STA721 Linear Models Models & Estimation
  • 8.
    Vector spaces A collection of vectors V is a real vector space if the following conditions hold: for any pair x and y of vectors in V there corresponds a vector x + y and scalars α, β ∈ R such that: 1 x + y = y + x (vector addition is commutative) 2 (x + y) + z = x + (y + z) (vector addition is associative) 3 there exists a unique vector 0 ∈ V (the origin) such that x + 0 = x for every vector x 4 for every x ∈ V , there exists a unique vector −x such that x + (−x) = 0 5 α(βx) = (αβ)x (multiplication by scalars is associative) 6 1x = x for every x 7 α(x + y) = αx + αy (multiplication by scalars is distributive with respect vector addition) 8 (α + β)x = αx + βx (multiplication by scalars is distributive with respect to vector addition) duke.eps STA721 Linear Models Models & Estimation
  • 9.
    Subspaces Definition Let V be a vector space and let V0 be a set with V0 ⊆ V . V0 is a subspace of V if and only if V0 is a vector space. Theorem Let V be a vector space, and let V0 be a non-empty subset of V . If V0 is closed on vector addition and scalar multiplication, then V0 is a subspace of V Theorem Let V be a vector space, and let x1 , . . . , xr be vectors in V . The set of all linear combinations of x1 , . . . , xr is a subspace of V . The Column space of X, C (X), is a subspace of Rn duke.eps STA721 Linear Models Models & Estimation
  • 10.
    Basis Definition A finite set of vectors {xi } is linearly dependent if there exist scalars {ai } (not all zero) such that i ai xi = 0. Conversely i ai xi = 0 implies that ai = 0 ∀ i the set is linearly independent. Definition If V0 is a subspace of V and if {x1 , . . . , xr } is a linearly independent spanning set for V0 , then {x1 , . . . , xr } is a basis for V0 Theorem If V0 is a subspace of V , then all bases for V0 have the same number of vectors Can both the collection of vectors in the cell means and the treatment effects parameterizations be a basis? duke.eps STA721 Linear Models Models & Estimation
  • 11.
    Rank Definition The rank of a subspace V0 is the number of elements in a basis for V0 and is written as r (V0 ). Similarly if A is a matrix, the rank of C (A) is called the rank of A and is written r (A). What is the rank of the subspace in the Oneway ANOVA model? duke.eps STA721 Linear Models Models & Estimation
  • 12.
    Orthogonality Definition An inner product space is a vector space V equipped with an inner product: ·, · is a mapping V × V → R. Two vectors are orthogonal if x, y = 0, written x ⊥ y (usual xT y = 0) Definition Two subspaces are orthogonal if x ∈ M and y ∈ N imply that x⊥y Definition Orthonormal basis: {x1 , . . . , xr } is an orthonormal basis (ONB) for M if xi , xj = 0 for i = j and xi , xi = 1 for all i. Length x= x = x, x . Distance of two vectors is x − y duke.eps STA721 Linear Models Models & Estimation
  • 13.
    Oneway Anova Find an orthonormal basis for the oneway ANOVA model. Is it unique? Can also use Gram-Schmidt sequential orthogonalization duke.eps STA721 Linear Models Models & Estimation
  • 14.
    Decomposition Definition If M and N are subspaces of V that satisfy M ∩ N = {0} M + N = V where M + N = {z | z = x + y, x ∈ M, y ∈ N} (M ⊕ N = V ) Direct Sum then M and N are complementary spaces. N is the complement of M Definition If M and N are complementary spaces M + N = Rn and M and N are orthogonal subspaces, then M is the orthogonal complement of N, N ⊥ . If z ∈ Rn = N + M, then we can uniquely decompose it into a part x ∈ M and y ∈ N and r (M) + r (N) = n duke.eps STA721 Linear Models Models & Estimation
  • 15.
    Projections Definition P is an orthogonal projection operator onto C (X) if and only if u ∈ C (X) implies Pu = u (projection) v ⊥ C (X) implies Pv = 0 (perpendicular) C (X) and C (X)⊥ are complementary spaces y = u + v for y ∈ Rn , u ∈ C (X) and v ⊥ C (X) then Py = Pu + Pv = u u is the orthogonal projection of y onto C (X) duke.eps STA721 Linear Models Models & Estimation
  • 16.
    More on Projections Prop P is an orthogonal projection operator on C (X) then C (X) = C (P) Theorem P is an orthogonal projection operator on C (P) if and only if P = P2 (idempotent) P = PT (symmetry) Claim: If X is n × p and r (X) = p then, PX = X(XT X)−1 XT operator onto C (X) duke.eps STA721 Linear Models Models & Estimation
  • 17.
    Projections Claim:If X is n × p and r (X) = p then, PX = X(XT X)−1 XT is an orthogonal projection operator onto C (X) P = P2 (idempotent) P2 = PX PX = X(XT X)−1 XT X(XT X)−1 XT X = X(XT X)−1 XT = PX P = PT (symmetry) PT X = (X(XT X)−1 XT )T = (XT )T ((XT X)−1 )T (X)T = X(XT X)−1 XT = PX C (X ) = C (PX ) duke.eps STA721 Linear Models Models & Estimation
  • 18.
    Projections Claim:I − PX is an orthogonal projection onto C (X)⊥ idempotent (I − PX )2 = (I − PX )(I − PX ) = I − PX − PX + PX PX = I − PX − PX + PX = I − PX Symmetry I − PX = (I − PX )T u ∈ C (X)⊥ ⇒ u ⊥ C (X) and (I − PX )u = u (projection) if v ∈ C (X), (I − PX )v = v − v = 0 duke.eps STA721 Linear Models Models & Estimation
  • 19.
    Null Space Column space C (X) N(A): Null space of A is {u | Au = 0} Null space of XT is (u | XT u = 0} N(XT ) = C (X)⊥ the orthogonal complement of C (X) N(PX ) = C (X)⊥ u ∈ N(P) ⇒ PX u = 0 ⇔ X(XT X)−1 XT u = 0 ⇔ uT X(XT X)−1 XT = 0T ⇔ uT X(XT X)−1 XT β = 0 ⇔ uT v = 0 v ∈ C (X) duke.eps STA721 Linear Models Models & Estimation
  • 20.
    Models Again Y ∼ N(µ, σ 2 In ) with µ ∈ C (X) Claim: Maximum Likelihood Estimator (MLE) of µ is PX Y Log Likelihood: n 1 Y−µ 2 log L(µ, σ 2 ) = − log(σ 2 ) − 2 2 σ2 Decompose Y = PX Y + (I − PX )Y Use PX µ = µ and Simplify 2 Y−µ duke.eps STA721 Linear Models Models & Estimation