1.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 112
2.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part III Basic Concepts Linear TransformationsLinear Algebra Trace Inner Product Projection Rank, Determinant, Trace Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 80of 112
3.
Introduction to StatisticalIntuition Machine Learning c 2011 Christfried Webers NICTA The Australian National University Geometry Points and Lines Vector addition and scaling Basic Concepts Humans have experience with 3 dimensions (less with 1, 2 Linear Transformations though) Trace Inner Product Generalisation to N dimensions (possibly N → ∞) Projection Line → vector space V Rank, Determinant, Trace Matrix Inverse Point → vector x ∈ V Eigenvectors Example : X ∈ Rn×m Singular Value Decomposition n×m n·m Space of matrices R and the space of vectors R are Directional Derivative, isomorphic Gradient Books 81of 112
4.
Introduction to StatisticalVector Space V, underlying Field F Machine Learning c 2011 Christfried Webers NICTA The Australian NationalGiven vectors u, v, w ∈ V and scalars α, β ∈ F, the following Universityholds: Associativity of addition : u + (v + w) = (u + v) + w . Commutativity of addition : v + w = w + v . Identity element of addition : ∃ 0 such that Basic Concepts v + 0 = v, ∀ v ∈ V. Linear Transformations Trace Inverse of addition : For all v ∈ V, ∃w ∈ V, such that Inner Product v + w = 0. The additive inverse is denoted −v. Projection Distributivity of scalar multiplication with respect to vector Rank, Determinant, Trace addition : α(v + w) = αv + αw . Matrix Inverse Eigenvectors Distributivity of scalar multiplication with respect to ﬁeld Singular Value addition : (α + β)v = αv + βv . Decomposition Directional Derivative, Compatibility of scalar multiplication with ﬁeld Gradient multiplication : α(βv) = (αβ)v . Books Identity element of scalar multiplication : 1v = v, where 1 is the identity in F . 82of 112
5.
Introduction to StatisticalMatrix-Vector Multiplication Machine Learning c 2011 Christfried Webers NICTA The Australian National University § ¤1 f o r i i n xrange (m) : R[ i ] = 0 . 0 ; f o r j i n xrange ( n ) : R[ i ] = R[ i ] + A[ i , j ] ∗ V[ j ] ¦ ¥ Basic Concepts Listing 1: Code for elementwise matrix multiplication. Linear Transformations Trace Inner Product Projection Rank, Determinant, Trace A V = R Matrix Inverse Eigenvectors a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Singular Value a21 a22 . . . a2n v2 a21 v1 + a22 v2 + · · · + a2n vn = Decomposition . . . Directional Derivative, ... . . . . . . . . . ... Gradient am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Books 83of 112
6.
Introduction to StatisticalMatrix-Vector Multiplication Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basic Concepts A V = R Linear Transformations a11 a12 . . . a1n v1 a11 v1 + a12 v2 + · · · + a1n vn Trace a21 a22 . . . a2n v2 a21 v1 + a22 v2 + · · · + a2n vn = Inner Product . . . ... . . . . . . . . . ... Projection am1 am2 . . . amn vn am1 v1 + am2 v2 + · · · + amn vn Rank, Determinant, Trace Matrix Inverse § ¤ Eigenvectors Singular Value1 R = A[: ,0]; Decomposition f o r j i n xrange ( 1 , n ) : Directional Derivative, R += A [ : , j ] ∗ V [ j ] ; ¦ ¥ Gradient Books Listing 2: Code for columnwise matrix multiplication. 84of 112
7.
Introduction to StatisticalMatrix-Vector Multiplication Machine Learning c 2011 Christfried Webers NICTA The Australian National University Denote the n columns of A by a1 , a2 , . . . , an Basic Concepts Linear Transformations Each ai is now a (column) vector Trace Inner Product Projection A V = R Rank, Determinant, Trace Matrix Inverse v1 Eigenvectors v2 a1 a2 ... an = a1 v1 + a2 v2 + · · · + an vn . . . Singular Value Decomposition vn Directional Derivative, Gradient Books 85of 112
8.
Introduction to StatisticalTranspose of R Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given R = AV, Basic Concepts Linear Transformations R = a1 v1 + a2 v2 + · · · + an vn Trace Inner Product Projection T What is R ? Rank, Determinant, Trace Matrix Inverse RT = v1 aT + v2 aT + · · · + vn aT 1 2 n Eigenvectors Singular Value NOT equal to AT V T ! (In fact, AT V T is not even deﬁned.) Decomposition Directional Derivative, Gradient Books 86of 112
9.
Introduction to StatisticalTranspose Machine Learning c 2011 Christfried Webers NICTA The Australian National University Reverse order rule : (AV)T = V T AT Basic Concepts VT AT = RT Linear Transformations Trace aT 1 Inner Product Projection aT 2 = v1 aT + v2 aT + · · · + vn aT v1 v2 . . . vn 1 2 n Rank, Determinant, Trace Matrix Inverse ... Eigenvectors aT n Singular Value Decomposition Directional Derivative, Gradient Books 87of 112
10.
Introduction to StatisticalTrace Machine Learning c 2011 Christfried Webers NICTA The Australian National University The trace of a square matrix A is the sum of all diagonal elements of A. n Basic Concepts tr {A} = Akk Linear Transformations k=1 Trace Example Inner Product Projection 1 2 3 Rank, Determinant, Trace A = 4 5 6 tr {A} = 1 + 5 + 9 = 15 Matrix Inverse Eigenvectors 7 8 9 Singular Value Decomposition The trace does not exist for a non-square matrix. Directional Derivative, Gradient Books 88of 112
11.
Introduction to Statisticalvec(x) operator Machine Learning c 2011 Christfried Webers NICTA The Australian National Deﬁne vec(X) as the vector which results from stacking all University columns of a matrix A on top of each other. Example 1 Basic Concepts 4 Linear Transformations 7 Trace 1 2 3 2 Inner Product A = 4 5 6 vec(A) = 5 Projection 7 8 9 8 Rank, Determinant, Trace 3 Matrix Inverse 6 Eigenvectors 9 Singular Value Decomposition Directional Derivative, Given two matrices X, Y ∈ Rn×p , the trace of the product Gradient XT Y can be written as Books tr XT Y = vec(X)T vec(Y) 89of 112
12.
Introduction to StatisticalInner Product Machine Learning c 2011 Christfried Webers NICTA The Australian National University Mapping from two vectors x, y ∈ V to a ﬁeld of scalars F ( F = R or F = C ): Basic Concepts ·, · : V × V → F Linear Transformations Trace Conjugate Symmetry : Inner Product x, y = y, x Projection Rank, Determinant, Trace Linearity : Matrix Inverse ax, y = a x, y , and x + y, z = x, z + y, z Eigenvectors Positive-deﬁnitness : Singular Value Decomposition x, x ≥ 0, and x, x = 0 for x = 0 only. Directional Derivative, Gradient Books 90of 112
13.
Introduction to StatisticalCanonical Inner Products Machine Learning c 2011 Christfried Webers NICTA Inner product for real numbers x, y ∈ R: The Australian National University x, y = x y Dot product between two vectors: Basic Concepts Given x, y ∈ Rn Linear Transformations Trace n T Inner Product x, y = x y ≡ xk yk = x1 y1 + x2 y2 + · · · + xn yn Projection k=1 Rank, Determinant, Trace Canonical inner product for matrices : Matrix Inverse Given X, Y ∈ Rn×p Eigenvectors Singular Value p p n Decomposition X, Y = tr XT Y = (XT Y)kk = (XT )kl (Y)lk Directional Derivative, Gradient k=1 k=1 l=1 Books p n = Xlk Ylk k=1 l=1 91of 112
14.
Introduction to StatisticalCalculations with Matrices Machine Learning c 2011 Christfried Webers th NICTA Denote (i, j) element of matrix X by Xij The Australian National University Transpose : (XT )ij = Xji ... Product : (XY)ij = k=1 Xik Ykj Proof of the linearity of the canonical matrix inner product : p T X + Y, Z = tr (X + Y) Z = ((X + Y)T Z)kk Basic Concepts k=1 Linear Transformations p n p n Trace = ((X + Y)T )kl Zlk = (Xlk + Ylk )Zlk Inner Product k=1 l=1 k=1 l=1 Projection p n Rank, Determinant, Trace = Xlk Zlk + Ylk Zlk Matrix Inverse Eigenvectors k=1 l=1 p n Singular Value Decomposition T T = (X )kl Zlk + (Y )kl Zlk Directional Derivative, Gradient k=1 l=1 p Books T T = (X Z)kk + (Y Z)kk k=1 = tr XT Z + tr YT Z = X, Z + Y, Z 92of 112
15.
Introduction to StatisticalProjection Machine Learning c 2011 Christfried Webers NICTA The Australian National In linear algebra and functional analysis, a projection is a University linear transformation P from a vector space V to itself such that P2 = P Basic Concepts Linear Transformations (a,b) Trace y Inner Product Projection Rank, Determinant, Trace Matrix Inverse Eigenvectors Singular Value Decomposition x Directional Derivative, (a-b,0) Gradient Books a a−b 1 −1 A = A= b 0 0 0 93of 112
16.
Introduction to StatisticalOrthogonal Projection Machine Learning c 2011 Christfried Webers NICTA Orthogonality : need an inner product x, y The Australian National University Choose two arbitrary vectors x and y. Then Px and y − Py are orthogonal. 0 = Px, y − Py = (Px)T (y − Py) = xT (P − PT P)y Basic Concepts Orthogonal Projection Linear Transformations Trace 2 T P =P P=P Inner Product Projection Example : Given some unit vector u ∈ Rn characterising a Rank, Determinant, Trace line through the origin in Rn Matrix Inverse Eigenvectors project an arbitrary vector x ∈ Rn onto this line by Singular Value Decomposition P = uuT Directional Derivative, Gradient Books Proof : P = PT , and P2 x = (uuT )(uuT )x = uuT x = Px 94of 112
17.
Introduction to StatisticalOrthogonal Projection Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a matrix A ∈ Rn×p and a vector x. What is the closest point x to x in the column space of A ? Orthogonal Projection into the column space of A x = A(AT A)−1 AT x Basic Concepts Linear Transformations Projection matrix Trace Inner Product P = A(AT A)−1 AT Projection Rank, Determinant, Trace Proof Matrix Inverse Eigenvectors P2 = A(AT A)−1 AT A(AT A)−1 AT = A(AT A)−1 AT = P Singular Value Decomposition Directional Derivative, Orthogonal projection ? Gradient Books PT = (A(AT A)−1 AT )T = A(AT A)−1 AT = P 95of 112
18.
Introduction to StatisticalLinear Independence of Vectors Machine Learning c 2011 Christfried Webers NICTA The Australian National University A set of vectors is linearly independent if none of them can be written as a linear combination of ﬁnitely many other vectors in the set. Given a vector space V and k vectors v1 , . . . , vk ∈ V and Basic Concepts scalars a1 , . . . , ak . The vectors are linearly independent, if Linear Transformations and only if Trace Inner Product 0 Projection 0 Rank, Determinant, Trace a1 v1 + a2 v2 + · · · + ak vk = 0 ≡ . . Matrix Inverse . Eigenvectors 0 Singular Value Decomposition has only the trivial solution Directional Derivative, Gradient Books a1 = a2 = ak = 0 96of 112
19.
Introduction to StatisticalRank Machine Learning c 2011 Christfried Webers NICTA The Australian National University The column rank of a matrix A is the maximal number of Basic Concepts linearly independent columns of A. Linear Transformations Trace The row rank of a matrix A is the maximal number of Inner Product linearly independent rows of A. Projection For every matrix A, the row rank and column rank are Rank, Determinant, Trace equal, called the rank of A. Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 97of 112
20.
Introduction to StatisticalDeterminant Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityLet Sn be the set of all permutation of the numbers {1, . . . , n}.The determinant of the square matrix A is then given by n Basic Concepts det {A} = sgn(σ) Ai,σ(i) Linear Transformations Trace σ∈Sn i=1 Inner Productwhere sgn is the signature of the permutation ( +1 if the Projectionpermutation is even, −1 if the permutation is odd). Rank, Determinant, Trace Matrix Inverse det {AB} = det {A} det {B} for A, B square matrices. Eigenvectors −1 det A−1 = det {A} . Singular Value Decomposition det AT = det {A}. Directional Derivative, Gradient Books 98of 112
21.
Introduction to StatisticalMatrix Inverse Machine Learning c 2011 Christfried Webers NICTA The Australian National University Identity Matrix I I = AA−1 = A−1 A The matrix inverse A−1 does only exist for square matrices Basic Concepts which are NOT singular. Linear Transformations Trace Singular matrix Inner Product at least one eigenvalue is zero, Projection determinant |A| = 0. Rank, Determinant, Trace Inversion ’reverses the order’ (like transposition) Matrix Inverse Eigenvectors (AB)−1 = B−1 A−1 Singular Value Decomposition Directional Derivative, (Only then is (AB)(AB)−1 = (AB)B−1 A−1 = AA−1 = I.) Gradient Books 99of 112
22.
Introduction to StatisticalMatrix Inverse and Transpose Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basic Concepts −1 T Linear Transformations (A ) ? Trace need to assume that (A−1 ) exists Inner Product rule : (A−1 )T = (AT )−1 Projection Rank, Determinant, Trace Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 100of 112
23.
Introduction to StatisticalUseful Identity Machine Learning c 2011 Christfried Webers NICTA The Australian National University (A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1 Basic Concepts Linear Transformations Trace How to analyse and prove such an equation? Inner Product Projection −1 n×n A must exist, therefore A must be square, say A ∈ R . Rank, Determinant, Trace Same for C−1 , so let’s assume C ∈ Rm×m . Matrix Inverse m×n Therefore, B ∈ R . Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 101of 112
24.
Introduction to StatisticalProve the Identity Machine Learning c 2011 Christfried Webers NICTA The Australian National University (A−1 + BT C−1 B)−1 BT C−1 = ABT (BABT + C)−1 Basic Concepts Multiply by (BABT + C) Linear Transformations Trace (A−1 + BT C−1 B)−1 BT C−1 (BABT + C) = ABT Inner Product Projection Simplify the left-hand side Rank, Determinant, Trace Matrix Inverse (A−1 + BT C−1 B)−1 [BT C−1 B(ABT ) + BT ] Eigenvectors −1 T −1 −1 T −1 −1 T Singular Value = (A +B C B) [B C B+A ]AB Decomposition = ABT Directional Derivative, Gradient Books 102of 112
25.
Introduction to StatisticalWoodbury Identity Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basic Concepts (A + BD−1 C)−1 = A−1 − A−1 B(D + CA−1 B)−1 CA−1 Linear Transformations Trace Inner Product Useful if A is diagonal and easy to invert, and B is very tall Projection (many rows, but only a few columns) and C is very wide Rank, Determinant, Trace (few rows, many columns). Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 103of 112
26.
Introduction to StatisticalManipulating Matrix Equations Machine Learning c 2011 Christfried Webers NICTA The Australian National University Don’t multiply by a matrix which does not have full rank. Basic Concepts Why? In the general case, you will loose solutions. Linear Transformations Trace Don’t multiply by nonsquare matrices. Inner Product Don’t assume a matrix inverse exists because a matrix is Projection square. The matrix might be singular and you are making Rank, Determinant, Trace the same mistake as if deducing a = b from a 0 = b 0. Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 104of 112
27.
Introduction to StatisticalEigenvectors Machine Learning c 2011 Christfried Webers NICTA The Australian National University Every square matrix A ∈ Rn×n has an Eigenvector decomposition Ax = λx Basic Concepts where x ∈ Rn and λ ∈ C. Linear Transformations Trace Example: Inner Product 0 1 x = λx Projection −1 0 Rank, Determinant, Trace Matrix Inverse λ = {−ı, ı} Eigenvectors Singular Value ı −ı Decomposition x= , Directional Derivative, 1 1 Gradient Books 105of 112
28.
Introduction to StatisticalEigenvectors Machine Learning c 2011 Christfried Webers NICTA The Australian National University How many eigenvalue/eigenvector pairs? Basic Concepts Linear Transformations Ax = λx Trace Inner Product is equivalent to Projection (A − λI)x = 0 Rank, Determinant, Trace Has only non-trivial solution for det {A − λI} = 0 Matrix Inverse Eigenvectors polynom of nth order; at most n distinct solutions Singular Value Decomposition Directional Derivative, Gradient Books 106of 112
29.
Introduction to StatisticalReal Eigenvalues Machine Learning c 2011 Christfried Webers NICTA The Australian National University How can we enforce real eigenvalues? Let’s look at matrices with complex entries A ∈ Cn×n . Basic Concepts Linear Transformations Transposition is replaced by Hermitian adjoint, e.g. Trace Inner Product H 1 + ı2 3 + ı4 1 − ı2 5 − ı6 Projection = 5 + ı6 7 + ı8 7 − ı8 7 − ı8 Rank, Determinant, Trace Matrix Inverse Denote the complex conjugate of a complex number λ by Eigenvectors λ. Singular Value Decomposition Directional Derivative, Gradient Books 107of 112
30.
Introduction to StatisticalReal Eigenvalues Machine Learning c 2011 Christfried Webers NICTA How can we enforce real eigenvalues? The Australian National University Let’s assume A ∈ Cn×n , Hermitian (AH = A). Calculate xH Ax = λxH x for an eigenvector x ∈ Cn of A. Basic Concepts Another possibility to calculate xH Ax Linear Transformations Trace xH Ax = xH AH x (A is Hermitian) Inner Product H H Projection = (x Ax) (reverse order) Rank, Determinant, Trace = (λxH x)H (eigenvalue) Matrix Inverse H Eigenvectors = λx x Singular Value Decomposition and therefore Directional Derivative, Gradient λ=λ (λ is real). Books If A is Hermitian, then all eigenvalues are real. Special case: If A has only real entries and is symmetric, then all eigenvalues are real. 108of 112
31.
Introduction to StatisticalSingular Value Decomposition Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityEvery matrix A ∈ Rn×p can be decomposed into a product of Basic Conceptsthree matrices Linear Transformations A = UΣV T Trace Inner Productwhere U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices ( ProjectionU T U = I and V T V = I ), and Σ ∈ Rn×p has nonnegative Rank, Determinant, Tracenumbers on the diagonal. Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 109of 112
32.
Introduction to StatisticalHow to calculate a gradient? Machine Learning c 2011 Christfried Webers NICTA The Australian National Given a vector space V, and a function f : V → R. University Example: X ∈ Rn×p , arbitrary C ∈ Rn×n , f (X) = tr X T CX . Basic Concepts How to calculate the gradient of f ? Linear Transformations Ill-deﬁned question. There is no gradient in a vector space. Trace Inner Product Only a Directional Derivative at point X ∈ V in direction Projection ξ ∈ V. Rank, Determinant, Trace f (X + hξ) − f (X) Df (X)(ξ) = lim Matrix Inverse h→0 h Eigenvectors For the example f (X) = tr X T CX Singular Value Decomposition Directional Derivative, tr (X + hξ)T C(X + hξ) − tr X T CX Gradient Df (X)(ξ) = lim Books h→0 h = tr ξ CX + X Cξ = tr X T (CT + C)ξ T T 110of 112
33.
Introduction to StatisticalHow to calculate a gradient? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a vector space V, and a function f : V → R. How to calculate the gradient of f ? 1 Calculate the Directional Derivative 2 Deﬁne an inner product A, B on the vector space Basic Concepts Linear Transformations 3 The gradient is now deﬁned as Trace Inner Product Df (X)(ξ) = grad f , ξ Projection Rank, Determinant, Trace For the example f (X) = tr X T CX Matrix Inverse Eigenvectors Deﬁne the inner product A, B = tr AT B . Singular Value Decomposition Write the Directional Derivative as an inner product with ξ Directional Derivative, Gradient T T Df (X)(ξ) = tr X (C + C)ξ Books T = (C + C )X, ξ = grad f , ξ 111of 112
34.
Introduction to StatisticalBooks Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basic Concepts Gilbert Strang, "Introduction to Linear Algebra", Wellesley Linear Transformations Cambridge, 2009. Trace Inner Product David C. Lay, "Linear Algebra and Its Applications", Projection Addison Wesley, 2005. Rank, Determinant, Trace Matrix Inverse Eigenvectors Singular Value Decomposition Directional Derivative, Gradient Books 112of 112
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment