2. (Goodfellow 2016)
About this chapter
• Not a comprehensive survey of all of linear algebra
• Focused on the subset most relevant to deep
learning
• Larger subset: e.g., Linear Algebra by Georgi Shilov
3. (Goodfellow 2016)
Scalars
• A scalar is a single number
• Integers, real numbers, rational numbers, etc.
• We denote it with italic font:
a, n, x
4. (Goodfellow 2016)
Vectors
• A vector is a 1-D array of numbers:
• Can be real, binary, integer, etc.
• Example notation for type and size:
rder. We can identify each individual number by its index in that ordering.
ypically we give vectors lower case names written in bold typeface, such
s x. The elements of the vector are identified by writing its name in italic
ypeface, with a subscript. The first element of x is x1, the second element
x2 and so on. We also need to say what kind of numbers are stored in
he vector. If each element is in R, and the vector has n elements, then the
ector lies in the set formed by taking the Cartesian product of R n times,
enoted as Rn. When we need to explicitly identify the elements of a vector,
e write them as a column enclosed in square brackets:
x =
2
6
6
6
4
x1
x2
...
xn
3
7
7
7
5
. (2.1)
We can think of vectors as identifying points in space, with each element
ving the coordinate along a different axis.
ometimes we need to index a set of elements of a vector. In this case, we
efine a set containing the indices and write the set as a subscript. For
xample, to access x1, x3 and x6, we define the set S = {1, 3, 6} and write
S. We use the sign to index the complement of a set. For example x 1 is
he vector containing all elements of x except for x1, and x S is the vector
ontaining all of the elements of x except for x1, x3 and x6.
Matrices: A matrix is a 2-D array of numbers, so each element is identified by
wo indices instead of just one. We usually give matrices upper-case variable
ames with bold typeface, such as A. If a real-valued matrix A has a height
• Vectors: A vector is an array of
order. We can identify each indiv
Typically we give vectors lower
as x. The elements of the vector
typeface, with a subscript. The fi
is x2 and so on. We also need t
the vector. If each element is in R
vector lies in the set formed by t
denoted as Rn. When we need to
5. (Goodfellow 2016)
Matrices
• A matrix is a 2-D array of numbers:
• Example notation for type and shape:
A =
2
4
A1,1 A1,2
A2,1 A2,2
A3,1 A3,2
3
5 ) A>
=
A1,1 A2,1 A3,1
A1,2 A2,2 A3,2
The transpose of the matrix can be thought of as a mirror image across the
nal.
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:
A1,1 A1,2
A2,1 A2,2
. (2.2)
times we may need to index matrix-valued expressions that are not just
gle letter. In this case, we use subscripts after the expression, but do
onvert anything to lower case. For example, f(A)i,j gives element (i, j)
e matrix computed by applying the function f to A.
ors: In some cases we will need an array with more than two axes. In
eneral case, an array of numbers arranged on a regular grid with a
ble number of axes is known as a tensor. We denote a tensor named “A”
this typeface: A. We identify the element of A at coordinates (i, j, k)
riting Ai,j,k.
ndices and write the set as a subscript
x6, we define the set S = {1, 3, 6} and
x the complement of a set. For example
ents of x except for x1, and x S is the
of x except for x1, x3 and x6.
ray of numbers, so each element is identifi
. We usually give matrices upper-case va
h as A. If a real-valued matrix A has a
we say that A 2 Rm⇥n. We usually id
g its name in italic but not bold font, an
Column
Row
6. (Goodfellow 2016)
Tensors
• A tensor is an array of numbers, that may have
• zero dimensions, and be a scalar
• one dimension, and be a vector
• two dimensions, and be a matrix
• or more dimensions.
7. (Goodfellow 2016)
Matrix Transpose
CHAPTER 2. LINEAR ALGEBRA
A =
2
4
A1,1 A1,2
A2,1 A2,2
A3,1 A3,2
3
5 ) A>
=
A1,1 A2,1 A3,1
A1,2 A2,2 A3,2
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
the i-th column of A. When we need to explicitly identify the elements of a
matrix, we write them as an array enclosed in square brackets:
A1,1 A1,2
A2,1 A2,2
. (2.2)
rtant operation on matrices is the transpose. The transpose of a
mirror image of the matrix across a diagonal line, called the main
ning down and to the right, starting from its upper left corner. See
graphical depiction of this operation. We denote the transpose of a
A>, and it is defined such that
(A>
)i,j = Aj,i. (2.3)
an be thought of as matrices that contain only one column. The
a vector is therefore a matrix with only one row. Sometimes we
33
ations have many useful properties that make mathematical
more convenient. For example, matrix multiplication is
A(B + C) = AB + AC. (2.6)
A(BC) = (AB)C. (2.7)
is not commutative (the condition AB = BA does not
alar multiplication. However, the dot product between two
:
x>
y = y>
x. (2.8)
matrix product has a simple form:
(AB)>
= B>
A>
. (2.9)
onstrate Eq. 2.8, by exploiting the fact that the value of
8. (Goodfellow 2016)
Matrix (Dot) Product
duct of matrices A and B is a third matrix C. In order
ned, A must have the same number of columns as B has
⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p.
product just by placing two or more matrices together,
C = AB. (2.4)
n is defined by
Ci,j =
X
k
Ai,kBk,j. (2.5)
d product of two matrices is not just a matrix containing
dual elements. Such an operation exists and is called the
Hadamard product, and is denoted as A B.
en two vectors x and y of the same dimensionality is the
can think of the matrix product C = AB as computing
etween row i of A and column j of B.
= •m
p
m
p
n
n
Must
match
e defined, A must have the same number of columns as B has
pe m ⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p.
atrix product just by placing two or more matrices together,
C = AB. (2.4)
ration is defined by
Ci,j =
X
k
Ai,kBk,j. (2.5)
andard product of two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
t or Hadamard product, and is denoted as A B.
between two vectors x and y of the same dimensionality is the
. We can think of the matrix product C = AB as computing
uct between row i of A and column j of B.
34
9. (Goodfellow 2016)
Identity Matrix
PTER 2. LINEAR ALGEBRA
2
4
1 0 0
0 1 0
0 0 1
3
5
Figure 2.2: Example identity matrix: This is I3.
A2,1x1 + A2,2x2 + · · · + A2,nxn = b2 (2
. . . (2
Am,1x1 + Am,2x2 + · · · + Am,nxn = bm. (2
Matrix-vector product notation provides a more compact representation
ations of this form.
Inverse Matrices
werful tool called matrix inversion that allows us to
for many values of A.
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
at matrix. We denote the identity matrix that preserves
n. Formally, In 2 Rn⇥n, and
8x 2 Rn
, Inx = x. (2.20)
ity matrix is simple: all of the entries along the main
the other entries are zero. See Fig. 2.2 for an example.
10. (Goodfellow 2016)
Systems of Equations
t of useful properties of the matrix product here, but
that many more exist.
ear algebra notation to write down a system of linear
Ax = b (2.11)
n matrix, b 2 Rm is a known vector, and x 2 Rn is a
we would like to solve for. Each element xi of x is one
Each row of A and each element of b provide another
Eq. 2.11 as:
A1,:x = b1 (2.12)
A2,:x = b2 (2.13)
. . . (2.14)
expands to
aware that many more exist.
ough linear algebra notation to write down a system of linear
Ax = b (2.11)
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would like to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
ewrite Eq. 2.11 as:
A1,:x = b1 (2.12)
A2,:x = b2 (2.13)
. . . (2.14)
Am,:x = bm (2.15)
tly, as:
A1,1x1 + A1,2x2 + · · · + A1,nxn = b1 (2.16)
35
11. (Goodfellow 2016)
Solving Systems of Equations
• A linear system of equations can have:
• No solution
• Many solutions
• Exactly one solution: this means multiplication by
the matrix is an invertible function
12. (Goodfellow 2016)
Matrix Inversion
• Matrix inverse:
• Solving a system using an inverse:
• Numerically unstable, but useful for abstract
analysis
identity matrix is simple: all of the entries along the main
all of the other entries are zero. See Fig. 2.2 for an example.
se of A is denoted as A 1, and it is defined as the matrix
A 1
A = In. (2.21)
Eq. 2.11 by the following steps:
Ax = b (2.22)
A 1
Ax = A 1
b (2.23)
Inx = A 1
b (2.24)
36
f the identity matrix is simple: all of the entries along the main
while all of the other entries are zero. See Fig. 2.2 for an example.
inverse of A is denoted as A 1, and it is defined as the matrix
A 1
A = In. (2.21)
solve Eq. 2.11 by the following steps:
Ax = b (2.22)
A 1
Ax = A 1
b (2.23)
Inx = A 1
b (2.24)
36
13. (Goodfellow 2016)
Invertibility
• Matrix can’t be inverted if…
• More rows than columns
• More columns than rows
• Redundant rows/columns (“linearly dependent”,
“low rank”)
14. (Goodfellow 2016)
Norms
• Functions that measure how “large” a vector is
• Similar to a distance between zero and the point
represented by the vector
is given by
||x||p =
X
i
|xi|p
!1
p
for p 2 R, p 1.
Norms, including the Lp norm, are functions mapping vect
values. On an intuitive level, the norm of a vector x measure
the origin to the point x. More rigorously, a norm is any func
the following properties:
• f(x) = 0 ) x = 0
• f(x + y) f(x) + f(y) (the triangle inequality)
• 8↵ 2 R, f(↵x) = |↵|f(x)
The L2 norm, with p = 2, is known as the Euclidean nor
Euclidean distance from the origin to the point identified by
15. (Goodfellow 2016)
• Lp
norm
• Most popular norm: L2 norm, p=2
• L1 norm, p=1:
• Max norm, infinite p:
Norms
o measure the size of a vector. In machine learning, we
ectors using a function called a norm. Formally, the L
||x||p =
X
i
|xi|p
!1
p
the Lp norm, are functions mapping vectors to non-n
ive level, the norm of a vector x measures the distan
nt x. More rigorously, a norm is any function f that
ties:
APTER 2. LINEAR ALGEBRA
ause it increases very slowly near the origin. In several machine learning
lications, it is important to discriminate between elements that are exactly
and elements that are small but nonzero. In these cases, we turn to a function
t grows at the same rate in all locations, but retains mathematical simplicity:
L1 norm. The L1 norm may be simplified to
||x||1 =
X
i
|xi|. (2.31)
L1 norm is commonly used in machine learning when the difference between
o and nonzero elements is very important. Every time an element of x moves
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the size of the vector by counting its number of nonzero
ments. Some authors refer to this function as the “L0 norm,” but this is incorrect
minology. The number of non-zero entries in a vector is not a norm, because
zero and elements that are small but nonzero. In these cases, we turn to a function
that grows at the same rate in all locations, but retains mathematical simplicity:
the L1 norm. The L1 norm may be simplified to
||x||1 =
X
i
|xi|. (2.31)
The L1 norm is commonly used in machine learning when the difference between
zero and nonzero elements is very important. Every time an element of x moves
away from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the size of the vector by counting its number of nonzero
elements. Some authors refer to this function as the “L0 norm,” but this is incorrect
terminology. The number of non-zero entries in a vector is not a norm, because
scaling the vector by ↵ does not change the number of nonzero entries. The L1
norm is often used as a substitute for the number of nonzero entries.
One other norm that commonly arises in machine learning is the L1 norm,
also known as the max norm. This norm simplifies to the absolute value of the
element with the largest magnitude in the vector,
||x||1 = max
i
|xi|. (2.32)
Sometimes we may also wish to measure the size of a matrix. In the context
of deep learning, the most common way to do this is with the otherwise obscure
Frobenius norm sX
16. (Goodfellow 2016)
• Unit vector:
• Symmetric Matrix:
• Orthogonal matrix:
Special Matrices and Vectors
A = A>
. (2.35)
en arise when the entries are generated by some function of
es not depend on the order of the arguments. For example,
ance measurements, with Ai,j giving the distance from point
= Aj,i because distance functions are symmetric.
ector with unit norm:
||x||2 = 1. (2.36)
vector y are orthogonal to each other if x>y = 0. If both
orm, this means that they are at a 90 degree angle to each
n vectors may be mutually orthogonal with nonzero norm.
only orthogonal but also have unit norm, we call them
ix is a square matrix whose rows are mutually orthonormal
e mutually orthonormal:
> >
1 n
machine learning algorithm in terms of arbitrary matrices,
sive (and less descriptive) algorithm by restricting some
ices need be square. It is possible to construct a rectangular
quare diagonal matrices do not have inverses but it is still
them cheaply. For a non-square diagonal matrix D, the
scaling each element of x, and either concatenating some
is taller than it is wide, or discarding some of the last
D is wider than it is tall.
is any matrix that is equal to its own transpose:
A = A>
. (2.35)
n arise when the entries are generated by some function of
s not depend on the order of the arguments. For example,
nce measurements, with Ai,j giving the distance from point
= Aj,i because distance functions are symmetric.
es often arise when the entries are generated by some function of
at does not depend on the order of the arguments. For example,
distance measurements, with Ai,j giving the distance from point
Ai,j = Aj,i because distance functions are symmetric.
is a vector with unit norm:
||x||2 = 1. (2.36)
nd a vector y are orthogonal to each other if x>y = 0. If both
ero norm, this means that they are at a 90 degree angle to each
most n vectors may be mutually orthogonal with nonzero norm.
e not only orthogonal but also have unit norm, we call them
matrix is a square matrix whose rows are mutually orthonormal
ns are mutually orthonormal:
A>
A = AA>
= I. (2.37)
41
LGEBRA
A 1
= A>
, (2.38)
17. (Goodfellow 2016)
Eigendecomposition
• Eigenvector and eigenvalue:
• Eigendecomposition of a diagonalizable matrix:
• Every real symmetric matrix has a real, orthogonal
eigendecomposition:
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
h we decompose a matrix into a set of eigenvectors and
square matrix A is a non-zero vector v such that multipli-
the scale of v:
Av = v. (2.39)
as the eigenvalue corresponding to this eigenvector. (One
vector such that v>A = v>, but we are usually concerned
.
or of A, then so is any rescaled vector sv for s 2 R, s 6= 0.
he same eigenvalue. For this reason, we usually only look
example of the effect of eigenvectors and eigenvalues. Here, we have
h two orthonormal eigenvectors, v(1)
with eigenvalue 1 and v(2)
with
Left) We plot the set of all unit vectors u 2 R2
as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
cales space in direction v(i)
by i.
form a matrix V with one eigenvector per column: V = [v(1), . . . ,
, we can concatenate the eigenvalues to form a vector = [ 1, . . . ,
ndecomposition of A is then given by
A = V diag( )V 1
. (2.40)
en that constructing matrices with specific eigenvalues and eigenvec-
to stretch space in desired directions. However, we often want to
rices into their eigenvalues and eigenvectors. Doing so can help us
ain properties of the matrix, much as decomposing an integer into
rs can help us understand the behavior of that integer.
matrix can be decomposed into eigenvalues and eigenvectors. In some
ALGEBRA
on exists, but may involve complex rather than real numbers.
ook, we usually need to decompose only a specific class of
simple decomposition. Specifically, every real symmetric
posed into an expression using only real-valued eigenvectors
A = Q⇤Q>
, (2.41)
gonal matrix composed of eigenvectors of A, and ⇤ is a
eigenvalue ⇤i,i is associated with the eigenvector in column i
19. (Goodfellow 2016)
Singular Value Decomposition
• Similar to eigendecomposition
• More general; matrix need not be square
LINEAR ALGEBRA
ly applicable. Every real matrix has a singular value decomposition,
is not true of the eigenvalue decomposition. For example, if a matrix
, the eigendecomposition is not defined, and we must use a singular
osition instead.
at the eigendecomposition involves analyzing a matrix A to discover
f eigenvectors and a vector of eigenvalues such that we can rewrite
A = V diag( )V 1
. (2.42)
ular value decomposition is similar, except this time we will write A
of three matrices:
A = UDV >
. (2.43)
hat A is an m ⇥ n matrix. Then U is defined to be an m ⇥ m matrix,
m ⇥ n matrix, and V to be an n ⇥ n matrix.
20. (Goodfellow 2016)
Moore-Penrose Pseudoinverse
• If the equation has:
• Exactly one solution: this is the same as the inverse.
• No solution: this gives us the solution with the
smallest error
• Many solutions: this gives us the solution with the
smallest norm of x.
When A has more columns than r
pseudoinverse provides one of the ma
he solution x = A+y with minima
olutions.
When A has more rows than colu
n this case, using the pseudoinverse
possible to y in terms of Euclidean n
rithms for computing the pseudoinverse are not based on this defini-
er the formula
A+
= V D+
U>
, (2.47)
nd V are the singular value decomposition of A, and the pseudoinverse
onal matrix D is obtained by taking the reciprocal of its non-zero
taking the transpose of the resulting matrix.
as more columns than rows, then solving a linear equation using the
provides one of the many possible solutions. Specifically, it provides
x = A+y with minimal Euclidean norm ||x||2 among all possible
as more rows than columns, it is possible for there to be no solution.
using the pseudoinverse gives us the x for which Ax is as close as
in terms of Euclidean norm ||Ax y||2.
e Trace Operator
rator gives the sum of all of the diagonal entries of a matrix:
21. (Goodfellow 2016)
Computing the Pseudoinverse
eudoinverse allows us to make some headway in these
of A is defined as a matrix
A+
= lim
↵&0
(A>
A + ↵I) 1
A>
. (2.46)
mputing the pseudoinverse are not based on this defini-
la
A+
= V D+
U>
, (2.47)
singular value decomposition of A, and the pseudoinverse
D is obtained by taking the reciprocal of its non-zero
ranspose of the resulting matrix.
mns than rows, then solving a linear equation using the
of the many possible solutions. Specifically, it provides
Take reciprocal of non-zero entries
The SVD allows the computation of the pseudoinverse:
22. (Goodfellow 2016)
Trace
perator
sum of all of the diagonal entries of a matrix:
Tr(A) =
X
i
Ai,i. (2.48)
ful for a variety of reasons. Some operations that are
sorting to summation notation can be specified using
46
pression using many useful identities. For example, the trace
nt to the transpose operator:
Tr(A) = Tr(A>
). (2.50)
square matrix composed of many factors is also invariant to
ctor into the first position, if the shapes of the corresponding
resulting product to be defined:
Tr(ABC) = Tr(CAB) = Tr(BCA) (2.51)
Tr(
nY
i=1
F (i)
) = Tr(F (n)
n 1Y
i=1
F (i)
). (2.52)
cyclic permutation holds even if the resulting product has a
r example, for A 2 Rm⇥n and B 2 Rn⇥m, we have
23. (Goodfellow 2016)
Learning linear algebra
• Do a lot of practice problems
• Start out with lots of summation signs and indexing
into individual entries
• Eventually you will be able to mostly use matrix
and vector product notation quickly and easily