Linear algebra concepts such as vectors, matrices, and tensors are useful in machine learning because inputs and outputs are represented as vectors. Vectors are arrays of numbers that can represent things like images, sound, or other data. Matrices are 2D arrays that are used for operations like multiplying vectors and matrices. Tensors generalize matrices to higher dimensions. Other key concepts include linear dependence, eigendecomposition, singular value decomposition, and linear algebra operations. Probability and information theory concepts are also important, including random variables, probability distributions, expectation, variance, independence, Bayes' theorem, entropy, and structured probabilistic models.
2. Why linear algebra is useful
In many machine learning and deep learning
algorithms the input and output are both
represented as vectors
By vectors simple means a collection of
numbers
Convert a input ( such as picture, sound etc )
into a number
3. Cont..
Scalars: single number or quantity which has
an important property magnitude
Ex: speed of a car ( speed = 45 km/hr)
we write scalars in italics
usually we denote with lower case variable
names
Ex: n ∈ N be a number of units
4. Cont..
Vectors: A vector is an array of numbers
Ex: let be an input vector
X1, X2,….Xn put together to concatenate called it is
as vector
10. Cont..
Span and Linear dependence
span : the span of a set of vector is the set of all vectors
obtained by a linear combination of original vectors.
every possible linear combination
ex: The span of the coordinate vectors v1=(1,0) v2=(0,1)
14. Cont...
Linear dependence: A set of vectors is linearly
dependent if the vectors can be written as a linear
combination of the other vectors .
ex: v1=(1,0) v2=(0,1) v3=(2,1)
v3 can be represented as linear combination of 2V1+ V2
15. Cont..
Norms: Norms are a way of measuring the
length of vectors, matrices etc
To estimate how big a vector or tensor
To estimate how close one vector or tensor is
to another
18. Cont...
Eigen decomposition :
One of the most widly used matrix decomposition is
Eigen decomposition
In which we decompose a matrix into a set of eigen
vectors and eigen values
A vector that undergoes pure scaling without any
rotation is know as eigen vector
The scaling factor (stretch ratio) is known as eigen value.
21. Cont..
Singular value Decomposition
A SVD is derived from Eigen decomposition
The matrix A can be an mXn matrix which
does not have to be square matrix.
23. Cont..
The Moore-penrose pesudo inverse:
It is used when the matrix may not be
invertible.
If A is invertible, then the moore-penrose is
equal to matrix inverse
The pseudo inverse is also referred to as the
generalized inverse.
24. Cont..
Let A be a matrix of order m x n then the
pseudo inverse of A is defined as
If the column of a matrix A are linearly
independent then the pseudo inverse of A is
A+ =(AT A-1) AT
If the rows of the matrix are linearly
independent then pseudo inverse of A is
A+ = AT(AAT) -1
28. Cont..
Principle Component Analysis:
It is a dimensionality reduction method that is
often used to reduce the dimensionality of large
datasets by transforming a large set of
variables into smaller one that still contains
most of the information in the large set.
32. Probability and Information Theory
Random variables:
A variable whose value is determined by a
random experiment
It is defined over the sample space to real
number
X : S -> R
X is random variable, S is sample space, R is
real numbers
Sample space is a collection or a set of
possible outcomes of a random experiment
33. Cont..
Types of random variables
Discrete random variables: It takes only finite
number of distinct values
Ex: 0, 1,2,3,4....
Continuous random variables:
Infinite and uncountable set of values
Ex: Interest rates of loans in a country
34. Probability and Information Theory
Probability Distributions
It gives the possibility of each outcome of
random experiment or event
35. Probability and Information Theory
Probability Mass function:
If X is a discrete random variable with distinct
values x1, x2, .... xn then the function p(x)
defined as
Is called the probability mass function
37. Probability and Information Theory
Marginal Probability:
The probability of an event irrespective of the
outcome of another variable
It is simply the distribution of each of these
individual variables
38. Probability and Information Theory
For example, we would say that the marginal
distribution of sports is:
Baseball: 36
Basketball: 31
Football: 33
39. Probability and Information Theory
We could also write the marginal distribution of
sports in percentage terms (i.e. out of the total
of 100 respondents):
Baseball: 36 / 100 = 36%
Basketball: 31 / 100 = 31%
Football: 33 / 100 = 33%
40. Probability and Information Theory
Conditional Probability:
The probability of occurrence of any event A
when another event B in relation to A has
already occurred
41. Probability and Information Theory
The chain rule of conditional probability
for three events A,B,C we have chain rule
The chain rule can be generalized for ‘n’
number of events A1, A2, A3..... An,
42. Probability and Information Theory
Independence and conditional Independence
Two random variables X and Y are said to be
statistically independent if and only if
P(X,Y) = P(X)P(Y)
Ex: Independent – X: Throw of a dice
Y: Toss of a coin
Not independent : X: Height Y: Weight
In general as height increases weight
increases
43. Probability and Information Theory
Independence is equivalent to saying
P(y/x) = P(y) or P(x/y) = P(x)
The dependence on y on x whether y happens
or not has no relation on whether x happens
44. Probability and Information Theory
Conditional Independence
Two random variable X and Y are said to be
independent given Z if and only if
P(x,y / z) =P(x/z) P(y/z)
Independence and conditional independent
X: Throw of a dice
Y: Toss of a coin
Z: Card from deck
45. Probability and Information Theory
X: Height Y: Vocabulary Z: Age
Not independent unless i gave some condition
( if i say this person is just 2 feet tall, it
automatically means it is most probably this
person must be child and have low vocabulary
Unless i have some condition they are not
independent
In such case X and Y are not really
independent variables. They are only
independent suppose i gave a particular
condition
46. Probability and Information Theory
Suppose if I fix age at 30 people of age of 30
regardless of height will have vocabularies.
They donot at least depend on height
The above case is two variables are not
independent but they are conditionally
independent
47. Probability and Information Theory
Expectation:
It gives mean/average/expected value of the
random variable given the distribution
Distribution of an event consists not only of the
input of the values that can be observed but is
made up of all possible values
Ex: Expected returns on a certain investment in
the market
Expected rainfall during coming monsoon
48. Probability and Information Theory
The expectation or expected value of some
function f(x) with respect to a probability
distribution P(x) is the average of f(x) when x is
drawn from p
Denoted by Ex~p [f(x)]
Ex~p [f(x)] = ∑ P(x) f(x)
x
49. Probability and Information Theory
Multivariate Expectation: (multiple variables)
For a multivariate random variable x (vector) we
can interpret the variable by considering
50. Probability and Information Theory
Linearity of Expectation:
If ‘f’ is a linear combination of two other
functions g and h, α, β are scalars
f(x) = α g(x) = βh(x) then
Expectation of f is
E[f] = α E[g] + β E[h]
51. Probability and Information Theory
Variance :
It gives the variation from the expected value
Variance also measures amount of fluctuation
of the variable
Ex: Variance is returns on certain investment in
the market ( risk measure)
53. Probability and Information Theory
CoVariance :
This is for a pair of variables x and y
Measures the total variation of two random
variables from their expected values
54. Probability and Information Theory
Useful properties of common functions
Logistic sigmoid function
Sigmoid function is mathematical function
having a characteristic ‘S’ shaped curve
Whatever input we give to sigmoid function it
give the output 0 and 1. (0,1)
55. Probability and Information Theory
Soft plus function:
The out put produced by sigmoid function have
upper limits and lower limits where as softplus
function produces output in (0, +∞ )
It is smooth approximation function can be used
to constrain output of a machine to always
positive
56. Probability and Information Theory
Soft plus function:
The out put produced by sigmoid function have
upper limits and lower limits where as softplus
function produces output in (0, +∞ )
It is smooth approximation function can be used
to constrain output of a machine to always
positive
57. Probability and Information Theory
Bayes theorem: It helps in determining the
probability of an event that is based on some
event that has already occurred.
If A and B are two events then the formula for
bayes theorem is given by
Where P(A/B) is the probability of condition
when event A is occurring while event B has
already occurred
58. Probability and Information Theory
From the definition of conditional probability
Bayers theorem can be derived for event as
59. Probability and Information Theory
Information theory is a mathematical approach
to the study of coding of information along with
the quantification, storage, and communication
of information.
It provides the quantitative measure of
information contained in message signal
Digital information are always associated with
uncertainty
If probability of occurrence of an event is very
high = information contained in that event will
be less
60. Probability and Information Theory
Ex: Tomorrow, the sun will rise from the east
If the probability of occurrence of an event is
low information contained in the event will be
more
Ex: Solar eclipse will occur today
Consider a discrete random variable X with
possible outcomes xi i=1,2,…n. The information
of event X=xi is defined as
61. Probability and Information Theory
Consider a source which tosses a fair coin
produces an output equal to 1 if a head
appears and 0 if a tail appears
P(1) =P(0) =0.5
The information content of each output from
the source is
1 bit
62. Probability and Information Theory
Entropy: it tells how much information present
in an event
We can measure the amount of uncertainty in
an entire probability distribution using Shannon
entropy
H(x) is total amount of information in an entire
probability distribution
63. Probability and Information Theory
kullback leibler divergence:
It is a measure of how one probability
distribution is different from second
64. Probability and Information Theory
Structured Probabilistic Models : It is
representation of the factorization of probability
distribution using graph
A way of describing probability distributions
using a graph to describe which variables
interact with each other directly
Directed: graphs with directed edges
Undirected: graphs with undirected edges
66. Probability and Information Theory
Undirected graphs:
Any set of nodes that are connected to each other is
called clique
Each clique in an undirected model is associated with a
factor of