DL-unit-1.pptx

Why linear algebra is useful
In many machine learning and deep learning
algorithms the input and output are both
represented as vectors
By vectors simple means a collection of
numbers
Convert a input ( such as picture, sound etc )
into a number

Cont..
Scalars: single number or quantity which has
an important property magnitude
Ex: speed of a car ( speed = 45 km/hr)
we write scalars in italics
usually we denote with lower case variable
names
Ex: n ∈ N be a number of units

Cont..
Vectors: A vector is an array of numbers
Ex: let be an input vector
X1, X2,….Xn put together to concatenate called it is
as vector

Cont..
Vectors: Magnitude and direction
Ex: A car is travelling east at the speed of 45 km/hr

Cont..
Matrices : A matrix is a 2-D array of numbers

Cont..
Tensors: Array of numbers with dimensions
greater than 2
multi dimensional array

Cont...
Multiplying matrices and vectors:
To define multiplication between matrix A and vector x we
need to view the vector as a column matrix.

Cont..
Span and Linear dependence
span : the span of a set of vector is the set of all vectors
obtained by a linear combination of original vectors.
every possible linear combination
ex: The span of the coordinate vectors v1=(1,0) v2=(0,1)

Cont...
Identity and inverse matrices:

Cont...
Linear dependence: A set of vectors is linearly
dependent if the vectors can be written as a linear
combination of the other vectors .
ex: v1=(1,0) v2=(0,1) v3=(2,1)
v3 can be represented as linear combination of 2V1+ V2

Cont..
Norms: Norms are a way of measuring the
length of vectors, matrices etc
To estimate how big a vector or tensor
To estimate how close one vector or tensor is
to another

Cont..
Mathematically a norm is any function f that
satisfies

Cont...
Eigen decomposition :
One of the most widly used matrix decomposition is
Eigen decomposition
In which we decompose a matrix into a set of eigen
vectors and eigen values
A vector that undergoes pure scaling without any
rotation is know as eigen vector
The scaling factor (stretch ratio) is known as eigen value.

Cont..
Singular value Decomposition
A SVD is derived from Eigen decomposition
The matrix A can be an mXn matrix which
does not have to be square matrix.

Cont..
The Moore-penrose pesudo inverse:
It is used when the matrix may not be
invertible.
If A is invertible, then the moore-penrose is
equal to matrix inverse
The pseudo inverse is also referred to as the
generalized inverse.

Cont..
Let A be a matrix of order m x n then the
pseudo inverse of A is defined as
If the column of a matrix A are linearly
independent then the pseudo inverse of A is
A+ =(AT A-1) AT
If the rows of the matrix are linearly
independent then pseudo inverse of A is
A+ = AT(AAT) -1

Cont..
The Trace operator:
The trace operator gives the sum of all of the
diagonal entries of a matrix

Cont..
The Determinant
The determinant of a square matrix denoted
det(A)

Cont..
Principle Component Analysis:
It is a dimensionality reduction method that is
often used to reduce the dimensionality of large
datasets by transforming a large set of
variables into smaller one that still contains
most of the information in the large set.

Probability and Information Theory
Random variables:
A variable whose value is determined by a
random experiment
It is defined over the sample space to real
number
X : S -> R
X is random variable, S is sample space, R is
real numbers
Sample space is a collection or a set of
possible outcomes of a random experiment

Cont..
Types of random variables
Discrete random variables: It takes only finite
number of distinct values
Ex: 0, 1,2,3,4....
Continuous random variables:
Infinite and uncountable set of values
Ex: Interest rates of loans in a country

Probability Distributions
It gives the possibility of each outcome of
random experiment or event

Probability Mass function:
If X is a discrete random variable with distinct
values x1, x2, .... xn then the function p(x)
defined as
Is called the probability mass function

Probability density function:
A function f(x) is PDF if

Marginal Probability:
The probability of an event irrespective of the
outcome of another variable
It is simply the distribution of each of these
individual variables

For example, we would say that the marginal
distribution of sports is:
Baseball: 36
Basketball: 31
Football: 33

We could also write the marginal distribution of
sports in percentage terms (i.e. out of the total
of 100 respondents):
Baseball: 36 / 100 = 36%
Basketball: 31 / 100 = 31%
Football: 33 / 100 = 33%

Conditional Probability:
The probability of occurrence of any event A
when another event B in relation to A has
already occurred

The chain rule of conditional probability
for three events A,B,C we have chain rule
The chain rule can be generalized for ‘n’
number of events A1, A2, A3..... An,

Independence and conditional Independence
Two random variables X and Y are said to be
statistically independent if and only if
P(X,Y) = P(X)P(Y)
Ex: Independent – X: Throw of a dice
Y: Toss of a coin
Not independent : X: Height Y: Weight
In general as height increases weight
increases

Independence is equivalent to saying
P(y/x) = P(y) or P(x/y) = P(x)
The dependence on y on x whether y happens
or not has no relation on whether x happens

Conditional Independence
Two random variable X and Y are said to be
independent given Z if and only if
P(x,y / z) =P(x/z) P(y/z)
Independence and conditional independent
X: Throw of a dice
Y: Toss of a coin
Z: Card from deck

X: Height Y: Vocabulary Z: Age
Not independent unless i gave some condition
( if i say this person is just 2 feet tall, it
automatically means it is most probably this
person must be child and have low vocabulary
Unless i have some condition they are not
independent
In such case X and Y are not really
independent variables. They are only
independent suppose i gave a particular
condition

Suppose if I fix age at 30 people of age of 30
regardless of height will have vocabularies.
They donot at least depend on height
The above case is two variables are not
independent but they are conditionally
independent

Expectation:
It gives mean/average/expected value of the
random variable given the distribution
Distribution of an event consists not only of the
input of the values that can be observed but is
made up of all possible values
Ex: Expected returns on a certain investment in
the market
Expected rainfall during coming monsoon

The expectation or expected value of some
function f(x) with respect to a probability
distribution P(x) is the average of f(x) when x is
drawn from p
Denoted by Ex~p [f(x)]
Ex~p [f(x)] = ∑ P(x) f(x)
x

Multivariate Expectation: (multiple variables)
For a multivariate random variable x (vector) we
can interpret the variable by considering

Linearity of Expectation:
If ‘f’ is a linear combination of two other
functions g and h, α, β are scalars
f(x) = α g(x) = βh(x) then
Expectation of f is
E[f] = α E[g] + β E[h]

Variance :
It gives the variation from the expected value
Variance also measures amount of fluctuation
of the variable
Ex: Variance is returns on certain investment in
the market ( risk measure)

CoVariance :
This is for a pair of variables x and y
Measures the total variation of two random
variables from their expected values

Useful properties of common functions
Logistic sigmoid function
Sigmoid function is mathematical function
having a characteristic ‘S’ shaped curve
Whatever input we give to sigmoid function it
give the output 0 and 1. (0,1)

Soft plus function:
The out put produced by sigmoid function have
upper limits and lower limits where as softplus
function produces output in (0, +∞ )
It is smooth approximation function can be used
to constrain output of a machine to always
positive

Bayes theorem: It helps in determining the
probability of an event that is based on some
event that has already occurred.
If A and B are two events then the formula for
bayes theorem is given by
Where P(A/B) is the probability of condition
when event A is occurring while event B has
already occurred

From the definition of conditional probability
Bayers theorem can be derived for event as

Information theory is a mathematical approach
to the study of coding of information along with
the quantification, storage, and communication
of information.
It provides the quantitative measure of
information contained in message signal
Digital information are always associated with
uncertainty
If probability of occurrence of an event is very
high = information contained in that event will
be less

Ex: Tomorrow, the sun will rise from the east
If the probability of occurrence of an event is
low information contained in the event will be
more
Ex: Solar eclipse will occur today
Consider a discrete random variable X with
possible outcomes xi i=1,2,…n. The information
of event X=xi is defined as

Consider a source which tosses a fair coin
produces an output equal to 1 if a head
appears and 0 if a tail appears
P(1) =P(0) =0.5
The information content of each output from
the source is
1 bit

Entropy: it tells how much information present
in an event
We can measure the amount of uncertainty in
an entire probability distribution using Shannon
entropy
H(x) is total amount of information in an entire
probability distribution

kullback leibler divergence:
It is a measure of how one probability
distribution is different from second

Structured Probabilistic Models : It is
representation of the factorization of probability
distribution using graph
A way of describing probability distributions
using a graph to describe which variables
interact with each other directly
Directed: graphs with directed edges
Undirected: graphs with undirected edges

Directed graphs:
The probability distribution over x is given by

Undirected graphs:
Any set of nodes that are connected to each other is
called clique
Each clique in an undirected model is associated with a
factor of

DL-unit-1.pptx

Recommended

Recommended

More Related Content

Similar to DL-unit-1.pptx

Similar to DL-unit-1.pptx (20)

Recently uploaded

Recently uploaded (20)

DL-unit-1.pptx