(35) Simple explanation of convolutional neural network | Deep
Learning Tutorial 23 (Tensorflow & Python) - YouTube
Implementing a Neural Network from Scratch in Python · Denny
's Blog (dennybritz.com)
MOST IMPORTANT LINKS
Pip install tensorflow -- to install tensorflow on your system
Running Jupyter server
Accessing jupyter notepad
Most popular DL frameworks !
Pytorch
• By facebook
Tensorflow
• By google
• Keras is not a full fledge framework but rather a nice wrapper around
TensorFlow, CNTK (By Microsoft) and “Theano. It just makes programming
easier
• Post TensorFlow 2.0, Keras is now part of Tensorflow Library suite.
Two most popular deep learning
frameworks are
a)PyTorch and (b) Tensorflow
Slope of a vertical line is always undefined
Slope of a horizontal line is always 0
Beauty of CNN is that no need to
provide explicit filters, it will
automatically detect filters on its own !
Classification through
Deep Network
(All nodes connected to all
other)
CNN is not necessarily deep means all nodes not
necessarily connected to all other nodes.
We are providing
thousand of photos of
koalas here, so CNN
will use
backpropagation to
automatically generate
appropriate filters. It is
part of learning.
Only parameters we specify is
- How many filters u want
- What will be size of filters.
- No need to provide values for filter !
CNN architecture components
• Convolution
• Padding
• Stride
• Pooling
• SoftMax
• Fully
Connected NN
Tensor
Tensor
The Forward pass of kernel
• During the forward pass, the kernel slides across the height and width of the
image-producing the image representation of that receptive region.
• This produces a two-dimensional representation of the image known as an
activation map that gives the response of the kernel at each spatial position of
the image.
• The sliding size of the kernel is called a stride.
Sample CNN
What is local receptive field?
• Subset of a feature-map or input-image
Parameters?
• # of parameters in a given layer is the count of “learnable” elements
• Parameters in general are weights that are learnt during training.
• They are weight matrices that contribute to model’s predictive power,
changed during back-propagation process.
# of parameters in an Input Layer
• Input layer has nothing to learn, at it’s core, what it does is just provide the
input image’s shape.
• So no learnable parameters here. Thus number of parameters = 0
# of parameters in a Convolutional Layer
• Consider a convolutional layer which takes
• “l” feature maps as the input
• “k” feature maps as output.
• The filter size is “n*m”.
• Example: Here the input has l=3 feature maps
as inputs, k=96 feature maps as outputs and
filter size is n=11 and m=11.
• It is important to understand, that we don’t
simply have a 11*11 filter, but actually, we
have 11*11*3 filter, as our input has 3
dimensions.
• And as an output from first conv layer, we
learn 96 different filters which total weights is
“n*m*k*l”. Then there is a term called bias
for each feature map. So, the total number of
parameters are “(n*m*l+1)*k”.
# param = ( (11 * 11 * 3) + 1) * 96
Formula for Output Shape
• N = input size
• F = size of filter
• P = # of padding
• S = # of strides
Output_shape = ((10 – 3
+2(0))/1)+1
Output_shape = 7/1 + 1
Output_shame = 8
Sample Calculation: Output_shape & # Params
# params in 1 filter = 3 x 3 + 1 = 10 (including 1 bias per filter)
# params in 5 filters = 10 * 5 = 50
Output_shape = ((10 – 3 + 2(0)) / 1 ) +1 = 7 + 1 = ( 8 x 8 x 5)
Main benefits of Pooling layer
• Reduced Size
• Translation invariance
• Feature Enhancement (Max-Pooling)
• No need of training
• No learning parameters
Source:
(58) Pooling Layer in CNN | MaxPooling in Convolutional Neural Network – YouTube.
https://www.youtube.com/watch?v=DwmGefkowCU
Benefits of Pooling: Size Reduction
Pooling = Sub-Sampling
Sub-sampling handles translation invariance
In both figures, A and B, the number ‘8’ is slightly shifted from
origin, but after subsampling / max-pooling filter applied, in the
resulting image, both are centered at origin however some details
are lost.
Therefore, generally speaking pooling (Min,Avg) focus on higher
level features and ignore the minute details.
Except Max-pooling where the features are actually enhanced
Benefits of Pooling: Size Reduction
Benefits of Pooling: Feature Enhancement
In case of max-pooling, you take small
area from the input image and take the
most dominant number (max) from it
You are actually selecting the best / most
bright weight from the receptive field,
which yields the most enhanced feature
to you.
Caution, it is only in the case of Max-
pooling. Not applicable to other forms of
pooling
Benefits of Pooling: No need of training
In Convolution layer, what will be the
weights in the filter is actually found out by
applying back propagation.
However, pooling is just an aggregate
operation (Min, Max, Avg) therefore, no
training is required. It is just an aggregate
function.
All you need to tell model is
- What is local receptive field?
- Value of stride?
- Type of pooling – (Avg, Min, Max)
Pooling is layer is faster for the above
reason. There is no back propagation
involved
Types of pooling in KERAS
• Max-pooling
• Avg-pooling
• Global-Pooling
• Global Max-Pooling
• Global Avg-Pooling
• Simply average of receptive field
Usually in majority of cases, max-
pooling is used, but some time
avg-pooling is also used.
Global Max-pooling
• You convert entire filter map into a
1x1 scaler value
• For global average pooling, you take
average of all values of an input
feature map
• For global max pooling, you take
max of all values of an input feature
map
• Where to Use?: In the end stage of a
CNN, when you are flattening your
data, you use global max pooling as
replacement of flattening, to reduce
over-fitting.
For global max
pooling of an input
4x4x3 feature
maps, you get
output 1 x 3.
1 for each feature
map
Disadvantages of Pooling: Location
• The feature of translation invariance
actually make location of required
feature irrelevant to the detection of
feature. This is quite helpful in some
classification tasks, where for example
you need to identify, if the image
contains a cat or not, regardless of her
position/location in the input image
• However, in some computer vision tasks,
location of the feature is very important.
Such as image segmentation tasks,
location does matter.
• Thus pooling is not used in image-
segmentation tasks
In image segmentation tasks, the location of
car is important. The features must all be in the
same location where the car is present.
Disadvantages of Pooling: Information loss
• Lot of information is lost
• For example for pooling conversion from 4x4=16 to 2x2=4, its almost
60% loss of information
• However, it all depend on the application and information vs
computational complexity tradeoff
In image segmentation tasks, the
location of car is important. The
features must all be in the same
location where the car is present.
LENET-5 Architecture
Considered
as 1 layer
Considered
as 1 layer
Monochrome / greyscale image
FULLY CONNECTED ANN
FLATTEN
LAYER
LeNET-5 / TENSORFLOW
# Adding libraries
import tensorflow
from tensorflow import keras
from keras.layers import Dense,Conv2D,Flatten,MaxPooling2D
from keras import Sequential
from keras.datasets import mnist
# Loading dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Generating model through KERAS Library
model_lenet5 = Sequential()
model_lenet5.add(Conv2D(6,kernel_size=(5,5),padding='valid', activation='tanh',
input_shape=(32,32,1)))
model_lenet5.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'))
model_lenet5.add(Conv2D(16,kernel_size=(5,5),padding='valid', activation='tanh'))
model_lenet5.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'))
model_lenet5.add(Flatten())
model_lenet5.add(Dense(120,activation='tanh'))
model_lenet5.add(Dense(84,activation='tanh'))
model_lenet5.add(Dense(10,activation='softmax'))
# Generating model Summary
model_lenet5.summary()
OUTPUT
ARCHITECTURE
SOURCE CODE
AMMAR
AHMED
How to calculate # of Learnable Parameters for Convlution Layer
6 feature maps at output
6 Filters required
Each of size mxnxl
Because Filter is also
3Dimensional
And 3rd
dimension comes
from input channels
Input image (RGB)
3 Dimensional
mxnxl
mxnxl
mxnxl
mxnxl
mxnxl
mxnxl
LENET-5 – Parameter Estimation
Layer
Fs = (nxm)
Input Shape
(a x b x l)
Output Shape
= ( (a – n + 2p) / s ) + 1
# of learnable parameters FS = Filter Size
#F = Number
of Filters
applied
l = # of
channels at
input
n , m = size of
filter
p = padding
s = strides
k = output
feature maps
First Layer: Conv3D
fs=(5x5), p=0, s=1, #f=6
(32,32,1) = ( (32-5+2(0)) / 1 ) + 1
= 27 + 1 = 28
(28 x 28 , 6)
= ( n x m x l + 1) * k
= ( 5 x 5 x 1 + 1 ) * 6
= 156
First Layer: Max-Pool
fs=(2x2), p=0, s=2, #f=6 (28x28x6)
= ( (28-2+2(0)) / 2 ) + 1
= 13 + 1 = 14
(14 x 14 x 6 )
0
Second layer: Conv3D
fs=(5x5), p=0, s=1, #f=16
(14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1
= 9 + 1 = 10
(10 x 10 x 16)
= ( n x m x l +1 ) * k
= ( 5 x 5 x 6 + 1 ) * 16
= 2,416
Second layer: Max-Pool
fs=(2x2), p=0, s=2, #f=16
(10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1
= 4 + 1 = 4
(5 x 5 x 16)
0
Flatten Layer (5 x 5 x 16) (1,400) – 1D Array 0
First Dense Layer (neuron=120) (1,400) (1,120) - 1D Array =(input*neurons)+biases
=(400*120)+120=48,120
2nd
Dense Layer (neuron=84) (1,120) (1,84) – 1D Array =(120*84+84)=10,164
Final output layer (1,84) (1,10) =(84x10+10)=850
Total learnable parameters 61,706 or 241 KB
AMMAR
AHMED
LENET-5 – Parameter Estimation
Layer Input Shape
(n x m x l)
Output Shape
= ( (n – f + 2p) / s ) + 1
# of parameters FS = Filter Size
#F = Number
of Filters
applied
l = # of
channels at
input
n , m = size of
filter
p = padding
s = strides
k = output
feature maps
First Layer: Conv3D
fs=(5x5), p=0, s=1, #f=6
(32,32,1) = ( (32-5+2(0)) / 1 ) + 1
= 27 + 1 = 28
(28 x 28 , 6)
= ( n x m x l ) * k
= ( 5 x 5 x 1 + 1 ) * 6
= 156
Fs2 * INPUTCHANNEL *
OUTPUT CHANNEL + BIAS (=#
OUTPUT CHANNELS)
(5*5*1*6+6)
First Layer: Max-Pool
fs=(2,2), p=0, s=2, #f=6 (28x28x6)
= ( (28-2+2(0)) / 2 ) + 1
= 13 + 1 = 14
(14 x 14 x 6 )
0
Second layer: Conv3D
fs=(5,5), p=0, s=1, #f=16
(14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1
= 9 + 1 = 10
(10 x 10 x 16)
= ( n x m x l ) * k
= ( 5 x 5 x 6 + 1 ) * 16
= 2,416
Second layer: Max-Pool
fs=(2,2), p=0, s=2, #f=16
(10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1
= 4 + 1 = 4
(5 x 5 x 16)
0
Flatten Layer (5 x 5 x 16) (1,400) 0
First Dense Layer (nodes=120) (1,400) (1,120) =(input*nodes)+biases
=(400*120)+120=48,120
(1*1) * 400 * 120 + 120
nd
AMMAR
AHMED
LENET-5 – Parameter Estimation
Layer Input Shape
(n x m x l)
Output Shape
= ( (n – f + 2p) / s ) + 1
# of parameters FS = Filter Size
#F = Number
of Filters
applied
l = # of
channels at
input
n , m = size of
filter
p = padding
s = strides
k = output
feature maps
First Layer: Conv3D
fs=(9x9), p=0, s=1, #f=3 (15,15,1)
= ( (15-9+2(0)) / 1 ) + 1
=6 + 1 =7
(7x 7 , 3)
= ( n x m x l ) * k
= ( 9 x 9 x 1 + 1 ) * 3
= 82*3 = 246
First Layer: Max-Pool
fs=(2,2), p=0, s=2, #f=3 (7x7x3)
= ( (7-3+2(0)) / 2 ) + 1
= 2 + 1 = 3
(3 x 3 x 3 )
0
Flatten Layer (3 x 3 x 3) (1,27) 0
First Dense Layer (nodes=27) (1,27) (1,27)
=(input*nodes)+biases
=(27*27)+27= 756
Final output layer(nodes=3) (1,27) (1,3) =(27*3)+3=84
AMMAR
AHMED
Object Detector types
Single-shot object detection
• Single-shot object detection uses a single pass
of the input image to make predictions about the
presence and location of objects in the image.
• It processes an entire image in a single pass,
making them computationally efficient.
• It is generally less accurate than other methods,
and it’s less effective in detecting small objects.
• Such algorithms can be used to detect objects in
real time in resource-constrained environments.
• YOLO is a single-shot detector that uses a fully
convolutional neural network (CNN) to process an
image.
Two-shot object detection
• Uses two passes of the input image to make
predictions about the presence and location of objects.
• The first pass is used to generate a set of proposals or
potential object locations, and the second pass is
used to refine these proposals and make final
predictions.
• This approach is more accurate than single-shot object
detection but is also more computationally expensive.
• Generally, single-shot object detection is better suited
for real-time applications, while two-shot object
detection is better for applications where accuracy is
more important.
Object Detection Method Types
Metrics: Intersection over Union (IoU)
• Intersection over Union is a popular metric to
measure localization accuracy and calculate
localization errors in object detection models.
• To calculate the IoU between the predicted and
the ground truth bounding boxes, we first take
the intersecting area between the two
corresponding bounding boxes for the same
object.
• Following this, we calculate the total area
covered by the two bounding boxes— also known
as the “Union” and the area of overlap between
them called the “Intersection.”
• The intersection divided by the Union gives us the
ratio of the overlap to the total area, providing a
good estimate of how close the prediction
bounding box is to the original bounding box.
Metrics: Average Precision (AP)
• Average Precision (AP) is calculated as the area under a precision vs. recall curve
for a set of predictions.
• Recall is calculated as the ratio of the total predictions made by the model under a
class with a total of existing labels for the class.
• Precision refers to the ratio of true positives with respect to the total predictions
made by the model.
• Recall and precision offer a trade-off that is graphically represented into a curve
by varying the classification threshold. The area under this precision vs. recall
curve gives us the Average Precision per class for the model. The average of this
value, taken over all classes, is called mean Average Precision (mAP).
• In object detection, precision and recall aren’t used for class predictions. Instead,
they serve as predictions of boundary boxes for measuring the decision
performance. An IoU value > 0.5. is taken as a positive prediction, while an IoU
value < 0.5 is a negative prediction.
YOLO from Ultralytics
• You Only Look Once (YOLO)
proposes using an end-to-end
neural network that makes
predictions of bounding boxes
and class probabilities all at
once.
• It differs from the approach
taken by previous object
detection algorithms, which
repurposed classifiers to
perform detection.
• YOLO performs all of its predictions
with the help of a single fully
connected layer.
Ultralytics YOLOv8 | State-of-the-Art Vision AI
YOLO Vs. Others
YOLO Algorithm / Architecture
YOLO Algorithm for Object Detection Explained [+Examples] (v7labs.com)
YoloR History : YOLO = You Only Look Once
Yolo has
outperformed
previous R CNN, Fast
R CNN and Faster R
CNN methods for
object detection
In one forward pass,
it can make
prediction, therefore
called You only look
once (YOLO)
(3) What is YOLO algorith
m? | Deep Learning Tuto
rial 31 (
Tensorflow, Keras
& Python) - YouTube
Object Localization
Yolo : Multi-Object Detection : Mult-grid and center of body approach
Training YOLO on multi-grid vectors
YOLO prediction
First Issue with YOLO: Multiple objects overlap but centers of both object not in one cell
• It may detect multiple bounding box for
same object (as shown)
• We don’t which box contains person and
which contains dog
• But every bounding box will have its own
probability
No Max Suppression
• We try to find the overlap area, which
is intersection over union
We apply technique
of “No Max
suppression” to get
the two distinct
boxes as shown on
right
First Issue with YOLO: Multiple objects overlap but centers of both objects are in one cell
When one cell contains center of two objects
– we have a problem of representation
Should we generate two
separate vectors of 7 depth or
should we combine them into
one anchor box of 14 values??
In real life, its rare to have
center of multiple objects into
same small cell.
Taking into account situations
where at max two objects lie in
center of one cell, is sufficient
for most cases.
CNN with two anchor boxes: A solution
Neural Network Types and Data
RNN
CNN
ANN
ANN
RNN
CNN
Details
Shifting from sigmoid to RELU
activation function, drastically
improved computation of
gradient descent algorithm.
This enabled use of larger and
shallow networks
Yhat = output
Y = ground truth
Loss function find difference
Logistic Regression cost function
Gradient Descent
• J(w,b) = cost function
• The plot of w,b and J(w,b) is a
surface
• J(w,b) is a convex function
• Target to find minimum of J(w,b)
• It is not an non-convex function,
which has lot of local minimas
• This nature of convex is reason
why we use this cost function
• For logistic function and due to
convex nature, it is not necessary
to initialize w,b from 0. you can
start from any point on the surface
Initialization of w and b
• To find a good value for the parameters, what we'll do is
initialize w and b to some initial value may be denoted by
that little red dot.
• And for logistic regression, almost any initialization method
works.
• Usually you Initialize the values of 0.
• Random initialization also works, but people don't usually
do that for logistic regression.
• Because this function is convex, no matter where you
initialize, you should get to the same point or roughly the
same point.
• And what gradient descent does is it starts at that initial
point and then takes a step in the steepest downhill
direction.
• So after one step of gradient descent, you might end up
there because it's trying to take a step downhill in the
direction of steepest descent or as quickly down who as
possible.
• So that's one iteration of gradient descent. And after
iterations of gradient descent, you might stop there, three
iterations and so on.
Need to converge to this
point here. Which is absolute
minima or global optimum
Gradeint Descent: assume on one parameter ‘w’
1
2
Suppose initial w is at point 1
W:= w - a (1) .. Because slope is +ve, updated w will be lower, move downward the curve
Suppose initial w is at point 2
W:= w – a (-1) .. Because slope is –ve, updated w will be higher, move downward the curve
Alpha = update / trainging rate
Gradeint Descent: for both parameter ‘w’ and ‘b’
If J is function of one variable, we
use simple derivative “d”
If J is function of 2 or more
variable, we use partial derivative
Coding convention
• Dw = simple derivative
• Db = partial derivative
Derivative of a linear line is constant
Derivative of a curve (non linear) is not constant
Computation Graph
Going in reverse order (back propagation) is easier way to calculate
derivatives . DJ/Dv then Dv/dy then du/db
Comuting dJ / da
Computing dJ/db and dJ/dc
Logistic Regression Derivatives
Derivation
Derivation of DL/dz
- Deep Learning Specialization / N
eural Networks and Deep Learning
- DeepLearning.AI
Derivation of dZ = a – y
Y^(y cap) is denoted as ‘a’ here
dZ in ML means = dL(a,y) / dz
Derivation of dw1, dw2 and db
Dw1 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b)
Dw1 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b)
Dw1 = ( a – y ) . x1
Dw1 = x1. DZ
(here DZ is ML notation)
Dw2 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b)
Dw2 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b)
Dw2 = ( a – y ) . x2
Dw2 = x2. DZ
Db = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b)
Db = ( a – y ) . d/db (w1x1 + w2x2 + b)
Db = ( a – y ) . 1
Db = DZ
Dw1,Dw2,Db and DZ are all ML notations
And means derivative of L(a,y) with respect to w1,w2 and b
respectively
Computing for entire dataset [m samples]
Major
for
loop
for
m
samples
Minor
for
loop
(required)
for
wn
weights
We need vectorization to get rid of
these for loop and write efficient code.
Necessary when m is very large.
What is Vectorization?
GPU and CPU can exceute parallel instructions
If you use built in function such as np.dot which don’t require
explicitly implementing for loop, it enables numpy to exploit
parallelism and thus your computation runs faster
Vectorization can significantly improve your code.
# Program to demonstrate how vectorization improves computational performance
# by comparing a vector dot product (parall implementation) vs foor loop implementation (sequential
execution)
# Ammar Ahmed
import time
import numpy as np
# Getting details about underlying hardware # running on google collab
import platform
print("Machine :" + str(platform.machine()))
print("Platform version :" + str(platform.version()))
print("Platform :" + str(platform.platform()))
# Generating array of elements
a = np.random.rand(1000000)
b = np.random.rand(1000000)
# Vector implementation
result=0
tic = time.time()
result = np.dot(a,b)
toc = time.time()
t1 = (toc - tic)*1000
print("Execution time of vectorized version = " + str(t1) + "ms " + " Computed Value :" + str(result))
# Non vector / loop implementation
result=0
tic = time.time()
for i in range(1000000):
result += a[i]*b[i]
toc = time.time()
t2 = (toc - tic)*1000
print("Execution time of sequential loop version = " + str(t2) + "ms "+ " Computed Value :" + str(result))
print("Time difference = " + str(t2-21) + "ms")
Vector implementation using python : numpy
Logistic regression derivatives
Python broadcasting
Python broadcasting
Cal is already in right shape for
divide through broadcasting
method.
However .reshape make sure
you are doing it right.
However, it can be omitted.
Python broadcasting
Python broadcasting
Refer to python documentation for more general principle for broadcasting
Correct Answers
Correct Answers
KEY LEARNING!
• A * b = element
wise multiplication
• Np.dot(a,b) = matrix
multiplication
Vectorization
Tips: “Don’t use rank 1 arrays” Rank 1 arrays are neither row vector
nor column vector. Therefore matrix
or vector operations are not
consistent with them.
Always initialize with proper
structure and size
Use
A = a.reshape((5,1)) to convert rank 1
array into a column vector
a.Shape = (5,1) = column vector
a.Shape = (1,5) = row vector
Assert(a.shape == (5,1) ) SIMPLIY YOUR CODE
ALWAYS USE COLUMN OR ROW VECTORS
Python code: wrong implementation
Rank 1 Array
Its neither row vector
nor column vector
Not doing
proper
transpose to the
vector. wrong
Not proper dot
product as required
Python code: correct implementation
Why we use numpy and not python’s math library?
Actually, we
rarely use the
"math" library
in deep learning
because the
inputs of the
functions are
real numbers.
In deep learning
we mostly use
matrices and
vectors. This is
why numpy is
more useful.
Numpy
version of
sigmoid
Vectorization of an RGB image | Reshaping Arrays
image2vector
Flatten vector /
vectorization
using .reshape
on image matrix
Normalizing Rows
Normalizaiton - code
**Note**:
In normalize_rows(), you can try to print
the shapes of x_norm and x, and then
rerun the assessment.
You'll find out that they have different
shapes. This is normal given that x_norm
takes the norm of each row of x.
So x_norm has the same number of rows
but only 1 column. So how did it work
when you divided x by x_norm? This is
called broadcasting and we'll talk about it
now!
Softmax – A normalizing function
Softmax is a normalizing function used
when the algorithm needs to classify two
or more classes.
Softmax – Python Code
- If you print the shapes of x_exp,
x_sum and s above and rerun the
assessment cell, you will see that
x_sum is of shape (2,1) while x_exp
and s are of shape (2,5).
**x_exp/x_sum** works due to
python broadcasting.
Key points to remember
• np.exp(x) works for any np.array x and applies the exponential function to
every coordinate
• the sigmoid function and its gradient
• image2vector is commonly used in deep learning
• np.reshape is widely used.
• numpy has efficient built-in functions
• broadcasting is extremely useful
Implement L1 loss function
Implement L2 loss function
Some interpreter directives
import numpy as np
import copy
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset
from public_tests import *
%matplotlib inline
%load_ext autoreload
%autoreload 2
Trick to learn
ANN as simple cat detector
What is deepcopy?
Required Functions to implement
Neural Network Representation
Two layer network
As we don’t count input
layer
[n] represent layer #
an represent node # in the
layer
Neural Network Representation
Neural Network Representation
Implementing above four set of equations using loop will
be very slow.
We need to VECTORIZE them.
Vectorize representation
Converting to
stacked matrixes
or column vector
notation
Vectorize representation
Convolution animations
No padding, no strides Arbitrary padding, no strides Half padding, no strides Full padding, no strides
No padding, strides Padding, strides Padding, strides (odd)
GitHub - vdumoulin/conv_arithmetic
: A technical report on convolution arit
hmetic in the context of deep learning

Deep learning requirement and notes for novoice

  • 1.
    (35) Simple explanationof convolutional neural network | Deep Learning Tutorial 23 (Tensorflow & Python) - YouTube Implementing a Neural Network from Scratch in Python · Denny 's Blog (dennybritz.com) MOST IMPORTANT LINKS
  • 2.
    Pip install tensorflow-- to install tensorflow on your system Running Jupyter server Accessing jupyter notepad
  • 3.
    Most popular DLframeworks ! Pytorch • By facebook Tensorflow • By google
  • 4.
    • Keras isnot a full fledge framework but rather a nice wrapper around TensorFlow, CNTK (By Microsoft) and “Theano. It just makes programming easier • Post TensorFlow 2.0, Keras is now part of Tensorflow Library suite. Two most popular deep learning frameworks are a)PyTorch and (b) Tensorflow
  • 6.
    Slope of avertical line is always undefined Slope of a horizontal line is always 0
  • 7.
    Beauty of CNNis that no need to provide explicit filters, it will automatically detect filters on its own ! Classification through Deep Network (All nodes connected to all other) CNN is not necessarily deep means all nodes not necessarily connected to all other nodes. We are providing thousand of photos of koalas here, so CNN will use backpropagation to automatically generate appropriate filters. It is part of learning. Only parameters we specify is - How many filters u want - What will be size of filters. - No need to provide values for filter !
  • 8.
    CNN architecture components •Convolution • Padding • Stride • Pooling • SoftMax • Fully Connected NN Tensor Tensor
  • 9.
    The Forward passof kernel • During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region. • This produces a two-dimensional representation of the image known as an activation map that gives the response of the kernel at each spatial position of the image. • The sliding size of the kernel is called a stride.
  • 10.
  • 11.
    What is localreceptive field? • Subset of a feature-map or input-image
  • 12.
    Parameters? • # ofparameters in a given layer is the count of “learnable” elements • Parameters in general are weights that are learnt during training. • They are weight matrices that contribute to model’s predictive power, changed during back-propagation process.
  • 13.
    # of parametersin an Input Layer • Input layer has nothing to learn, at it’s core, what it does is just provide the input image’s shape. • So no learnable parameters here. Thus number of parameters = 0
  • 14.
    # of parametersin a Convolutional Layer • Consider a convolutional layer which takes • “l” feature maps as the input • “k” feature maps as output. • The filter size is “n*m”. • Example: Here the input has l=3 feature maps as inputs, k=96 feature maps as outputs and filter size is n=11 and m=11. • It is important to understand, that we don’t simply have a 11*11 filter, but actually, we have 11*11*3 filter, as our input has 3 dimensions. • And as an output from first conv layer, we learn 96 different filters which total weights is “n*m*k*l”. Then there is a term called bias for each feature map. So, the total number of parameters are “(n*m*l+1)*k”. # param = ( (11 * 11 * 3) + 1) * 96
  • 15.
    Formula for OutputShape • N = input size • F = size of filter • P = # of padding • S = # of strides Output_shape = ((10 – 3 +2(0))/1)+1 Output_shape = 7/1 + 1 Output_shame = 8
  • 16.
    Sample Calculation: Output_shape& # Params # params in 1 filter = 3 x 3 + 1 = 10 (including 1 bias per filter) # params in 5 filters = 10 * 5 = 50 Output_shape = ((10 – 3 + 2(0)) / 1 ) +1 = 7 + 1 = ( 8 x 8 x 5)
  • 17.
    Main benefits ofPooling layer • Reduced Size • Translation invariance • Feature Enhancement (Max-Pooling) • No need of training • No learning parameters Source: (58) Pooling Layer in CNN | MaxPooling in Convolutional Neural Network – YouTube. https://www.youtube.com/watch?v=DwmGefkowCU
  • 18.
    Benefits of Pooling:Size Reduction
  • 19.
    Pooling = Sub-Sampling Sub-samplinghandles translation invariance In both figures, A and B, the number ‘8’ is slightly shifted from origin, but after subsampling / max-pooling filter applied, in the resulting image, both are centered at origin however some details are lost. Therefore, generally speaking pooling (Min,Avg) focus on higher level features and ignore the minute details. Except Max-pooling where the features are actually enhanced Benefits of Pooling: Size Reduction
  • 20.
    Benefits of Pooling:Feature Enhancement In case of max-pooling, you take small area from the input image and take the most dominant number (max) from it You are actually selecting the best / most bright weight from the receptive field, which yields the most enhanced feature to you. Caution, it is only in the case of Max- pooling. Not applicable to other forms of pooling
  • 21.
    Benefits of Pooling:No need of training In Convolution layer, what will be the weights in the filter is actually found out by applying back propagation. However, pooling is just an aggregate operation (Min, Max, Avg) therefore, no training is required. It is just an aggregate function. All you need to tell model is - What is local receptive field? - Value of stride? - Type of pooling – (Avg, Min, Max) Pooling is layer is faster for the above reason. There is no back propagation involved
  • 22.
    Types of poolingin KERAS • Max-pooling • Avg-pooling • Global-Pooling • Global Max-Pooling • Global Avg-Pooling • Simply average of receptive field Usually in majority of cases, max- pooling is used, but some time avg-pooling is also used.
  • 23.
    Global Max-pooling • Youconvert entire filter map into a 1x1 scaler value • For global average pooling, you take average of all values of an input feature map • For global max pooling, you take max of all values of an input feature map • Where to Use?: In the end stage of a CNN, when you are flattening your data, you use global max pooling as replacement of flattening, to reduce over-fitting. For global max pooling of an input 4x4x3 feature maps, you get output 1 x 3. 1 for each feature map
  • 24.
    Disadvantages of Pooling:Location • The feature of translation invariance actually make location of required feature irrelevant to the detection of feature. This is quite helpful in some classification tasks, where for example you need to identify, if the image contains a cat or not, regardless of her position/location in the input image • However, in some computer vision tasks, location of the feature is very important. Such as image segmentation tasks, location does matter. • Thus pooling is not used in image- segmentation tasks In image segmentation tasks, the location of car is important. The features must all be in the same location where the car is present.
  • 25.
    Disadvantages of Pooling:Information loss • Lot of information is lost • For example for pooling conversion from 4x4=16 to 2x2=4, its almost 60% loss of information • However, it all depend on the application and information vs computational complexity tradeoff In image segmentation tasks, the location of car is important. The features must all be in the same location where the car is present.
  • 26.
    LENET-5 Architecture Considered as 1layer Considered as 1 layer Monochrome / greyscale image FULLY CONNECTED ANN FLATTEN LAYER
  • 27.
    LeNET-5 / TENSORFLOW #Adding libraries import tensorflow from tensorflow import keras from keras.layers import Dense,Conv2D,Flatten,MaxPooling2D from keras import Sequential from keras.datasets import mnist # Loading dataset (X_train, y_train), (X_test, y_test) = mnist.load_data() # Generating model through KERAS Library model_lenet5 = Sequential() model_lenet5.add(Conv2D(6,kernel_size=(5,5),padding='valid', activation='tanh', input_shape=(32,32,1))) model_lenet5.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid')) model_lenet5.add(Conv2D(16,kernel_size=(5,5),padding='valid', activation='tanh')) model_lenet5.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid')) model_lenet5.add(Flatten()) model_lenet5.add(Dense(120,activation='tanh')) model_lenet5.add(Dense(84,activation='tanh')) model_lenet5.add(Dense(10,activation='softmax')) # Generating model Summary model_lenet5.summary() OUTPUT ARCHITECTURE SOURCE CODE AMMAR AHMED
  • 28.
    How to calculate# of Learnable Parameters for Convlution Layer 6 feature maps at output 6 Filters required Each of size mxnxl Because Filter is also 3Dimensional And 3rd dimension comes from input channels Input image (RGB) 3 Dimensional mxnxl mxnxl mxnxl mxnxl mxnxl mxnxl
  • 29.
    LENET-5 – ParameterEstimation Layer Fs = (nxm) Input Shape (a x b x l) Output Shape = ( (a – n + 2p) / s ) + 1 # of learnable parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=(5x5), p=0, s=1, #f=6 (32,32,1) = ( (32-5+2(0)) / 1 ) + 1 = 27 + 1 = 28 (28 x 28 , 6) = ( n x m x l + 1) * k = ( 5 x 5 x 1 + 1 ) * 6 = 156 First Layer: Max-Pool fs=(2x2), p=0, s=2, #f=6 (28x28x6) = ( (28-2+2(0)) / 2 ) + 1 = 13 + 1 = 14 (14 x 14 x 6 ) 0 Second layer: Conv3D fs=(5x5), p=0, s=1, #f=16 (14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1 = 9 + 1 = 10 (10 x 10 x 16) = ( n x m x l +1 ) * k = ( 5 x 5 x 6 + 1 ) * 16 = 2,416 Second layer: Max-Pool fs=(2x2), p=0, s=2, #f=16 (10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1 = 4 + 1 = 4 (5 x 5 x 16) 0 Flatten Layer (5 x 5 x 16) (1,400) – 1D Array 0 First Dense Layer (neuron=120) (1,400) (1,120) - 1D Array =(input*neurons)+biases =(400*120)+120=48,120 2nd Dense Layer (neuron=84) (1,120) (1,84) – 1D Array =(120*84+84)=10,164 Final output layer (1,84) (1,10) =(84x10+10)=850 Total learnable parameters 61,706 or 241 KB AMMAR AHMED
  • 30.
    LENET-5 – ParameterEstimation Layer Input Shape (n x m x l) Output Shape = ( (n – f + 2p) / s ) + 1 # of parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=(5x5), p=0, s=1, #f=6 (32,32,1) = ( (32-5+2(0)) / 1 ) + 1 = 27 + 1 = 28 (28 x 28 , 6) = ( n x m x l ) * k = ( 5 x 5 x 1 + 1 ) * 6 = 156 Fs2 * INPUTCHANNEL * OUTPUT CHANNEL + BIAS (=# OUTPUT CHANNELS) (5*5*1*6+6) First Layer: Max-Pool fs=(2,2), p=0, s=2, #f=6 (28x28x6) = ( (28-2+2(0)) / 2 ) + 1 = 13 + 1 = 14 (14 x 14 x 6 ) 0 Second layer: Conv3D fs=(5,5), p=0, s=1, #f=16 (14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1 = 9 + 1 = 10 (10 x 10 x 16) = ( n x m x l ) * k = ( 5 x 5 x 6 + 1 ) * 16 = 2,416 Second layer: Max-Pool fs=(2,2), p=0, s=2, #f=16 (10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1 = 4 + 1 = 4 (5 x 5 x 16) 0 Flatten Layer (5 x 5 x 16) (1,400) 0 First Dense Layer (nodes=120) (1,400) (1,120) =(input*nodes)+biases =(400*120)+120=48,120 (1*1) * 400 * 120 + 120 nd AMMAR AHMED
  • 31.
    LENET-5 – ParameterEstimation Layer Input Shape (n x m x l) Output Shape = ( (n – f + 2p) / s ) + 1 # of parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=(9x9), p=0, s=1, #f=3 (15,15,1) = ( (15-9+2(0)) / 1 ) + 1 =6 + 1 =7 (7x 7 , 3) = ( n x m x l ) * k = ( 9 x 9 x 1 + 1 ) * 3 = 82*3 = 246 First Layer: Max-Pool fs=(2,2), p=0, s=2, #f=3 (7x7x3) = ( (7-3+2(0)) / 2 ) + 1 = 2 + 1 = 3 (3 x 3 x 3 ) 0 Flatten Layer (3 x 3 x 3) (1,27) 0 First Dense Layer (nodes=27) (1,27) (1,27) =(input*nodes)+biases =(27*27)+27= 756 Final output layer(nodes=3) (1,27) (1,3) =(27*3)+3=84 AMMAR AHMED
  • 32.
  • 33.
    Single-shot object detection •Single-shot object detection uses a single pass of the input image to make predictions about the presence and location of objects in the image. • It processes an entire image in a single pass, making them computationally efficient. • It is generally less accurate than other methods, and it’s less effective in detecting small objects. • Such algorithms can be used to detect objects in real time in resource-constrained environments. • YOLO is a single-shot detector that uses a fully convolutional neural network (CNN) to process an image. Two-shot object detection • Uses two passes of the input image to make predictions about the presence and location of objects. • The first pass is used to generate a set of proposals or potential object locations, and the second pass is used to refine these proposals and make final predictions. • This approach is more accurate than single-shot object detection but is also more computationally expensive. • Generally, single-shot object detection is better suited for real-time applications, while two-shot object detection is better for applications where accuracy is more important. Object Detection Method Types
  • 35.
    Metrics: Intersection overUnion (IoU) • Intersection over Union is a popular metric to measure localization accuracy and calculate localization errors in object detection models. • To calculate the IoU between the predicted and the ground truth bounding boxes, we first take the intersecting area between the two corresponding bounding boxes for the same object. • Following this, we calculate the total area covered by the two bounding boxes— also known as the “Union” and the area of overlap between them called the “Intersection.” • The intersection divided by the Union gives us the ratio of the overlap to the total area, providing a good estimate of how close the prediction bounding box is to the original bounding box.
  • 36.
    Metrics: Average Precision(AP) • Average Precision (AP) is calculated as the area under a precision vs. recall curve for a set of predictions. • Recall is calculated as the ratio of the total predictions made by the model under a class with a total of existing labels for the class. • Precision refers to the ratio of true positives with respect to the total predictions made by the model. • Recall and precision offer a trade-off that is graphically represented into a curve by varying the classification threshold. The area under this precision vs. recall curve gives us the Average Precision per class for the model. The average of this value, taken over all classes, is called mean Average Precision (mAP). • In object detection, precision and recall aren’t used for class predictions. Instead, they serve as predictions of boundary boxes for measuring the decision performance. An IoU value > 0.5. is taken as a positive prediction, while an IoU value < 0.5 is a negative prediction.
  • 37.
    YOLO from Ultralytics •You Only Look Once (YOLO) proposes using an end-to-end neural network that makes predictions of bounding boxes and class probabilities all at once. • It differs from the approach taken by previous object detection algorithms, which repurposed classifiers to perform detection. • YOLO performs all of its predictions with the help of a single fully connected layer. Ultralytics YOLOv8 | State-of-the-Art Vision AI
  • 38.
  • 39.
    YOLO Algorithm /Architecture YOLO Algorithm for Object Detection Explained [+Examples] (v7labs.com)
  • 40.
    YoloR History :YOLO = You Only Look Once Yolo has outperformed previous R CNN, Fast R CNN and Faster R CNN methods for object detection In one forward pass, it can make prediction, therefore called You only look once (YOLO) (3) What is YOLO algorith m? | Deep Learning Tuto rial 31 ( Tensorflow, Keras & Python) - YouTube
  • 41.
  • 42.
    Yolo : Multi-ObjectDetection : Mult-grid and center of body approach
  • 43.
    Training YOLO onmulti-grid vectors
  • 44.
  • 45.
    First Issue withYOLO: Multiple objects overlap but centers of both object not in one cell • It may detect multiple bounding box for same object (as shown) • We don’t which box contains person and which contains dog • But every bounding box will have its own probability
  • 46.
    No Max Suppression •We try to find the overlap area, which is intersection over union We apply technique of “No Max suppression” to get the two distinct boxes as shown on right
  • 47.
    First Issue withYOLO: Multiple objects overlap but centers of both objects are in one cell When one cell contains center of two objects – we have a problem of representation Should we generate two separate vectors of 7 depth or should we combine them into one anchor box of 14 values?? In real life, its rare to have center of multiple objects into same small cell. Taking into account situations where at max two objects lie in center of one cell, is sufficient for most cases.
  • 48.
    CNN with twoanchor boxes: A solution
  • 49.
    Neural Network Typesand Data RNN CNN ANN ANN RNN CNN
  • 50.
    Details Shifting from sigmoidto RELU activation function, drastically improved computation of gradient descent algorithm. This enabled use of larger and shallow networks Yhat = output Y = ground truth Loss function find difference
  • 51.
  • 52.
    Gradient Descent • J(w,b)= cost function • The plot of w,b and J(w,b) is a surface • J(w,b) is a convex function • Target to find minimum of J(w,b) • It is not an non-convex function, which has lot of local minimas • This nature of convex is reason why we use this cost function • For logistic function and due to convex nature, it is not necessary to initialize w,b from 0. you can start from any point on the surface
  • 53.
    Initialization of wand b • To find a good value for the parameters, what we'll do is initialize w and b to some initial value may be denoted by that little red dot. • And for logistic regression, almost any initialization method works. • Usually you Initialize the values of 0. • Random initialization also works, but people don't usually do that for logistic regression. • Because this function is convex, no matter where you initialize, you should get to the same point or roughly the same point. • And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. • So after one step of gradient descent, you might end up there because it's trying to take a step downhill in the direction of steepest descent or as quickly down who as possible. • So that's one iteration of gradient descent. And after iterations of gradient descent, you might stop there, three iterations and so on. Need to converge to this point here. Which is absolute minima or global optimum
  • 54.
    Gradeint Descent: assumeon one parameter ‘w’ 1 2 Suppose initial w is at point 1 W:= w - a (1) .. Because slope is +ve, updated w will be lower, move downward the curve Suppose initial w is at point 2 W:= w – a (-1) .. Because slope is –ve, updated w will be higher, move downward the curve Alpha = update / trainging rate
  • 55.
    Gradeint Descent: forboth parameter ‘w’ and ‘b’ If J is function of one variable, we use simple derivative “d” If J is function of 2 or more variable, we use partial derivative
  • 56.
    Coding convention • Dw= simple derivative • Db = partial derivative
  • 57.
    Derivative of alinear line is constant
  • 58.
    Derivative of acurve (non linear) is not constant
  • 60.
    Computation Graph Going inreverse order (back propagation) is easier way to calculate derivatives . DJ/Dv then Dv/dy then du/db
  • 61.
  • 62.
  • 64.
  • 65.
    Derivation Derivation of DL/dz -Deep Learning Specialization / N eural Networks and Deep Learning - DeepLearning.AI
  • 66.
    Derivation of dZ= a – y Y^(y cap) is denoted as ‘a’ here dZ in ML means = dL(a,y) / dz
  • 67.
    Derivation of dw1,dw2 and db Dw1 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Dw1 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b) Dw1 = ( a – y ) . x1 Dw1 = x1. DZ (here DZ is ML notation) Dw2 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Dw2 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b) Dw2 = ( a – y ) . x2 Dw2 = x2. DZ Db = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Db = ( a – y ) . d/db (w1x1 + w2x2 + b) Db = ( a – y ) . 1 Db = DZ Dw1,Dw2,Db and DZ are all ML notations And means derivative of L(a,y) with respect to w1,w2 and b respectively
  • 68.
    Computing for entiredataset [m samples] Major for loop for m samples Minor for loop (required) for wn weights We need vectorization to get rid of these for loop and write efficient code. Necessary when m is very large.
  • 69.
    What is Vectorization? GPUand CPU can exceute parallel instructions If you use built in function such as np.dot which don’t require explicitly implementing for loop, it enables numpy to exploit parallelism and thus your computation runs faster Vectorization can significantly improve your code. # Program to demonstrate how vectorization improves computational performance # by comparing a vector dot product (parall implementation) vs foor loop implementation (sequential execution) # Ammar Ahmed import time import numpy as np # Getting details about underlying hardware # running on google collab import platform print("Machine :" + str(platform.machine())) print("Platform version :" + str(platform.version())) print("Platform :" + str(platform.platform())) # Generating array of elements a = np.random.rand(1000000) b = np.random.rand(1000000) # Vector implementation result=0 tic = time.time() result = np.dot(a,b) toc = time.time() t1 = (toc - tic)*1000 print("Execution time of vectorized version = " + str(t1) + "ms " + " Computed Value :" + str(result)) # Non vector / loop implementation result=0 tic = time.time() for i in range(1000000): result += a[i]*b[i] toc = time.time() t2 = (toc - tic)*1000 print("Execution time of sequential loop version = " + str(t2) + "ms "+ " Computed Value :" + str(result)) print("Time difference = " + str(t2-21) + "ms")
  • 70.
  • 71.
  • 74.
  • 75.
    Python broadcasting Cal isalready in right shape for divide through broadcasting method. However .reshape make sure you are doing it right. However, it can be omitted.
  • 76.
  • 77.
    Python broadcasting Refer topython documentation for more general principle for broadcasting
  • 80.
  • 81.
  • 82.
    KEY LEARNING! • A* b = element wise multiplication • Np.dot(a,b) = matrix multiplication
  • 85.
  • 86.
    Tips: “Don’t userank 1 arrays” Rank 1 arrays are neither row vector nor column vector. Therefore matrix or vector operations are not consistent with them. Always initialize with proper structure and size Use A = a.reshape((5,1)) to convert rank 1 array into a column vector a.Shape = (5,1) = column vector a.Shape = (1,5) = row vector Assert(a.shape == (5,1) ) SIMPLIY YOUR CODE ALWAYS USE COLUMN OR ROW VECTORS
  • 87.
    Python code: wrongimplementation Rank 1 Array Its neither row vector nor column vector Not doing proper transpose to the vector. wrong Not proper dot product as required
  • 88.
    Python code: correctimplementation
  • 89.
    Why we usenumpy and not python’s math library? Actually, we rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful. Numpy version of sigmoid
  • 90.
    Vectorization of anRGB image | Reshaping Arrays
  • 91.
  • 93.
  • 94.
    Normalizaiton - code **Note**: Innormalize_rows(), you can try to print the shapes of x_norm and x, and then rerun the assessment. You'll find out that they have different shapes. This is normal given that x_norm takes the norm of each row of x. So x_norm has the same number of rows but only 1 column. So how did it work when you divided x by x_norm? This is called broadcasting and we'll talk about it now!
  • 95.
    Softmax – Anormalizing function Softmax is a normalizing function used when the algorithm needs to classify two or more classes.
  • 96.
    Softmax – PythonCode - If you print the shapes of x_exp, x_sum and s above and rerun the assessment cell, you will see that x_sum is of shape (2,1) while x_exp and s are of shape (2,5). **x_exp/x_sum** works due to python broadcasting.
  • 97.
    Key points toremember • np.exp(x) works for any np.array x and applies the exponential function to every coordinate • the sigmoid function and its gradient • image2vector is commonly used in deep learning • np.reshape is widely used. • numpy has efficient built-in functions • broadcasting is extremely useful
  • 98.
  • 99.
  • 100.
    Some interpreter directives importnumpy as np import copy import matplotlib.pyplot as plt import h5py import scipy from PIL import Image from scipy import ndimage from lr_utils import load_dataset from public_tests import * %matplotlib inline %load_ext autoreload %autoreload 2
  • 101.
  • 102.
    ANN as simplecat detector
  • 104.
  • 105.
  • 106.
    Neural Network Representation Twolayer network As we don’t count input layer [n] represent layer # an represent node # in the layer
  • 107.
  • 108.
    Neural Network Representation Implementingabove four set of equations using loop will be very slow. We need to VECTORIZE them.
  • 109.
    Vectorize representation Converting to stackedmatrixes or column vector notation
  • 110.
  • 111.
    Convolution animations No padding,no strides Arbitrary padding, no strides Half padding, no strides Full padding, no strides No padding, strides Padding, strides Padding, strides (odd) GitHub - vdumoulin/conv_arithmetic : A technical report on convolution arit hmetic in the context of deep learning

Editor's Notes

  • #71 Derivation of DL/dz - Deep Learning Specialization / Neural Networks and Deep Learning - DeepLearning.AI