This covers a end-to-end coverage of neural networks,CNN internals , Tensorflow and Keras basic , intution on object detection and face recognition and AI on Android x86.
2. Information Technology 2
Day 1
• Module 1 – Computer Vision and Neural Networks- 45 Min
10 mins break
• Module 2 – CNN Internals– 90 Min
• Q & A – 20 min
1 Hr Lunch break
• Module 3 – Tensorflow – 45 Min
• Module 4 – Mnist Dataset with NN and CNN-30 Min
• 10 mins break
• Module 5 – Keras Framework – 30 min
• Module 6 – Short Talk on RNN- 30 Min
Day 2
• Module 7 – Classic Networks - 75 Min
10 mins break
• Module 8 – Classic Networks Programming (Resnet 50
etc.) – 30 Min
• Module 9 – Short Talk on Object Detection and Face
Recognition – 45 Min
1 Hr Lunch break
Module 10 – Android AI with OPENVINO – 30 min
Course Topics (9-4, 9-12)
6. Information Technology 6
What is COMPUETR VISION (Legacy)
Computer vision is the transformation of data from a still or video camera
into either a decision or a new representation.
Decisions like “laser range finder indicates an object is 1 meter away” or
“there is a person in this scene” or “there are 14 tumor cells on this slide”
etc….
A new representation might mean turning a color image into a grayscale
image or removing camera motion from an image sequence.
8. Information Technology 8
Vision
• Human thinks vision is easy (though its seamless), but the human brain divides the vision signal into many channels
that stream different kinds of information into your brain.
• Your brain has an attention system that identifies, in a task-dependent way, important parts of an image to examine
while suppressing examination of other areas.
• There is massive feedback in the visual stream that is, as yet, little understood.
• There are widespread associative inputs from muscle control sensors and all of the other senses that allow the brain
to draw on cross-associations made from years of living in the world.
• The feedback loops in the brain go back to all stages of processing, including the hardware sensors themselves (the
eyes), which mechanically control lighting via the iris and tune the reception on the surface of the retina.
9. Information Technology 9
Many neurons in the visual cortex have a small local receptive field, meaning they react only to visual
stimuli located in a limited region of the visual field (see next slide , in which the local receptive fields of
five neurons are represented by dashed circles). The receptive fields of different neurons may overlap,
and together they tile the whole visual field. Moreover, some neurons react only to images of horizontal
lines, while others react only to lines with different orientations (two neurons may have the same
receptive field but react to different line orientations). They also noticed that some neurons have larger
receptive fields, and they react to more complex patterns that are combinations of the lower-level
patterns. These observations led to the idea that the higher-level neurons are based on the outputs of
neighboring lower-level neurons (in next slide, notice that each neuron is connected only to a few
neurons from the previous layer). This powerful architecture is able to detect all sorts of complex patterns
in any area of the visual field.
Excerpts from Nobel prize winner in physiology in 1981, David H. Hubel
and Torsten Wiesel
A neuroscientific motivation behind
convolution
11. Information Technology 11
Brain possesses more than 1011 cells (or neurons), each of
which have well over 104 contacts/weights (or synapses)
with other neurons. If each neuron acts as a type of
microprocessor, then we have an immense computer (total
~1015 weights) in which all the processing elements can
operate concurrently.
12. Information Technology 12
Simulating vision intelligence
by deep learning !!
The hierarchy of concepts enables the computer
to learn complicated concepts by building them out
of simpler ones. If we draw a graph showing how
these concepts are built on top of each other, the
graph is deep, with many layers. For this reason,
we call this approach to AI deep learning.
13. Information Technology 13
What motivates us to do Computer
VISION with deep learning?
Avalanche/intrusion/landslides ?
License plate recognition – Traffic rule
violation/law-order situation
Medical Imaging
17. Information Technology 17
Machine vision
Computer receives a grid of numbers/Cells from the camera or from disk.
For the most part, there’s no built-in pattern recognition, no automatic control of
focus and aperture, no cross-associations with years of experience. It’s a naïve
vision systems.
Any given number within grid shown on previous slide has a rather large noise
component and so by itself gives us little information, but this grid of numbers is
all the computer “sees” .
Our task, then, becomes to turn this noisy grid of numbers into the perception
understandable to humans.
23. Information Technology 23
Deep Learning using neural networks
• In deep learning, we feed millions of data instances into a network of neurons (neural networks) ,
teaching them to recognize patterns from raw inputs.
• The deep neural networks take raw inputs (such as pixel values in an image) and transform them into
useful representations, extracting higher-level features (such as shapes and edges in images) that
capture complex concepts by combining smaller and smaller pieces of information to solve challenging
tasks such as image classification.
• The networks automatically learn to build abstract representations by adapting and correcting
themselves, fitting patterns observed in the data. Here networks are trained with a feedback process
called backpropagation based on gradient descent optimization.
27. Information Technology 27
What is filter/kernel and what it does?
Notice that the vertical white lines get
enhanced while the rest gets blurred
notice that the horizontal white lines
get enhanced while the rest is blurred
out
In Deep Learning Literature , No need to define the filters manually, Instead during training the
convolutional layer will automatically learn the most useful filters for its task, and the layers above will
learn to combine them into more complex patterns.
Filter examples like Sobel
filter , Scharr filter etc.
28. Information Technology 29
What does it mean to convolve?Each value in the matrix on the left corresponds to a
single pixel value, and we convolve a 3x3 filter with
the image by multiplying its values element-wise with
the original matrix, then summing them up and adding
a bias. Here the Stride value is 1.
30. Information Technology 32
A convolution layer transforms an input volume into an output volume of different size. The size of input image generally
reduces!! (see Padding in upcoming slides if size reduction is a concern!!)
What does cnn do finally ?
31. Information Technology 33
Parameter sharing: A feature detector (such as a vertical edge detector)
that’s useful in one part of the image is probably useful in another part of
the image.
Sparsity of connections: In each layer, each output value depends only
on small number of inputs.
Nature Of Images Establishing Invariance (Irrespective of Translation)
Learning Theory Establishing Regularization (reducing Degrees of
Freedom to Degrees of Filter Size)
WHY CNN ?
33. Information Technology 35
The main benefits of padding are the following:
– It allows you to use a CONV layer without necessarily shrinking the height and width of the
volumes. This is important for building deeper networks, since otherwise the height/width would
shrink as you go to deeper layers. An important special case is the "same" convolution, in which
the height/width is exactly preserved after one layer.
– It helps us keep more of the information at the border of an image. Without padding, very few
values at the next layer would be affected by pixels as the edges of an image.
Two Kinds of Padding
1) Valid Convolutions No Padding is done.
2) Same Convolutions Pad to keep the output size same as input size.
Padding
35. Information Technology 37
With each layer of convolution , the output size reduces at the edge i.e.
output images shrinks. Hence with deeper networks, lot of informations
at the edge would go away . Hence Padding is needed.
Even in some case, information at edge becomes crucial or It helps us
keep more of the information at the border of an image , , hence padding
would be required anyway .
Why Padding is required?
38. Information Technology 40
Fire a Neuron or Not!! To activate a neuron ,
specialized functions are used which are called as
activation functions.
The purpose of the activation function is to introduce non-linearity into the network .
Real world problems are non-linear.
Hence, to make the incoming data nonlinear, we use nonlinear mapping called activation function.
An activation function is a decision making function that determines the presence of a particular neural
feature.
What is activation function ?
39. Information Technology 41
why activation function is needed?
A feed-forward neural network with linear
activation and any number of hidden layers
is equivalent to just a linear neural network
with no hidden layer.
Depth doesn’t contribute to
the expressiveness of the
model unless we use
nonlinear activations
between the linear layers.
40. Information Technology 42
1. Sigmoid (used for binary classification mostly at output layer)
2. Relu (used mostly at hidden layers)
3. Tanh (used for better derivatives)
4. Softmax (used for multilevel classifications)
5. Leaky relu
6. Relu6
Some Popular types of activations
Functions
44. Information Technology 46
The Softmax regression is a form of logistic regression that normalizes an input value into
a vector of values that follows a probability distribution whose total sums up to 1 . Below is
an example for multiclass
Softmax
51. Information Technology 53
Since we are using backpropagation the function we generate must be
differentiable at any point.
Nonlinear functions must be continuous and differentiable between it’s range.
Why Differentiable ?
Important Properties of activation
function!!
Rate of Change or Differentiability of
loss with respect to weight reaches
zero (global minima) to get the
perfect predictions!!
HOW and WHY?
(next Slide)
53. Information Technology 55
The pooling (POOL) layer reduces the height and width of the input. It helps speed up computation, as well as helps make
feature detectors more invariant to its position in the input. The two types of pooling layers are:
Max-pooling layer: slides an (f, f) window over the input and stores the max value of the window in the output.
Average-pooling layer: slides an (f, f) window over the input and stores the average value of the window in the output.
POOLING Layer
Theoretically reason for applying pooling
is that we would like our computed
features not to care about small changes
in position in an image
57. Information Technology 59
If there are 10 Filters and each Filter is 3X3 , how many
weights are there to be learnt for an RGB Input image?
Exercise for CNN with Volume
(3X3X3+1)X10 = 280
58. Information Technology 60
1. Doing Convolution over RGB Image quite important instead of doing with Gray Image.
2. Here in CNN Over Volume , Depth of Input Image should be same as Depth of Filters. The Depth of
Filters can be customized to extract different features out of input image. Also if Depth of Filter is 1 ,
we would extract only gray scale information not the RGB information, hence it won’t be much
useful.
3. Various Kind of Filters can be stacked together to do the Feature Extraction based on the
requirement. Like Edge detection in only RED channel or in ALL channels etc.
4. Filters are learnable .
5. A filter maps to a local receptive field on the given input image.
6. CNN’s Property is that it avoids overfitting by using less number of parameters no matter how big
the input image is .
7. One Layer of Convolution is “Convolution (W * X + b ) + Activation(like Relu etc) “ .From this layer,
output goes to next layer of Convolution or Pooling or FC etc.
CNN Over volume
59. Information Technology 61
• So whatever be your input depth, only 2-D layer of neurons will be the output.
• The output volume is independent of the input volume, and it only depends on the number
filters(depth).Following Formula seems to be true
• Number of filters (Depth of the CNN layer) is a hyper parameter.
• Each filter has it's own set of weights enabling it to learn a different feature on the same local region
covered by the filter.
CNN Over Volume
Depth of output layer = Depth of convolution layer
63. Information Technology 65
• Tensors are the standard way of representing data in deep learning.
• Scalars as rank-0 tensors, vectors as rank-1 tensors, matrices as rank-2 tensors and rank-3 tensor as a rectangular
prism of numbers or cube.
• A rank-1 tensor has a shape of dimension 1, a rank-2 tensor a shape of dimension 2, and a rank-3 tensor of dimension
3.
• RGB images are represented as tensors (three-dimensional arrays), with each pixel having three values corresponding
to red, green, and blue components.
• Here computation is approached as a dataflow graph/computation graph . In this graph, nodes represent operations
(add/mul/cnn/concat/pool etc.), and edges represent data (tensors) .
• Calling a TensorFlow operation adds a description of a computation to TensorFlow’s “computation graph”.
• Variables in TensorFlow hold tensors and allow for stateful computation that modifies variables to occur.
• TensorFlow 1.x is largely follows declarative programming style.
Basics on tensorflow
64. Information Technology 66
• Tensorflow 2.0 follows imperative programming style . Hence you can run your model instantly.
• TensorFlow derives gradient descent optimization algorithm automatically based on the computation graph and loss
function provided by the user.
• To monitor, debug, and visualize the training process, and to streamline experiments, TensorFlow comes with
TensorBoard.
• It has support for distributed training, asynchronous computation with threading and queues, efficient I/O and data
formats, and much more.
Basics on tensorflow
66. Information Technology 68
In Tensorflow , A Computation graph is Dataflow Graph.
In a dataflow graph, the edges allow data to “flow” from one node to another in a directed manner.
Each of the graph’s nodes represents an operation.
Operations in the graph include all kinds of functions, from simple arithmetic ones such as subtraction
and multiplication to more complex ones.
What is a Computation Graph ?
The key idea behind computation graphs in
TensorFlow is that we first define what
computations should take place, and then
trigger the computation in an external
mechanism.
67. Information Technology 69
Writing and running programs in TensorFlow has the following steps:
• Create Tensors (variables) that are not yet executed/evaluated.
• Write operations between those Tensors.
• Initialize your Tensors.
• Create a Session.
• Run the Session. This will run the operations you'd written above.
Construct a Graph
Execute a Graph
Note :Importing TensorFlow (with import tensorflow as tf), a specific empty default
graph is formed. Additional Graphs can be created (with tf.Graph()) but they need
to be set a default graph ( “with ‘graph’.as_default()” ) for operation to be added
and executed.
68. Information Technology 70
Requested Node in sess.run() is called Fetches . The requested node is
part of elements of the graph , we wish to compute .
Fetches
Asking sess.run() for multiple
nodes’ outputs simply by
inputting a list of requested
nodes:
69. Information Technology 71
How tensorflow execution works
• Starts at requested output/outputs and works backward
• Compute Nodes that must be executed according to
dependencies
• Part of the Graph that would be computed , depends on
output query .
71. Information Technology 73
It denotes measure of similarity only when the model outputs class probabilities .
Hence it is a "cost" function that attempts to compute the difference between two
probability distribution functions.
Here it applies to activation function like Softmax and Sigmoid which outputs
probabilities , but not to Relu. Relu doesn’t output probabilities .
CROSS Entropy (CE)
74. Information Technology 76
A one hot encoding allows the representation of categorical data to be more
expressive.
There may be problems when there is no ordinal relationship and allowing the
representation to lean on any such relationship might be damaging to learning to
solve the problem. An example might be the labels ‘dog’ and ‘cat’
ONE HOT Encoding
82. Information Technology 84
MNIST Image Classification with SOFTMAX ONLY
(not using spatial information/CNN)Example of Supervised
Learning
83. Information Technology 85
Softmax regression model will figure out,
• For each pixel in the image, which digits tend to have high (or low) values in that location. Pixel values are
correlated with digit in image.
• For instance, the center of the image will tend to be white for zeros, but black for sixes.
• Thus, a black pixel in the center of an image will be evidence against the image containing a zero, and in favor of it
containing a six.
MNIST Image Classification with SOFTMAX ONLY
(not using spatial information/CNN)
Evidence for the image containing the digit 0
Here xi and wi are respective vectors for digit 0 , similarly for all other digits (1….9). Here
Learning in this model consists of finding weights that tell us how to accumulate evidence
for the existence of each of the digits.
84. Information Technology 86
Graph representation of mnist softmax
model “bias term,” which is equivalent to
stating which digits we believe an image
to be before seeing the pixel values. If
you have seen this before, then try
adding it to the model and check results.
85. Information Technology 87
7 * 7 * 64 * 1024 ~= 3.2M Vs
28 * 28 * 64 8 1024 ~= 51 M if
no Pooling Used.
CONV Layer with MNIST
DataSet
88. Information Technology 90
• This is a regularization trick used in order to force the network to distribute the learned representation
across all the neurons.
• It “turns off” a random preset fraction of the units in a layer, by setting their values to zero during
training.
• The Dropped-out neurons are random—different for each computation—forcing the network to learn a
representation that will work even after the dropout.
• This process is often thought of as training an “ensemble” of multiple networks, thereby increasing
generalization hence prevent overfitting .
• When using the network as a classifier at test time (“inference”), there is no dropout and the full
network is used as is .
Dropout
90. Information Technology 92
Being able to go from idea to result with the least possible delay is
key to finding good models.
Other Attributes of Keras Framework
• Keras was developed to enable deep learning engineers to build and experiment with different
models very quickly.
• Keras is an even higher-level framework and provides additional abstractions.
• Keras is more restrictive than the lower-level frameworks, so there are some very complex models
that you can implement in TensorFlow but not (without more difficulty) in Keras.
Why keras ?
91. Information Technology 93
Two Kind of Models
Sequential Models
Model Class with
Functional APIs
1. Create the model by calling the function above
2. Compile the model by calling model.compile(optimizer = "...",
loss = "...", metrics = ["accuracy"])
3. Train the model on train data by calling model.fit(x = ..., y = ...,
epochs = ..., batch_size = ...)
4. Test the model on test data by calling model.evaluate(x = ..., y =
...)
Four steps
in Keras for
training and
test
Predict using model.predict(input_image of same shape on which
it has been trained.)
97. Information Technology 99
Basic idea behind RNN
Rnn – recurrent neural networks
Each new element in the
sequence contributes some
new information, which
updates the current state of
the model.
Markov chain model
View data sequences as “chains,” with
each node in the chain dependent in
some way on the previous node, so
that “history” is not erased but carried
on
Mathematically in
statistics and probability
102. Information Technology 104
Update step for rnn
Tan h hyperbolic tangent function that has its range in [–1,1]
xt and ht are the input and state vectors
111. Information Technology 113
Problem with deeper neural networks are they are harder to train and once the number of layers reach certain number, the
training error starts to raise again.
Deep networks are also harder to train due to exploding and vanishing gradients problem.
Problems with Deeper networks?
113. Information Technology 115
ResNet (Residual Network), proposed by He at all in Deep Residual Learning for Image Recognition paper (2015), solves
these problems by implementing skip connection where output from one layer is fed to layer deeper in the network as
below, hence a "shortcut" or a "skip connection" allows the gradient to be directly backpropagated to earlier layers:
Skip Connection - RESNET
114. Information Technology 116
CONV 3X3 , Same Conv with intermittent Pooling Layers
On using CONV 3x3 and Same Conv , dimension of al+2 and al doesn’t change ,
hence addition of them should not be a problem. But
On using Pooling Layers , dimension of al+2 and al differs , then use intermittent
Ws to multiplied to get same dimension to achieve identity function as below
al+2 = Ws * al
Why resnet ?
117. Information Technology 119
Convolutional block with dimension matchup for
shortcut with final layer
The CONV2D layer on the shortcut path
does not use any non-linear activation
function. Its main role is to just apply a
(learned) linear function that reduces the
dimension of the input, so that the
dimensions match up for the later
addition step.
118. Information Technology 120
The advantages of ResNets are:
• performance doesn’t degrade with very deep network
• cheaper to compute
• ability to train very very deep network
ResNet works because:
• identify function is easy for residual block to learn
• using a skip-connection helps the gradient to back-propagate and thus helps
you to train deeper networks
Benefits of resnet
119. Information Technology 121
This leads to less computation by giving the similar output of reduced channel dimensions for a given input image.
it suffers with less over-fitting due to smaller kernel size (1x1).
One by One convolution was first introduced in this paper titled Network in Network .
1X1 CONVOLUTION or Network-IN-
Network
121. Information Technology 123
Depth wise separable Convolution – A foundation to
mobilenet and googlnet/Inception network and Many more…DF Input/output Feature Map,
Dk Kernel Map,
M Input Depth , NOutput Depth
130. Information Technology 132
Object Detection
Object Localization
Image Classification
Probabilities whether an
object exists or not ?
Classes to detect in
Image (car,Pedestrian
Motorcycle ?
Bounding Box Location
numbers for
each Image
in form of
coordinates
131. Information Technology 133
Train Convnet on Cropped Images and then run sliding windows protocol as follows
HOW OD works?
Run Sliding Window
convolutionally to get all
bounding box in one
shot
Sequentially with
different window sizes
on the given input
image leading to
higher computation
or
132. Information Technology 134
Evaluating object localization
IoU (Intersection of Union) is a measure of the overlap between two bounding
boxes.
How accurate detected
bounding boxes are ?
134. Information Technology 136
Non-max suppression
Get Rid of Bounding box with
Low Probability detection
Score for each Class detected
Get Rid of Overlapped
Bounding Box by measuring
IoU with Highest Probability
Boxes for each of the class
detected
135. Information Technology 137
Anchor boxes (for overlapping objects)
• Propose 2-5 Anchor boxes for different
sizes of objects detected
• Match IoU of Anchor Boxes with the
Ground Truth of Image and Predict that
Class in respective Anchor Box
• Anchor boxes are defined only by their
width and height
137. Information Technology 139
Region proposal CNN (R-CNN)
Run Semantic Segmentation Algorithm to
detect Blobs in the Image
Here Blob is part of Image where at least
one object can be classified .
Run Classifier on the Proposed regions
of Blobs only (also called ROI- region of
interest)
138. Information Technology 140
• YOLO (you only look once)
• F-RCNN
• Fast-RCNN
• Faster-RCNN
• SSD (single Shot detector) and
• Many More …
Other better methods of od
142. Information Technology 144
Face embedding to be used for Oneshot Learning,
Similarity function, Siamese Network , triplet loss
and logisitic regression
⋮
f(𝑥(𝑖)
)
⋮
f(𝑥(𝑗)
)
𝑦
143. Information Technology 145
An encoding is a good one if:
• The encodings of two images of the same person are quite similar to each other
• The encodings of two images of different persons are very different
One example of face
embedding/encoding
145. Information Technology 147
Training via Triple Loss uses triplets of Images (A,P,N) where
• A is an "Anchor" image--a picture of a person.
• P is a "Positive" image--a picture of the same person as the Anchor image.
• N is a "Negative" image--a picture of a different person than the Anchor image.
Triplet Loss in Detail
Minimize the Loss
Function using Gradient
Descent
Margin( Hyper parameter)
Denoting max(z,0) as [z]+
146. Information Technology 148
• Train via Triplet Loss and then Classify
• Train via Logistic Regression and then Classify
• Compare Pre-computed Encoding (Facenet ) with the Encoding of New
Image
Different Ways of Doing face recognition
147. Information Technology 149
Face verification vs. face recognition
Verification
• Input image, name/ID
• Output whether the input image is that of the claimed person
Recognition
• Has a database of K persons
• Get an input image
• Output ID if the image is any of the K persons (or “not recognized”)
149. Information Technology 151
Android AI Architecture on Celadon
Android Applications TensorFlow Lite
Android NNAPI/NN Runtime
Android NN
HAL
CPU
fallback
Caffe Tensorflow MxNet
Model
Optimizer
ONNX Kaldi
TensorFlow
Load Plugin, XML and
Infer
OPENVINO (Inference Engine API )
Plug-InArchitecture
USB/PCI
Driver
Myrad
2/X
DLA
GEN
CPU: Xeon/Core
/SKL/Atom
MKL-DNN
clDNN Plugin
MKLDNN
Plugin
PCI Driver
Intrinsics
FPGA Plugin
clDNN/Open
CL
Myriad
Plugin
GNA Plugin
PCI Driver
GNA
GNA API
C++
C+
+
Java/JNI
VULKAN Driver
CPU:
Xeon/Core
/SKL/Atom
MKLDNN
HAL
VPU
HAL
GPU
HAL
GNA
HAL
VULKAN
HAL
Android OPENVIN
O
GOS AI Team AI
Frameworks
Intel
HW
Execution Capabilities
Partitioning
IR
https://github.com/projectceladon
150. Information Technology 152
OpenCV [OpenCV] is an open source (see http://opensource.org) computer vision library available from http://opencv.org.
In 1999 Gary Bradski [Bradski], working at Intel Corporation, launched OpenCV with the hopes of accelerating computer
vision and artificial intelligence by providing a solid infrastructure for everyone working in the field.
The OpenCV library contains over 500 functions that span many areas in vision, including factory product inspection,
medical imaging, security, user interface, camera calibration, stereo vision, and robotics.
It has its own ML and DNN libraries .
What is opencv?
OPENCV Architecture on Different OSes