SlideShare a Scribd company logo
6.874, 6.802, 20.390, 20.490, HST.506
Computational Systems Biology
Deep Learning in the Life Sciences
Lecture 3:
Convolutional Neural Networks
Prof. Manolis Kellis
http://mit6874.github.io
Slides credit: 6.S191, Dana Erlich, Param Vir Singh,
David Gifford, Alexander Amini, Ava Soleimany
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
1a. What do you see, and how?
Can we teach machines to see?
What do you see?
How do you see?
How can we help
computers see?
What computers ‘see’: Images as Numbers
What the computer "sees"
Levin
Image
Processing
&
Computer
Vision
An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image.
Question: is this Lincoln?Washington? Jefferson? Obama?
How can the computer answer this question?
What you see
Input Image Input Image + values Pixel intensity values
(“pix-el”=picture-element)
What you both see
Can I just do classification on the 1,166400-long image vector directly?
No. Instead: exploit image spatial structure. Learn patches. Build them up
1b. Classical machine vision roots
in study of human/animal brains
Inspiration: human/animal visual cortex
• Layers of neurons: pixels, edges, shapes, primitives, scenes
• E.g. Layer 4 responds to bands w/ given slant, contrasting edges
Primitives: Neurons & action potentials
•Chemical accumulation across
dendritic connections
•Pre-synaptic axon
 post-synaptic dendrite
 neuronal cell body
•Each neuron receives multiple
signals from its many dendrites
•When threshold crossed, it fires
•Its axon then sends outgoing
signal to downstream neurons
•Weak stimuli ignored
•Sufficiently strong cross
activation threshold
•Non-linearity within
each neuronal level
•Neurons connected into circuits (neural networks): emergent properties, learning, memory
•Simple primitives arranged in simple, repetitive, and extremely large networks
•86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
Abstraction layers: edges, bars, dir., shapes, objects, scenes
LGN: Small dots
V1: Orientation,
disparity, some color
V4: Color, basic shapes,
2D/3D, curvature
VTC: Complex features
and objects(VTC: ventral temporal cortex
•Abstraction layers  visual cortex layers
•Complex concepts from simple parts, hierarchy
•Primitives of visual concepts encoded in
neuronal connection in early cortical layers
• Massive recent expanse of human brain has re-used a
relatively simple but general learning architecture
General “learning machine”, reused widely
• Hearing, taste, smell, sight, touch all re-
use similar learning architecture
Motor Cortex
Visual Cortex • Interchangeable
circuitry
• Auditory cortex
learns to ‘see’ if
sent visual signals
• Injury area tasks
shift to uninjured
areas
• Not fully-general learning, but well-adapted to our world
• Humans co-opted this circuitry to many new applications
• Modern tasks accessible to any homo sapiens (<70k years)
• ML primitives not too different from animals: more to come?
human
chimp
Hardware
expansion
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure
for image recognition
Using Spatial Structure
Idea: connect
patches of input to
neurons in hidden
layer.
Neuron connected
to region of input.
Only “sees”these
values.
Input: 2D
image.
Array of pixel
values
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
- Filter of size 4x4 :16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
This“patchy” operation isconvolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
Fully Connected Neural Network
Fully Connected:
• Each neuron in
hidden layer
connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters
Input:
• 2D image
• Vector of pixel
values
Key idea: Use spatial structure in input to inform architecture
of the network
High Level Feature Detection
Let’s identify key features in each image category
Wheels,License Plate,
Headlights
Door,Windows,Steps
Nose,Eyes,Mouth
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise
multiply and add
Filter / Kernel
Producing Feature Maps
Original Sharpen Edge Detect “Strong” Edge
Detect
A simple pattern: Edges
How can we detect edges with a kernel?
Input
-1 -1
Filter
Output
(Goodfellow 2016)
Simple Kernels / Filters
X or X?
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed.
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution
(Goodfellow 2016)
Zero Padding Controls Output Size
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
• Valid-only convolution: output only when
entire kernel contained in input (shrinks output)
• Same convolution: zero pad input so output
is same size as input dimensions
x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
• TF convolution operator takes stride and zero fill option as parameters
• Stride is distance between kernel applications in each dimension
• Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features
de novo
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)
Low level features Mid level features High level features
Lee+ ICML 2009
Eyes,ears,nose
Edges,dark spots Facial structure
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
LeNet-5
⋮ ⋮
�
𝑦𝑦
32×32×1 28×28×6 14×14×6 10×10×16
5×5×16
120 84
5 × 5
s = 1
f = 2
s = 2
avg pool
5 × 5
s = 1
avg pool
f = 2
s = 2
. . .
. . .
Reminder:
Output size = (N+2P-F)/stride + 1
10
conv conv
FC FC
[LeCun et al., 1998]
This slide is taken from Andrew Ng
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output
• Different filters look at different channels
• Sigmoid and Tanh nonlinearity
[LeCun et al., 1998]
Backpropagation of convolution
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs
Mid level features
Low level features High level features
Edges,dark spots
Conv Layer 1
Lee+ ICML 2009
Eyes,ears,nose
Conv Layer 2
Facial structure
Conv Layer 3
CNNs for Classification
1. Convolution:Apply filters to generate feature maps.
2. Non-linearity:Often ReLU.
3. Pooling:Downsampling operation on each feature map.
Trainmodel with image data.
Learn weights of filters in convolutional layers.
tf.keras.layers.Conv2
D
tf.keras.activations.
*
tf.keras.layers.MaxPool2
D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
tf.keras.layers.
Conv2D
Convolutional Layers: Local Connectivity
For a neuron in hidden layer:
• Take inputs from patch
• Compute weighted sum
• Apply bias
4x4 filter:
matrix of
weights wij for neuron (p,q) in hidden layer
1) applying a window of weights
2) computing linear combinations
3) activating with non-linear function
tf.keras.layers.Conv2D
CNNs: Spatial Arrangement of Output
Volume
depth
width
height
Layer Dimensions:
ℎ  w d
where h and w are spatial
dimensions d (depth) = number of
filters
Stride:
Filter step size
Receptive Field:
Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
Rectified Linear Unit
(ReLU)
- Apply after every convolution operation
(i.e.,after convolutional layers)
- ReLU:pixel-by-pixel operation that replaces
all negative values by zero.
- Non-linear operation
tf.keras.layers.ReLU
Karn Intuitive CNNs
Pooling
Max Pooling,average pooling
1) Reduced
dimensionality
2) Spatial invariance
tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
strides=2
)
The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
x= tf.nn.relu(x)
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')
Output Input Pooling Batch H W Input channel
Neighborhood
[batch, height, width, channels]
Dilated Convolution
91
CNNs for Classification: Feature Learning
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities
- CONV and POOL layers output high-level features of input
- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
Putting it all together
import tensorflow as tf
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# second convolutional layer
tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# fully connected classifier
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu’),
tf.keras.layers.Dense(10, activation=‘softmax’)
# 10 outputs
])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?
Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Detect
features
to
classify
Li/Johnson/Yeung C231n
Feature invariance to perturbation is hard
Next-generation models
explode # of parameters
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
• ImageNet Classification with Deep Convolutional
Neural Networks - Alex Krizhevsky, Ilya Sutskever,
Geoffrey E. Hinton; 2012
• Facilitated by GPUs, highly optimized convolution
implementation and large datasets (ImageNet)
• One of the largest CNNs to date
• Has 60 Million parameter compared to 60k
parameter of LeNet-5
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
• The annual “Olympics” of computer vision.
• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.
• 2012 marked the first year where a CNN was used to
achieve a top 5 test error rate of 15.3%.
• The next best entry achieved an error of 26.2%.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
• Input: 227x227x3 images (224x224 before
padding)
• First layer: 96 11x11 filters applied at stride 4
• Output volume size?
(N-F)/s+1 = (227-11)/4+1 = 55 ->
[55x55x96]
• Number of parameters in this layer?
(11*11*3)*96 = 35K
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
AlexNet
[Krizhevsky et al., 2012]
• Input: 227x227x3 images (224x224 before
padding)
• After CONV1: 55x55x96
• Second layer: 3x3 filters applied at stride 2
• Output volume size?
(N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96]
• Number of parameters in this layer?
0!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
AlexNet
. . .
227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256
13×13
×256
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256
11 × 11
s = 4
P = 0
3 × 3
s = 2
max pool
5 × 5
S = 1
P = 2
3 × 3
s = 2
max pool
3 × 3
S = 1
P = 1
3 × 3
s = 1
P = 1
3 × 3
S = 1
P = 1
3 × 3
s = 2
max pool
conv conv
conv conv conv
. . .
[Krizhevsky et al., 2012]
. . .
This slide is taken from Andrew Ng
AlexNet
. . .
4096 4096
Softmax
1000
⋮ ⋮
[Krizhevsky et al., 2012]
FC FC
This slide is taken from Andrew Ng
AlexNet
[Krizhevsky et al., 2012]
Details/Retrospectives:
• first use of ReLU
• used Norm layers (not common anymore)
• heavy data augmentation
• dropout 0.5
• batch size 128
• 7 CNN ensemble
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
• Trained on GTX 580 GPU with only 3 GB of memory.
• Network spread across 2 GPUs, half the neurons (feature
maps) on each GPU.
• CONV1, CONV2, CONV4, CONV5:
Connections only with feature maps on same GPU.
• CONV3, FC6, FC7, FC8:
Connections with all feature maps in preceding layer,
communication across GPUs.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
• Very Deep Convolutional Networks For Large Scale
Image Recognition - Karen Simonyan and Andrew
Zisserman; 2015
• The runner-up at the ILSVRC 2014 competition
• Significantly deeper than AlexNet
• 140 million parameters
[Simonyan and Zisserman, 2014]
VGGNet
• Smaller filters
Only 3x3 CONV filters, stride 1, pad 1
and 2x2 MAX POOL , stride 2
• Deeper network
AlexNet: 8 layers
VGGNet: 16 - 19 layers
• ZFNet: 11.7% top 5 error in ILSVRC’13
• VGGNet: 7.3% top 5 error in ILSVRC’14
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64
Pool 1/2
3x3 conv, 128
3x3 conv, 128
Pool 1/2
3x3 conv, 256
3x3 conv, 256
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
FC 4096
FC 4096
FC 1000
Softmax
VGGNet
[Simonyan and Zisserman, 2014]
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.
• What is the effective receptive field of three 3x3 conv (stride
1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
[Simonyan and Zisserman, 2014]
VGG16:
TOTAL memory: 24M * 4 bytes ~= 96MB / image
TOTAL params: 138M parameters
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input
3x3 conv, 64
3x3 conv, 64
Pool
3x3 conv, 128
3x3 conv, 128
Pool
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax
[Simonyan and Zisserman, 2014]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 =
147,456
Pool memory: 56*56*128=400K params: 0
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
Pool memory: 28*28*256=200K params: 0
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool memory: 14*14*512=100K params: 0
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Pool memory: 7*7*512=25K params: 0
FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448
FC 4096 memory: 4096 params: 4096*4096 = 16,777,216
FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
VGGNet
[Simonyan and Zisserman, 2014]
Details/Retrospectives :
• ILSVRC’14 2nd in classification, 1st in localization
• Similar training procedure as AlexNet
• No Local Response Normalisation (LRN)
• Use VGG16 or VGG19 (VGG19 only slightly better, more
memory)
• Use ensembles for best results
• FC7 features generalize well to other tasks
• Trained on 4 Nvidia Titan Black GPUs for two to three weeks.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
VGG Net reinforced the notion that convolutional neural
networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work.
Keep it deep.
Keep it simple.
[Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet
• Going Deeper with Convolutions - Christian Szegedy et
al.; 2015
• ILSVRC 2014 competition winner
• Also significantly deeper than AlexNet
• x12 less parameters than AlexNet
• Focused on computational efficiency
[Szegedy et al., 2014]
GoogleNet
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)
[Szegedy et al., 2014]
GoogleNet
“Inception module”: design a good local network topology (network within
a network) and then stack these modules on top of each other
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
Filter
concatenation
Previous layer
1x1
convolution
3x3
convolution
5x5
convolution
1x1
convolution
1x1
convolution
1x1
convolution
3x3 max
pooling
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Introduced the idea that CNN layers didn’t always have to be
stacked up sequentially. Coming up with the Inception
module, the authors showed that a creative structuring of
layers can lead to improved performance and
computationally efficiency.
[Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and
exploding gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
[He et al., 2015]
ResNet
• ILSVRC’15 classification winner (3.57% top 5
error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
• 56-layer model performs worse on both training and test error
-> The deeper model performs worse (not caused by overfitting)!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.
• We will use skip connections allowing us to take the activation
from one layer and feed it into another layer, much deeper
into the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
[He et al., 2015]
ResNet
Short cut/ skip connection
𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2]
𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏]
𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏])
𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐])
𝑎𝑎[𝑙𝑙+1]
a[l]
a[l+1]
𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2]
𝐚𝐚[𝐥𝐥+𝟐𝟐]
= 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐
+ 𝐚𝐚 𝐥𝐥
= 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟏𝟏]
+ 𝐛𝐛[𝐥𝐥+𝟐𝟐]
+ 𝐚𝐚 𝐥𝐥
)
[He et al., 2015]
ResNet
Full ResNet architecture:
• Stack residual blocks
• Every residual block has two 3x3 conv layers
• Periodically, double # of filters and
downsample spatially using stride 2 (in each
dimension)
• Additional conv layer at the beginning
• No FC layers at the end (only FC 1000 to
output classes)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Even better than human performance!
[He et al., 2015]
Accuracy comparison
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power
consumption
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Countless applications
An Architecture for Many Applications
Detection
Semantic segmentation
End-to-end robotic control
Semantic Segmentation: Fully Convolutional Networks
FCN:Fully Convolutional Network.
Network designed with all convolutional layers,with downsampling and
upsampling operations
tf.keras.layers.Conv2DTranspose
Long+ CVPR 2015
Facial Detection & Recognition
Self-Driving Cars
Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception
Raw
Perception
I
(ex.camera)
Coarse
Maps
M
(ex.GPS)
Possible Control Commands
Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation
Entire model trained end-to-end
without any human labelling or annotations
Amini+ ICRA 2019
Automatic Colorization of Black and White Images
Optimizing Images
Post Processing Feature Optimization
(Illumination)
Post Processing Feature Optimization
(Color Curves and Details)
Post Processing Feature Optimization
(Color Tone: Warmness)
Up-scaling low-resolution images
Medicine, Biology, Healthcare
Gulshan+ JAMA 2016.
Breast Cancer Screening
6
.
Breast cancer case
missed by radiologist
but detected byAI
AI
MD
Readers
AI
MD
Readers
CNN-based system outperformed expert
radiologists at detecting breast
cancer from mammograms
Semantic Segmentation: Biomedical Image Analysis
BrainTumors
Dong+ MIUA
2017.
Malaria Infection
Soleimany+ arXiv
2019.
Dong+ MIUA 2017;Soleimany+ arXiv 2019
Origi
nal
Ground
Truth
Segmenta
tion
Uncertai
nty
DeepBind
[Alipanahi et al., 2015]
Predicting disease mutations
[Alipanahi et al., 2015]
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Deep Learning for Computer Vision: Summary
Foundations
• Why computer vision?
• Representing images
• Convolutions for feature
extraction
CNNs
• CNN architecture
• Application to
classification
• ImageNet
Applications
• Segmentation,image
captioning,control
• Security,medicine,
robotics

More Related Content

What's hot

CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
Sungjoon Choi
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
Muhammad Haroon
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
Ferdous ahmed
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
SumeraHangi
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Hichem Felouat
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
Abhishek Sharma
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
YanhuaSi
 
Digit recognition using mnist database
Digit recognition using mnist databaseDigit recognition using mnist database
Digit recognition using mnist database
btandale
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Suraj Aavula
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
SungminYou
 
CNN
CNNCNN
CNN
chs71
 

What's hot (20)

CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
 
Digit recognition using mnist database
Digit recognition using mnist databaseDigit recognition using mnist database
Digit recognition using mnist database
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
AlexNet
AlexNetAlexNet
AlexNet
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
CNN
CNNCNN
CNN
 

Similar to CNN Algorithm

Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Amr Rashed
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
deep learning
deep learningdeep learning
deep learning
Hassanein Alwan
 
Deep Learning Training at Intel
Deep Learning Training at IntelDeep Learning Training at Intel
Deep Learning Training at Intel
Atul Vaish
 
build a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Pythonbuild a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Python
Kv Sagar
 
7 nn1-intro.ppt
7 nn1-intro.ppt7 nn1-intro.ppt
7 nn1-intro.ppt
Sambit Satpathy
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptx
RiteshPandey184067
 
Introducción a las redes convolucionales
Introducción a las redes convolucionalesIntroducción a las redes convolucionales
Introducción a las redes convolucionales
JoseAlGarcaGutierrez
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
Amr Rashed
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
Amr Rashed
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
Dean Wyatte
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
WithTheBest
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
Dong Heon Cho
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Purnima Pandit
 
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine IntelligencePrinciples of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Numenta
 
Neural
NeuralNeural
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
vatsal199567
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
Charles Deledalle
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
Poo Kuan Hoong
 

Similar to CNN Algorithm (20)

Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
DL.pdf
DL.pdfDL.pdf
DL.pdf
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
deep learning
deep learningdeep learning
deep learning
 
Deep Learning Training at Intel
Deep Learning Training at IntelDeep Learning Training at Intel
Deep Learning Training at Intel
 
build a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Pythonbuild a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Python
 
7 nn1-intro.ppt
7 nn1-intro.ppt7 nn1-intro.ppt
7 nn1-intro.ppt
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptx
 
Introducción a las redes convolucionales
Introducción a las redes convolucionalesIntroducción a las redes convolucionales
Introducción a las redes convolucionales
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
 
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine IntelligencePrinciples of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
 
Neural
NeuralNeural
Neural
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 

More from georgejustymirobi1

JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.ppt
georgejustymirobi1
 
Network IP Security.pdf
Network IP Security.pdfNetwork IP Security.pdf
Network IP Security.pdf
georgejustymirobi1
 
How To Write A Scientific Paper
How To Write A Scientific PaperHow To Write A Scientific Paper
How To Write A Scientific Paper
georgejustymirobi1
 
writing_the_research_paper.ppt
writing_the_research_paper.pptwriting_the_research_paper.ppt
writing_the_research_paper.ppt
georgejustymirobi1
 
Bluetooth.ppt
Bluetooth.pptBluetooth.ppt
Bluetooth.ppt
georgejustymirobi1
 
ABCD15042603583.pdf
ABCD15042603583.pdfABCD15042603583.pdf
ABCD15042603583.pdf
georgejustymirobi1
 
ch18 ABCD.pdf
ch18 ABCD.pdfch18 ABCD.pdf
ch18 ABCD.pdf
georgejustymirobi1
 
ch13 ABCD.ppt
ch13 ABCD.pptch13 ABCD.ppt
ch13 ABCD.ppt
georgejustymirobi1
 
1682302951397_PGP.pdf
1682302951397_PGP.pdf1682302951397_PGP.pdf
1682302951397_PGP.pdf
georgejustymirobi1
 
applicationlayer.pptx
applicationlayer.pptxapplicationlayer.pptx
applicationlayer.pptx
georgejustymirobi1
 
Fair Bluetooth.pdf
Fair Bluetooth.pdfFair Bluetooth.pdf
Fair Bluetooth.pdf
georgejustymirobi1
 
Bluetooth.pptx
Bluetooth.pptxBluetooth.pptx
Bluetooth.pptx
georgejustymirobi1
 
Research Score.pdf
Research Score.pdfResearch Score.pdf
Research Score.pdf
georgejustymirobi1
 
educational_technology_meena_arora.ppt
educational_technology_meena_arora.ppteducational_technology_meena_arora.ppt
educational_technology_meena_arora.ppt
georgejustymirobi1
 
Array.ppt
Array.pptArray.ppt
PYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptxPYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptx
georgejustymirobi1
 
cprogrammingoperator.ppt
cprogrammingoperator.pptcprogrammingoperator.ppt
cprogrammingoperator.ppt
georgejustymirobi1
 
cprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.pptcprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.ppt
georgejustymirobi1
 

More from georgejustymirobi1 (20)

JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.ppt
 
Network IP Security.pdf
Network IP Security.pdfNetwork IP Security.pdf
Network IP Security.pdf
 
How To Write A Scientific Paper
How To Write A Scientific PaperHow To Write A Scientific Paper
How To Write A Scientific Paper
 
writing_the_research_paper.ppt
writing_the_research_paper.pptwriting_the_research_paper.ppt
writing_the_research_paper.ppt
 
Bluetooth.ppt
Bluetooth.pptBluetooth.ppt
Bluetooth.ppt
 
ABCD15042603583.pdf
ABCD15042603583.pdfABCD15042603583.pdf
ABCD15042603583.pdf
 
ch18 ABCD.pdf
ch18 ABCD.pdfch18 ABCD.pdf
ch18 ABCD.pdf
 
ch13 ABCD.ppt
ch13 ABCD.pptch13 ABCD.ppt
ch13 ABCD.ppt
 
BluetoothSecurity.ppt
BluetoothSecurity.pptBluetoothSecurity.ppt
BluetoothSecurity.ppt
 
1682302951397_PGP.pdf
1682302951397_PGP.pdf1682302951397_PGP.pdf
1682302951397_PGP.pdf
 
CNN
CNNCNN
CNN
 
applicationlayer.pptx
applicationlayer.pptxapplicationlayer.pptx
applicationlayer.pptx
 
Fair Bluetooth.pdf
Fair Bluetooth.pdfFair Bluetooth.pdf
Fair Bluetooth.pdf
 
Bluetooth.pptx
Bluetooth.pptxBluetooth.pptx
Bluetooth.pptx
 
Research Score.pdf
Research Score.pdfResearch Score.pdf
Research Score.pdf
 
educational_technology_meena_arora.ppt
educational_technology_meena_arora.ppteducational_technology_meena_arora.ppt
educational_technology_meena_arora.ppt
 
Array.ppt
Array.pptArray.ppt
Array.ppt
 
PYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptxPYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptx
 
cprogrammingoperator.ppt
cprogrammingoperator.pptcprogrammingoperator.ppt
cprogrammingoperator.ppt
 
cprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.pptcprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.ppt
 

Recently uploaded

CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 

CNN Algorithm

  • 1. 6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 3: Convolutional Neural Networks Prof. Manolis Kellis http://mit6874.github.io Slides credit: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini, Ava Soleimany
  • 2. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 3. 1a. What do you see, and how? Can we teach machines to see?
  • 4. What do you see?
  • 5. How do you see? How can we help computers see?
  • 6. What computers ‘see’: Images as Numbers What the computer "sees" Levin Image Processing & Computer Vision An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image. Question: is this Lincoln?Washington? Jefferson? Obama? How can the computer answer this question? What you see Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) What you both see Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up
  • 7. 1b. Classical machine vision roots in study of human/animal brains
  • 8. Inspiration: human/animal visual cortex • Layers of neurons: pixels, edges, shapes, primitives, scenes • E.g. Layer 4 responds to bands w/ given slant, contrasting edges
  • 9. Primitives: Neurons & action potentials •Chemical accumulation across dendritic connections •Pre-synaptic axon  post-synaptic dendrite  neuronal cell body •Each neuron receives multiple signals from its many dendrites •When threshold crossed, it fires •Its axon then sends outgoing signal to downstream neurons •Weak stimuli ignored •Sufficiently strong cross activation threshold •Non-linearity within each neuronal level •Neurons connected into circuits (neural networks): emergent properties, learning, memory •Simple primitives arranged in simple, repetitive, and extremely large networks •86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
  • 10. Abstraction layers: edges, bars, dir., shapes, objects, scenes LGN: Small dots V1: Orientation, disparity, some color V4: Color, basic shapes, 2D/3D, curvature VTC: Complex features and objects(VTC: ventral temporal cortex •Abstraction layers  visual cortex layers •Complex concepts from simple parts, hierarchy •Primitives of visual concepts encoded in neuronal connection in early cortical layers
  • 11. • Massive recent expanse of human brain has re-used a relatively simple but general learning architecture General “learning machine”, reused widely • Hearing, taste, smell, sight, touch all re- use similar learning architecture Motor Cortex Visual Cortex • Interchangeable circuitry • Auditory cortex learns to ‘see’ if sent visual signals • Injury area tasks shift to uninjured areas • Not fully-general learning, but well-adapted to our world • Humans co-opted this circuitry to many new applications • Modern tasks accessible to any homo sapiens (<70k years) • ML primitives not too different from animals: more to come? human chimp Hardware expansion
  • 12. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 13. 2a. Spatial structure for image recognition
  • 14. Using Spatial Structure Idea: connect patches of input to neurons in hidden layer. Neuron connected to region of input. Only “sees”these values. Input: 2D image. Array of pixel values
  • 15. Using Spatial Structure Connect patch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
  • 16. Feature Extraction with Convolution - Filter of size 4x4 :16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This“patchy” operation isconvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 17. Fully Connected Neural Network Fully Connected: • Each neuron in hidden layer connected to all neurons in input layer • No spatial information • Many, many parameters Input: • 2D image • Vector of pixel values Key idea: Use spatial structure in input to inform architecture of the network
  • 18. High Level Feature Detection Let’s identify key features in each image category Wheels,License Plate, Headlights Door,Windows,Steps Nose,Eyes,Mouth
  • 21. Convolution operation is element wise multiply and add Filter / Kernel
  • 22. Producing Feature Maps Original Sharpen Edge Detect “Strong” Edge Detect
  • 23. A simple pattern: Edges How can we detect edges with a kernel? Input -1 -1 Filter Output (Goodfellow 2016)
  • 24. Simple Kernels / Filters
  • 25. X or X? Image is represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed. Rohrer How do CNNs work?
  • 26. There are three approaches to edge cases in convolution
  • 27. (Goodfellow 2016) Zero Padding Controls Output Size • Full convolution: zero pad input so output is produced whenever an output value contains at least one input value (expands output) • Valid-only convolution: output only when entire kernel contained in input (shrinks output) • Same convolution: zero pad input so output is same size as input dimensions x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME') • TF convolution operator takes stride and zero fill option as parameters • Stride is distance between kernel applications in each dimension • Padding can be SAME or VALID
  • 28.
  • 29. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 30. 3a. Learning Visual Features de novo
  • 31. Key idea: learn hierarchy of features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Lee+ ICML 2009 Eyes,ears,nose Edges,dark spots Facial structure
  • 32. Key idea: re-use parameters Convolution shares parameters Example 3x3 convolution on a 5x5 image
  • 33. Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 34. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 35. LeNet-5 ⋮ ⋮ � 𝑦𝑦 32×32×1 28×28×6 14×14×6 10×10×16 5×5×16 120 84 5 × 5 s = 1 f = 2 s = 2 avg pool 5 × 5 s = 1 avg pool f = 2 s = 2 . . . . . . Reminder: Output size = (N+2P-F)/stride + 1 10 conv conv FC FC [LeCun et al., 1998] This slide is taken from Andrew Ng
  • 36. LeNet-5 • Only 60K parameters • As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑ • General structure: conv->pool->conv->pool->FC->FC->output • Different filters look at different channels • Sigmoid and Tanh nonlinearity [LeCun et al., 1998]
  • 37. Backpropagation of convolution Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
  • 40. Representation Learning in Deep CNNs Mid level features Low level features High level features Edges,dark spots Conv Layer 1 Lee+ ICML 2009 Eyes,ears,nose Conv Layer 2 Facial structure Conv Layer 3
  • 41. CNNs for Classification 1. Convolution:Apply filters to generate feature maps. 2. Non-linearity:Often ReLU. 3. Pooling:Downsampling operation on each feature map. Trainmodel with image data. Learn weights of filters in convolutional layers. tf.keras.layers.Conv2 D tf.keras.activations. * tf.keras.layers.MaxPool2 D
  • 42. Example – Six convolutional layers
  • 43. Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias tf.keras.layers. Conv2D
  • 44. Convolutional Layers: Local Connectivity For a neuron in hidden layer: • Take inputs from patch • Compute weighted sum • Apply bias 4x4 filter: matrix of weights wij for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function tf.keras.layers.Conv2D
  • 45. CNNs: Spatial Arrangement of Output Volume depth width height Layer Dimensions: ℎ  w d where h and w are spatial dimensions d (depth) = number of filters Stride: Filter step size Receptive Field: Locations in input image that a node is path connected to tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
  • 46. Introducing Non-Linearity Rectified Linear Unit (ReLU) - Apply after every convolution operation (i.e.,after convolutional layers) - ReLU:pixel-by-pixel operation that replaces all negative values by zero. - Non-linear operation tf.keras.layers.ReLU Karn Intuitive CNNs
  • 47. Pooling Max Pooling,average pooling 1) Reduced dimensionality 2) Spatial invariance tf.keras.layers.Max Pool2D( pool_size=(2,2), strides=2 )
  • 48. The REctified Linear Unit (RELU) is a common non-linear detector stage after convolution x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) x= tf.nn.relu(x) f(x) = max(0, x) When will we backpropagate through this? Once it “dies” what happens to it?
  • 49. Pooling reduces dimensionality by giving up spatial location • max pooling reports the maximum output within a defined neighborhood • Padding can be SAME or VALID x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') Output Input Pooling Batch H W Input channel Neighborhood [batch, height, width, channels]
  • 51. 91 CNNs for Classification: Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
  • 52. CNNs for Classification: Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class
  • 53. Putting it all together import tensorflow as tf def generate_model(): model = tf.keras.Sequential([ # first convolutional layer tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # second convolutional layer tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # fully connected classifier tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024, activation='relu’), tf.keras.layers.Dense(10, activation=‘softmax’) # 10 outputs ]) return model
  • 54. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 55. 4a. Real-world feature invariance is hard
  • 56. How can computers recognize objects?
  • 57. How can computers recognize objects? Challenge: • Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. • How can we overcome this challenge? Answer: • Learn a ton of features (millions) from the bottom up • Learn the convolutional filters, rather than pre-computing them
  • 60. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 61. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 62. AlexNet • ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 • Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) • One of the largest CNNs to date • Has 60 Million parameter compared to 60k parameter of LeNet-5 [Krizhevsky et al., 2012]
  • 63. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners • The annual “Olympics” of computer vision. • Teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. • 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.3%. • The next best entry achieved an error of 26.2%.
  • 64. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 65. AlexNet [Krizhevsky et al., 2012] Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 • Input: 227x227x3 images (224x224 before padding) • First layer: 96 11x11 filters applied at stride 4 • Output volume size? (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96] • Number of parameters in this layer? (11*11*3)*96 = 35K Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 67. AlexNet [Krizhevsky et al., 2012] • Input: 227x227x3 images (224x224 before padding) • After CONV1: 55x55x96 • Second layer: 3x3 filters applied at stride 2 • Output volume size? (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96] • Number of parameters in this layer? 0! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
  • 68. AlexNet . . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . . This slide is taken from Andrew Ng
  • 69. AlexNet . . . 4096 4096 Softmax 1000 ⋮ ⋮ [Krizhevsky et al., 2012] FC FC This slide is taken from Andrew Ng
  • 70. AlexNet [Krizhevsky et al., 2012] Details/Retrospectives: • first use of ReLU • used Norm layers (not common anymore) • heavy data augmentation • dropout 0.5 • batch size 128 • 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 71. AlexNet [Krizhevsky et al., 2012] • Trained on GTX 580 GPU with only 3 GB of memory. • Network spread across 2 GPUs, half the neurons (feature maps) on each GPU. • CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. • CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 72. AlexNet AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. [Krizhevsky et al., 2012]
  • 73. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 74. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 75. VGGNet • Very Deep Convolutional Networks For Large Scale Image Recognition - Karen Simonyan and Andrew Zisserman; 2015 • The runner-up at the ILSVRC 2014 competition • Significantly deeper than AlexNet • 140 million parameters [Simonyan and Zisserman, 2014]
  • 76. VGGNet • Smaller filters Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2 • Deeper network AlexNet: 8 layers VGGNet: 16 - 19 layers • ZFNet: 11.7% top 5 error in ILSVRC’13 • VGGNet: 7.3% top 5 error in ILSVRC’14 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] Input 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Pool 1/2 3x3 conv, 256 3x3 conv, 256 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 FC 4096 FC 1000 Softmax
  • 77. VGGNet [Simonyan and Zisserman, 2014] • Why use smaller filters? (3x3 conv) Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer. • What is the effective receptive field of three 3x3 conv (stride 1) layers? 7x7 But deeper, more non-linearities And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 78. VGGNet [Simonyan and Zisserman, 2014] VGG16: TOTAL memory: 24M * 4 bytes ~= 96MB / image TOTAL params: 138M parameters Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input 3x3 conv, 64 3x3 conv, 64 Pool 3x3 conv, 128 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 4096 FC 1000 Softmax
  • 79. [Simonyan and Zisserman, 2014] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input memory: 224*224*3=150K params: 0 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Pool memory: 112*112*64=800K params: 0 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 Pool memory: 56*56*128=400K params: 0 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 Pool memory: 28*28*256=200K params: 0 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 Pool memory: 14*14*512=100K params: 0 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool memory: 7*7*512=25K params: 0 FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448 FC 4096 memory: 4096 params: 4096*4096 = 16,777,216 FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
  • 80. VGGNet [Simonyan and Zisserman, 2014] Details/Retrospectives : • ILSVRC’14 2nd in classification, 1st in localization • Similar training procedure as AlexNet • No Local Response Normalisation (LRN) • Use VGG16 or VGG19 (VGG19 only slightly better, more memory) • Use ensembles for best results • FC7 features generalize well to other tasks • Trained on 4 Nvidia Titan Black GPUs for two to three weeks. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 81. VGGNet VGG Net reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. [Simonyan and Zisserman, 2014]
  • 82. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 83. GoogleNet • Going Deeper with Convolutions - Christian Szegedy et al.; 2015 • ILSVRC 2014 competition winner • Also significantly deeper than AlexNet • x12 less parameters than AlexNet • Focused on computational efficiency [Szegedy et al., 2014]
  • 84. GoogleNet • 22 layers • Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure • No FC layers • Only 5 million parameters! • ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014]
  • 85. GoogleNet “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] Filter concatenation Previous layer 1x1 convolution 3x3 convolution 5x5 convolution 1x1 convolution 1x1 convolution 1x1 convolution 3x3 max pooling
  • 86. GoogleNet Details/Retrospectives : • Deeper networks, with computational efficiency • 22 layers • Efficient “Inception” module • No FC layers • 12x less params than AlexNet • ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
  • 87. GoogleNet Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. [Szegedy et al., 2014]
  • 88. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 89. ResNet • Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 • Extremely deep network – 152 layers • Deeper neural networks are more difficult to train. • Deep networks suffer from vanishing and exploding gradients. • Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. [He et al., 2015]
  • 90. ResNet • ILSVRC’15 classification winner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 91. ResNet • What happens when we continue stacking deeper layers on a convolutional neural network? • 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 92. ResNet • Hypothesis: The problem is an optimization problem. Very deep networks are harder to optimize. • Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping. • We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network. • Use layers to fit residual F(x) = H(x) – x instead of H(x) directly Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 93. ResNet Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x. [He et al., 2015]
  • 94. ResNet Short cut/ skip connection 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] 𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏]) 𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐]) 𝑎𝑎[𝑙𝑙+1] a[l] a[l+1] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) [He et al., 2015]
  • 95. ResNet Full ResNet architecture: • Stack residual blocks • Every residual block has two 3x3 conv layers • Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) • Additional conv layer at the beginning • No FC layers at the end (only FC 1000 to output classes) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 96. ResNet • Total depths of 34, 50, 101, or 152 layers for ImageNet • For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 97. ResNet Experimental Results: • Able to train very deep networks without degrading • Deeper networks now achieve lower training errors as expected [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 98. ResNet The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Even better than human performance! [He et al., 2015]
  • 99. Accuracy comparison The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 100. Forward pass time and power consumption The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 101. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 102. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 104. An Architecture for Many Applications Detection Semantic segmentation End-to-end robotic control
  • 105. Semantic Segmentation: Fully Convolutional Networks FCN:Fully Convolutional Network. Network designed with all convolutional layers,with downsampling and upsampling operations tf.keras.layers.Conv2DTranspose Long+ CVPR 2015
  • 106. Facial Detection & Recognition
  • 108. Self-Driving Cars: Navigation from Visual Perception Raw Perception I (ex.camera) Coarse Maps M (ex.GPS) Possible Control Commands Amini+ ICRA 2019
  • 109. End-to-End Framework for Autonomous Navigation Entire model trained end-to-end without any human labelling or annotations Amini+ ICRA 2019
  • 110. Automatic Colorization of Black and White Images
  • 111. Optimizing Images Post Processing Feature Optimization (Illumination) Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Color Tone: Warmness)
  • 114. Breast Cancer Screening 6 . Breast cancer case missed by radiologist but detected byAI AI MD Readers AI MD Readers CNN-based system outperformed expert radiologists at detecting breast cancer from mammograms
  • 115. Semantic Segmentation: Biomedical Image Analysis BrainTumors Dong+ MIUA 2017. Malaria Infection Soleimany+ arXiv 2019. Dong+ MIUA 2017;Soleimany+ arXiv 2019 Origi nal Ground Truth Segmenta tion Uncertai nty
  • 118. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 119. Deep Learning for Computer Vision: Summary Foundations • Why computer vision? • Representing images • Convolutions for feature extraction CNNs • CNN architecture • Application to classification • ImageNet Applications • Segmentation,image captioning,control • Security,medicine, robotics