SlideShare a Scribd company logo
1 of 119
Download to read offline
6.874, 6.802, 20.390, 20.490, HST.506
Computational Systems Biology
Deep Learning in the Life Sciences
Lecture 3:
Convolutional Neural Networks
Prof. Manolis Kellis
http://mit6874.github.io
Slides credit: 6.S191, Dana Erlich, Param Vir Singh,
David Gifford, Alexander Amini, Ava Soleimany
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
1a. What do you see, and how?
Can we teach machines to see?
What do you see?
How do you see?
How can we help
computers see?
What computers ‘see’: Images as Numbers
What the computer "sees"
Levin
Image
Processing
&
Computer
Vision
An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image.
Question: is this Lincoln?Washington? Jefferson? Obama?
How can the computer answer this question?
What you see
Input Image Input Image + values Pixel intensity values
(“pix-el”=picture-element)
What you both see
Can I just do classification on the 1,166400-long image vector directly?
No. Instead: exploit image spatial structure. Learn patches. Build them up
1b. Classical machine vision roots
in study of human/animal brains
Inspiration: human/animal visual cortex
• Layers of neurons: pixels, edges, shapes, primitives, scenes
• E.g. Layer 4 responds to bands w/ given slant, contrasting edges
Primitives: Neurons & action potentials
•Chemical accumulation across
dendritic connections
•Pre-synaptic axon
 post-synaptic dendrite
 neuronal cell body
•Each neuron receives multiple
signals from its many dendrites
•When threshold crossed, it fires
•Its axon then sends outgoing
signal to downstream neurons
•Weak stimuli ignored
•Sufficiently strong cross
activation threshold
•Non-linearity within
each neuronal level
•Neurons connected into circuits (neural networks): emergent properties, learning, memory
•Simple primitives arranged in simple, repetitive, and extremely large networks
•86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
Abstraction layers: edges, bars, dir., shapes, objects, scenes
LGN: Small dots
V1: Orientation,
disparity, some color
V4: Color, basic shapes,
2D/3D, curvature
VTC: Complex features
and objects(VTC: ventral temporal cortex
•Abstraction layers  visual cortex layers
•Complex concepts from simple parts, hierarchy
•Primitives of visual concepts encoded in
neuronal connection in early cortical layers
• Massive recent expanse of human brain has re-used a
relatively simple but general learning architecture
General “learning machine”, reused widely
• Hearing, taste, smell, sight, touch all re-
use similar learning architecture
Motor Cortex
Visual Cortex • Interchangeable
circuitry
• Auditory cortex
learns to ‘see’ if
sent visual signals
• Injury area tasks
shift to uninjured
areas
• Not fully-general learning, but well-adapted to our world
• Humans co-opted this circuitry to many new applications
• Modern tasks accessible to any homo sapiens (<70k years)
• ML primitives not too different from animals: more to come?
human
chimp
Hardware
expansion
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure
for image recognition
Using Spatial Structure
Idea: connect
patches of input to
neurons in hidden
layer.
Neuron connected
to region of input.
Only “sees”these
values.
Input: 2D
image.
Array of pixel
values
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
- Filter of size 4x4 :16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
This“patchy” operation isconvolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
Fully Connected Neural Network
Fully Connected:
• Each neuron in
hidden layer
connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters
Input:
• 2D image
• Vector of pixel
values
Key idea: Use spatial structure in input to inform architecture
of the network
High Level Feature Detection
Let’s identify key features in each image category
Wheels,License Plate,
Headlights
Door,Windows,Steps
Nose,Eyes,Mouth
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise
multiply and add
Filter / Kernel
Producing Feature Maps
Original Sharpen Edge Detect “Strong” Edge
Detect
A simple pattern: Edges
How can we detect edges with a kernel?
Input
-1 -1
Filter
Output
(Goodfellow 2016)
Simple Kernels / Filters
X or X?
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed.
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution
(Goodfellow 2016)
Zero Padding Controls Output Size
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
• Valid-only convolution: output only when
entire kernel contained in input (shrinks output)
• Same convolution: zero pad input so output
is same size as input dimensions
x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
• TF convolution operator takes stride and zero fill option as parameters
• Stride is distance between kernel applications in each dimension
• Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features
de novo
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)
Low level features Mid level features High level features
Lee+ ICML 2009
Eyes,ears,nose
Edges,dark spots Facial structure
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
LeNet-5
⋮ ⋮
�
𝑦𝑦
32×32×1 28×28×6 14×14×6 10×10×16
5×5×16
120 84
5 × 5
s = 1
f = 2
s = 2
avg pool
5 × 5
s = 1
avg pool
f = 2
s = 2
. . .
. . .
Reminder:
Output size = (N+2P-F)/stride + 1
10
conv conv
FC FC
[LeCun et al., 1998]
This slide is taken from Andrew Ng
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output
• Different filters look at different channels
• Sigmoid and Tanh nonlinearity
[LeCun et al., 1998]
Backpropagation of convolution
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs
Mid level features
Low level features High level features
Edges,dark spots
Conv Layer 1
Lee+ ICML 2009
Eyes,ears,nose
Conv Layer 2
Facial structure
Conv Layer 3
CNNs for Classification
1. Convolution:Apply filters to generate feature maps.
2. Non-linearity:Often ReLU.
3. Pooling:Downsampling operation on each feature map.
Trainmodel with image data.
Learn weights of filters in convolutional layers.
tf.keras.layers.Conv2
D
tf.keras.activations.
*
tf.keras.layers.MaxPool2
D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
tf.keras.layers.
Conv2D
Convolutional Layers: Local Connectivity
For a neuron in hidden layer:
• Take inputs from patch
• Compute weighted sum
• Apply bias
4x4 filter:
matrix of
weights wij for neuron (p,q) in hidden layer
1) applying a window of weights
2) computing linear combinations
3) activating with non-linear function
tf.keras.layers.Conv2D
CNNs: Spatial Arrangement of Output
Volume
depth
width
height
Layer Dimensions:
ℎ  w d
where h and w are spatial
dimensions d (depth) = number of
filters
Stride:
Filter step size
Receptive Field:
Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
Rectified Linear Unit
(ReLU)
- Apply after every convolution operation
(i.e.,after convolutional layers)
- ReLU:pixel-by-pixel operation that replaces
all negative values by zero.
- Non-linear operation
tf.keras.layers.ReLU
Karn Intuitive CNNs
Pooling
Max Pooling,average pooling
1) Reduced
dimensionality
2) Spatial invariance
tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
strides=2
)
The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
x= tf.nn.relu(x)
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')
Output Input Pooling Batch H W Input channel
Neighborhood
[batch, height, width, channels]
Dilated Convolution
91
CNNs for Classification: Feature Learning
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities
- CONV and POOL layers output high-level features of input
- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
Putting it all together
import tensorflow as tf
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# second convolutional layer
tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# fully connected classifier
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu’),
tf.keras.layers.Dense(10, activation=‘softmax’)
# 10 outputs
])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?
Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Detect
features
to
classify
Li/Johnson/Yeung C231n
Feature invariance to perturbation is hard
Next-generation models
explode # of parameters
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
• ImageNet Classification with Deep Convolutional
Neural Networks - Alex Krizhevsky, Ilya Sutskever,
Geoffrey E. Hinton; 2012
• Facilitated by GPUs, highly optimized convolution
implementation and large datasets (ImageNet)
• One of the largest CNNs to date
• Has 60 Million parameter compared to 60k
parameter of LeNet-5
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
• The annual “Olympics” of computer vision.
• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.
• 2012 marked the first year where a CNN was used to
achieve a top 5 test error rate of 15.3%.
• The next best entry achieved an error of 26.2%.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
• Input: 227x227x3 images (224x224 before
padding)
• First layer: 96 11x11 filters applied at stride 4
• Output volume size?
(N-F)/s+1 = (227-11)/4+1 = 55 ->
[55x55x96]
• Number of parameters in this layer?
(11*11*3)*96 = 35K
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
AlexNet
[Krizhevsky et al., 2012]
• Input: 227x227x3 images (224x224 before
padding)
• After CONV1: 55x55x96
• Second layer: 3x3 filters applied at stride 2
• Output volume size?
(N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96]
• Number of parameters in this layer?
0!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
AlexNet
. . .
227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256
13×13
×256
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256
11 × 11
s = 4
P = 0
3 × 3
s = 2
max pool
5 × 5
S = 1
P = 2
3 × 3
s = 2
max pool
3 × 3
S = 1
P = 1
3 × 3
s = 1
P = 1
3 × 3
S = 1
P = 1
3 × 3
s = 2
max pool
conv conv
conv conv conv
. . .
[Krizhevsky et al., 2012]
. . .
This slide is taken from Andrew Ng
AlexNet
. . .
4096 4096
Softmax
1000
⋮ ⋮
[Krizhevsky et al., 2012]
FC FC
This slide is taken from Andrew Ng
AlexNet
[Krizhevsky et al., 2012]
Details/Retrospectives:
• first use of ReLU
• used Norm layers (not common anymore)
• heavy data augmentation
• dropout 0.5
• batch size 128
• 7 CNN ensemble
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
• Trained on GTX 580 GPU with only 3 GB of memory.
• Network spread across 2 GPUs, half the neurons (feature
maps) on each GPU.
• CONV1, CONV2, CONV4, CONV5:
Connections only with feature maps on same GPU.
• CONV3, FC6, FC7, FC8:
Connections with all feature maps in preceding layer,
communication across GPUs.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
• Very Deep Convolutional Networks For Large Scale
Image Recognition - Karen Simonyan and Andrew
Zisserman; 2015
• The runner-up at the ILSVRC 2014 competition
• Significantly deeper than AlexNet
• 140 million parameters
[Simonyan and Zisserman, 2014]
VGGNet
• Smaller filters
Only 3x3 CONV filters, stride 1, pad 1
and 2x2 MAX POOL , stride 2
• Deeper network
AlexNet: 8 layers
VGGNet: 16 - 19 layers
• ZFNet: 11.7% top 5 error in ILSVRC’13
• VGGNet: 7.3% top 5 error in ILSVRC’14
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64
Pool 1/2
3x3 conv, 128
3x3 conv, 128
Pool 1/2
3x3 conv, 256
3x3 conv, 256
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
FC 4096
FC 4096
FC 1000
Softmax
VGGNet
[Simonyan and Zisserman, 2014]
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.
• What is the effective receptive field of three 3x3 conv (stride
1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
[Simonyan and Zisserman, 2014]
VGG16:
TOTAL memory: 24M * 4 bytes ~= 96MB / image
TOTAL params: 138M parameters
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input
3x3 conv, 64
3x3 conv, 64
Pool
3x3 conv, 128
3x3 conv, 128
Pool
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax
[Simonyan and Zisserman, 2014]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 =
147,456
Pool memory: 56*56*128=400K params: 0
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
Pool memory: 28*28*256=200K params: 0
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool memory: 14*14*512=100K params: 0
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Pool memory: 7*7*512=25K params: 0
FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448
FC 4096 memory: 4096 params: 4096*4096 = 16,777,216
FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
VGGNet
[Simonyan and Zisserman, 2014]
Details/Retrospectives :
• ILSVRC’14 2nd in classification, 1st in localization
• Similar training procedure as AlexNet
• No Local Response Normalisation (LRN)
• Use VGG16 or VGG19 (VGG19 only slightly better, more
memory)
• Use ensembles for best results
• FC7 features generalize well to other tasks
• Trained on 4 Nvidia Titan Black GPUs for two to three weeks.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
VGG Net reinforced the notion that convolutional neural
networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work.
Keep it deep.
Keep it simple.
[Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet
• Going Deeper with Convolutions - Christian Szegedy et
al.; 2015
• ILSVRC 2014 competition winner
• Also significantly deeper than AlexNet
• x12 less parameters than AlexNet
• Focused on computational efficiency
[Szegedy et al., 2014]
GoogleNet
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)
[Szegedy et al., 2014]
GoogleNet
“Inception module”: design a good local network topology (network within
a network) and then stack these modules on top of each other
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
Filter
concatenation
Previous layer
1x1
convolution
3x3
convolution
5x5
convolution
1x1
convolution
1x1
convolution
1x1
convolution
3x3 max
pooling
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Introduced the idea that CNN layers didn’t always have to be
stacked up sequentially. Coming up with the Inception
module, the authors showed that a creative structuring of
layers can lead to improved performance and
computationally efficiency.
[Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and
exploding gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
[He et al., 2015]
ResNet
• ILSVRC’15 classification winner (3.57% top 5
error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
• 56-layer model performs worse on both training and test error
-> The deeper model performs worse (not caused by overfitting)!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.
• We will use skip connections allowing us to take the activation
from one layer and feed it into another layer, much deeper
into the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
[He et al., 2015]
ResNet
Short cut/ skip connection
𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2]
𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏]
𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏])
𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐])
𝑎𝑎[𝑙𝑙+1]
a[l]
a[l+1]
𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2]
𝐚𝐚[𝐥𝐥+𝟐𝟐]
= 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐
+ 𝐚𝐚 𝐥𝐥
= 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟏𝟏]
+ 𝐛𝐛[𝐥𝐥+𝟐𝟐]
+ 𝐚𝐚 𝐥𝐥
)
[He et al., 2015]
ResNet
Full ResNet architecture:
• Stack residual blocks
• Every residual block has two 3x3 conv layers
• Periodically, double # of filters and
downsample spatially using stride 2 (in each
dimension)
• Additional conv layer at the beginning
• No FC layers at the end (only FC 1000 to
output classes)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Even better than human performance!
[He et al., 2015]
Accuracy comparison
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power
consumption
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Countless applications
An Architecture for Many Applications
Detection
Semantic segmentation
End-to-end robotic control
Semantic Segmentation: Fully Convolutional Networks
FCN:Fully Convolutional Network.
Network designed with all convolutional layers,with downsampling and
upsampling operations
tf.keras.layers.Conv2DTranspose
Long+ CVPR 2015
Facial Detection & Recognition
Self-Driving Cars
Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception
Raw
Perception
I
(ex.camera)
Coarse
Maps
M
(ex.GPS)
Possible Control Commands
Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation
Entire model trained end-to-end
without any human labelling or annotations
Amini+ ICRA 2019
Automatic Colorization of Black and White Images
Optimizing Images
Post Processing Feature Optimization
(Illumination)
Post Processing Feature Optimization
(Color Curves and Details)
Post Processing Feature Optimization
(Color Tone: Warmness)
Up-scaling low-resolution images
Medicine, Biology, Healthcare
Gulshan+ JAMA 2016.
Breast Cancer Screening
6
.
Breast cancer case
missed by radiologist
but detected byAI
AI
MD
Readers
AI
MD
Readers
CNN-based system outperformed expert
radiologists at detecting breast
cancer from mammograms
Semantic Segmentation: Biomedical Image Analysis
BrainTumors
Dong+ MIUA
2017.
Malaria Infection
Soleimany+ arXiv
2019.
Dong+ MIUA 2017;Soleimany+ arXiv 2019
Origi
nal
Ground
Truth
Segmenta
tion
Uncertai
nty
DeepBind
[Alipanahi et al., 2015]
Predicting disease mutations
[Alipanahi et al., 2015]
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Deep Learning for Computer Vision: Summary
Foundations
• Why computer vision?
• Representing images
• Convolutions for feature
extraction
CNNs
• CNN architecture
• Application to
classification
• ImageNet
Applications
• Segmentation,image
captioning,control
• Security,medicine,
robotics

More Related Content

What's hot

Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkItachi SK
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044Jinwon Lee
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearningAbhishek Sharma
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learningMohamed Essam
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 

What's hot (20)

Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Deep learning
Deep learningDeep learning
Deep learning
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learning
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 

Similar to CNN Algorithm

Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningAmr Rashed
 
Deep Learning Training at Intel
Deep Learning Training at IntelDeep Learning Training at Intel
Deep Learning Training at IntelAtul Vaish
 
build a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Pythonbuild a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in PythonKv Sagar
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxRiteshPandey184067
 
Introducción a las redes convolucionales
Introducción a las redes convolucionalesIntroducción a las redes convolucionales
Introducción a las redes convolucionalesJoseAlGarcaGutierrez
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Amr Rashed
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural NetworksDean Wyatte
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamWithTheBest
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDong Heon Cho
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima PanditPurnima Pandit
 
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine IntelligencePrinciples of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine IntelligenceNumenta
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningCharles Deledalle
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep LearningPoo Kuan Hoong
 

Similar to CNN Algorithm (20)

Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
DL.pdf
DL.pdfDL.pdf
DL.pdf
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
deep learning
deep learningdeep learning
deep learning
 
Deep Learning Training at Intel
Deep Learning Training at IntelDeep Learning Training at Intel
Deep Learning Training at Intel
 
build a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Pythonbuild a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Python
 
7 nn1-intro.ppt
7 nn1-intro.ppt7 nn1-intro.ppt
7 nn1-intro.ppt
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptx
 
Introducción a las redes convolucionales
Introducción a las redes convolucionalesIntroducción a las redes convolucionales
Introducción a las redes convolucionales
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
 
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine IntelligencePrinciples of Hierarchical Temporal Memory - Foundations of Machine Intelligence
Principles of Hierarchical Temporal Memory - Foundations of Machine Intelligence
 
Neural
NeuralNeural
Neural
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 

More from georgejustymirobi1 (20)

JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.ppt
 
Network IP Security.pdf
Network IP Security.pdfNetwork IP Security.pdf
Network IP Security.pdf
 
How To Write A Scientific Paper
How To Write A Scientific PaperHow To Write A Scientific Paper
How To Write A Scientific Paper
 
writing_the_research_paper.ppt
writing_the_research_paper.pptwriting_the_research_paper.ppt
writing_the_research_paper.ppt
 
Bluetooth.ppt
Bluetooth.pptBluetooth.ppt
Bluetooth.ppt
 
ABCD15042603583.pdf
ABCD15042603583.pdfABCD15042603583.pdf
ABCD15042603583.pdf
 
ch18 ABCD.pdf
ch18 ABCD.pdfch18 ABCD.pdf
ch18 ABCD.pdf
 
ch13 ABCD.ppt
ch13 ABCD.pptch13 ABCD.ppt
ch13 ABCD.ppt
 
BluetoothSecurity.ppt
BluetoothSecurity.pptBluetoothSecurity.ppt
BluetoothSecurity.ppt
 
1682302951397_PGP.pdf
1682302951397_PGP.pdf1682302951397_PGP.pdf
1682302951397_PGP.pdf
 
CNN
CNNCNN
CNN
 
applicationlayer.pptx
applicationlayer.pptxapplicationlayer.pptx
applicationlayer.pptx
 
Fair Bluetooth.pdf
Fair Bluetooth.pdfFair Bluetooth.pdf
Fair Bluetooth.pdf
 
Bluetooth.pptx
Bluetooth.pptxBluetooth.pptx
Bluetooth.pptx
 
Research Score.pdf
Research Score.pdfResearch Score.pdf
Research Score.pdf
 
educational_technology_meena_arora.ppt
educational_technology_meena_arora.ppteducational_technology_meena_arora.ppt
educational_technology_meena_arora.ppt
 
Array.ppt
Array.pptArray.ppt
Array.ppt
 
PYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptxPYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptx
 
cprogrammingoperator.ppt
cprogrammingoperator.pptcprogrammingoperator.ppt
cprogrammingoperator.ppt
 
cprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.pptcprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.ppt
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 

CNN Algorithm

  • 1. 6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 3: Convolutional Neural Networks Prof. Manolis Kellis http://mit6874.github.io Slides credit: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini, Ava Soleimany
  • 2. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 3. 1a. What do you see, and how? Can we teach machines to see?
  • 4. What do you see?
  • 5. How do you see? How can we help computers see?
  • 6. What computers ‘see’: Images as Numbers What the computer "sees" Levin Image Processing & Computer Vision An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image. Question: is this Lincoln?Washington? Jefferson? Obama? How can the computer answer this question? What you see Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) What you both see Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up
  • 7. 1b. Classical machine vision roots in study of human/animal brains
  • 8. Inspiration: human/animal visual cortex • Layers of neurons: pixels, edges, shapes, primitives, scenes • E.g. Layer 4 responds to bands w/ given slant, contrasting edges
  • 9. Primitives: Neurons & action potentials •Chemical accumulation across dendritic connections •Pre-synaptic axon  post-synaptic dendrite  neuronal cell body •Each neuron receives multiple signals from its many dendrites •When threshold crossed, it fires •Its axon then sends outgoing signal to downstream neurons •Weak stimuli ignored •Sufficiently strong cross activation threshold •Non-linearity within each neuronal level •Neurons connected into circuits (neural networks): emergent properties, learning, memory •Simple primitives arranged in simple, repetitive, and extremely large networks •86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
  • 10. Abstraction layers: edges, bars, dir., shapes, objects, scenes LGN: Small dots V1: Orientation, disparity, some color V4: Color, basic shapes, 2D/3D, curvature VTC: Complex features and objects(VTC: ventral temporal cortex •Abstraction layers  visual cortex layers •Complex concepts from simple parts, hierarchy •Primitives of visual concepts encoded in neuronal connection in early cortical layers
  • 11. • Massive recent expanse of human brain has re-used a relatively simple but general learning architecture General “learning machine”, reused widely • Hearing, taste, smell, sight, touch all re- use similar learning architecture Motor Cortex Visual Cortex • Interchangeable circuitry • Auditory cortex learns to ‘see’ if sent visual signals • Injury area tasks shift to uninjured areas • Not fully-general learning, but well-adapted to our world • Humans co-opted this circuitry to many new applications • Modern tasks accessible to any homo sapiens (<70k years) • ML primitives not too different from animals: more to come? human chimp Hardware expansion
  • 12. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 13. 2a. Spatial structure for image recognition
  • 14. Using Spatial Structure Idea: connect patches of input to neurons in hidden layer. Neuron connected to region of input. Only “sees”these values. Input: 2D image. Array of pixel values
  • 15. Using Spatial Structure Connect patch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
  • 16. Feature Extraction with Convolution - Filter of size 4x4 :16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This“patchy” operation isconvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 17. Fully Connected Neural Network Fully Connected: • Each neuron in hidden layer connected to all neurons in input layer • No spatial information • Many, many parameters Input: • 2D image • Vector of pixel values Key idea: Use spatial structure in input to inform architecture of the network
  • 18. High Level Feature Detection Let’s identify key features in each image category Wheels,License Plate, Headlights Door,Windows,Steps Nose,Eyes,Mouth
  • 21. Convolution operation is element wise multiply and add Filter / Kernel
  • 22. Producing Feature Maps Original Sharpen Edge Detect “Strong” Edge Detect
  • 23. A simple pattern: Edges How can we detect edges with a kernel? Input -1 -1 Filter Output (Goodfellow 2016)
  • 24. Simple Kernels / Filters
  • 25. X or X? Image is represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed. Rohrer How do CNNs work?
  • 26. There are three approaches to edge cases in convolution
  • 27. (Goodfellow 2016) Zero Padding Controls Output Size • Full convolution: zero pad input so output is produced whenever an output value contains at least one input value (expands output) • Valid-only convolution: output only when entire kernel contained in input (shrinks output) • Same convolution: zero pad input so output is same size as input dimensions x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME') • TF convolution operator takes stride and zero fill option as parameters • Stride is distance between kernel applications in each dimension • Padding can be SAME or VALID
  • 28.
  • 29. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 30. 3a. Learning Visual Features de novo
  • 31. Key idea: learn hierarchy of features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Lee+ ICML 2009 Eyes,ears,nose Edges,dark spots Facial structure
  • 32. Key idea: re-use parameters Convolution shares parameters Example 3x3 convolution on a 5x5 image
  • 33. Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 34. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 35. LeNet-5 ⋮ ⋮ � 𝑦𝑦 32×32×1 28×28×6 14×14×6 10×10×16 5×5×16 120 84 5 × 5 s = 1 f = 2 s = 2 avg pool 5 × 5 s = 1 avg pool f = 2 s = 2 . . . . . . Reminder: Output size = (N+2P-F)/stride + 1 10 conv conv FC FC [LeCun et al., 1998] This slide is taken from Andrew Ng
  • 36. LeNet-5 • Only 60K parameters • As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑ • General structure: conv->pool->conv->pool->FC->FC->output • Different filters look at different channels • Sigmoid and Tanh nonlinearity [LeCun et al., 1998]
  • 37. Backpropagation of convolution Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
  • 40. Representation Learning in Deep CNNs Mid level features Low level features High level features Edges,dark spots Conv Layer 1 Lee+ ICML 2009 Eyes,ears,nose Conv Layer 2 Facial structure Conv Layer 3
  • 41. CNNs for Classification 1. Convolution:Apply filters to generate feature maps. 2. Non-linearity:Often ReLU. 3. Pooling:Downsampling operation on each feature map. Trainmodel with image data. Learn weights of filters in convolutional layers. tf.keras.layers.Conv2 D tf.keras.activations. * tf.keras.layers.MaxPool2 D
  • 42. Example – Six convolutional layers
  • 43. Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias tf.keras.layers. Conv2D
  • 44. Convolutional Layers: Local Connectivity For a neuron in hidden layer: • Take inputs from patch • Compute weighted sum • Apply bias 4x4 filter: matrix of weights wij for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function tf.keras.layers.Conv2D
  • 45. CNNs: Spatial Arrangement of Output Volume depth width height Layer Dimensions: ℎ  w d where h and w are spatial dimensions d (depth) = number of filters Stride: Filter step size Receptive Field: Locations in input image that a node is path connected to tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
  • 46. Introducing Non-Linearity Rectified Linear Unit (ReLU) - Apply after every convolution operation (i.e.,after convolutional layers) - ReLU:pixel-by-pixel operation that replaces all negative values by zero. - Non-linear operation tf.keras.layers.ReLU Karn Intuitive CNNs
  • 47. Pooling Max Pooling,average pooling 1) Reduced dimensionality 2) Spatial invariance tf.keras.layers.Max Pool2D( pool_size=(2,2), strides=2 )
  • 48. The REctified Linear Unit (RELU) is a common non-linear detector stage after convolution x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) x= tf.nn.relu(x) f(x) = max(0, x) When will we backpropagate through this? Once it “dies” what happens to it?
  • 49. Pooling reduces dimensionality by giving up spatial location • max pooling reports the maximum output within a defined neighborhood • Padding can be SAME or VALID x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') Output Input Pooling Batch H W Input channel Neighborhood [batch, height, width, channels]
  • 51. 91 CNNs for Classification: Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
  • 52. CNNs for Classification: Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class
  • 53. Putting it all together import tensorflow as tf def generate_model(): model = tf.keras.Sequential([ # first convolutional layer tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # second convolutional layer tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # fully connected classifier tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024, activation='relu’), tf.keras.layers.Dense(10, activation=‘softmax’) # 10 outputs ]) return model
  • 54. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 55. 4a. Real-world feature invariance is hard
  • 56. How can computers recognize objects?
  • 57. How can computers recognize objects? Challenge: • Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. • How can we overcome this challenge? Answer: • Learn a ton of features (millions) from the bottom up • Learn the convolutional filters, rather than pre-computing them
  • 60. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 61. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 62. AlexNet • ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 • Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) • One of the largest CNNs to date • Has 60 Million parameter compared to 60k parameter of LeNet-5 [Krizhevsky et al., 2012]
  • 63. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners • The annual “Olympics” of computer vision. • Teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. • 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.3%. • The next best entry achieved an error of 26.2%.
  • 64. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 65. AlexNet [Krizhevsky et al., 2012] Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 • Input: 227x227x3 images (224x224 before padding) • First layer: 96 11x11 filters applied at stride 4 • Output volume size? (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96] • Number of parameters in this layer? (11*11*3)*96 = 35K Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 67. AlexNet [Krizhevsky et al., 2012] • Input: 227x227x3 images (224x224 before padding) • After CONV1: 55x55x96 • Second layer: 3x3 filters applied at stride 2 • Output volume size? (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96] • Number of parameters in this layer? 0! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
  • 68. AlexNet . . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . . This slide is taken from Andrew Ng
  • 69. AlexNet . . . 4096 4096 Softmax 1000 ⋮ ⋮ [Krizhevsky et al., 2012] FC FC This slide is taken from Andrew Ng
  • 70. AlexNet [Krizhevsky et al., 2012] Details/Retrospectives: • first use of ReLU • used Norm layers (not common anymore) • heavy data augmentation • dropout 0.5 • batch size 128 • 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 71. AlexNet [Krizhevsky et al., 2012] • Trained on GTX 580 GPU with only 3 GB of memory. • Network spread across 2 GPUs, half the neurons (feature maps) on each GPU. • CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. • CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 72. AlexNet AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. [Krizhevsky et al., 2012]
  • 73. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 74. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 75. VGGNet • Very Deep Convolutional Networks For Large Scale Image Recognition - Karen Simonyan and Andrew Zisserman; 2015 • The runner-up at the ILSVRC 2014 competition • Significantly deeper than AlexNet • 140 million parameters [Simonyan and Zisserman, 2014]
  • 76. VGGNet • Smaller filters Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2 • Deeper network AlexNet: 8 layers VGGNet: 16 - 19 layers • ZFNet: 11.7% top 5 error in ILSVRC’13 • VGGNet: 7.3% top 5 error in ILSVRC’14 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] Input 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Pool 1/2 3x3 conv, 256 3x3 conv, 256 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 FC 4096 FC 1000 Softmax
  • 77. VGGNet [Simonyan and Zisserman, 2014] • Why use smaller filters? (3x3 conv) Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer. • What is the effective receptive field of three 3x3 conv (stride 1) layers? 7x7 But deeper, more non-linearities And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 78. VGGNet [Simonyan and Zisserman, 2014] VGG16: TOTAL memory: 24M * 4 bytes ~= 96MB / image TOTAL params: 138M parameters Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input 3x3 conv, 64 3x3 conv, 64 Pool 3x3 conv, 128 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 4096 FC 1000 Softmax
  • 79. [Simonyan and Zisserman, 2014] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input memory: 224*224*3=150K params: 0 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Pool memory: 112*112*64=800K params: 0 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 Pool memory: 56*56*128=400K params: 0 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 Pool memory: 28*28*256=200K params: 0 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 Pool memory: 14*14*512=100K params: 0 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool memory: 7*7*512=25K params: 0 FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448 FC 4096 memory: 4096 params: 4096*4096 = 16,777,216 FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
  • 80. VGGNet [Simonyan and Zisserman, 2014] Details/Retrospectives : • ILSVRC’14 2nd in classification, 1st in localization • Similar training procedure as AlexNet • No Local Response Normalisation (LRN) • Use VGG16 or VGG19 (VGG19 only slightly better, more memory) • Use ensembles for best results • FC7 features generalize well to other tasks • Trained on 4 Nvidia Titan Black GPUs for two to three weeks. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 81. VGGNet VGG Net reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. [Simonyan and Zisserman, 2014]
  • 82. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 83. GoogleNet • Going Deeper with Convolutions - Christian Szegedy et al.; 2015 • ILSVRC 2014 competition winner • Also significantly deeper than AlexNet • x12 less parameters than AlexNet • Focused on computational efficiency [Szegedy et al., 2014]
  • 84. GoogleNet • 22 layers • Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure • No FC layers • Only 5 million parameters! • ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014]
  • 85. GoogleNet “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] Filter concatenation Previous layer 1x1 convolution 3x3 convolution 5x5 convolution 1x1 convolution 1x1 convolution 1x1 convolution 3x3 max pooling
  • 86. GoogleNet Details/Retrospectives : • Deeper networks, with computational efficiency • 22 layers • Efficient “Inception” module • No FC layers • 12x less params than AlexNet • ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
  • 87. GoogleNet Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. [Szegedy et al., 2014]
  • 88. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 89. ResNet • Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 • Extremely deep network – 152 layers • Deeper neural networks are more difficult to train. • Deep networks suffer from vanishing and exploding gradients. • Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. [He et al., 2015]
  • 90. ResNet • ILSVRC’15 classification winner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 91. ResNet • What happens when we continue stacking deeper layers on a convolutional neural network? • 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 92. ResNet • Hypothesis: The problem is an optimization problem. Very deep networks are harder to optimize. • Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping. • We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network. • Use layers to fit residual F(x) = H(x) – x instead of H(x) directly Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 93. ResNet Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x. [He et al., 2015]
  • 94. ResNet Short cut/ skip connection 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] 𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏]) 𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐]) 𝑎𝑎[𝑙𝑙+1] a[l] a[l+1] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) [He et al., 2015]
  • 95. ResNet Full ResNet architecture: • Stack residual blocks • Every residual block has two 3x3 conv layers • Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) • Additional conv layer at the beginning • No FC layers at the end (only FC 1000 to output classes) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 96. ResNet • Total depths of 34, 50, 101, or 152 layers for ImageNet • For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 97. ResNet Experimental Results: • Able to train very deep networks without degrading • Deeper networks now achieve lower training errors as expected [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 98. ResNet The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Even better than human performance! [He et al., 2015]
  • 99. Accuracy comparison The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 100. Forward pass time and power consumption The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 101. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 102. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 104. An Architecture for Many Applications Detection Semantic segmentation End-to-end robotic control
  • 105. Semantic Segmentation: Fully Convolutional Networks FCN:Fully Convolutional Network. Network designed with all convolutional layers,with downsampling and upsampling operations tf.keras.layers.Conv2DTranspose Long+ CVPR 2015
  • 106. Facial Detection & Recognition
  • 108. Self-Driving Cars: Navigation from Visual Perception Raw Perception I (ex.camera) Coarse Maps M (ex.GPS) Possible Control Commands Amini+ ICRA 2019
  • 109. End-to-End Framework for Autonomous Navigation Entire model trained end-to-end without any human labelling or annotations Amini+ ICRA 2019
  • 110. Automatic Colorization of Black and White Images
  • 111. Optimizing Images Post Processing Feature Optimization (Illumination) Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Color Tone: Warmness)
  • 114. Breast Cancer Screening 6 . Breast cancer case missed by radiologist but detected byAI AI MD Readers AI MD Readers CNN-based system outperformed expert radiologists at detecting breast cancer from mammograms
  • 115. Semantic Segmentation: Biomedical Image Analysis BrainTumors Dong+ MIUA 2017. Malaria Infection Soleimany+ arXiv 2019. Dong+ MIUA 2017;Soleimany+ arXiv 2019 Origi nal Ground Truth Segmenta tion Uncertai nty
  • 118. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 119. Deep Learning for Computer Vision: Summary Foundations • Why computer vision? • Representing images • Convolutions for feature extraction CNNs • CNN architecture • Application to classification • ImageNet Applications • Segmentation,image captioning,control • Security,medicine, robotics