ConvNet_Transformers_engineering_ppt.pptx

Introduction
• Convolutional Neural Networks (CNNs) exploit
two main principles:
• • Inductive Bias – prior assumptions like
locality and weight sharing
• • Translation Equivariance – filter response
shifts predictably with input
• These help CNNs generalize efficiently with
fewer parameters.
• Fig: CNN Feature Hierarchy

Mathematical Definition of
Convolution
• Given image I: Z^2 -> R and kernel K:
• (I * K)(x) = Σ I(x + u) K(u)
• where x = (x1, x2) indexes pixel location.
• Fig: 2D Convolution Operation

Example: Horizontal Edge
Detection
• Kernel Kh = [[-1 -1 -1], [0 0 0], [1 1 1]]
• Image I = [[0 0 0], [0 0 0], [1 1 1]]
• Result: (I * Kh)(0,0) = 3 → strong horizontal
edge.
• Fig: Edge Detection Response Map

Translation Equivariance
• Definition: f(Tt I) = Tt f(I)
• If an object shifts, the filter output shifts
accordingly.
• This ensures spatial generalization.
• Fig: Equivariance Illustration

From Equivariance to Invariance
• Equivariance: Output shifts with input
• Invariance: Pooling removes spatial
dependence
• Example: MaxPooling creates shift-invariant
features.
• Fig: Pooling and Translation Invariance

ResNet Motivation
• Deeper networks degrade due to vanishing
gradients and optimization difficulty.
• Solution: Add skip connections to allow
gradient flow.
• Fig: Degradation vs Residual Learning

Residual Learning Formulation
• Instead of learning H(x) directly, learn residual
F(x, W):
• H(x) = F(x, W) + x
• If F(x, W) = 0, identity mapping H(x) = x
• Fig: Residual Block Diagram

Gradient Flow Through Residuals
• During backpropagation:
• dL/dx = dL/dy * (dF(x)/dx + I)
• Even if dF/dx ≈ 0, gradient passes through I,
preventing vanishing gradients.
• Fig: Gradient Path in ResNet

ResNet-18 Architecture
• Stages:
• 1. 7x7 Conv + MaxPool
• 2. Conv2_x → 56x56x64
• 3. Conv3_x → 28x28x128
• 4. Conv4_x → 14x14x256
• 5. Conv5_x → 7x7x512
• Then: Global Avg Pool + FC
• Fig: ResNet-18 Pipeline

Vision Transformer (ViT) Overview
• ViT splits image into non-overlapping patches
and applies self-attention.
• No convolutions used.
• Each patch (P x P) flattened and embedded
into D-dimensional space.
• Fig: Patch Embedding Process

Mathematics of Patch Embedding
• Image I R^(H x W x C) → N = (H·W)/(P^2)
∈
patches
• Each patch xp R^(P^2C) → z0_i = xp We +
∈
be
• Add positional encoding: z0_i ← z0_i + Epos
• Fig: Patch + Positional Encoding

Transformer Encoder Equations
• Layer operations:
• 1. z_hat(l) = MHSA(LN(z(l-1))) + z(l-1)
• 2. z(l) = MLP(LN(z_hat(l))) + z_hat(l)
• Each sublayer includes LayerNorm + residual
connection.
• Fig: Encoder Block Flow

Multi-Head Self-Attention (MHSA)
• Attention(Q,K,V) = softmax(QK^T / sqrt(dk)) V
• Where Q = XWQ, K = XWK, V = XWV
• Captures global dependencies.
• Fig: Self-Attention Mechanism

Feed-Forward Network (MLP)
• MLP(x) = W2 * σ(W1x + b1) + b2
• Usually expansion ratio = 4.
• Activation: GELU(x) = 0.5x(1 + tanh(√(2/π)(x +
0.0447x^3)))
• Fig: Transformer MLP Block

Drawbacks of ViT
• 1. Requires large datasets (low inductive bias)
• 2. O(N^2) complexity due to full attention
• 3. No local spatial prior
• Fig: ViT Scaling Problem

Swin Transformer Overview
• Hierarchical ViT with shifted window attention
• Reduces computation while preserving locality
• Input: 224x224x3 → patchify (4x4) →
56x56x96
• Fig: Swin Hierarchical Pipeline

Window-based Self-Attention
• Local attention on 7x7 windows → 49 tokens
per window
• Attention(Q,K,V) = softmax(QK^T / sqrt(d)) V
• FLOPs per window ≈ 2 * Nw^2 * d
• Fig: Window Attention Computation

FLOPs Comparison
• Global ViT: O(N^2) = 6.3x10^8
• Swin: Local windows O(M^2N) = 2.9x10^7
• ~20x reduction in computation.
• Fig: FLOPs Reduction Visualization

Shifted Windows
• Shift by (M/2, M/2) patches for cross-window
info.
• Enables overlapping receptive fields.
• Fig: Shifted vs Non-Shifted Windows

ConvNeXt Introduction
• Modernized CNN reintroducing convolutional
inductive biases with Transformer inspiration.
• Overcomes ViT limitations with efficient
convolutions.
• Fig: ConvNeXt Overview

ConvNeXt Block Mathematics
• z1 = DWConv7x7(x)
• z2 = LN(z1)
• z3 = GELU(PWConv_expand(z2))
• z4 = PWConv_reduce(z3)
• y = x + z4
• Fig: ConvNeXt Block Structure

FLOPs in ConvNeXt
• Example: H=W=28, C=192
• DWConv7x7 ≈ 7.4M
• Expand 1x1 Conv ≈ 116M
• Reduce 1x1 Conv ≈ 116M
• Total ≈ 239M FLOPs per block
• Fig: FLOP Breakdown

ConvNeXt vs Transformer
• DWConv ≈ Local Attention
• Inverted Bottleneck ≈ MLP Expansion
• LN + GELU → Transformer-style training
• ConvNeXt retains locality with efficiency.
• Fig: Comparison Table

Conclusion
• ConvNeXt bridges CNNs and Transformers:
• ConvNeXt = Modern ConvNet + Transformer-
inspired Design
• Efficient, robust, and hardware-friendly.
• Fig: Unified CNN-Transformer Paradigm

ConvNet_Transformers_engineering_ppt.pptx

More Related Content

Similar to ConvNet_Transformers_engineering_ppt.pptx

Recently uploaded

ConvNet_Transformers_engineering_ppt.pptx