Vision Transformer (ViT)Overview
• ViT splits image into non-overlapping patches
and applies self-attention.
• No convolutions used.
• Each patch (P x P) flattened and embedded
into D-dimensional space.
• Fig: Patch Embedding Process
11.
Mathematics of PatchEmbedding
• Image I R^(H x W x C) → N = (H·W)/(P^2)
∈
patches
• Each patch xp R^(P^2C) → z0_i = xp We +
∈
be
• Add positional encoding: z0_i ← z0_i + Epos
• Fig: Patch + Positional Encoding
Drawbacks of ViT
•1. Requires large datasets (low inductive bias)
• 2. O(N^2) complexity due to full attention
• 3. No local spatial prior
• Fig: ViT Scaling Problem
16.
Swin Transformer Overview
•Hierarchical ViT with shifted window attention
• Reduces computation while preserving locality
• Input: 224x224x3 → patchify (4x4) →
56x56x96
• Fig: Swin Hierarchical Pipeline
17.
Window-based Self-Attention
• Localattention on 7x7 windows → 49 tokens
per window
• Attention(Q,K,V) = softmax(QK^T / sqrt(d)) V
• FLOPs per window ≈ 2 * Nw^2 * d
• Fig: Window Attention Computation
18.
FLOPs Comparison
• GlobalViT: O(N^2) = 6.3x10^8
• Swin: Local windows O(M^2N) = 2.9x10^7
• ~20x reduction in computation.
• Fig: FLOPs Reduction Visualization
19.
Shifted Windows
• Shiftby (M/2, M/2) patches for cross-window
info.
• Enables overlapping receptive fields.
• Fig: Shifted vs Non-Shifted Windows
20.
ConvNeXt Introduction
• ModernizedCNN reintroducing convolutional
inductive biases with Transformer inspiration.
• Overcomes ViT limitations with efficient
convolutions.
• Fig: ConvNeXt Overview