Successfully reported this slideshow.
Your SlideShare is downloading. ×

DaViT.pdf

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 9 Ad

More Related Content

Similar to DaViT.pdf (20)

Recently uploaded (20)

Advertisement

DaViT.pdf

  1. 1. Paper review: 2023.01.19 Uploaded on ArXiv: April 2022
  2. 2. Background • For Transformer models global context modelling capabilities, the computational complexity grows quadratically. • It limits their ability to scale up to high-resolution scenarios. • Local attention on spatially local windows benefit for linear complexity, but with a loss of global contextual information. • It is important to design an architecture that can capture global contexts while maintaining efficiency.
  3. 3. Introduction • Effective vision transformer architecture that can capture global context while maintaining computational efficiency. • Exploits self-attention mechanisms with both “spatial tokens” and “channel tokens”. • With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. • With channel tokens, it is inversed: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. • Tokens along the sequence direction are further grouped for both spatial and channel tokens to maintain the linear complexity of the entire model. • These two self-attentions complement each other. • Since each channel token contains an abstract representation of the entire image -> the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels. • The spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. • DaViT achieved state-of-the-art performance on four different tasks with efficient computations.
  4. 4. Spatial and Channel Dual Attention
  5. 5. Attention • Standard global self-attention • Complexity of O(2P2C + 4PC2) • Spatial window-based self-attention • Complexity of O(2PPwC+4PC2) • Linear complexity with spatial size P • Channel Group Attention • Complexity of O(6PC2) • Linear complexity with spatial size P Nw: Number of windows, Ng: Number of channel group, Cg: Channels per group, Ch: Channels per head
  6. 6. Dual Attention Block Architecture
  7. 7. Comparisons of efficiency vs. performance
  8. 8. Results – Image Classification and Semantic Segmentation
  9. 9. Results – Object Detection

×