H transformer-1d paper review!!

H-Transformer-1D:
Fast One Dimensional Hierarchical Attention For Sequences
2021.08.15
자연어처리팀
진명훈 백지윤 신동진

00. 논문 및 저자 소개
Zhenhai Zhu
NLU / CV / Numerical Analysis
/ Computer-aided Design
https://ai.facebook.com/people/sinong-wang/
Belinda Z. Li
NLP / NMT / Language Generation
/ Syntactic Parsing / Discourse Parsing
https://belindal.github.io/
2021.07.25

01.
문제가 뭐야?
Transformers a.k.a. Self-Attention
Self-attention 기반의 트랜스포머는 분야를 가리지 않고 정말 잘해요!
Self-Attention Bottleneck
트랜스포머의 핵심 아이디어 Self-Attention의 계산 복잡도가 𝒪(𝑛2
)에요!
Related Works
위 문제를 다른 논문에선 어떻게 접근했을까요?

01. 문제가 뭐야?
• Attention은 RNN (Luong et al., 2015), CNN (Bello et al., 2019), GCN (Velickovic et al., 2018)의 핵심 block
• Linearly combining information using content-based weights
• 그 중 Multi-Head Scaled Dot-Product Attention은 다양한 이해, 생성 Tasks에서 SOTA를 달성 중인 Transformer (Vaswani et al., 2017)의 핵심 구조
• 진짜 만능... 어찌나 잘하던지 아래 Tasks들에서 전부 SOTA
• Machine Translation, Document Classification, Entailment, Summarization, Question Answering (BigBird, Transformer-XL, Adaptive input repr for neural LM)
• Music Generation (Music Transformer)
• Image Generation (Generative pretraining from pixels, Image Transformer)
• Genomics (BigBird, MLM for proteins via linearly scalable long-context transformers)
• Transformer는 BERT (Devlin et al., 2019)와 GPT-3 (Brown et al., 2020)의 Backbone 모델이기도 함
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities

𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
Self-Attention
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉

𝑄𝑊
𝑖
𝑄
∈ ℝ5×3
𝐾𝑊𝑖
𝐾
∈ ℝ3×5
𝐴𝑡𝑡 ∈ ℝ5×5
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
𝑘1 𝑘2 𝑘3 𝑘4 𝑘5
𝑎11 𝑎12 𝑎13 𝑎14 𝑎15
𝑣1 𝑣2 𝑣3
𝑜11
𝑉𝑊𝑖
𝑉
∈ ℝ5×3
𝑂 ∈ ℝ5×3
</s> 𝑠1 𝑠2 𝑠3 𝑠4
Embedding + Projection
𝑞5
𝑡4
Embedding
+
Projection
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities
두 Matrix Multiplication이 Bottleneck!!
softmax
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑
𝑉𝑊𝑖
𝑉
𝒪 𝐿2
𝑑

• 그러나 sequence 길이에 비례하여 quadratic operation을 요함 (∼ 𝒪 𝐿2
𝑑 )
• 이는 아주 긴(특히 1,000개 이상)의 토큰을 처리할 때 굉장히 심각한 bottleneck으로 작용한다고 한다.
• Efficient Transformers: A survey (Tay et al., 2020d)
Add & Norm
Input
Embedding
Output
Embedding
Linear
Softmax
+ +
Multi-Head
Attention
Feed
Forward
Feed
Forward
Masked
Multi-Head
Attention
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Positional
Encoding
Positional
Encoding
Encoder
Decoder
Inputs Outputs
(Shifted right)
Outputs
Probabilities

Related Works
https://arxiv.org/abs/2009.06732
Subquadratic self-attention 연구가 많이 진행되었어요!

Related Works
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
time
𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ
Sparse Transformer
(Child et al., 2019)
Routing Transformer
(Roy et al., 2020)
Linformer
(Wang et al., 2020)
Big Bird
(Zaheer et al., 2020)
Reformer
(Kitaev et al., 2020)
Performer
(Choromanski et al., 2020)
Linear Transformer
(Katharopoulos et al., 2020)

Related Works
Keys
Queries
Goal: 효과적인 연산을 위해 attention 연산을 근사!

Related Works
Keys
Queries
• Data-Independent Patterns
• Blockwise Transformer
• Sparse Transformer
• Longformer
• Big Bird

Related Works
Keys
Queries
• Data-Dependent Patterns
• Linformer
• Reformer
• Routing Transformer
• Clustered Attention
• Sinkhorn Transformer

Related Works
• Data-Dependent Patterns
• Kernels and Alternative Attention Mechanisms
• Linear Transformer
• Random Feature Attention
• Performer
• Synthesizer
• Recurrence
• Transformer-XL
• Compressive Transformers

Related Works
Blockwise Strided Diagonal Random Global
Blockwise Transformer
Local Attention
Sparse Transformer
Longformer
Big Bird
Longformer
Big Bird Big Bird
Longformer
ETC

• Data-dependent Patterns
Related Works
Buckets
Buckets: Hasing
Sorting and
blocking
Compression
Buckets: Clustering

Related Works

02.
이를 해결하기 위한 방법은?
H-Matrix and Multigrid Method + Intuition
Attention 행렬의 Sparsity를 다루기 위한 Numerical 방법을 소개합니다
Hierarchical Attention and its computational complexity
위 개념을 적용한 H-Transformer-1D 알고리즘과 복잡도롤 소개합니다
How to implement?
Lucidrain님의 구현체를 참고하여 torch로 어떻게 구현했는지 확인합니다

02. 이를 해결하기 위한 방법은?
H-Matrix and Multigrid method + Intuition
Q)
왜 꼭 어떤 pattern을 0으로 지워야 하는가?
더 수학적(numerical analysis)으로 행렬을 Sparse하게 만드는 기법이 많은데!
그 중 Hierarchical–Matrix를 사용하면 특별히 잘 되지 않을까?

𝑍 = softmax
𝑄 𝐾 𝑇
𝑑
𝑉
𝑍 = 𝐷−1
𝐴𝑉
𝐴 = 𝑒𝑆
Self-Attention을 위의 수식처럼 쓸게요!

𝑄 ∈ ℝ16×𝑑
𝐾 ∈ ℝ𝑑×16
𝐴 ∈ ℝ16×16
≈
Self-Attention is low-rank matrix

𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗
𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2
− 1
2.7183
0.7678
0.3816
0.3680
0.3679
• 왼쪽의 행렬로 예시를 들어보자!
• Sequence length는 16
• Looser tolerance를 10−1
로 줘도 행렬 𝐴는 Full-rank임
• 즉, 표준적인 low-rank approximation은 효과적이지 못함

𝐴𝑖,𝑗 = 𝑒𝑆𝑖,𝑗
𝑆𝑖,𝑗 = 2𝑒− 𝑖−𝑗 2
− 1
2.7183
0.7678
0.3816
0.3680
0.3679
• 이걸 두 level의 matrix block으로 분리해보자! (hierarchy하게)
• 그러면 Looser tolerance를 10−3
에서 각 block 행렬의 rank는 아래와
같아요!
• Off-diagonal term에도 value가 있기에 함부로 무시하면 성능에 영향
을 끼침 (poor approximation)
• 우리의 방법은 level을 높여서 compression rate를 높이는 것도 가능!
4
2 4
2
2
4
4
2
2
2

𝑄 ∈ ℝ16×𝑑
𝐾 ∈ ℝ𝑑×16
𝐴 ∈ ℝ16×16
4
2 4
2
2
4
4
2
2
2
≈
4
2 4
2
2
4
4
2
2
2

Multigrid method!
Multi-level nested iterative method for solving large-scale sparse matrices
Results from discretized partial differential equations
https://www.cambridge.org/core/books/introduction-to-numerical-geodynamic-modelling/multigrid-method/D8858D6C897D3AC0F44E6C296E86585F

https://www.researchgate.net/figure/Illustration-of-the-multigrid-V-cycle_fig2_328599327
coarsening
Simple Average Simple duplicate

Hierarchical Structure
→ Inductive Bias
4
2 4
2
2
4
4
2
2
2
• 논문에선 다음과 같이 설명
• Sharp Nearby, Fuzzy far away!
• 문장이 길어지면 순서는 성능에 거의 영향을 미치지 않음
• 모델은 멀리 떨어진 단어에 대해 high-level 대략적인 표현을 유지
• Multilevel method의 핵심 아이디어는 아래임!
• perform no approximation for near interactions, and apply
progressively lower-precision approximation for progressively
longer distance interactions
• Full-rank diagonal blocks (near interactions)
• Higher precision approx. for 4X4 off-diagonal (mid-distance)
• Low precision approx. for 8X8 off-diagonal (long-distance)
• Inductive Bias: 우리가 세운 가설 ㅎㅎ
• Attention matrix는 hierarchical low-rank 구조로 이뤄져 있을 거야!
• Good Benchmark Performance가 이를 뒷받침하지!
• 그런 것 치고 실험 세팅이... Hmm...

Hierarchical Attention and its computational complexity
time
𝒪 ℓ ℓ 𝒪 ℓlogℓ 𝒪 ℓlogℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ 𝒪 ℓ
Sparse Transformer
(Child et al., 2019)
Routing
Transformer
(Roy et al., 2020)
Linformer
(Wang et al., 2020)
Big Bird
(Zaheer et al., 2020)
Reformer
(Kitaev et al., 2020)
Performer
(Choromanski et al., 2020)
Linear Transformer
(Katharopoulos et al.,
2020)
Luna: Linear Unified Nested Attention
(Xuezhe ma et al., 2021)
H-Transformer-1D
(Zhenhai Zhu et al., 2021)
𝒪 ℓ

How to implement?
Q)
이거는 어떻게 구현할 수 있을까?
실제로 실험 결과가 Inductive Bias와 속도를 지지할까?

How to implement?
https://github.com/lucidrains/h-transformer-1d/issues/1

How to implement?
https://github.com/lucidrains/h-transformer-1d/commit/5660abdab7c7359c9c178032c73f37a3241937d1
⊕ RotaryEmbedding
⊕ Reversible Residual Connection

How to implement?
"안녕하세요 오늘 논문 읽기 모임에 참여한 여러분들 모두 환영합니다 Google의 논문인데 많이 아쉽네요"
['▁안녕', '하세요', '▁오늘', '▁논문', '▁읽', '기', '▁모임', '에', '▁참여한', '▁여러분', '들', '▁모두', '▁환영', '합', '니다', '▁G', 'oo', 'g', 'le', '의', '▁논문', '인데', '▁많이', '▁아쉽', '네요']
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)

How to implement?
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)
(bsz, seq_len, emb_dim) == (1, 25, 512)
bsz
seq_len
emb_dim

How to implement?
[22465, 23935, 14864, 24313, 15350, 9264, 19510, 11786, 19205, 18918, 9993, 14422, 20603, 13600, 20211, 15464, 24327, 302, 16203, 12024, 24313, 15094, 14605, 26180, 29221]
(bsz,) == (1,)
(bsz, seq_len) == (1, 25)
(bsz, seq_len) == (1, 25)
bsz
seq_len
(bsz, padded_seq_len, emb_dim) == (1, 32, 512)
emb_dim

How to implement?
bsz
seq_len (bsz, padded_seq_len, emb_dim) == (1, 32, 512)
emb_dim
(bsz*num_heads, padded_seq_len, head_dim) == (8, 32, 64)

How to implement?
Level 0
(8, 32, 64)
Level 1
(8, 16, 64)
Level 2
(8, 8, 64)
Level 3
(8, 4, 64)

How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑌𝑙 𝑉𝑊𝑖
𝑉
Flip!
bsz*num_heads,
(seq_len/2**level) / block_size,
block_size,
head_dim
Level 3
Q, K, V
8 = 1 * 8
2 = (32 / (2**3)) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
4 = 32 / (2**3)
64 = 512 / 8

How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑉
Flip!
bsz*num_heads,
block_size,
head_dim
Level 2
Q, K, V
8 = 1 * 8
4 = (32 / 4) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
8 = 32 / (2**2)
64 = 512 / 8

How to implement?
𝑄𝑊
𝑖
𝑄
𝐾𝑊𝑖
𝐾
ሚ
𝐴𝑙 ෨
𝑉
Flip!
bsz*num_heads,
block_size,
head_dim
Level 1
Q, K, V
8 = 1 * 8
8 = (32 / (2**1)) / 2
2 = 2
64 = 512 / 8
bsz*num_heads,
seq_len/2**level,
head_dim
Y
8 = 1 * 8
16 = 32 / (2**1)
64 = 512 / 8

How to implement?
Level 3

How to implement?
Level 3 + Level 2

How to implement?
Level 3 + Level 2 + Level 1

How to implement?
Level 3 + Level 2 + Level 1 + Level 0
Block size = 2

How to implement?
Block size = 4
Level 2 + Level 1 + Level 0

How to implement?
Block size = 8
Level 1 + Level 0
논문과 구현이 다른 부분 ISSUE로 남길 예정이에요!

03.
실제로 잘 됐는가? 한계는?
실험 Settings 및 Datasets 소개
논문 실험 setting 및 LRA, 1B Words Datasets에 대해 소개합니다
Report Experiments and Results
논문에서 보고한 실험 결과에 대해 보고합니다.
Limitations and Future Work
너무나도 명확한 논문의 한계에 대해 논의하고 향후 연구에 대해 얘기합니다.

03. 실제로 잘 됐는가? 한계는?
실험 settings 및 Datasets 소개
Long-Range Arena (LRA)
- evaluate transformer-based models in a systematic way
- generalization power, computational efficiency, memory foot-print
Task
- ListOps : 긴 수학적 표현
- Text : 텍스트 분류
- Retrieval : document retrieval
- Image : CIFAR-10 이미지 → flattened sequence ⇒ 분류
- Pathfinder : 원 2개와 점선 이미지 → flattened ⇒ 공간 배치 분류
- Path-X : Pathfinder의 확장 (32x32 → 128x128)
Benchmark 1: LRA

실험 settings 및 Datasets 소개
https://github.com/redna11/lra-igloo
https://github.com/google-research/long-range-arena
https://arxiv.org/pdf/2106.01540.pdf
𝒪(𝐿2
)
−
𝒪(𝐿𝑀)
𝒪(𝐿 𝐿)
𝒪(𝐿(𝐾 + 𝑀))
𝒪(𝐿2
)
𝒪(𝐿logL)
𝒪(𝐵2
)
𝒪(𝐿2
)
𝒪(𝐿)
𝒪(𝐿)
𝒪(𝐿)
𝒪(𝐿)
?
𝒪(𝐿)
Models ListOps Text Retrieval Image Pathfinder Path-X Avg
Chance 10.00 50.00 50.00 10.00 50.00 50.00 44.00
Transformer 36.37 64.27 57.46 42.44 71.40 FAIL 54.39
Local Attention 15.82 52.98 53.39 41.46 66.63 FAIL 46.06
Sparse Transformer 17.07 63.58 59.59 44.24 71.71 FAIL 51.24
LongFormer 35.63 62.85 56.89 42.22 69.71 FAIL 53.46
Linformer 35.70 53.94 52.27 38.56 76.34 FAIL 51.36
Reformer 37.27 56.10 53.40 38.07 68.50 FAIL 50.67
Sinkhorn Transformer 33.67 61.20 53.83 41.23 67.45 FAIL 51.39
Synthesizer 36.99 61.68 54.67 41.61 69.45 FAIL 52.88
BigBird 36.05 64.02 59.29 40.83 74.87 FAIL 55.01
Linear Transformer 16.13 65.90 53.09 42.34 75.30 FAIL 50.55
Performer 18.01 65.40 53.82 42.77 77.05 FAIL 51.41
Luna-16 36.96 64.25 78.93 45.41 77.21 FAIL 60.55
Luna-128 37.13 64.38 79.15 47.40 77.67 FAIL 61.15
Luna-256 37.25 64.57 79.29 47.38 77.72 FAIL 61.24
IGLOO 39.23 82.00 75.50 47.00 67.50 NA 62.25
H-Transformer-1D 49.53 78.69 63.99 46.05 68.78 FAIL 61.41

Luna에 대해 빈약한 실험... IGLOO 대비 떨어지는 Performance...

Benchmark 2: LM trained on 1B words
• 1-billion words benchmark
• Transformer baseline : Flax 의 기본 transformer decoder

Limitations and Future work
• Title: “Fast One-Dimensional Hierarchical Attention For Sequence”
• 그러나 연산 효율에 대한 실험 X...
• Computational Complexity가 linear인 알고리즘은 이제는 많다! 결과로 보여주라 (Luna처럼)
• Path-X task에서 결국 FAIL
• 뭔가 보여줄 수 있을 줄 알았으나 결국 이전 논문의 결과와 동일
• Inductive Bias가 그래서 결국 무엇인가?
• Attention Matrix가 Hierarchical Low-Rank Structure를 가질 것이라고 주장
• 이에 대한 근거는 실험 결과로 보여주겠다고 선언
• 그러나 실질적으로 public 1위 IGLOO한테 밀림...
• 결국 무엇을 보여주고 싶었던 걸까?
• Phil Wang님에 따르면, Cross-Attention이 불가능하다고 한다.
• Source와 Target 사이에 locality에 대한 개념이 없기 때문
• https://github.com/lucidrains/h-transformer-1d/issues/2
• Google의 Hot-Paper라기엔... 많이 아쉬웠던 논문
4
2 4
2
2
4
4
2
2
2
✓ Self-Attention Matrix을 Sparse하게 만드는데 Numerical Analysis를 접목
✓ 2D 모델에선 개선된 실험 세팅과 속도 비교가 있길...

H transformer-1d paper review!!

More Related Content

What's hot

Similar to H transformer-1d paper review!!

More from taeseon ryu

Recently uploaded

H transformer-1d paper review!!