Big Bird - Transformers for Longer Sequences

Big Bird: Transformers for Longer
Sequences
딥러닝 논문 읽기 모임
자연어처리팀 : 문의현, 백지윤, 조진욱, 황경진
발표자 : 백지윤
Manzil Zaheer, Guru Guruganesh, Avinava Dubey
2020 NeurIPS

Contents
• 1 Introduction Full-attention in Transformer

• 2 BIGBIRD Architecture

• 3 Theoretical Results about Sparse Attention Mechanism

• 4 Experiments & Results
• 5 Conclusion

1. Introduction - Transformer
Transformer's key principle - Self-Attention
Softmax
α1
α2
α3

1. Introduction - Transformer
a1
self-attention
"layer"

Full-attention
• 단점 :
1) O(n^2) 의 시간,공간 복잡도
• 2) 1) 의 영향을 받아, Dmodel 사이즈
(;512) 에도 제한을 받음
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
나 는 건물주가 되고
Dmodel
싶다
Dmodel
Q / K K1 K2 K3 K4
Q1
Q2
Q3
Q4
Quadratic !
• 1) full-attention 의 장점은 그대로 활용하면서 내
적 연산은 줄일 수 있는 방법이 없을까?
• 2) 1) 의 sparse attention 도 기존 full-
attention 을 사용했을 때의 장점 (;expressivity,
flexibility) 을 그대로 가져올 수 있을까?

Related Work
• Dmodel 사이즈(;512) 의 제한을 받아들이고, 연관된
더 작은 contexts 를 선택하여 반복하는 방식
SpanBERT, ORQA, REALM, RAG
• Quadratic 연산 자체를 줄이는 방식
- Random attention
- Window attention
- Global attention
- BIGBIRD
Reformer, Longformer

Big Bird
• activation : softmax
• N(i) : 인접한 토큰 간에만 연산을 수행
• H : # of heads
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
Q / K K1 K2 K3 K4
Q1 0 0
Q2 0 0 0
Q3 0 0 0 0
Q4 0 0 0 0
나는
건물주가
되고
싶다
1 1
1
조건 1. 각 노드 (;토큰) 간 평균적 경로 길이가 짧아야 한
다
조건 2. Notion of locality
<graph sparsification problem>

1. small average path length
• P : P 의 확률로 연결 여부가 결정
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
Erdos-Renyi model
근거 1. Bounded to O(log n)
“The average distances in random graphs with given expected degress”
“Distribution of shortest path lengths in subcritical Erdos-Renyi networks”
첫 번째 고유값과 두 번째 고유값과의 유의미한 거리 차 -> rapid mixing time
for random walks -> 각 노드 pair간 정보가 빨리 전달됨
근거 2. Rapid mixing time
“External eigenvalues of critical Erdos-Renyi graphs”
“Spectral radii of sparse random matrices”

2. Notion of locality
• NLP, 전산생물학 등은 ‘sequential’ 한 단어
를 다루기 때문에 이웃 토큰간의 인접성 을 지
키는 것이 매우 중요
• 이웃 토큰간 인접성을 측정하는 방법 :
clustering coefficient (;특정 노드와 이웃
한 노드들이 연결되어있을 확률)
clustering coefficient
• V : 특정 노드
• Kv : 특정 노드의 차수
• Nv : 특정 노드의 이웃들끼리 연결된 edge
개수
• cc(v) = 2 * Nv / Kv * (Kv -1)
( 2 * 0 ) / ( 4*3 ) = 0
small-world graphs (Watts and Strogatz)
https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html

small-world graphs
https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html
1. 각각 양 방향으로 w/2 개씩 총 w개의 이웃을 가지는 N개의 노드로 구성된 그래프 만들기
ex. 10 nodes, 4 neighbors
2. 약간의 randomness 추가

Theoretical Results about Sparse Attention
Mechanism

1. Universal Approximators
Universal Approximation Theorem
https://www.youtube.com/watch?v=vnkGn4r62Q8
: 1개의 hidden layer 을 가진 NN 을 이용해
어떠한 함수든 근사시킬 수 있다는 이론
Ex.
…
3 * ( f(x) - f(x- 0.1))
3
0.1
x
b=0
b=0.1
1
1
Output
3
-3

“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
• Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space
f: [0,1]nxd -> ℝnxd (n : # of tokens, d : d-dimensional embeddings)
• TD : H ; # of heads, m ; head size, q ; hidden layer dim
• d (f,g) :
H,m,q
p
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>

Approximate Fcd by piece-wise constant functions using Feed Forward
STEP 1.
• Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space
[0,1]nxd -> G = {0,δ,2δ, …,1-δ}nxd
Delta cubes

Approximate piece-wise constant functions by modified transformers using sparse attention
STEP 2.
• Contextual mappings 란 ? 이전 선행 논문에서, transformer 의 attention 이 ‘contextual mapping’ 을
하기 때문에 universal approximator 역할을 한다 라고 주장

STEP 2.
• 이전 선행 논문에서는 해당 내용을 transformation attention 의 permutation equivariant 한 특징을 사용하여 증명.

STEP 2.
• Sparse attention 은 full attention 이 아니기 때문에 동일하게 증명할 수 없음 !
- sparse shift operator : 특정 범위에 있는 entries 들을 shift 함 (directed sparse attention graphg D 가 그 정도를 결정)
- additional global token

Approximate modified transformers by original transformers using sparse attention
STEP 3.

2.Turing Completeness
• Turing Complete : 어떤 프로래밍 언어나 추상 기계가 튜링 기계와 동일한 계산 능력을 가진다
(튜링 기계 : 특수한 테이프를 기반으로 작동하는 기계)
- 조건 1 : 특정 분기가 있어야 함 (즉 “if ~ 라면” “어떻게 행동할지”)
- 조건 2. 임의의 충분한 메모리가 있어야 함
https://www.youtube.com/watch?v=RPQD7-AOjMI
=> Full attention 을 사용한 Transformer 가 turing complete 하다고 밝히는 선행 논문들이 있음. 이것을 기반으로
Modified transformer 도 turing complete 하다고 증명을 함

3. Limitations
worst case 의 경우 입력 시퀀스 길이 만큼의
layer 가 필요함
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)

NLP - Pretraining and MLM
• BIGBIRD 의 ITC/ETC 버전을 만들고
사전 훈련을 진행 (마스킹 된 토큰의 임
의 하위 집합을 예측하는 작업이 포함)
• 사전 훈련을 위해 4개의 표준 데이터 셋
을 사용
• 배치 크기는 32-64로 설정
• 문서 최대 길이는 512 토큰에서
4096 토큰으로 증가

Genomics - pretraining and MLM

Genomics - promoter region prediction &
Chromatin-Profile prediction

Conclusion
• BIGBIRD 는 QA,classification 의
다수의 tasks 에서 SOTA 달성
• Genomics 결과를 통해 향후 nlp 이
외의 분야에도 활용될 여지가 보임

Big Bird - Transformers for Longer Sequences

More Related Content

What's hot

Similar to Big Bird - Transformers for Longer Sequences

More from taeseon ryu

Big Bird - Transformers for Longer Sequences