4. Introduction
- Patch embedder
: It takes an input of a discrete sequence,
embeds each element, and chunks it into
patches of a fixed length.
- Global Module
: It is a large autoregressive transformer that
contextualizes patch representations
by performing self-attention over previous patches
- Local Module
: It is a small local transformer that inputs
a contextualized patch representation from
the global model, and autoregressively
predicts the next patch
5. Components
1. Patch Embedder
Byte sequence x0~xT
Patch size : P
Dimension size :
Patch length :
<- byte embedding
reshape
<- input to the global model
9. Motivation
- Increasing Parameters for Fixed Compute
: Larger feedforward layers across patches rather than individual tokens
- Re-use of Established Components
: increases the likelihood that the architecture will inherit the desirable scaling properties of transformers.
10. Efficiency Analysis
Training Efficiency
Sequence length : T
Vanila transformer
Sparse transformer
Routing transformer
Sequence length : T
Patch size : P
Patch Length : T/P
Global model
Local model
MEGABYTE(Overall)
11. Efficiency Analysis
Feedforward Layers
in the GPT3 architecture, the quadratic self-attention computation accounts for only 1.4% of FLOPS
대부분은 feedforward에서 FLOPS를 잡아먹음..
# of non-embedding parameters : m
Sequence length : T
MEGABYTE
# of global model parameters
# of local model parameters
2mT FLOPS
16. Conclusion
- SOTA in existing byte-level models across a range of tasks and modalities
- Allowing large models of sequences of over 1M tokens
- gives competitive language modeling results with subword models,
which may allow byte-level models to replace tokenization
- Scale was far below those of SOTA LLMs