20230608_megabyte

•

0 likes•2 views

YongSang Yoo

20230608_megabyte

Education

MEGABYTE: Predicting Million-byte Sequences
with Multiscale Transformers
2023.6.7
유용상
NLP 티타임

Introduction
- Patch embedder
: It takes an input of a discrete sequence,
embeds each element, and chunks it into
patches of a fixed length.
- Global Module
: It is a large autoregressive transformer that
contextualizes patch representations
by performing self-attention over previous patches
- Local Module
: It is a small local transformer that inputs
a contextualized patch representation from
the global model, and autoregressively
predicts the next patch

Components
1. Patch Embedder
Byte sequence x0~xT
Patch size : P
Dimension size :
Patch length :
<- byte embedding
reshape
<- input to the global model

Motivation
- Increasing Parameters for Fixed Compute
: Larger feedforward layers across patches rather than individual tokens
- Re-use of Established Components
: increases the likelihood that the architecture will inherit the desirable scaling properties of transformers.

Efficiency Analysis
Training Efficiency
Sequence length : T
Vanila transformer
Sparse transformer
Routing transformer
Sequence length : T
Patch size : P
Patch Length : T/P
Global model
Local model
MEGABYTE(Overall)

Efficiency Analysis
Feedforward Layers
in the GPT3 architecture, the quadratic self-attention computation accounts for only 1.4% of FLOPS
대부분은 feedforward에서 FLOPS를 잡아먹음..
# of non-embedding parameters : m
Sequence length : T
MEGABYTE
# of global model parameters
# of local model parameters
2mT FLOPS

Conclusion
- SOTA in existing byte-level models across a range of tasks and modalities
- Allowing large models of sequences of over 1M tokens
- gives competitive language modeling results with subword models,
which may allow byte-level models to replace tokenization
- Scale was far below those of SOTA LLMs

Similar to 20230608_megabyte

Iy3116761679IJERA Editor

santhosh popshetwarSanthosh Kumar Popshetwar

Presentation vision transformersppt.pptxhtn540

HPPS - Final - 06/14/2007Marco Santambrogio

IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...IRJET Journal

Transformer ZooGrigory Sapunov

Optimization of graph storage using GoFFishAnushree Prasanna Kumar

PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...Hari M

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin

Design and Implementation of Low Power DSP Core with Programmable Truncated V...ijsrd.com

Ieee 802.11.nSai Shankar

Ieee 802.11.nHari Krishnan

A High Speed Transposed Form FIR Filter Using Floating Point Dadda MultiplierIJRES Journal

Architecture of Wemlin HubGoran Cvetkoski

SudheerV_resume_aSudheer Vegesna

Energy Efficient Bit Extension Type Accelerator Chip for Detection AlgorithmsIJERA Editor

International Journal of Computational Engineering Research(IJCER)ijceronline

Similar to 20230608_megabyte (20)

Iy3116761679

santhosh popshetwar

Presentation vision transformersppt.pptx

HPPS - Final - 06/14/2007

IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...

Transformer Zoo

Optimization of graph storage using GoFFish

PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Design and Implementation of Low Power DSP Core with Programmable Truncated V...

Ieee 802.11.n

A High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier

Architecture of Wemlin Hub

SudheerV_resume_a

Energy Efficient Bit Extension Type Accelerator Chip for Detection Algorithms

International Journal of Computational Engineering Research(IJCER)

Recently uploaded

Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade

Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam

Nutritional Needs Presentation - HLTH 104misteraugie

Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417

ICT role in 21st century education and it's challenges.MaryamAhmad92

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane

Application orientated numerical on hev.pptRamjanShidvankar

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

The basics of sentences session 3pptx.pptxheathfieldcps1

Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande

Sociology 101 Demonstration of Learning Exhibitjbellavia9

Making and Justifying Mathematical Decisions.pdfChris Hunter

PROCESS RECORDING FORMAT.docxPoojaSen20

On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash

microwave assisted reaction. General introductionMaksud Ahmed

Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic

Advanced Views - Calendar View in Odoo 17Celine George

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection

Recently uploaded (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx

Python Notes for mca i year students osmania university.docx

Nutritional Needs Presentation - HLTH 104

Unit-IV; Professional Sales Representative (PSR).pptx

ICT role in 21st century education and it's challenges.

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...

Application orientated numerical on hev.ppt

Unit-IV- Pharma. Marketing Channels.pptx

The basics of sentences session 3pptx.pptx

Web & Social Media Analytics Previous Year Question Paper.pdf

Sociology 101 Demonstration of Learning Exhibit

Making and Justifying Mathematical Decisions.pdf

PROCESS RECORDING FORMAT.docx

On National Teacher Day, meet the 2024-25 Kenan Fellows

microwave assisted reaction. General introduction

Key note speaker Neum_Admir Softic_ENG.pdf

Advanced Views - Calendar View in Odoo 17

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...

20230608_megabyte

1. MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers 2023.6.7 유용상 NLP 티타임

2. Introduction

3. Introduction

4. Introduction - Patch embedder : It takes an input of a discrete sequence, embeds each element, and chunks it into patches of a fixed length. - Global Module : It is a large autoregressive transformer that contextualizes patch representations by performing self-attention over previous patches - Local Module : It is a small local transformer that inputs a contextualized patch representation from the global model, and autoregressively predicts the next patch

5. Components 1. Patch Embedder Byte sequence x0~xT Patch size : P Dimension size : Patch length : <- byte embedding reshape <- input to the global model

6. Components 2. Global model

7. Components 3. Local model

8. Motivation - Why is local model needed?

9. Motivation - Increasing Parameters for Fixed Compute : Larger feedforward layers across patches rather than individual tokens - Re-use of Established Components : increases the likelihood that the architecture will inherit the desirable scaling properties of transformers.

10. Efficiency Analysis Training Efficiency Sequence length : T Vanila transformer Sparse transformer Routing transformer Sequence length : T Patch size : P Patch Length : T/P Global model Local model MEGABYTE(Overall)

11. Efficiency Analysis Feedforward Layers in the GPT3 architecture, the quadratic self-attention computation accounts for only 1.4% of FLOPS 대부분은 feedforward에서 FLOPS를 잡아먹음.. # of non-embedding parameters : m Sequence length : T MEGABYTE # of global model parameters # of local model parameters 2mT FLOPS

12. Efficiency Analysis Combined Analysis

13. Performance Language Modeling

14. Performance Generation Speed

15. Performance Effective Use of Context

16. Conclusion - SOTA in existing byte-level models across a range of tasks and modalities - Allowing large models of sequences of over 1M tokens - gives competitive language modeling results with subword models, which may allow byte-level models to replace tokenization - Scale was far below those of SOTA LLMs

20230608_megabyte

Recommended

Recommended

More Related Content

Similar to 20230608_megabyte

Similar to 20230608_megabyte (20)

More from YongSang Yoo

More from YongSang Yoo (10)

Recently uploaded

Recently uploaded (20)

20230608_megabyte