LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch

LLM에서 배우는
이미지 생성 모델 ZERO부터
학습하기
김태훈 carpedm20
Training Large-Scale Diffusion Model from Scratch

김태훈
해킹, 슈퍼 컴퓨팅
학부 졸업 후 OpenAI 입사
엔지니어 3명으로 글로벌 40만명 유저를 모은 창업
SHIFT UP
carpedm20

김태훈
carpedm20
클라이밍 🧗, 샐러드 🥗 덕후
개발 한다고 등한시 했던 세상에 대해 알고 싶어서
<새로운 만남>이란 모임
1년 간 해커 하우스에서 지내면서 책 읽고 글 쓰고 사람 만나고
7년 이상 유의미했던 600여명과의 대화를 기록

Content
1. 현업에서 느끼는 이미지 생성 모델의 한계
2. LLM 학습에서 배우기
3. 8B Diffusion Model 학습을 하며 배운 점

원하는 그림체
디테일한 표현

하지만 이런 사례에도 불구하고

게임 회사에서 느끼는
Diffusion 모델의 한계

3712×4928
3456×3783 3712×4928

3712×4928
3712×4928
2201×2970

그림체 디테일
부족한
특색 없는

필요한 작업 시간
애매한 그림 고치기 직접 그리기
>

여전히 존재하는 수많은 문제

1. 영어의 어려움
2. 완성도 있는 그림이 잘 나오지 않는다
3. 특정 스타일로 잘 그리고 싶다
4. 같은 캐릭터 (얼굴, 코스튬) 의 다른 자세 그리기
5. 잘 그리지 못하고, 말로 표현하기 어려운 것들
• 악세서리
• 다양한 의상 재질
• 배경, 몬스터
• 빛의 방향, 구도
• 물체 간 관계 (총을 등에 매고 있거나, 양손으로 잡고 있는)

각 상황에 맞게 LoRA를 만들고
ControlNet를 상황 별로 찾는 방법엔 한계가 있다
악세서리
스타일
다양한 의상 재질
배경, 몬스터
빛의 방향, 구도
물체 간 관계

Foundation Model
= 좋은 시작점

Training Large-Scale
Diffusion Model from Scratch

Foundation Model 학습의
가장 큰 장벽

https://www.databricks.com/blog/stable-diffusion-2

어떻게 하면 효율적으로 학습할 수 있을까?

Diffusion보다 활발히 연구되고 있는
LLM 연구를 많이 참고

OpenAI에 있던 친구가 나와서 시작한 프로젝트
사실상 실패했지만 Compute이 부족한 팀에게 많은 배울 점을 공유함
*데이터에 너무 집중을 하지 않았음

OPT 팀이 실험 (+ 삽질) 한 100+개의 실험을 정리한 Logbook
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf

• 데이터가 많아진 만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다
• Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 계산할 수 있다
더 많은 데이터로 좋은 성능을 내는 작은 모델을 학습함

• 데이터가 많아진 만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다
• Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 예측할 수 있다

• PPO로 사람들이 더 선호하는 모델을 학습
• 비교가 생성보다 리소스가 적게 들기 때문
• Reward model을 학습해야 하나 DPO를 쓰면
복잡도가 줄어 듦 (less flexible)

• PPO로 사람들이 더 선호하는 모델을 학습
• 비교가 생성보다 리소스가 적게 들기 때문

• 좋은 퀄리티의 Synthetic 데이터는 WebCrawl 보다
훨씬 적은 양으로 좋은 모델을 학습 시킬 수 있다.
• 14 days on 96 A100 GPUs (천만원 이하)
ChatGPT & GPT-4

We remark that the experience gained in the process of creating the
training data for both phi-1 and phi-1.5 leads us to the conclusion that
the creation of a robust and comprehensive dataset demands more than
raw computational power:
It requires intricate iterations, strategic topic selection, and a deep
understanding of knowledge gaps to ensure quality and diversity of the
data. We speculate that the creation of synthetic datasets will become, in
the near future, an important technical skill and a central topic of research
in AI
Textbooks Are All You Need II: phi-1.5 technical report

We remark that the experience gained in the process of creating the
training data for both phi-1 and phi-1.5 leads us to the conclusion that the
creation of a robust and comprehensive dataset demands more than raw
computational power:
It requires intricate iterations, strategic topic selection, and a deep
understanding of knowledge gaps to ensure quality and diversity of the
data. We speculate that the creation of synthetic datasets will become, in
the near future, an important technical skill and a central topic of research
in AI
Textbooks Are All You Need II: phi-1.5 technical report
좋은 데이터 셋 만드는 거 개 어려움 😓
이거 님들도 필요할 텐데 🤣

LLM 연구에서 참고할 것

Learning rate schedule
모델, 배치 사이즈

안정적인 모델 학습을 위해 봐야하는 metric
Continual Learning
라지 모델 학습 시 발생하는 버그

안정적인 모델 학습을 위해 봐야하는 metric
Continual Learning
라지 모델 학습 시 발생하는 버그
좋은 Synthetic 데이터
만드는 법
모델 디자인

이 외에도
“ The final data mixtures and weights were
determined through ablations on smaller models.”
“ We stage training to alter the mixture composition during training -
increasing the weight of domain relevant data towards the end of training.”
Gemini: A Family of Highly Capable Multimodal Models

이 외에도
Gemini: A Family of Highly Capable Multimodal Models
Trained on large collection of:
Web documents, books, and code + image, audio, and video
Quality filtering for all datasets:
Heuristics + model-based classifiers
Final data mixtures/weights determined through ablation on smaller models
Increased weight of domain-relevant data towards the end of training

이 외에도
데이터가 제한된 상황에선
• 데이터를 반복 > 모델 크기 ↑
• 코드를 추가 > 기존 데이터 반복
• Perplexity 필터링 > De-duplication
Scaling Data-Constrained Language Models

LLM에서의 시행 착오를 바탕으로
Diffusion 모델 학습을 접근

• striking red hair
• horned black headpiece
• piercing blue eyes
• crimson lipstick
• sleek bodysuit
• black leather belts
• silver buckles
• gunmetal gray rifle
• imposing stance
• flowing red cape
• knee-high footwear
• dark background
• powerful posture
• detailed armor
• black fingerless gloves
• red accents
• light skin tone
• animated style
• sharp facial features
• glossy lip shine
• aggressive expression
• asymmetrical design
• thigh strap holster
• mechanical arm pieces
• crimson thigh-highs
• bright red scarf
• scarlet cape interior
• high-heeled boots
• compact shoulder pads
• reflective surfaces

Caption의 퀄리티가 최종 모델 성능에 매우 중요

DALL·E 3
1. 이미지 캡션 성능이 매우 중요
2. 5%의 짧은 캡션 95%의 디테일한 캡션으로 학습
3. T5 텍스트 인코더 사용
4. GPT-4로 사용자의 짧은 프롬프트를 upsample함
Improving Image Generation with Better Captions

• White naval cap
• Cerulean blue hair
• Military-inspired uniform
• Golden epaulettes
• Stern gaze
• Light blue eyes

GPT4-Vision
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

GPT4-Vision
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.

GPT4-Vision
• Naval officer's cap
• Stern gaze
• Light blue eyes
• White thigh-highs
• Naval insignia pins
• Anchored naval theme
• Golden sword handle
• High-waisted shorts
• Blue-striped white jacket
• Nautical skirt flap
• Battleship turret platforms
• Metallic heel decorations
• Flowing hair motion
• Immaculate white gloves
• Pearly skin tone
• Soft facial features
• White naval cap
• Silver belt buckle
• Delicate lace trim
• Floating hair ribbon
• Detailed jacket buttons
• Crisp ironed seams
• Intricate gold braiding
• Platform shoe design
• Elegant standing pose
• Sword sheath hanging
• Glossy shoe finish
• Contrasting shadows
• Softly lit contours
• Pastel color palette
• Graceful hand placement
• Ship deck floor
• Transparent background
• High visual fidelity
• Stylized anime character
• Minimalist backdrop
• Polished artistic rendering
<image, caption> dataset

Orca: Progressive Learning from Complex Explanation Traces of GPT-4
여러 개의 system instruction

모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함

모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
Instruction Tuning
<User query, LFM response>
Explanation Tuning
<System message, User query, LFM response>
다양성, 단계별 추론을 유도

원하는 캡션을 유도할 수 있는
User instruction

원하는 캡션을 유도할 수 있는
User instruction
• Naval officer's cap
• Stern gaze
• Light blue eyes
• White thigh-highs
• Naval insignia pins
• Anchored naval theme
• Golden sword handle
• High-waisted shorts
• Blue-striped white jacket
• Nautical skirt flap
• Battleship turret platforms
• Metallic heel decorations
• Flowing hair motion
• Immaculate white gloves
• Pearly skin tone
• Soft facial features
• White naval cap

GPT4-Vision
Model Name Prompt Cost (USD) Completion Cost (USD) Max Prompt Tokens
gpt-4-vision-preview $0.00001000 $0.00003000 128000
gemini-pro-vision $0.00000020 $0.00000050 30720
Gemini-Vision
:
1 : 2
1 : 4
1 : 6
=

“ The final data mixtures were
determined through ablations
on smaller models.”
GPT4-Vision Gemini-Vision
:
1 : 2
1 : 4
1 : 6
=

ArtBook-50M Dataset
Dataset Volume Caption Misc
ArtBook-50M-2D 30M -
ArtBook-50M-3D 25M -
ArtBook-50M-AI 5M - This is not JourneyDB
ArtBook-GPT 4M GPT-4V 12 System Instruction
ArtBook-Gem 24M Gemini 12 System Instruction
ArtBook-50M 50M Raw

Data preprocessing
1. OCR로 글자 있는 사진 제거
2. Caption token 길이가 10 이하인 이미지 제거
3. Image size, aspect ratio 필터링
4. 특정 컨셉이 포함된 이미지 제거
5. Duplication removal using Milvus
6. Aesthetic score

학습을 하기 전에 몰랐던 것

Hardware Lottery in the Era of LLMs
“Not all hardware is created equal. The variance of cluster quality across
hardware providers is so high that it is literally a lottery pertaining to how
much pain one would have to go through to train good models. In short, a
hardware lottery in the era of LLMs.”
https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness

GPU가 제대로 돌아가는
서버인지 확인 필수

$ nvbandwidth -t device_to_device_memcpy_read_ce
https://github.com/NVIDIA/nvbandwidth

기본적인 H100 학습 성능 개선
• CUDA Version: 12.4
• Mixed Precision
• PyTorch 2.0, torch.compile()
• Fully Sharded Data Parallel
• 학습 데이터에 맞는 Custom Tokenizer
• PyTorch Profiler, Nsight Compute

Training Speed Error
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True
) as prof:
# 문제가 있던 코드
optimizer.step()
prof.export_chrome_trace("trace.json")

Correct CUDA Kernel execution
for Gradient Update

Analyzing and Improving
the Training Dynamics of Diffusion Models

Validate the model performance
• Validation Loss (w EMA)
• FID
• 학습 데이터의 distribution과 유사하나 leakage 막기
• CLIP Score
• Using GPT-4 (or LLaVA) as judge

Validate the model performance
• Validation Loss
• FID
• 학습 데이터의 distribution과 유사하나 leakage 막기
• CLIP Score
• Using GPT-4 (or LLaVA) as judge
• LLM에서 많이 쓰이는 방식
PIXART-Σ:Weak-to-Strong Training of Diﬀusion Transformer for 4K Text-to-Image GeneraMon

GPT4-Vision
As an AI visual assistant, you are analyzing
two specific images. Given a specific
caption, you need to judge which image
aligns with the caption more closely. Please
pay attention to the key information,
including object identities, properties,
spatial relationships, object numbers and
image style, etc.
The caption for the two images is: {caption}
Please respond me strictly in the following
format: <the first image is better> or <the
second image is better> or <The two images
are tied.>. The reason is <give your reason
here>.

GPT4-Vision
As an AI visual assistant, you are analyzing
two specific images. Given a specific
caption, you need to judge which image
aligns with the caption more closely. Please
pay attention to the key information,
including object identities, properties,
spatial relationships, object numbers and
image style, etc.
The caption for the two images is: {caption}
Please respond me strictly in the following
format: <the first image is better> or <the
second image is better> or <The two images
are tied.>. The reason is <give your reason
here>.
<the first image
is better>

PIXART-Σ:Weak-to-Strong Training of Diﬀusion Transformer for 4K Text-to-Image GeneraMon

A Log of Training
Large-Scale Diffusion Model
from Scratch

Domain-specific knowledge
Domain-specific AI

시각적 = Global
시장에 충분한 기회

시각적 = Global
관심이 적고

시각적 = Global
관심이 적고
관련 경험이 부재

회사 소개 : 개 요
회사개요
기업명
대표
설립일
게임
임직원
홈페이지
사업영역
소재지
시프트업
김형태
2013년 12월 2일
데스티니차일드 / 승리의 여신 : 니케
스텔라 블레이드
290여명
shiftup.co.kr
모바일게임 개발 및 서비스 / 콘솔 게임 개
발
서울시 강남구 서초동

SHIFT UP AI Labs
'코스피 상장' 시동 건 시프트업⋯기업가치 3조원에 쏠리는 눈
창업 10년 만에 코스피 상장을 추진하는 게임사 시프트업이 ‘제2의 크래프톤’이 될 수 있을지
업계의 관심이 쏠리고 있다.
시프트업 '니케', 누적 매출 1조원 돌파…
시프트업의 서브컬처 게임 ‘승리의 여신:니케’가 누적 매출 1조 원을 넘겼다. 2022년 11월 출시된
지 약 1년 4개월 만이다.

프로젝트 : 스텔라 블레이드
미국 경제지 포브스 “플레이스테이션 쇼케이스
작품 중 매우 놀라운 신작”으로 소개
회사소개 프로젝트 게임산업 시프트업채용
소니의 적극적인 구애로 국내 게임사 최초로 PS5
독점
3N도 도전하지 못했던 트리플 A급 액션 게임
2024 플레이스테이션 최고의 기대작 1위
출시일 발표 후 예약 구매 1위

시프트업의 성장 전략
NEXT
<승리의 여신 : 니케> 글로벌 점유율 확대 + 국내 최초의 AAA급 콘솔 IP <스텔라 블레이드> 개발 + 끊임 없는 신규 IP 제작
Project
글로벌 장기 흥행작과 AAA게임 IP를 보유한 게
임사
회사소개 프로젝트 게임산업 시프트업채용

생성 AI로
게임 제작 비용을 10배 줄이는 조직

Training
Large Image Synthesis Model
Large Multimodal Model
Video Synthesis Model
Audio Synthesis

좋은 논문 쓰기가 목표
산학 연구 기회
KAIST 인공지능 대학원
KAIST에서 전액 출자하여 기업에서
필요한 연구과제를 수행하기 위한
CCC회원사 전용 프로그램
포항공과대학교
산학협력단
OpenAI에서의
GPT-3, 강화 학습,
BlockSparse 연구 경험

저희와 함께 성장하고 싶다면
bit.ly/shiftup-ai

➡ https://bit.ly/ai-compiler-study

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch

Similar to LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch (10)

More from Taehoon Kim

More from Taehoon Kim (16)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch