LLM에서 배우는
이미지 생성 모델 ZERO부터
학습하기
김태훈 carpedm20
Training Large-Scale Diffusion Model from Scratch
김태훈
해킹, 슈퍼 컴퓨팅
학부 졸업 후 OpenAI 입사
엔지니어 3명으로 글로벌 40만명 유저를 모은 창업
SHIFT UP
carpedm20
김태훈
carpedm20
클라이밍 🧗, 샐러드 🥗 덕후
개발 한다고 등한시 했던 세상에 대해 알고 싶어서
<새로운 만남>이란 모임
1년 간 해커 하우스에서 지내면서 책 읽고 글 쓰고 사람 만나고
7년 이상 유의미했던 600여명과의 대화를 기록
AI Compiler Study
♟ ♟
Content
1. 현업에서 느끼는 이미지 생성 모델의 한계
2. LLM 학습에서 배우기
3. 8B Diffusion Model 학습을 하며 배운 점
Diffusion 모델
Stable Diffusion
원하는 그림체
디테일한 표현
하지만 이런 사례에도 불구하고
게임 회사에서 느끼는
Diffusion 모델의 한계
3456×3783 2201×2970
3456×3783
2201×2970
3712×4928
3712×4928
3712×4928
3456×3783 3712×4928
3712×4928
3712×4928
2201×2970
Stable Diffusion
그림체 디테일
부족한
특색 없는
필요한 작업 시간
애매한 그림 고치기 직접 그리기
>
Diffusion
여전히 존재하는 수많은 문제
1. 영어의 어려움
2. 완성도 있는 그림이 잘 나오지 않는다
3. 특정 스타일로 잘 그리고 싶다
4. 같은 캐릭터 (얼굴, 코스튬) 의 다른 자세 그리기
5. 잘 그리지 못하고, 말로 표현하기 어려운 것들
• 악세서리
• 다양한 의상 재질
• 배경, 몬스터
• 빛의 방향, 구도
• 물체 간 관계 (총을 등에 매고 있거나, 양손으로 잡고 있는)
각 상황에 맞게 LoRA를 만들고
ControlNet를 상황 별로 찾는 방법엔 한계가 있다
악세서리
스타일
다양한 의상 재질
배경, 몬스터
빛의 방향, 구도
물체 간 관계
Foundation Model
= 좋은 시작점
Training Large-Scale
Diffusion Model from Scratch
Foundation Model 학습의
가장 큰 장벽
NVIDIA A100 Tensor Core GPU
https://www.databricks.com/blog/stable-diffusion-2
3천 만원 x 10번 = 3억
어떻게 하면 효율적으로 학습할 수 있을까?
Diffusion보다 활발히 연구되고 있는
LLM 연구를 많이 참고
OpenAI에 있던 친구가 나와서 시작한 프로젝트
사실상 실패했지만 Compute이 부족한 팀에게 많은 배울 점을 공유함
*데이터에 너무 집중을 하지 않았음
Continual Learning
OPT 팀이 실험 (+ 삽질) 한 100+개의 실험을 정리한 Logbook
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
• 데이터가 많아진 만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다
• Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 계산할 수 있다
더 많은 데이터로 좋은 성능을 내는 작은 모델을 학습함
• 데이터가 많아진 만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다
• Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 예측할 수 있다
• PPO로 사람들이 더 선호하는 모델을 학습
• 비교가 생성보다 리소스가 적게 들기 때문
• Reward model을 학습해야 하나 DPO를 쓰면
복잡도가 줄어 듦 (less flexible)
• PPO로 사람들이 더 선호하는 모델을 학습
• 비교가 생성보다 리소스가 적게 들기 때문
• 좋은 퀄리티의 Synthetic 데이터는 WebCrawl 보다
훨씬 적은 양으로 좋은 모델을 학습 시킬 수 있다.
• 14 days on 96 A100 GPUs (천만원 이하)
ChatGPT & GPT-4
We remark that the experience gained in the process of creating the
training data for both phi-1 and phi-1.5 leads us to the conclusion that
the creation of a robust and comprehensive dataset demands more than
raw computational power:
It requires intricate iterations, strategic topic selection, and a deep
understanding of knowledge gaps to ensure quality and diversity of the
data. We speculate that the creation of synthetic datasets will become, in
the near future, an important technical skill and a central topic of research
in AI
Textbooks Are All You Need II: phi-1.5 technical report
We remark that the experience gained in the process of creating the
training data for both phi-1 and phi-1.5 leads us to the conclusion that the
creation of a robust and comprehensive dataset demands more than raw
computational power:
It requires intricate iterations, strategic topic selection, and a deep
understanding of knowledge gaps to ensure quality and diversity of the
data. We speculate that the creation of synthetic datasets will become, in
the near future, an important technical skill and a central topic of research
in AI
Textbooks Are All You Need II: phi-1.5 technical report
좋은 데이터 셋 만드는 거 개 어려움 😓
이거 님들도 필요할 텐데 🤣
LLM 연구에서 참고할 것
LLM 연구에서 참고할 것
Learning rate schedule
모델, 배치 사이즈
LLM 연구에서 참고할 것
Learning rate schedule
모델, 배치 사이즈
안정적인 모델 학습을 위해 봐야하는 metric
Continual Learning
라지 모델 학습 시 발생하는 버그
LLM 연구에서 참고할 것
Learning rate schedule
모델, 배치 사이즈
안정적인 모델 학습을 위해 봐야하는 metric
Continual Learning
라지 모델 학습 시 발생하는 버그
좋은 Synthetic 데이터
만드는 법
모델 디자인
이 외에도
“ The final data mixtures and weights were
determined through ablations on smaller models.”
“ We stage training to alter the mixture composition during training -
increasing the weight of domain relevant data towards the end of training.”
Gemini: A Family of Highly Capable Multimodal Models
이 외에도
Gemini: A Family of Highly Capable Multimodal Models
Trained on large collection of:
Web documents, books, and code + image, audio, and video
Quality filtering for all datasets:
Heuristics + model-based classifiers
Final data mixtures/weights determined through ablation on smaller models
Increased weight of domain-relevant data towards the end of training
이 외에도
데이터가 제한된 상황에선
• 데이터를 반복 > 모델 크기 ↑
• 코드를 추가 > 기존 데이터 반복
• Perplexity 필터링 > De-duplication
Scaling Data-Constrained Language Models
다시 돌아와서
Foundation Model
= 좋은 시작점
LLM에서의 시행 착오를 바탕으로
Diffusion 모델 학습을 접근
학습
• striking red hair
• horned black headpiece
• piercing blue eyes
• crimson lipstick
• sleek bodysuit
• black leather belts
• silver buckles
• gunmetal gray rifle
• imposing stance
• flowing red cape
• knee-high footwear
• dark background
• powerful posture
• detailed armor
• black fingerless gloves
• red accents
• light skin tone
• animated style
• sharp facial features
• glossy lip shine
• aggressive expression
• asymmetrical design
• thigh strap holster
• mechanical arm pieces
• crimson thigh-highs
• bright red scarf
• scarlet cape interior
• high-heeled boots
• compact shoulder pads
• reflective surfaces
• striking red hair
• horned black headpiece
• piercing blue eyes
• crimson lipstick
• sleek bodysuit
• black leather belts
• silver buckles
• gunmetal gray rifle
• imposing stance
• flowing red cape
• knee-high footwear
• dark background
• powerful posture
• detailed armor
• black fingerless gloves
• red accents
• light skin tone
• animated style
• sharp facial features
• glossy lip shine
• aggressive expression
• asymmetrical design
• thigh strap holster
• mechanical arm pieces
• crimson thigh-highs
• bright red scarf
• scarlet cape interior
• high-heeled boots
• compact shoulder pads
• reflective surfaces
• striking red hair
• horned black headpiece
• piercing blue eyes
• crimson lipstick
• sleek bodysuit
• black leather belts
• silver buckles
• gunmetal gray rifle
• imposing stance
• flowing red cape
• knee-high footwear
• dark background
• powerful posture
• detailed armor
• black fingerless gloves
• red accents
• light skin tone
• animated style
• sharp facial features
• glossy lip shine
• aggressive expression
• asymmetrical design
• thigh strap holster
• mechanical arm pieces
• crimson thigh-highs
• bright red scarf
• scarlet cape interior
• high-heeled boots
• compact shoulder pads
• reflective surfaces
• striking red hair
• horned black headpiece
• piercing blue eyes
• crimson lipstick
• sleek bodysuit
• black leather belts
• silver buckles
• gunmetal gray rifle
• imposing stance
• flowing red cape
• knee-high footwear
• dark background
• powerful posture
• detailed armor
• black fingerless gloves
• red accents
• light skin tone
• animated style
• sharp facial features
• glossy lip shine
• aggressive expression
• asymmetrical design
• thigh strap holster
• mechanical arm pieces
• crimson thigh-highs
• bright red scarf
• scarlet cape interior
• high-heeled boots
• compact shoulder pads
• reflective surfaces
Caption의 퀄리티가 최종 모델 성능에 매우 중요
DALL·E 3
1. 이미지 캡션 성능이 매우 중요
2. 5%의 짧은 캡션 95%의 디테일한 캡션으로 학습
3. T5 텍스트 인코더 사용
4. GPT-4로 사용자의 짧은 프롬프트를 upsample함
Improving Image Generation with Better Captions
• White naval cap
• Cerulean blue hair
• Military-inspired uniform
• Golden epaulettes
• Stern gaze
• Light blue eyes
• White naval cap
• Cerulean blue hair
• Military-inspired uniform
• Golden epaulettes
• Stern gaze
• Light blue eyes
• White naval cap
• Cerulean blue hair
• Military-inspired uniform
• Golden epaulettes
• Stern gaze
• Light blue eyes
GPT4-Vision
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
GPT4-Vision
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
GPT4-Vision
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
• Cerulean blue hair
• Military-inspired uniform
• Naval officer's cap
• Golden epaulettes
• Stern gaze
• Light blue eyes
• White thigh-highs
• Naval insignia pins
• Anchored naval theme
• Golden sword handle
• High-waisted shorts
• Blue-striped white jacket
• Nautical skirt flap
• Battleship turret platforms
• Metallic heel decorations
• Flowing hair motion
• Immaculate white gloves
• Pearly skin tone
• Soft facial features
• White naval cap
• Silver belt buckle
• Delicate lace trim
• Floating hair ribbon
• Detailed jacket buttons
• Crisp ironed seams
• Intricate gold braiding
• Platform shoe design
• Elegant standing pose
• Sword sheath hanging
• Glossy shoe finish
• Contrasting shadows
• Softly lit contours
• Pastel color palette
• Graceful hand placement
• Ship deck floor
• Transparent background
• High visual fidelity
• Stylized anime character
• Minimalist backdrop
• Polished artistic rendering
<image, caption> dataset
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
여러 개의 system instruction
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
여러 개의 system instruction
모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
여러 개의 system instruction
모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
Instruction Tuning
<User query, LFM response>
Explanation Tuning
<System message, User query, LFM response>
다양성, 단계별 추론을 유도
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
여러 개의 system instruction
모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
원하는 캡션을 유도할 수 있는
User instruction
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
You are a powerful image captioner.
Create detailed captions describing
the contents of the given image.
Include the object types and colors,
counting the objects, object actions,
precise object locations, texts,
doublechecking relative positions
between objects, etc.
Instead of describing the imaginary
content, only describing the content
one can determine confidently from
the image. Do not describe the
contents by itemizing them in list
form. Minimize aesthetic
descriptions as much as possible.
원하는 캡션을 유도할 수 있는
User instruction
• Cerulean blue hair
• Military-inspired uniform
• Naval officer's cap
• Golden epaulettes
• Stern gaze
• Light blue eyes
• White thigh-highs
• Naval insignia pins
• Anchored naval theme
• Golden sword handle
• High-waisted shorts
• Blue-striped white jacket
• Nautical skirt flap
• Battleship turret platforms
• Metallic heel decorations
• Flowing hair motion
• Immaculate white gloves
• Pearly skin tone
• Soft facial features
• White naval cap
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
GPT4-Vision
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Model Name Prompt Cost (USD) Completion Cost (USD) Max Prompt Tokens
gpt-4-vision-preview $0.00001000 $0.00003000 128000
gemini-pro-vision $0.00000020 $0.00000050 30720
Gemini-Vision
:
1 : 2
1 : 4
1 : 6
=
“ The final data mixtures were
determined through ablations
on smaller models.”
GPT4-Vision Gemini-Vision
:
1 : 2
1 : 4
1 : 6
=
ArtBook-50M
ArtBook-50M Dataset
Dataset Volume Caption Misc
ArtBook-50M-2D 30M -
ArtBook-50M-3D 25M -
ArtBook-50M-AI 5M - This is not JourneyDB
ArtBook-GPT 4M GPT-4V 12 System Instruction
ArtBook-Gem 24M Gemini 12 System Instruction
ArtBook-50M 50M Raw
ArtBook-50M Dataset
Dataset Volume Caption Misc
ArtBook-50M-2D 30M -
ArtBook-50M-3D 25M -
ArtBook-50M-AI 5M - This is not JourneyDB
ArtBook-GPT 4M GPT-4V 12 System Instruction
ArtBook-Gem 24M Gemini 12 System Instruction
ArtBook-50M 50M Raw
Data preprocessing
1. OCR로 글자 있는 사진 제거
2. Caption token 길이가 10 이하인 이미지 제거
3. Image size, aspect ratio 필터링
4. 특정 컨셉이 포함된 이미지 제거
5. Duplication removal using Milvus
6. Aesthetic score
Data preprocessing
1. OCR로 글자 있는 사진 제거
2. Caption token 길이가 10 이하인 이미지 제거
3. Image size, aspect ratio 필터링
4. 특정 컨셉이 포함된 이미지 제거
5. Duplication removal using Milvus
6. Aesthetic score
Image Database
Query
?
학습을 하기 전에 몰랐던 것
GPU 구하기 쉽지 않음
구해도 문제가⋯
Hardware Lottery in the Era of LLMs
“Not all hardware is created equal. The variance of cluster quality across
hardware providers is so high that it is literally a lottery pertaining to how
much pain one would have to go through to train good models. In short, a
hardware lottery in the era of LLMs.”
https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness
GPU가 제대로 돌아가는
서버인지 확인 필수
$ nvidia-smi topo -m
$ nvbandwidth -t device_to_device_memcpy_read_ce
https://github.com/NVIDIA/nvbandwidth
기본적인 H100 학습 성능 개선
• CUDA Version: 12.4
• Mixed Precision
• PyTorch 2.0, torch.compile()
• Fully Sharded Data Parallel
• 학습 데이터에 맞는 Custom Tokenizer
• PyTorch Profiler, Nsight Compute
Training Speed Error
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True
) as prof:
# 문제가 있던 코드
optimizer.step()
prof.export_chrome_trace("trace.json")
Correct CUDA Kernel execution
for Gradient Update
Analyzing and Improving
the Training Dynamics of Diffusion Models
가설 실험 회고
Validate the model performance
• Validation Loss (w EMA)
• FID
• 학습 데이터의 distribution과 유사하나 leakage 막기
• CLIP Score
• Using GPT-4 (or LLaVA) as judge
Validate the model performance
• Validation Loss
• FID
• 학습 데이터의 distribution과 유사하나 leakage 막기
• CLIP Score
• Using GPT-4 (or LLaVA) as judge
• LLM에서 많이 쓰이는 방식
PIXART-Σ:Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image GeneraMon
GPT4-Vision
As an AI visual assistant, you are analyzing
two specific images. Given a specific
caption, you need to judge which image
aligns with the caption more closely. Please
pay attention to the key information,
including object identities, properties,
spatial relationships, object numbers and
image style, etc.
The caption for the two images is: {caption}
Please respond me strictly in the following
format: <the first image is better> or <the
second image is better> or <The two images
are tied.>. The reason is <give your reason
here>.
GPT4-Vision
As an AI visual assistant, you are analyzing
two specific images. Given a specific
caption, you need to judge which image
aligns with the caption more closely. Please
pay attention to the key information,
including object identities, properties,
spatial relationships, object numbers and
image style, etc.
The caption for the two images is: {caption}
Please respond me strictly in the following
format: <the first image is better> or <the
second image is better> or <The two images
are tied.>. The reason is <give your reason
here>.
GPT4-Vision
As an AI visual assistant, you are analyzing
two specific images. Given a specific
caption, you need to judge which image
aligns with the caption more closely. Please
pay attention to the key information,
including object identities, properties,
spatial relationships, object numbers and
image style, etc.
The caption for the two images is: {caption}
Please respond me strictly in the following
format: <the first image is better> or <the
second image is better> or <The two images
are tied.>. The reason is <give your reason
here>.
<the first image
is better>
PIXART-Σ:Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image GeneraMon
정리
네
A Log of Training
Large-Scale Diffusion Model
from Scratch
Image Synthesis 하는 이유
기회
3712×4928
3456×3783 3712×4928
Domain-specific knowledge
Domain-specific AI
LLM vs Image Synthesis
시각적 = Global
시각적 = Global
시장에 충분한 기회
시각적 = Global
시장에 충분한 기회
관심이 적고
시각적 = Global
시장에 충분한 기회
관심이 적고
관련 경험이 부재
시프트업 AI Labs
회사 소개 : 개 요
회사개요
기업명
대표
설립일
게임
임직원
홈페이지
사업영역
소재지
시프트업
김형태
2013년 12월 2일
데스티니차일드 / 승리의 여신 : 니케
스텔라 블레이드
290여명
shiftup.co.kr
모바일게임 개발 및 서비스 / 콘솔 게임 개
발
서울시 강남구 서초동
SHIFT UP AI Labs
'코스피 상장' 시동 건 시프트업⋯기업가치 3조원에 쏠리는 눈
창업 10년 만에 코스피 상장을 추진하는 게임사 시프트업이 ‘제2의 크래프톤’이 될 수 있을지
업계의 관심이 쏠리고 있다.
시프트업 '니케', 누적 매출 1조원 돌파…
시프트업의 서브컬처 게임 ‘승리의 여신:니케’가 누적 매출 1조 원을 넘겼다. 2022년 11월 출시된
지 약 1년 4개월 만이다.
프로젝트 : 스텔라 블레이드
미국 경제지 포브스 “플레이스테이션 쇼케이스
작품 중 매우 놀라운 신작”으로 소개
회사소개 프로젝트 게임산업 시프트업채용
소니의 적극적인 구애로 국내 게임사 최초로 PS5
독점
3N도 도전하지 못했던 트리플 A급 액션 게임
2024 플레이스테이션 최고의 기대작 1위
출시일 발표 후 예약 구매 1위
시프트업의 성장 전략
NEXT
<승리의 여신 : 니케> 글로벌 점유율 확대 + 국내 최초의 AAA급 콘솔 IP <스텔라 블레이드> 개발 + 끊임 없는 신규 IP 제작
Project
글로벌 장기 흥행작과 AAA게임 IP를 보유한 게
임사
회사소개 프로젝트 게임산업 시프트업채용
쾌적한 강남 오피스
시프트업 AI Labs
생성 AI로
게임 제작 비용을 10배 줄이는 조직
Training
Large Image Synthesis Model
Large Multimodal Model
Video Synthesis Model
Audio Synthesis
좋은 논문 쓰기가 목표
산학 연구 기회
KAIST 인공지능 대학원
KAIST에서 전액 출자하여 기업에서
필요한 연구과제를 수행하기 위한
CCC회원사 전용 프로그램
포항공과대학교
산학협력단
OpenAI에서의
GPT-3, 강화 학습,
BlockSparse 연구 경험
저희와 함께 성장하고 싶다면
bit.ly/shiftup-ai
➡ https://bit.ly/ai-compiler-study
➡ https://bit.ly/ai-compiler-study

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scratch

  • 1.
    LLM에서 배우는 이미지 생성모델 ZERO부터 학습하기 김태훈 carpedm20 Training Large-Scale Diffusion Model from Scratch
  • 2.
    김태훈 해킹, 슈퍼 컴퓨팅 학부졸업 후 OpenAI 입사 엔지니어 3명으로 글로벌 40만명 유저를 모은 창업 SHIFT UP carpedm20
  • 3.
    김태훈 carpedm20 클라이밍 🧗, 샐러드🥗 덕후 개발 한다고 등한시 했던 세상에 대해 알고 싶어서 <새로운 만남>이란 모임 1년 간 해커 하우스에서 지내면서 책 읽고 글 쓰고 사람 만나고 7년 이상 유의미했던 600여명과의 대화를 기록
  • 4.
  • 5.
    Content 1. 현업에서 느끼는이미지 생성 모델의 한계 2. LLM 학습에서 배우기 3. 8B Diffusion Model 학습을 하며 배운 점
  • 6.
  • 11.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 24.
  • 26.
  • 27.
    필요한 작업 시간 애매한그림 고치기 직접 그리기 >
  • 28.
  • 29.
  • 30.
    1. 영어의 어려움 2.완성도 있는 그림이 잘 나오지 않는다 3. 특정 스타일로 잘 그리고 싶다 4. 같은 캐릭터 (얼굴, 코스튬) 의 다른 자세 그리기 5. 잘 그리지 못하고, 말로 표현하기 어려운 것들 • 악세서리 • 다양한 의상 재질 • 배경, 몬스터 • 빛의 방향, 구도 • 물체 간 관계 (총을 등에 매고 있거나, 양손으로 잡고 있는)
  • 31.
    각 상황에 맞게LoRA를 만들고 ControlNet를 상황 별로 찾는 방법엔 한계가 있다 악세서리 스타일 다양한 의상 재질 배경, 몬스터 빛의 방향, 구도 물체 간 관계
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    3천 만원 x10번 = 3억
  • 39.
    어떻게 하면 효율적으로학습할 수 있을까?
  • 40.
    Diffusion보다 활발히 연구되고있는 LLM 연구를 많이 참고
  • 42.
    OpenAI에 있던 친구가나와서 시작한 프로젝트 사실상 실패했지만 Compute이 부족한 팀에게 많은 배울 점을 공유함 *데이터에 너무 집중을 하지 않았음
  • 43.
  • 44.
    OPT 팀이 실험(+ 삽질) 한 100+개의 실험을 정리한 Logbook https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
  • 48.
    • 데이터가 많아진만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다 • Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 계산할 수 있다 더 많은 데이터로 좋은 성능을 내는 작은 모델을 학습함
  • 49.
    • 데이터가 많아진만큼 모델 크기도 커져야 최선의 성능을 보장할 수 있다 • Compute이 제한된 상황에서 최적의 모델 및 데이터 크기를 예측할 수 있다
  • 50.
    • PPO로 사람들이더 선호하는 모델을 학습 • 비교가 생성보다 리소스가 적게 들기 때문 • Reward model을 학습해야 하나 DPO를 쓰면 복잡도가 줄어 듦 (less flexible)
  • 51.
    • PPO로 사람들이더 선호하는 모델을 학습 • 비교가 생성보다 리소스가 적게 들기 때문
  • 53.
    • 좋은 퀄리티의Synthetic 데이터는 WebCrawl 보다 훨씬 적은 양으로 좋은 모델을 학습 시킬 수 있다. • 14 days on 96 A100 GPUs (천만원 이하) ChatGPT & GPT-4
  • 54.
    We remark thatthe experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI Textbooks Are All You Need II: phi-1.5 technical report
  • 55.
    We remark thatthe experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI Textbooks Are All You Need II: phi-1.5 technical report 좋은 데이터 셋 만드는 거 개 어려움 😓 이거 님들도 필요할 텐데 🤣
  • 56.
  • 57.
    LLM 연구에서 참고할것 Learning rate schedule 모델, 배치 사이즈
  • 58.
    LLM 연구에서 참고할것 Learning rate schedule 모델, 배치 사이즈 안정적인 모델 학습을 위해 봐야하는 metric Continual Learning 라지 모델 학습 시 발생하는 버그
  • 59.
    LLM 연구에서 참고할것 Learning rate schedule 모델, 배치 사이즈 안정적인 모델 학습을 위해 봐야하는 metric Continual Learning 라지 모델 학습 시 발생하는 버그 좋은 Synthetic 데이터 만드는 법 모델 디자인
  • 60.
    이 외에도 “ Thefinal data mixtures and weights were determined through ablations on smaller models.” “ We stage training to alter the mixture composition during training - increasing the weight of domain relevant data towards the end of training.” Gemini: A Family of Highly Capable Multimodal Models
  • 61.
    이 외에도 Gemini: AFamily of Highly Capable Multimodal Models Trained on large collection of: Web documents, books, and code + image, audio, and video Quality filtering for all datasets: Heuristics + model-based classifiers Final data mixtures/weights determined through ablation on smaller models Increased weight of domain-relevant data towards the end of training
  • 63.
    이 외에도 데이터가 제한된상황에선 • 데이터를 반복 > 모델 크기 ↑ • 코드를 추가 > 기존 데이터 반복 • Perplexity 필터링 > De-duplication Scaling Data-Constrained Language Models
  • 64.
  • 65.
  • 66.
    LLM에서의 시행 착오를바탕으로 Diffusion 모델 학습을 접근
  • 67.
  • 81.
    • striking redhair • horned black headpiece • piercing blue eyes • crimson lipstick • sleek bodysuit • black leather belts • silver buckles • gunmetal gray rifle • imposing stance • flowing red cape • knee-high footwear • dark background • powerful posture • detailed armor • black fingerless gloves • red accents • light skin tone • animated style • sharp facial features • glossy lip shine • aggressive expression • asymmetrical design • thigh strap holster • mechanical arm pieces • crimson thigh-highs • bright red scarf • scarlet cape interior • high-heeled boots • compact shoulder pads • reflective surfaces
  • 83.
    • striking redhair • horned black headpiece • piercing blue eyes • crimson lipstick • sleek bodysuit • black leather belts • silver buckles • gunmetal gray rifle • imposing stance • flowing red cape • knee-high footwear • dark background • powerful posture • detailed armor • black fingerless gloves • red accents • light skin tone • animated style • sharp facial features • glossy lip shine • aggressive expression • asymmetrical design • thigh strap holster • mechanical arm pieces • crimson thigh-highs • bright red scarf • scarlet cape interior • high-heeled boots • compact shoulder pads • reflective surfaces
  • 84.
    • striking redhair • horned black headpiece • piercing blue eyes • crimson lipstick • sleek bodysuit • black leather belts • silver buckles • gunmetal gray rifle • imposing stance • flowing red cape • knee-high footwear • dark background • powerful posture • detailed armor • black fingerless gloves • red accents • light skin tone • animated style • sharp facial features • glossy lip shine • aggressive expression • asymmetrical design • thigh strap holster • mechanical arm pieces • crimson thigh-highs • bright red scarf • scarlet cape interior • high-heeled boots • compact shoulder pads • reflective surfaces
  • 85.
    • striking redhair • horned black headpiece • piercing blue eyes • crimson lipstick • sleek bodysuit • black leather belts • silver buckles • gunmetal gray rifle • imposing stance • flowing red cape • knee-high footwear • dark background • powerful posture • detailed armor • black fingerless gloves • red accents • light skin tone • animated style • sharp facial features • glossy lip shine • aggressive expression • asymmetrical design • thigh strap holster • mechanical arm pieces • crimson thigh-highs • bright red scarf • scarlet cape interior • high-heeled boots • compact shoulder pads • reflective surfaces
  • 86.
    Caption의 퀄리티가 최종모델 성능에 매우 중요
  • 87.
    DALL·E 3 1. 이미지캡션 성능이 매우 중요 2. 5%의 짧은 캡션 95%의 디테일한 캡션으로 학습 3. T5 텍스트 인코더 사용 4. GPT-4로 사용자의 짧은 프롬프트를 upsample함 Improving Image Generation with Better Captions
  • 89.
    • White navalcap • Cerulean blue hair • Military-inspired uniform • Golden epaulettes • Stern gaze • Light blue eyes
  • 90.
    • White navalcap • Cerulean blue hair • Military-inspired uniform • Golden epaulettes • Stern gaze • Light blue eyes
  • 91.
    • White navalcap • Cerulean blue hair • Military-inspired uniform • Golden epaulettes • Stern gaze • Light blue eyes
  • 92.
    GPT4-Vision ShareGPT4V: Improving LargeMulti-Modal Models with Better Captions
  • 93.
    GPT4-Vision ShareGPT4V: Improving LargeMulti-Modal Models with Better Captions You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible.
  • 94.
    GPT4-Vision ShareGPT4V: Improving LargeMulti-Modal Models with Better Captions You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. • Cerulean blue hair • Military-inspired uniform • Naval officer's cap • Golden epaulettes • Stern gaze • Light blue eyes • White thigh-highs • Naval insignia pins • Anchored naval theme • Golden sword handle • High-waisted shorts • Blue-striped white jacket • Nautical skirt flap • Battleship turret platforms • Metallic heel decorations • Flowing hair motion • Immaculate white gloves • Pearly skin tone • Soft facial features • White naval cap • Silver belt buckle • Delicate lace trim • Floating hair ribbon • Detailed jacket buttons • Crisp ironed seams • Intricate gold braiding • Platform shoe design • Elegant standing pose • Sword sheath hanging • Glossy shoe finish • Contrasting shadows • Softly lit contours • Pastel color palette • Graceful hand placement • Ship deck floor • Transparent background • High visual fidelity • Stylized anime character • Minimalist backdrop • Polished artistic rendering <image, caption> dataset
  • 95.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 여러 개의 system instruction
  • 96.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 여러 개의 system instruction 모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
  • 97.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 여러 개의 system instruction 모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함 Instruction Tuning <User query, LFM response> Explanation Tuning <System message, User query, LFM response> 다양성, 단계별 추론을 유도
  • 98.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 여러 개의 system instruction 모델이 어떤 가이드라인, 형식과 톤으로 답을 할 것인가를 정함
  • 99.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 원하는 캡션을 유도할 수 있는 User instruction
  • 100.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 You are a powerful image captioner. Create detailed captions describing the contents of the given image. Include the object types and colors, counting the objects, object actions, precise object locations, texts, doublechecking relative positions between objects, etc. Instead of describing the imaginary content, only describing the content one can determine confidently from the image. Do not describe the contents by itemizing them in list form. Minimize aesthetic descriptions as much as possible. 원하는 캡션을 유도할 수 있는 User instruction • Cerulean blue hair • Military-inspired uniform • Naval officer's cap • Golden epaulettes • Stern gaze • Light blue eyes • White thigh-highs • Naval insignia pins • Anchored naval theme • Golden sword handle • High-waisted shorts • Blue-striped white jacket • Nautical skirt flap • Battleship turret platforms • Metallic heel decorations • Flowing hair motion • Immaculate white gloves • Pearly skin tone • Soft facial features • White naval cap
  • 101.
    Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4
  • 102.
    GPT4-Vision Orca: Progressive Learningfrom Complex Explanation Traces of GPT-4 Model Name Prompt Cost (USD) Completion Cost (USD) Max Prompt Tokens gpt-4-vision-preview $0.00001000 $0.00003000 128000 gemini-pro-vision $0.00000020 $0.00000050 30720 Gemini-Vision : 1 : 2 1 : 4 1 : 6 =
  • 103.
    “ The finaldata mixtures were determined through ablations on smaller models.” GPT4-Vision Gemini-Vision : 1 : 2 1 : 4 1 : 6 =
  • 107.
  • 108.
    ArtBook-50M Dataset Dataset VolumeCaption Misc ArtBook-50M-2D 30M - ArtBook-50M-3D 25M - ArtBook-50M-AI 5M - This is not JourneyDB ArtBook-GPT 4M GPT-4V 12 System Instruction ArtBook-Gem 24M Gemini 12 System Instruction ArtBook-50M 50M Raw
  • 109.
    ArtBook-50M Dataset Dataset VolumeCaption Misc ArtBook-50M-2D 30M - ArtBook-50M-3D 25M - ArtBook-50M-AI 5M - This is not JourneyDB ArtBook-GPT 4M GPT-4V 12 System Instruction ArtBook-Gem 24M Gemini 12 System Instruction ArtBook-50M 50M Raw
  • 110.
    Data preprocessing 1. OCR로글자 있는 사진 제거 2. Caption token 길이가 10 이하인 이미지 제거 3. Image size, aspect ratio 필터링 4. 특정 컨셉이 포함된 이미지 제거 5. Duplication removal using Milvus 6. Aesthetic score
  • 111.
    Data preprocessing 1. OCR로글자 있는 사진 제거 2. Caption token 길이가 10 이하인 이미지 제거 3. Image size, aspect ratio 필터링 4. 특정 컨셉이 포함된 이미지 제거 5. Duplication removal using Milvus 6. Aesthetic score
  • 114.
  • 123.
  • 124.
  • 126.
  • 127.
    Hardware Lottery inthe Era of LLMs “Not all hardware is created equal. The variance of cluster quality across hardware providers is so high that it is literally a lottery pertaining to how much pain one would have to go through to train good models. In short, a hardware lottery in the era of LLMs.” https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness
  • 128.
  • 129.
  • 130.
    $ nvbandwidth -tdevice_to_device_memcpy_read_ce https://github.com/NVIDIA/nvbandwidth
  • 131.
    기본적인 H100 학습성능 개선 • CUDA Version: 12.4 • Mixed Precision • PyTorch 2.0, torch.compile() • Fully Sharded Data Parallel • 학습 데이터에 맞는 Custom Tokenizer • PyTorch Profiler, Nsight Compute
  • 132.
    Training Speed Error fromtorch.profiler import profile, record_function, ProfilerActivity with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: # 문제가 있던 코드 optimizer.step() prof.export_chrome_trace("trace.json")
  • 133.
    Correct CUDA Kernelexecution for Gradient Update
  • 134.
    Analyzing and Improving theTraining Dynamics of Diffusion Models
  • 135.
  • 139.
    Validate the modelperformance • Validation Loss (w EMA) • FID • 학습 데이터의 distribution과 유사하나 leakage 막기 • CLIP Score • Using GPT-4 (or LLaVA) as judge
  • 140.
    Validate the modelperformance • Validation Loss • FID • 학습 데이터의 distribution과 유사하나 leakage 막기 • CLIP Score • Using GPT-4 (or LLaVA) as judge • LLM에서 많이 쓰이는 방식 PIXART-Σ:Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image GeneraMon
  • 141.
    GPT4-Vision As an AIvisual assistant, you are analyzing two specific images. Given a specific caption, you need to judge which image aligns with the caption more closely. Please pay attention to the key information, including object identities, properties, spatial relationships, object numbers and image style, etc. The caption for the two images is: {caption} Please respond me strictly in the following format: <the first image is better> or <the second image is better> or <The two images are tied.>. The reason is <give your reason here>.
  • 142.
    GPT4-Vision As an AIvisual assistant, you are analyzing two specific images. Given a specific caption, you need to judge which image aligns with the caption more closely. Please pay attention to the key information, including object identities, properties, spatial relationships, object numbers and image style, etc. The caption for the two images is: {caption} Please respond me strictly in the following format: <the first image is better> or <the second image is better> or <The two images are tied.>. The reason is <give your reason here>.
  • 143.
    GPT4-Vision As an AIvisual assistant, you are analyzing two specific images. Given a specific caption, you need to judge which image aligns with the caption more closely. Please pay attention to the key information, including object identities, properties, spatial relationships, object numbers and image style, etc. The caption for the two images is: {caption} Please respond me strictly in the following format: <the first image is better> or <the second image is better> or <The two images are tied.>. The reason is <give your reason here>. <the first image is better>
  • 144.
    PIXART-Σ:Weak-to-Strong Training ofDiffusion Transformer for 4K Text-to-Image GeneraMon
  • 147.
  • 149.
  • 150.
    A Log ofTraining Large-Scale Diffusion Model from Scratch
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
    LLM vs ImageSynthesis
  • 156.
  • 157.
  • 158.
    시각적 = Global 시장에충분한 기회 관심이 적고
  • 159.
    시각적 = Global 시장에충분한 기회 관심이 적고 관련 경험이 부재
  • 163.
  • 164.
    회사 소개 :개 요 회사개요 기업명 대표 설립일 게임 임직원 홈페이지 사업영역 소재지 시프트업 김형태 2013년 12월 2일 데스티니차일드 / 승리의 여신 : 니케 스텔라 블레이드 290여명 shiftup.co.kr 모바일게임 개발 및 서비스 / 콘솔 게임 개 발 서울시 강남구 서초동
  • 168.
    SHIFT UP AILabs '코스피 상장' 시동 건 시프트업⋯기업가치 3조원에 쏠리는 눈 창업 10년 만에 코스피 상장을 추진하는 게임사 시프트업이 ‘제2의 크래프톤’이 될 수 있을지 업계의 관심이 쏠리고 있다. 시프트업 '니케', 누적 매출 1조원 돌파… 시프트업의 서브컬처 게임 ‘승리의 여신:니케’가 누적 매출 1조 원을 넘겼다. 2022년 11월 출시된 지 약 1년 4개월 만이다.
  • 169.
    프로젝트 : 스텔라블레이드 미국 경제지 포브스 “플레이스테이션 쇼케이스 작품 중 매우 놀라운 신작”으로 소개 회사소개 프로젝트 게임산업 시프트업채용 소니의 적극적인 구애로 국내 게임사 최초로 PS5 독점 3N도 도전하지 못했던 트리플 A급 액션 게임 2024 플레이스테이션 최고의 기대작 1위 출시일 발표 후 예약 구매 1위
  • 170.
    시프트업의 성장 전략 NEXT <승리의여신 : 니케> 글로벌 점유율 확대 + 국내 최초의 AAA급 콘솔 IP <스텔라 블레이드> 개발 + 끊임 없는 신규 IP 제작 Project 글로벌 장기 흥행작과 AAA게임 IP를 보유한 게 임사 회사소개 프로젝트 게임산업 시프트업채용
  • 171.
  • 176.
  • 177.
    생성 AI로 게임 제작비용을 10배 줄이는 조직
  • 178.
    Training Large Image SynthesisModel Large Multimodal Model Video Synthesis Model Audio Synthesis
  • 180.
    좋은 논문 쓰기가목표 산학 연구 기회 KAIST 인공지능 대학원 KAIST에서 전액 출자하여 기업에서 필요한 연구과제를 수행하기 위한 CCC회원사 전용 프로그램 포항공과대학교 산학협력단 OpenAI에서의 GPT-3, 강화 학습, BlockSparse 연구 경험
  • 181.
    저희와 함께 성장하고싶다면 bit.ly/shiftup-ai
  • 182.
  • 183.