Computer Vision 연구
Computer Graphics, Computer Vision
Computer Vision
Image Segmentation
3D Reconstruction
Image Restoration
Lip Generation
• YOLACT: Real-time Instance Segmentation
Image Segmentation
Instance Segmentation
• MODNet: Trimap-Free Portrait Matting in Real Time
• Real-Time High-Resolution Background Matting(BGMv2)
Image Segmentation
Image Matting
• PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization
• Expressive Body Capture: 3D Hands, Face, and Body from a Single Image(SMPL eXpressive)
• SMPL: A Skinned Multi-Person Linear Model
3D Reconstruction
Human Digitalization
• Unsupervised Real-world Image Super Resolution via Domain-distance Aware Training(DASR)
• Designing a Practical Degradation Model for Deep Blind Image Super-Resolution(BSR GAN)
• SwinIR: Image Restoration Using Swin Transformer
• Towards Robust Blind Face Restoration with Codebook Lookup Transformer(CodeFormer)
• Inference를 통해 장단점 분석
Image Restoration
Super Resolution
• FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
• MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement
• Face2Face: Real-time Face Capture and Reenactment of RGB Videos
Lip Generation
Text to Mesh
Speech2 Visemes
Phonemes : 음소
Visemes : 음소의 시각적 정의
3D Mesh
Lip Generation
• ObamaNet: Photo-realistic lip-sync from text
• Image-to-Image Translation with Conditional Adversarial Nets(Pix2Pix)
Speech to Image
안녕하세요 Char2Wav LSTM
Input Output
Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
Speech to Image
Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
• SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
• Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Speech to Image
Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
• SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
• Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Speech to Image
• Data Parallelization
• Model Parallelization
DNN의 가장 큰 병목 현상인 학습시간의 단축을 multi-GPU의 활용으로 해결하고자 함
DNN Parallelization
• Hough Transform
• Labeling
Classical Method
• Multimodal learning
• Audio-Image
• Lip Generation
• Wav2Lip 후속 연구
• Generative model
• Stable Diffusion
• Audio Representation과 Image Representation을 추출하는 방식을 바꾸어서 생성모델의 품질을 높이는 방향
• 생성모델을 기존의 GAN 방식에서 Diffusion Model과 같은 다른 생성기법을 반영해 생성 품질을 높이는 방향
• 2D Image의 한계를 극복하기 위해 3D 정보를 Estimate해서 더 사실과 가까운 생성물을 생성하는 방향
• Representing Scenes as Neural Radiance Fields for View Synthesis(NeRF)
• Learning Transferable Visual Models From Natural Language Supervision(CLIP)
• Denoising Diffusion Probabilistic Models(Diffusion Model)
• High-Resolution Image Synthesis with Latent Diffusion Models
연구 계획
NeRF: Representing Scenes as
Neural Radiance Fields for View Synthesis
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng
AI Labs 영상처리파트
3D Reconstruction vs Volume Rendering
• Reality Capture
3D Reconstruction Volume Rendering
3차원 형태의 샘플링 데이터를
2차원 투시로 보여주는 기술
NeRF 는 이 쪽에 속함
1) March camera rays through the scene to generate a sampled set of 3D points
2) Use those points and their corresponding 2D viewing directions as input
to the neural network to produce an output set of colors and densities
3) Use classical volume rendering techniques to accumulate those colors and densities into a 2D image.
NeRF Model - 1
𝐹Θ: 𝑋, 𝑑 → (𝑐, 𝜎)
𝑑 = (𝜃, 𝜙)
𝑋 = (𝑥, 𝑦, 𝑧) 𝑐 = (𝑟, 𝑔, 𝑏)
𝜎 = (𝜎)
Ray가 지나가는 곳에 존재하는 vertices의 x,y,z 값
Density(밀도)를 뜻하며 density가 커지면
물체가 불투명해지고(뒤에 있는 것들이 잘 보이지 않음),
density가 작아지면 물체가 투명해짐
NeRF Model - 2
NeRF Model - 3
𝑋 = (𝑥, 𝑦, 𝑧)
𝑑 = (𝜃, 𝜙)
𝜎 = (𝜎)
𝑐 = (𝑟, 𝑔, 𝑏)
𝑋 = (𝑥, 𝑦, 𝑧) 𝑑 = (𝜃, 𝜙)
+ 𝑐 = (𝑟, 𝑔, 𝑏)
𝜎 = (𝜎)
𝑋 = (𝑥, 𝑦, 𝑧)
NeRF Model - 4
Volume Rendering
density RGB
Density값에 마이너스를 주고 exp를 한 것은
해당 vertices의 위치에서 앞의 density가 클수록
내 weight를 작게 가져가겠다는 의미
대충 𝑡𝑖 Random Sampling 한다는 뜻
Camera Ray를 뜻함
• Model의 Output으로 나온 한 Ray의 Color와 density 값들은 한 pixel로 합쳐지는 Volume Rendering 과정을 거
• 합쳐진 pixel rgb값은 실제 이미지의 pixel rgb값과 MSE Loss를 거쳐 Back propagation을 통해 학습이 진행
Positional Encoding
High frequency 데이터를 얻기 위해서 진행
data augmentation의 일종이라고 보면 됨
𝑋 = (𝑥, 𝑦, 𝑧)
𝑑 = (𝜃, 𝜙)
𝜎 = (𝜎)
𝑐 = (𝑟, 𝑔, 𝑏)
3차원을 60차원으로
3차원을 24차원으로
Hierarchical Volume Sampling
Weight를 Normalize 하고 Sampling을 더 많이 해서
Loss Function
COARSE MODEL과 FINE MODEL을 통해 Volume Rendering한 RGB값을
원본 RGB값과 Error 계산
NeRF 단점
1. 느린 속도
NeRF는 Training 및 Rendering 속도가 느림
NeRF 모델 하나당 한 물체를 표현할 수 있음
학습을 (200k~300k 기준) 한 번 돌리는데 대략 1~2일 소요
-> DeRF(21CVPR) , NeRF++, plenoxel(22CVPR)
2. NeRF는 Static한 Scene에서만 성능이 좋음
움직이는 물체가 있는 Scene에 대해서 많은 Noise를 생성
-> D-NeRF(21CVPR) , Nerfies(21ICCV) , HyperNeRF
NeRF 단점
3. NeRF는 같은 환경에서 촬영한 이미지에 대해서만 성능이 나옴
Static 한 물체더라도 날씨, 시간등에 따라 명암, 색상 빛 조건 등이 다를 수 있고,
Real world에는 사실 스튜디오에서 찍는게 아닌 이상 이런 데이터들이 더 많음
-> NeRV(21CVPR) , NeRD(21CVPR) , NeRF in the wild (21CVPR)
4. NeRF는 general한 Model이 아님
NeRF는 한 Model로 하나의 물체만 만들어낼 수 있음
-> GIRAFFE(21CVPR) , pixel-NeRF(22CVPR) ...
NeRF 단점
5. 너무 다양한 시점의 training set이 필요
NeRF에 input으로 들어가는 synthetic Training dataset은 100개
한 물체를 학습 하기 위해 100장의 사진을 찍는 것은 inefficient
몇 장의 사진만으로 물체를 렌더링 하는 연구 필요
-> pixel-NeRF(22CVPR) , DietNeRF(21ICCV), Instant-NGP
6. NeRF의 Input인 Camera Parameter
NeRF에서는 카메라의 위치를 알기 위한 intrinsic parameter와 extrinsic parameter 값이 필요
일반인이 스마트폰 카메라 등으로 물체를 촬영하여 학습을 하기에는 너무 많은 정보
이를 해결하기 위해 pose를 estimate, 혹은 pose 자체를 학습하는 등의 연구가 진행
-> iNeRF(21IROS) , NeRF-- , GNeRF(21ICCV) , BARF(21ICCV) , SCNeRF(21ICCV)
Semi Supervised Learning
Entropy, Relative Entropy and Mutual Information*
*Elements of Information Theory Thomas M. Cover, Joy A. Thomas
Definition 1 : The entropy of a discrete
random variable with p.d.f 𝑝(𝑥) is defined as,
𝐻 𝑥 : = −
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
Convention : 0 log 0 = 0
Remark 1 : The entropy 𝐻 𝑥 is a measure of
the average uncertainty of random variable 𝑋.
We can write also
𝐻 𝑥 = −
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
= −
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(
Lemma 1 : H(x) ≥ 0
Proof : By definition,
𝐻(𝑥) = −
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
= −
𝑝(𝑥)(− 𝑙𝑜𝑔 𝑝(𝑥)) ≥ 0
Definition 2 : The joint entropy 𝐻(𝑋, 𝑌) of a pair of
discrete random variable (𝑋, 𝑌) with joint p.d.f 𝑝(𝑥, 𝑦) is
defined as,
𝐻 𝑥, y = −
x y
𝑝(𝑥, y) 𝑙𝑜𝑔 𝑝(𝑥, y)
Clearly 𝐻(𝑋, 𝑌) ≥ 0
Definition 3 : If (𝑋, 𝑌) ~ 𝑝(𝑥, 𝑦),
then the conditional entropy 𝐻(𝑌|𝑋) is defined as,
𝐻 𝑌 𝑋 ≔
𝑝 𝑥 𝐻 𝑌 𝑋 = 𝑥
𝑝(𝑦, 𝑥) 𝑙𝑜𝑔 𝑝(𝑦|𝑥)
= −
𝑥 𝑦
𝑝 𝑦 𝑥 𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑦|𝑥)
= −
𝑥 𝑦
𝑝(𝑥, 𝑦) 𝑙𝑜𝑔 𝑝(𝑦|𝑥)
= −𝛦𝑝 𝑥,𝑦 (𝑙𝑜𝑔 𝑝 𝑌 𝑋 )
Theorem(chain Rule) : 𝐻 𝑋, 𝑌 = 𝐻 𝑌 𝑋 + 𝐻(𝑋)
Proof :
𝐻 𝑋, 𝑌 = −
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝 𝑥, 𝑦
= −
𝑥 𝑦
𝑝(𝑥, 𝑦) 𝑙𝑜𝑔(𝑝 𝑦 𝑥 𝑝 𝑥 )
= −
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝 𝑦 𝑥 −
𝑥 𝑦
𝑝(𝑥, 𝑦) log 𝑝(𝑥)
= 𝐻 𝑌 𝑋 −
𝑝(𝑥) log 𝑝(𝑥)
= 𝐻 𝑌 𝑋 + 𝐻(𝑥)
⇒ 𝐻 𝑌 𝑋 = 𝐻 𝑌 𝑋 + 𝐻(𝑋)
Corollary :
𝐻 𝑋, 𝑌, 𝑍 = 𝐻 𝑌 𝑋, 𝑍 + 𝐻 𝑋 𝑍 + 𝐻(𝑍)
Proof Ommit
Entropy, Relative Entropy and Mutual Information
Definition 4 : The Kullback-Leibler divergence or
relative entropy between two probability mass
function 𝑝 𝑥 and 𝑞(𝑥) is defined as
𝐷(𝑝| 𝑞 ≔
𝑝(𝑥) log
= 𝛦𝑝 𝑙𝑜𝑔
Remark 2 : 𝐷(𝑝||𝑞) ≠ 𝐷(𝑞||𝑝)
Fact : 𝐷(𝑝||𝑞) ≥ 0
Proof :
𝐷(𝑝| 𝑞 =
𝑝(𝑥) log
𝑝(𝑥)(−1) log
≥ −1 log
𝑝 𝑥
𝑞 𝑥
𝑝 𝑥
= −1 log 1 = 0
By Jensen’s inequality of convex function
Definition 5 : The cross-entropy of p.d.f p and q is defined as
𝐻𝑞 𝑝 ≔ −
𝑝(𝑥) log 𝑞(𝑥)
Where 𝑝(𝑥) is unknown, and 𝑞(𝑥) is an approximated p.d.f
Observation :
𝐷(𝑝| 𝑞 ≔
𝑝(𝑥) log
𝑝(𝑥) log 𝑝 𝑥 −
𝑝 𝑥 log 𝑞(𝑥)
= −𝐻 𝑝 + 𝐻𝑞(𝑝)
(2) 0 ≤ 𝐷(𝑝| 𝑞 = 𝐻𝑞 𝑝 − 𝐻(𝑝)
i.e. 𝐻𝑞(𝑝) ≥ 𝐻(𝑝)
(3) Since 𝑝 𝑥 is fixed,
minimizing 𝑫(𝒑| 𝒒 ⇔ minimizing 𝑯𝒒(𝒑)
Entropy, Relative Entropy and Mutual Information
Definition 6 : The mutual information 𝐼(𝑋; 𝑌) of two r.v’s
𝑋 and 𝑌 is defined as
𝐼 𝑋; 𝑌 ∶=
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
= 𝛦𝑝 𝑙𝑜𝑔
Observation : 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌
= 𝐻 𝑌 − 𝐻(𝑌|𝑋)
Proof :
𝐼 𝑋; 𝑌 =
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑦 𝑝(𝑦)
𝑝 𝑥 𝑝(𝑦)
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑦
𝑝 𝑥
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝 𝑥 𝑦 −
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝(𝑥)
= −𝐻 𝑋 𝑌 + 𝐻 𝑋 = −𝐻 𝑌 𝑋 + 𝐻(𝑌)
Remark 3 :
𝐼(𝑋; 𝑌) is the reduction in the uncertainty of 𝑋 due to the
information of 𝑌 (of 𝑌 due to the information of 𝑋)
Proposition :
𝐼(𝑋; 𝑌) ≥ 0 with equality holds ⇔ 𝑋 and 𝑌 are independent.
Proof :
𝐼 𝑋; 𝑌 =
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
≔ 𝐷(𝑝(𝑥, 𝑦)||𝑝 𝑥 𝑝(𝑦) ≥ 0
Corollary : 𝐼 𝑋; 𝑌 𝑍 ≥ 0
Proof : 𝐼 𝑋; 𝑌 𝑍 ≔ 𝐷(𝑝(𝑥, 𝑦|𝑧)||𝑝 𝑥 𝑧 𝑝 𝑦 𝑧 ) ≥ 0
because 𝐷(𝑝||𝑞) ≥ 0
Entropy, Relative Entropy and Mutual Information
Contrastive Learning
Problem : For a given massive data set 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑇} without labels, how do we learn an encoder 𝑓𝜃(⋅)
(representation) which will be used for the downstream task such as classification or clustering
Idea : For each data point 𝑥 ∈ 𝑋 ,
(1) Randomly draw a positive sample 𝑥+
from 𝑥
(2) Randomly draw 𝑁 − 1 negative samples {𝑥𝑗
; 𝑗 = 1,2, … , 𝑁 − 1} from different classes
(3) Choose any type of neural network and learn the encoder 𝑓𝜃(⋅)
(4) Choose a score function 𝑓𝜃 𝑥 𝑇
𝑓𝜃 𝑥+
, 𝑓𝜃 𝑥 𝑇
𝑓𝜃 𝑥𝑗
for example
(5) A loss function given 1 positive sample and 𝑁 − 1 negative samples is
𝐿 𝜃 = −Ε𝑥 log
exp 𝑓𝜃
𝑥 𝑓𝜃 𝑥+
exp 𝑓𝜃
𝑥 𝑓𝜃 𝑥+ + 𝑗=1
𝑒𝑥𝑝 𝑓𝜃
𝑥 𝑓𝜃 𝑥𝑗
(6) For a sample {𝑥𝑙}𝑙=1
⊂ 𝑋 ≔ {𝑥1, 𝑥2, … , 𝑥𝑇} of batch size M, use the empirical loss function
𝐿𝑀 𝜃 = −
exp 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑙
exp 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑙
+ 𝑗=1
𝑒𝑥𝑝 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑗
And update 𝜃 ; 𝜃 = 𝑎𝑟𝑔 min
𝐿𝑀(𝜃) by gradient descent algorithm
This is the cross-entropy loss
⇒ for a 𝑁 -classes softmax classifier,
i.e. learn to find the positive sample
from the 𝑁 samples
Contrastive Learning
Architecture :
Algorithm :
(0) Input ; batch size 𝑀, network structure
(1) Randomly sample {𝑥𝑙}𝑙=1
from 𝑋 ≔ {𝑥𝑡}𝑡=1
(2) Randomly initialize all parameters
(3) For each data point 𝑥𝑙, randomly draw one positive sample 𝑥𝑙
from 𝑥𝑙 , (𝑁 − 1) negative samples 𝑥𝑗
from different classes
(4) Compute the encoder, for 𝑙 = 1,2, … , 𝑀
𝑓𝜃 𝑥𝑙 , 𝑓𝜃 𝑥𝑙
, 𝑓𝜃(𝑥𝑗
(5) Use the empirical cross-entropy loss,
𝐿𝑀 𝜃 = −
exp 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑙
exp 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑙
+ 𝑗=1
𝑒𝑥𝑝 𝑓𝜃
𝑥𝑙 𝑓𝜃 𝑥𝑗
update all parameters by gradient descent algorithm ; 𝜃 = 𝑎𝑟𝑔 min
(6) Repeat the step(1)~step(5) until all updated parameters converges or change little within a given tolerance error
⇒ many many epochs
CNN 𝑓𝜃(⋅) Loss

