SlideShare a Scribd company logo
1 of 24
Download to read offline
Hierarchical Long-term Video Frame
Prediction without Supervision
Nevan Wichers (Google Brain)
Ruben Villegas, Dumitru Erhan, Honglak Lee
2021.08.01
딥러닝논문읽기모임 이미지처리팀
홍은기, 김병현, 김선옥, 안종식, 이찬혁
1. Introduction
2. Related Works
3. Background
4. Architecture & Training Strategies
5. Experiments
6. Conclusion & Discussion
2
목차
3
What is Video Prediction?
Input: x1 ... xt frames
Output: xt + 1 ... xt + T frames
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
In
Out
4
Application
1. 자율주행
2. 범죄 예측
3. 로봇
A Review on Deep Learning Techniques for Video Prediction (2020, Oprea et al.)
마이너리티 리포트 (2002)
5
Related Works
• Patch-level prediction
• Video Prediction을 위한 최초의 방법
• 예측 프레임에 가로 세로 줄이 생기는 현상이 나타나는 문제
• Frame-level prediction realistic videos
• Finn et al. (2016) – 픽셀의 움직임을 직접 예측하여 해당 움직임이 반영된 프레임 생성
• Lotter et al. (2017) – encoder-decoder 구조 사용
문제: 예측 거리가 멀어질수록 예측한 프레임을 입력으로 또다시 프레임을 예측하기 때문에 에러가 증폭
-> blurriness 및 degradation 현상 발생
6
Background – Villegas et al. (2017b)
Villegas et al., 2017b, Learning to Generate Long-term Future via Hierarchical Prediction
- 선정 논문의 선행 연구로서, future frame 예측을 위해 hierarchical approach를 제안
Inference Pipeline
1. 이전 프레임 각각에 대하여 pose estimation 수행 → high-level feature (pose) 추출
2. 추출된 pose 정보 시퀀스를 LSTM에 입력 → future pose 시퀀스 예측
3. Visual Analogy Network (VAN)을 이용하여 future frame 생성
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
7
Background – Villegas et al. (2017b)
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
1. 이전 프레임 X1:t-1 대하여 pose estimation 수행 → high-level feature (pose) 추출
8
Background – Villegas et al. (2017b)
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
2. 추출된 pose 시퀀스 정보를 LSTM에 입력 → future pose 시퀀스 예측
9
Background – Villegas et al. (2017b)
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
3. 현재 프레임 xt, xt 의 pose 정보 pt, 예측된 pose 정보 pt+n을
VAN (Visual Analogy Network)에 입력하여 예측 프레임 xt+n 생성
10
Background – Villegas et al. (2017b)
• Contribution:
1. 예측 거리가 먼 경우에도, 예측된 프레임이 아닌 실제 프레임만을 가지고 프레임을 생성하기 때문에
degradation 문제 완화
2. High level feature를 통해 영상의 motion dynamics를 효과적으로 포착
• 한계:
1. Pose estimation을 supervised learning 방식으로 학습 (ground-truth 필요)
2. Pose estimator, LSTM, Visual Analogy Network를 별도로 학습
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
11
Q & A
12
Architecture
1. Encoder
2. LSTM Predictor
3. Visual Analogy Network
13
Architecture
1. Encoder
현재 프레임 It 를 입력으로 하여 general feature vector et-1 ∈ Rd 출력
2. LSTM Predictor
관찰된 프레임(t ≤ C)의 경우, encoder의 출력 -> LSTM -> êt 예측
관찰되지 않은 프레임(t > C)의 경우, LSTM의 출력 -> LSTM -> êt 예측
Reed et al., 2015, Deep Visual Analogy-Making
- 이미지에 대한 Analogical reasoning을 구현하기 위한 기법 (내가 선글라스를 쓴다면?)
- Word2Vec 및 GloVe와 같은 word embedding 기법에 영감을 얻은 기법으로서 더하기 빼기와 같은 단순한 벡터 연
산을 통해 관계 유추
14
VAN (Visual Analogy Network)
Distributed Representations of Words and Phrases and their Compositionality (2013, Mikolov et al.)
https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
Queen + Kings - King = Queens
15
VAN (Visual Analogy Network)
Deep Visual Analogy-Making (2015, Reed et al.)
1. E2E (End-to-End Prediction)
단순 전략으로 LSTM 및 VAN을 함께 학습 – 실제 프레임과 예측 프레임 간의 L2 loss를 minimize
2. EPVA (Encoder Predictor with Visual Analogy)
Encoder에서 출력한 feature vector와 LSTM이 예측한 feature vector가 같도록 하는 제약 적용, 즉
16
Training Strategies
17
EPVA (Encoder Predictor with Visual Analogy)
Training: 실제 이미지 I 및 예측 이미지 î 간의 loss + 실제 이미지에서 추출한 벡터 e 및 이전 vector에서 예측한 예측 벡터 ê 간의 loss 계산
Inference: 이전 벡터에서 예측한 ê 을 VAN에 입력
18
EPVA with Adversarial Training
1. L2 loss (MSE loss)는 'blurriness effect'를 유발하는 것으로 잘 알려져 있음. 이에 따라 LSTM이 예측한 feature
vector 또한 blurriness 현상을 겪는 문제 발생
2. 따라서, predictor의 예측 ê 과 encoder의 출력 e 사이에 adversarial loss 추가
3. LSTM discriminator 도입 - feature vector가 encoder에서 생성된 것인지, predictor에서 생성한 것인지 판별
Discriminator loss:
Generator loss:
19
Experiments – Toy Dataset
• Prediction distance: 16 frames
• Results: predictions over 1,000 frames
• Given frames: 3 frames
20
Experiments – Human 3.6M
• Train dataset: subjects 1, 5, 6, 7, 8
• Validation dataset: subject 9
• Test dataset: subject 11
• Frame size: 64 x 64
• Subsampling rate: 6.25 frames per sec
• Prediction distance: 32 frames
• Results: predictions over 126 frames
• Given frames: 5 frames
21
Experiments – Person Detector Evaluation
Object Detector: MobileNet based Object Detector trained on MS-COCO
Detection accuracy on ground-truth data: 0.4
22
Ablation Studies
1. No VAN (VAN improves the quality of predictions)
• VAN을 다른 decoder로 대체 (feature vector만을 입력으로 이미지 생성)
• 32 프레임이 넘어가면서 성능 저하
2. Hybrid objective loss (E2E loss + EPVA loss)
• blurry한 프레임 생성
3. Context frame 5 -> 10
• 기존 5 프레임을 context로 사용하던 것과 달리 10 프레임을 사용하였을 때 성능 변화는 없었음
23
Conclusion & Discussion
1. Hierarchical approach의 구성요소를 joint-training
2. Adversarial training을 적용하여 추가적인 성능 개선
3. 제목의 ‘without supervision’이라는 문구는 이전 논문 (Villegas et al., 2017)의 맥락에서만 의미가 있을 듯. Video
Prediction 자체는 기본적으로 미래 프레임 자체가 정답이 되는 self-supervised learning task.
4. 64 x 64 사이즈면 다소 작은 것 아닌가?
최근 연구(Lee et al., 2021)에서는 Cityscape(256 x 512) 및 KITTI (256 x 256) 데이터셋 사용
Revisiting Hierarchical Approach for Persistent Long-term Video Prediction (2021, Lee et al.)
24
Q & A

More Related Content

Similar to 210801 hierarchical long term video frame prediction without supervision

Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...Gyubin Son
 
Learning Less is More - 6D Camera Localization via 3D Surface Regression
Learning Less is More - 6D Camera Localization via 3D Surface RegressionLearning Less is More - 6D Camera Localization via 3D Surface Regression
Learning Less is More - 6D Camera Localization via 3D Surface RegressionBrian Younggun Cho
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰taeseon ryu
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Susang Kim
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝NAVER Engineering
 
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개[대전AI포럼] 위성영상 분석 기술 개발 현황 소개
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개Taegyun Jeon
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper ReviewLEE HOSEONG
 
예제를 통해 쉽게_살펴보는_뷰제이에스
예제를 통해 쉽게_살펴보는_뷰제이에스예제를 통해 쉽게_살펴보는_뷰제이에스
예제를 통해 쉽게_살펴보는_뷰제이에스Dexter Jung
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsHyeongmin Lee
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...Susang Kim
 
권기훈_개인포트폴리오
권기훈_개인포트폴리오권기훈_개인포트폴리오
권기훈_개인포트폴리오Kihoon4
 
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js [W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js 양재동 코드랩
 
2020 k hackers-startup club_presentation_final version.pptx
2020 k hackers-startup club_presentation_final version.pptx2020 k hackers-startup club_presentation_final version.pptx
2020 k hackers-startup club_presentation_final version.pptx승형 이
 
Survey on Monocular Depth Estimation
Survey on Monocular Depth EstimationSurvey on Monocular Depth Estimation
Survey on Monocular Depth Estimation범준 김
 
검색엔진에 적용된 딥러닝 모델 방법론
검색엔진에 적용된 딥러닝 모델 방법론검색엔진에 적용된 딥러닝 모델 방법론
검색엔진에 적용된 딥러닝 모델 방법론Tae Young Lee
 
딥뉴럴넷 클러스터링 실패기
딥뉴럴넷 클러스터링 실패기딥뉴럴넷 클러스터링 실패기
딥뉴럴넷 클러스터링 실패기Myeongju Kim
 
(Qraft)naver pitching
(Qraft)naver pitching(Qraft)naver pitching
(Qraft)naver pitching형식 김
 

Similar to 210801 hierarchical long term video frame prediction without supervision (20)

Detecting fake jpeg images
Detecting fake jpeg imagesDetecting fake jpeg images
Detecting fake jpeg images
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
 
Learning Less is More - 6D Camera Localization via 3D Surface Regression
Learning Less is More - 6D Camera Localization via 3D Surface RegressionLearning Less is More - 6D Camera Localization via 3D Surface Regression
Learning Less is More - 6D Camera Localization via 3D Surface Regression
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝
 
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개[대전AI포럼] 위성영상 분석 기술 개발 현황 소개
[대전AI포럼] 위성영상 분석 기술 개발 현황 소개
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
 
예제를 통해 쉽게_살펴보는_뷰제이에스
예제를 통해 쉽게_살펴보는_뷰제이에스예제를 통해 쉽게_살펴보는_뷰제이에스
예제를 통해 쉽게_살펴보는_뷰제이에스
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation Algorithms
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...
 
권기훈_개인포트폴리오
권기훈_개인포트폴리오권기훈_개인포트폴리오
권기훈_개인포트폴리오
 
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js [W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js
[W3C HTML5 2017] 예제를 통해 쉽게 살펴보는 Vue.js
 
2020 k hackers-startup club_presentation_final version.pptx
2020 k hackers-startup club_presentation_final version.pptx2020 k hackers-startup club_presentation_final version.pptx
2020 k hackers-startup club_presentation_final version.pptx
 
Survey on Monocular Depth Estimation
Survey on Monocular Depth EstimationSurvey on Monocular Depth Estimation
Survey on Monocular Depth Estimation
 
검색엔진에 적용된 딥러닝 모델 방법론
검색엔진에 적용된 딥러닝 모델 방법론검색엔진에 적용된 딥러닝 모델 방법론
검색엔진에 적용된 딥러닝 모델 방법론
 
딥뉴럴넷 클러스터링 실패기
딥뉴럴넷 클러스터링 실패기딥뉴럴넷 클러스터링 실패기
딥뉴럴넷 클러스터링 실패기
 
(Qraft)naver pitching
(Qraft)naver pitching(Qraft)naver pitching
(Qraft)naver pitching
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

210801 hierarchical long term video frame prediction without supervision

  • 1. Hierarchical Long-term Video Frame Prediction without Supervision Nevan Wichers (Google Brain) Ruben Villegas, Dumitru Erhan, Honglak Lee 2021.08.01 딥러닝논문읽기모임 이미지처리팀 홍은기, 김병현, 김선옥, 안종식, 이찬혁
  • 2. 1. Introduction 2. Related Works 3. Background 4. Architecture & Training Strategies 5. Experiments 6. Conclusion & Discussion 2 목차
  • 3. 3 What is Video Prediction? Input: x1 ... xt frames Output: xt + 1 ... xt + T frames Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.) In Out
  • 4. 4 Application 1. 자율주행 2. 범죄 예측 3. 로봇 A Review on Deep Learning Techniques for Video Prediction (2020, Oprea et al.) 마이너리티 리포트 (2002)
  • 5. 5 Related Works • Patch-level prediction • Video Prediction을 위한 최초의 방법 • 예측 프레임에 가로 세로 줄이 생기는 현상이 나타나는 문제 • Frame-level prediction realistic videos • Finn et al. (2016) – 픽셀의 움직임을 직접 예측하여 해당 움직임이 반영된 프레임 생성 • Lotter et al. (2017) – encoder-decoder 구조 사용 문제: 예측 거리가 멀어질수록 예측한 프레임을 입력으로 또다시 프레임을 예측하기 때문에 에러가 증폭 -> blurriness 및 degradation 현상 발생
  • 6. 6 Background – Villegas et al. (2017b) Villegas et al., 2017b, Learning to Generate Long-term Future via Hierarchical Prediction - 선정 논문의 선행 연구로서, future frame 예측을 위해 hierarchical approach를 제안 Inference Pipeline 1. 이전 프레임 각각에 대하여 pose estimation 수행 → high-level feature (pose) 추출 2. 추출된 pose 정보 시퀀스를 LSTM에 입력 → future pose 시퀀스 예측 3. Visual Analogy Network (VAN)을 이용하여 future frame 생성 Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
  • 7. 7 Background – Villegas et al. (2017b) Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.) 1. 이전 프레임 X1:t-1 대하여 pose estimation 수행 → high-level feature (pose) 추출
  • 8. 8 Background – Villegas et al. (2017b) Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.) 2. 추출된 pose 시퀀스 정보를 LSTM에 입력 → future pose 시퀀스 예측
  • 9. 9 Background – Villegas et al. (2017b) Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.) 3. 현재 프레임 xt, xt 의 pose 정보 pt, 예측된 pose 정보 pt+n을 VAN (Visual Analogy Network)에 입력하여 예측 프레임 xt+n 생성
  • 10. 10 Background – Villegas et al. (2017b) • Contribution: 1. 예측 거리가 먼 경우에도, 예측된 프레임이 아닌 실제 프레임만을 가지고 프레임을 생성하기 때문에 degradation 문제 완화 2. High level feature를 통해 영상의 motion dynamics를 효과적으로 포착 • 한계: 1. Pose estimation을 supervised learning 방식으로 학습 (ground-truth 필요) 2. Pose estimator, LSTM, Visual Analogy Network를 별도로 학습 Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
  • 12. 12 Architecture 1. Encoder 2. LSTM Predictor 3. Visual Analogy Network
  • 13. 13 Architecture 1. Encoder 현재 프레임 It 를 입력으로 하여 general feature vector et-1 ∈ Rd 출력 2. LSTM Predictor 관찰된 프레임(t ≤ C)의 경우, encoder의 출력 -> LSTM -> êt 예측 관찰되지 않은 프레임(t > C)의 경우, LSTM의 출력 -> LSTM -> êt 예측
  • 14. Reed et al., 2015, Deep Visual Analogy-Making - 이미지에 대한 Analogical reasoning을 구현하기 위한 기법 (내가 선글라스를 쓴다면?) - Word2Vec 및 GloVe와 같은 word embedding 기법에 영감을 얻은 기법으로서 더하기 빼기와 같은 단순한 벡터 연 산을 통해 관계 유추 14 VAN (Visual Analogy Network) Distributed Representations of Words and Phrases and their Compositionality (2013, Mikolov et al.) https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/ Queen + Kings - King = Queens
  • 15. 15 VAN (Visual Analogy Network) Deep Visual Analogy-Making (2015, Reed et al.)
  • 16. 1. E2E (End-to-End Prediction) 단순 전략으로 LSTM 및 VAN을 함께 학습 – 실제 프레임과 예측 프레임 간의 L2 loss를 minimize 2. EPVA (Encoder Predictor with Visual Analogy) Encoder에서 출력한 feature vector와 LSTM이 예측한 feature vector가 같도록 하는 제약 적용, 즉 16 Training Strategies
  • 17. 17 EPVA (Encoder Predictor with Visual Analogy) Training: 실제 이미지 I 및 예측 이미지 î 간의 loss + 실제 이미지에서 추출한 벡터 e 및 이전 vector에서 예측한 예측 벡터 ê 간의 loss 계산 Inference: 이전 벡터에서 예측한 ê 을 VAN에 입력
  • 18. 18 EPVA with Adversarial Training 1. L2 loss (MSE loss)는 'blurriness effect'를 유발하는 것으로 잘 알려져 있음. 이에 따라 LSTM이 예측한 feature vector 또한 blurriness 현상을 겪는 문제 발생 2. 따라서, predictor의 예측 ê 과 encoder의 출력 e 사이에 adversarial loss 추가 3. LSTM discriminator 도입 - feature vector가 encoder에서 생성된 것인지, predictor에서 생성한 것인지 판별 Discriminator loss: Generator loss:
  • 19. 19 Experiments – Toy Dataset • Prediction distance: 16 frames • Results: predictions over 1,000 frames • Given frames: 3 frames
  • 20. 20 Experiments – Human 3.6M • Train dataset: subjects 1, 5, 6, 7, 8 • Validation dataset: subject 9 • Test dataset: subject 11 • Frame size: 64 x 64 • Subsampling rate: 6.25 frames per sec • Prediction distance: 32 frames • Results: predictions over 126 frames • Given frames: 5 frames
  • 21. 21 Experiments – Person Detector Evaluation Object Detector: MobileNet based Object Detector trained on MS-COCO Detection accuracy on ground-truth data: 0.4
  • 22. 22 Ablation Studies 1. No VAN (VAN improves the quality of predictions) • VAN을 다른 decoder로 대체 (feature vector만을 입력으로 이미지 생성) • 32 프레임이 넘어가면서 성능 저하 2. Hybrid objective loss (E2E loss + EPVA loss) • blurry한 프레임 생성 3. Context frame 5 -> 10 • 기존 5 프레임을 context로 사용하던 것과 달리 10 프레임을 사용하였을 때 성능 변화는 없었음
  • 23. 23 Conclusion & Discussion 1. Hierarchical approach의 구성요소를 joint-training 2. Adversarial training을 적용하여 추가적인 성능 개선 3. 제목의 ‘without supervision’이라는 문구는 이전 논문 (Villegas et al., 2017)의 맥락에서만 의미가 있을 듯. Video Prediction 자체는 기본적으로 미래 프레임 자체가 정답이 되는 self-supervised learning task. 4. 64 x 64 사이즈면 다소 작은 것 아닌가? 최근 연구(Lee et al., 2021)에서는 Cityscape(256 x 512) 및 KITTI (256 x 256) 데이터셋 사용 Revisiting Hierarchical Approach for Persistent Long-term Video Prediction (2021, Lee et al.)