210801 hierarchical long term video frame prediction without supervision

Hierarchical Long-term Video Frame
Prediction without Supervision
Nevan Wichers (Google Brain)
Ruben Villegas, Dumitru Erhan, Honglak Lee
2021.08.01
딥러닝논문읽기모임 이미지처리팀
홍은기, 김병현, 김선옥, 안종식, 이찬혁

1. Introduction
2. Related Works
3. Background
4. Architecture & Training Strategies
5. Experiments
6. Conclusion & Discussion
2
목차

3
What is Video Prediction?
Input: x1 ... xt frames
Output: xt + 1 ... xt + T frames
Learning to Generate Long-term Future via Hierarchical Prediction (2017, Villegas et al.)
In
Out

4
Application
1. 자율주행
2. 범죄 예측
3. 로봇
A Review on Deep Learning Techniques for Video Prediction (2020, Oprea et al.)
마이너리티 리포트 (2002)

5
Related Works
• Patch-level prediction
• Video Prediction을 위한 최초의 방법
• 예측 프레임에 가로 세로 줄이 생기는 현상이 나타나는 문제
• Frame-level prediction realistic videos
• Finn et al. (2016) – 픽셀의 움직임을 직접 예측하여 해당 움직임이 반영된 프레임 생성
• Lotter et al. (2017) – encoder-decoder 구조 사용
문제: 예측 거리가 멀어질수록 예측한 프레임을 입력으로 또다시 프레임을 예측하기 때문에 에러가 증폭
-> blurriness 및 degradation 현상 발생

6
Background – Villegas et al. (2017b)
Villegas et al., 2017b, Learning to Generate Long-term Future via Hierarchical Prediction
- 선정 논문의 선행 연구로서, future frame 예측을 위해 hierarchical approach를 제안
Inference Pipeline
1. 이전 프레임 각각에 대하여 pose estimation 수행 → high-level feature (pose) 추출
2. 추출된 pose 정보 시퀀스를 LSTM에 입력 → future pose 시퀀스 예측
3. Visual Analogy Network (VAN)을 이용하여 future frame 생성

7
1. 이전 프레임 X1:t-1 대하여 pose estimation 수행 → high-level feature (pose) 추출

8
2. 추출된 pose 시퀀스 정보를 LSTM에 입력 → future pose 시퀀스 예측

9
3. 현재 프레임 xt, xt 의 pose 정보 pt, 예측된 pose 정보 pt+n을
VAN (Visual Analogy Network)에 입력하여 예측 프레임 xt+n 생성

10
• Contribution:
1. 예측 거리가 먼 경우에도, 예측된 프레임이 아닌 실제 프레임만을 가지고 프레임을 생성하기 때문에
degradation 문제 완화
2. High level feature를 통해 영상의 motion dynamics를 효과적으로 포착
• 한계:
1. Pose estimation을 supervised learning 방식으로 학습 (ground-truth 필요)
2. Pose estimator, LSTM, Visual Analogy Network를 별도로 학습

12
Architecture
1. Encoder
2. LSTM Predictor
3. Visual Analogy Network

13
Architecture
1. Encoder
현재 프레임 It 를 입력으로 하여 general feature vector et-1 ∈ Rd 출력
2. LSTM Predictor
관찰된 프레임(t ≤ C)의 경우, encoder의 출력 -> LSTM -> êt 예측
관찰되지 않은 프레임(t > C)의 경우, LSTM의 출력 -> LSTM -> êt 예측

Reed et al., 2015, Deep Visual Analogy-Making
- 이미지에 대한 Analogical reasoning을 구현하기 위한 기법 (내가 선글라스를 쓴다면?)
- Word2Vec 및 GloVe와 같은 word embedding 기법에 영감을 얻은 기법으로서 더하기 빼기와 같은 단순한 벡터 연
산을 통해 관계 유추
14
VAN (Visual Analogy Network)
Distributed Representations of Words and Phrases and their Compositionality (2013, Mikolov et al.)
https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
Queen + Kings - King = Queens

15
VAN (Visual Analogy Network)
Deep Visual Analogy-Making (2015, Reed et al.)

1. E2E (End-to-End Prediction)
단순 전략으로 LSTM 및 VAN을 함께 학습 – 실제 프레임과 예측 프레임 간의 L2 loss를 minimize
2. EPVA (Encoder Predictor with Visual Analogy)
Encoder에서 출력한 feature vector와 LSTM이 예측한 feature vector가 같도록 하는 제약 적용, 즉
16
Training Strategies

17
EPVA (Encoder Predictor with Visual Analogy)
Training: 실제 이미지 I 및 예측 이미지 î 간의 loss + 실제 이미지에서 추출한 벡터 e 및 이전 vector에서 예측한 예측 벡터 ê 간의 loss 계산
Inference: 이전 벡터에서 예측한 ê 을 VAN에 입력

18
EPVA with Adversarial Training
1. L2 loss (MSE loss)는 'blurriness effect'를 유발하는 것으로 잘 알려져 있음. 이에 따라 LSTM이 예측한 feature
vector 또한 blurriness 현상을 겪는 문제 발생
2. 따라서, predictor의 예측 ê 과 encoder의 출력 e 사이에 adversarial loss 추가
3. LSTM discriminator 도입 - feature vector가 encoder에서 생성된 것인지, predictor에서 생성한 것인지 판별
Discriminator loss:
Generator loss:

19
Experiments – Toy Dataset
• Prediction distance: 16 frames
• Results: predictions over 1,000 frames
• Given frames: 3 frames

20
Experiments – Human 3.6M
• Train dataset: subjects 1, 5, 6, 7, 8
• Validation dataset: subject 9
• Test dataset: subject 11
• Frame size: 64 x 64
• Subsampling rate: 6.25 frames per sec
• Prediction distance: 32 frames
• Results: predictions over 126 frames
• Given frames: 5 frames

21
Experiments – Person Detector Evaluation
Object Detector: MobileNet based Object Detector trained on MS-COCO
Detection accuracy on ground-truth data: 0.4

22
Ablation Studies
1. No VAN (VAN improves the quality of predictions)
• VAN을 다른 decoder로 대체 (feature vector만을 입력으로 이미지 생성)
• 32 프레임이 넘어가면서 성능 저하
2. Hybrid objective loss (E2E loss + EPVA loss)
• blurry한 프레임 생성
3. Context frame 5 -> 10
• 기존 5 프레임을 context로 사용하던 것과 달리 10 프레임을 사용하였을 때 성능 변화는 없었음

23
Conclusion & Discussion
1. Hierarchical approach의 구성요소를 joint-training
2. Adversarial training을 적용하여 추가적인 성능 개선
3. 제목의 ‘without supervision’이라는 문구는 이전 논문 (Villegas et al., 2017)의 맥락에서만 의미가 있을 듯. Video
Prediction 자체는 기본적으로 미래 프레임 자체가 정답이 되는 self-supervised learning task.
4. 64 x 64 사이즈면 다소 작은 것 아닌가?
최근 연구(Lee et al., 2021)에서는 Cityscape(256 x 512) 및 KITTI (256 x 256) 데이터셋 사용
Revisiting Hierarchical Approach for Persistent Long-term Video Prediction (2021, Lee et al.)

210801 hierarchical long term video frame prediction without supervision

Recommended

Recommended

More Related Content

Similar to 210801 hierarchical long term video frame prediction without supervision

Similar to 210801 hierarchical long term video frame prediction without supervision (20)

More from taeseon ryu

More from taeseon ryu (20)

210801 hierarchical long term video frame prediction without supervision