[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training

Eye in the Sky: Real-time Drone Surveillance System
(DSS) forViolent Individuals Identiﬁcation using
ScatterNet Hybrid Deep Learning Network
Amarjot Singh et al.
손규빈

고려대학교 산업경영공학과

Data Science & Business Analytics 연구실

0. Summary
1. Feature Pyramid Network
2. SHDL networks - Human pose estimation
3. Support Vector Machine - Detect violent individuals
4. Aerial Violent Individual(AVI) dataset
5. Experiments
Index

0. Summary
1. FPN으로 human region 추출
2. SHDL network로 human region에서 keypoint 좌표를 regression
3. Key-point를 활용하여 폭력 행위 분류
3

1. FPN: Feature Pyramid Networks
“Eye in the sky” 구조에서 human region을 찾아내는 모델
4
•FPN 논문이 쓰여질 시기엔 object detection task에서 pyramid 구조를 잘 쓰지 않던 시절
•계산 복잡도 높고, 메모리 사용량도 높음
•해당 FPN 구조를 Faster R-CNN 모델에 적용하면 적은 cost 상승으로 높은 성능을 보임
•GPU에서 6 FPS(COCO dataset)
•Pyramid 구조를 사용하는 모델 종류 중 하나
•Featurized image pyramid
•Single feature map
•Pyramidal feature hierarchy
•Feature Pyramid Network

Pyramid 구조 (1) Featurized image pyramid
5
•hand craft feature를 활용하여 기존 이미지를 
다양한 계층, scale로 pyramid 구성
•추출된 pyramid의 각 level에서 
독립적으로 feature 추출
•즉 모든 위치에서 각각 Object detection 수행
•비효율적인 연산, 느린 속도

Pyramid 구조 (2) Single feature map
6
•이미지에서 feature를 추출하는 과정에서 
마지막 feature layer를 사용하는 일반적인 방식
•CNN의 특성 자체가 크기, 위치, 회전 등에 
invariant하기 때문에 마지막 레이어만을 써도 충분
•하지만 압축된 마지막 레이어만 사용하기 때문에 
일정 수준 이상의 성능은 보장할 수 없음

Pyramid 구조 (3) Pyramid feature hierarchy
7
•Multi-scale feature representation 사용
•입력 이미지에 대한 정보를 더 많이 사용하게 됨
•feature를 추출하는 각 레이어 별로 각각 
독립적으로 object detection 수행
•대표적으로 “SSD: Single Shot MultiBox Detector”

Pyramid 구조 (4) Feature Pyramid Network
8
•본 논문인 FPN 해당
•이전 구조들이 Feature를 추출하는 과정까지만 
수행했다면 FPN은 Upsampling 과정도 포함
•Feature map을 Upsampling 과정에서 
lateral connection으로 활용
•각 레벨에서 독립적으로 Object detection
•Multi-scale feature representation이면서 
이를 좀 더 효율적으로 사용하는 방식
Feature 
extraction 
(Spatial Info)
Upsampling
(Semantic Info)

FPN 모델 상세 구조
9
•Bottom-up
•기본적으로 ResNet 구조를 차용
•도표의 stride 의미는 receptive field 의미
•Top-down
•Upsampling은 단순함을 위해 nearest neighbor
upsampling 사용(2배 크기)
•feature map을 1x1 conv로 차원축소 후 
element-wise addtion
•Final
•3x3 conv 적용해서 최종 feaure map P 추출
•P에서 1x1 conv 2개를 사용해 class, bbox 추출출처: github.com/hwkim94

FPN Application 방식
10
•RPN : 순수 FPN 구조에 Predictor Head를 각 level에 붙임
•총 5개의 level에서 Anchor ratio {1:2, 1:1, 2:1} 3가지 사용 
-> 15 anchors
•IoU threshold
•0.7 이상 : positive
•0.3 이하 : negative
•Predictor head의 parameter는 모든 level에서 공유
•MS COCO 80 category detection 데이터셋에서 pretrain
P
1x1
conv
3x3
conv
1x1
conv
Class BBox

2. SHDL : ScatterNet Hybrid Deep Network 11
(1) ScatterNet : 저자의 이전 논문인 [Dual-tree wavelet ScatterNet]의 구조를 사용
•CNN 구조에서 Input
image와 붙어있는 첫
번째 Conv block 대체
(Coarse to fine)
•DT-CWT 필터를 활용해
feature 추출하는 2개
layer로 구성
•Hand crafted feature
의 한 종류이며, CNN의
튜닝과 최적 구조를 찾기
어려운 점을 지적

(1) ScatterNet : 저자의 이전 논문인 [Dual-tree wavelet ScatterNet]의 구조를 사용
• Input signal x를 dual-tree complex wavelets를 활용해서 feature 추출
•j : scale 의미. 2개 scale 사용
•r : rotaion 의미. 15, 45, 75, 105, 135, 165도 총 6가지 사용
•입력 이미지에 scale, rotation이 적용되고, 이를 wavelet transform 적용
•L2 normalization과, Log transform, Smoothing이 순서대로 적용
•최종 아웃풋은 각 레이어의 coefficients가 concatenate 된 vector
ψj,r

(2) Regression Network : CNN 모델 구조
•이전 ScatterNet의 output을 입력으로 받는 CNN
•레이어 구성
•Conv block 4개 구성 : { Convolution, ReLU, Pooling, Normalization } 4 blocks
•Fully connected layer 2개(+Dropout) : 1024, 2048 hidden units
Scatter
Network
Conv
block
Conv
block
Conv
block
Conv
block
Dense Dense

(2) Regression Network : 모델 학습
•Key-point 14개에 대한 (x, y) 좌표 -> 28개 값 regression
•Stochastic gradient descent
•이전 layer의 output을 prior로 사용하는 PCANet 프레임워크 사용
•Tukey’s Biweight loss function 사용 - 이상치에 강함
f(n) =
{
x(1 − x2
c2
)2
for|x| < c
0 for|x| > c

3. Violent individual classification 15
Key-point 값을 SVM으로 학습하여 6개 클래스 예측(폭력5+중립1)
•SHDL network에서 추출된 keypoint 
값을 입력으로 SVM 학습
•6개 클래스 분류 : 5개의 폭력 행동, 중립 행동
•학습 상세
•Gaussian kernel
•C = 14
•gamma = 0.00002
•5-fold cross validation

4. Aerial Violent Individual(AVI) Dataset 16
해당 task를 수행하기 위해 직접 데이터셋 제작
•2,000장의 이미지(사람 10명 등장)
•총 10,863명의 사람 등장
•48%인 5,124명이 폭력과 연관
•폭력 종류 5가지: Punching, Stabbing(찌르기), 
Shooting, Kicking, Strangling(목조르기, 멱살)
•등장하는 인물에 14가지 key-point annotation
•드론으로 2, 4, 6, 8미터 상공에서 촬영
•고도에 따라 거리가 달라지고, 그림자 등으로 
인해 이미지가 흐려질 수 있는 어려운 문제

5. Experiments 17
(1) FPN을 활용한 Human detect accuracy 97.2% 높은 성능
•MS COCO 데이터셋에서 pretrain 된 모델을 fine tuning
•AVI 데이터셋에 등장한 10,863명의 사람 중 10,558명 사람 detect 성공 -> 97.2%
(2) SHDL 실험 구조
•FPN을 통해 나온 human region을 120 x 80 이미지로 resize 후 normalize하여 사용
•10,558개의 region을 대상으로 train:validation:test 비율을 6:2:2로 사용

5. Experiments 18
(2) SHDL Key-point regression 성능
•Distance from GT : Ground Truth 픽셀과 어느 정도의 거리 차이까지 정답으로 인정할 것인지
•세가지 종류의 키포인트 모두에서 5픽셀 거리까지 허용했을 때 높은 성능을 보임

5. Experiments 19
(2) SHDL Key-point regression 성능
•Distance from GT 값을 d=5로 지정했을 때의 accuracy
•다른 세 가지 모델에 비해 높은 정확도를 보임
•CN : Coordinate network
•CNE : Coordinate extended network
SHDL CN CNE SpatialNet
AVI
Dataset
87.6% 79.6% 80.1% 83.4%

5. Experiments 20
(3) Violent individuals identification with SVM
•AVI 데이터셋에 대해
다른 모델과 성능 비교
Punching Kicking Strangling Shooting Stabbing
DSS 89% 94% 85% 82% 92
Surya 80% 84% 73% 73% 79%
Number of Violent individuals per image
1 2 3 4 5
DSS 94.1% 90.6% 88.3% 87.8% 84.0%
•폭력행위에 연관된 
인물이 많아질수록 
정확도 하락

3D human pose estimation in video with temporal
convolutions and semi-supervised training
Dario Pavllo et al.
손규빈

고려대학교 산업경영공학과

Data Science & Business Analytics 연구실

1. Introduction
2. Temporal Dilated Convolutional model
3. Semi-supervised approach
4. Experiments
Index

1. Introduction
Dilated Convolution을 사용한 2d->3D mapping Semi-supervised 모델
23
•목적 : 3D human pose estimation in video
•Problem formulation : Mapping
•2D keypoint detection -> 3D pose estimation
•2D에서 3D차원으로 mapping할 때 대부분의 기존 모델들은 RNN 구조를 사용
•Main contribution
•3D human pose estimation in video based on 
dilated temporal convolutions on 2D keypoint trajectories
•semi-supervised approach which exploits unlabeled video

2. Temporal dilated convolutional model
2D joint coordinates의 Sequence를 통해 3D joint를 구하는 모델
24

모델 구조
25
•Input data : 243(frame) x 34(17 joints * 2dim(x,y))
•4 Residual blocks, 0.25 dropout rate, 243 frames, filter size 3, output feature 1024
•TCN layer notation
•ex) 2J, 3d1, 1024 => 입력 채널 2J, Conv filter size 3, Dilation 1, 출력채널 1024
•VALID convolution을 사용하기 때문에 Skip connection에서 차원이 안 맞는 문제 
-> Residual을 좌우 동등하게 Slice하여 차원을 맞춰준다.

Normal convolution(Acausal) for train
26
•학습할 땐 이전시점과 미래시점 모두 사용

Causal convolution for test
27
•test 할 때는 실제 상황을 가정해야하므로 이전시점의 데이터만 활용

Padding with replica of the boundary frames
28
•가장자리 frame을 복제해서 padding(예시 이미지는 Acausal)
•실험했을 때 흔히 사용하는 zero pdding을 했을 때 loss가 더 컸다고 함

Supervised, Unsupervised loss 둘 모두 계산되고 동시에 최적화
29
•Batch를 Labeled, Unlabeled 절반씩 구성
•Supervised loss
•Ground truth 3d joint 활용
•Unsupervised loss(+Regularizer)
•Autoencoder 문제로 접근
•encoder: 3D pose estimator
•3D joint가 다시 projected back 
되었을 때 reconstruction loss를 사용
•Bone length를 L2 loss로 추가
Reconstruction error
MPJPE(Mean Per-Joint Position Error)

: 매칭되는 joint 간의 유클리디언 거리의 평균

Trajectory model
30
•Trajectory model은 2D pose를 활용하여 
=> 3D trajectory를 생성하는 네트워크
•본 논문의 목표인 2D -> 3D mapping을 위해 
trajectory 추가로 활용
•Unlabled data를 back projection할 때 
3D trajectory까지 고려해서 reconstruct
•Back projection이 올바르게 작동 가능 
Reconstruction error

Loss function
31
•Supervised loss
•3D Ground truth와 MPJPE 계산
•Global trajectory loss
•Camera에서 Ground-truth depth의 
역수를 취한 값을 가중치로 사용
•Weighted Mean Per-Joint 
Position Error(WMPJPE) 사용
E =
1
yz
||f(x) − y|| Reconstruction error

4. Experiments
(1) 사용한 데이터셋 : Human3.6M, HumanEva-I
32
•Human 3.6M
•360만개 video frame
•11 subjects(7개는 3D pose annotated)
•각 subject 별로 15개 action 존재
•HumanEva-I
•작은 데이터셋
•3개 subject, 3개 action(Walk, Jog, Box)
•15 joint skeleton 데이터를 사용

4. Experiments
(3) 2D pose estimation : Mask R-CNN & Cascaded pyramid network
33
•Backbone model
•Mask R-CNN with ResNet-101-FPN
•Cascaded Pyramid Network with ResNet-50
•학습 순서
•MS COCO 데이터셋에 pre-train
•Human3.6M에 fine-tune

4. Experiments
(4) Results - Qualitative
34
•Top : 영상에 2D pose를 띄운 것
•Bottom : 3D joint mapping

4. Experiments
(4) Results - Reconstruction error
35
예측한 Joint 좌표와 Ground truth 좌표의 유클리디언 거리(MPJPE)
대부분에서 더 나은 성능을 보이고 있고,
더 나은 성능을 보인 [24]모델은 Ground truth를 사용한 모델

4. Experiments
(4) Results
36
[다운샘플링]
Semi supervised 접근의 다른 모델에 비
해 더 낮은 에러를 보임
[전체 프레임 사용]
Supervised 성능과 근접하며 
타 모델들에 비해 14.7mm 이상 에러 줄임

[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to [paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training

Similar to [paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training (20)

[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with TCN, semi-supervised training