[Paper] learning video representations from correspondence proposals

Susang Kim(healess1@gmail.com)
Video Understanding(2)
Learning Video Representations from Correspondence Proposals

Video Architecture (ReCap)
ImageNet Pre-trained Model backbone을 활용하여 기존 아키텍쳐(a~d)와
논문에서 제시한 I3D(e)와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시

Kinetics Dataset (ReCap)
ImageNet(1000장/1000카테고리) 으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라
Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로
Action Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던
HMDB-51과 UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터
필요성의 중요함을 증명해냄.
Kinetics Dataset : 650,000개의 비디오에
행동중심으로 (단독행동, 사람간 행동,
물건을 다루는 행동)이 정의됨
(클래스당 600 비디오 클립으로 10초씩)
처음 400클래스 공개 후 600, 700개로
추가한 클래스로 정의된 Dataset이 있음

HMDB-51 Dataset (ReCap)
ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각
카테고리는 101개의 클립으로 구성
http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction

Action Recognition 두번째 논문
CVPR 2019에서 Stanford/Adobe에서
발표한 논문으로 RGB만을 활용한
2D이미지와 Temporal한 정보들 간의
Semantic Feature의 연관성을 구성하는
CPNet(Correspondence Proposals)을
통해 SOTA를 달성
Action Recognition : 특정 비디오
영상에서 사람이 어떤 행동을 하는지를
위한 Classification을 하는 것 (비디오
영상을 입력하여 예측 결과 출력)

이미지(Frame)의 특정 객체간에는 연관성을 가지고 있음
https://www.youtube.com/watch?v=4IInDT_S0ow&t=57s
Correspondence in Videos

Visualization of CP Module
프레임의 변화에 따른 특정 피쳐간의 연광성을 시각화해서 표현
https://www.youtube.com/watch?v=4IInDT_S0ow&t=57s

기존 연구에서의 문제점
We constructed a toy video dataset where
previous RGB only methods fail in
learning long-range motion. Through this
extremely simple dataset, we show the
drawbacks of previous methods and the
advantage of our architecture
일반적으로 Low frame rate와 fast motion의
Action Video의 경우 인식률이 낮지만
CPNet을 통해 개선함
(Up, Down, Left, Right)
32x32의 검정색 배경 위에 2x2의 흰색 점을
이동 (7에서 9pixel 정도로 이동)
Train 1000 / Validation 200
A Failing of Several Previous Methods

Correspondence Proposals Module
https://arxiv.org/pdf/1905.07853.pdf
보라색 점(Feature)를
중심으로 가장 유사한 특징을
가지는 Feature를 CP
Module를 통해 K개 만큼
(k-NN) 의미 있는 Feature를
뽑아 보다 복수프레임에서의
정확한 클래스를 구분해 낼 수
있음

Calculate the similarity of all features point pairs
각각의 THW의 채널(C) 별 Feature 값을 쌍으로 구성하여 Negative(-) euclidean distance metric(k-NN)을
통해 각각의 similarity score를 구성함 (T:시간, H:높이, W:길이, C:채널)
- -

Set the diagonal block matrices
similarity score를 통해 각각을 matrix(THW x THW)로 구성 (동일 T상에서의 Frame(H x W)은 −∞로 제외함
각각의 THW별 similarity score 값이 높은 k개의 feature를 구성함 (Look up table로 되어있음)
따라서 각각의 row는 THW x k개의 matrix로 구성됨 각 feature는 i값의 index를 가지고 있음

Correspondence Embedding layer
𝑖₀와 가장 의미있는 feature를 k개만큼 찾아내어 해당
index 값과 THW feature를 MLP로 학습 시켜 의
백터(CP Vector) 값을 구해냄

CP Modules Codes
nn_idx = knn.knn(net, k, new_height * new_width)
net_expand = tf.tile(tf.expand_dims(net, axis=2), [1,1,k,1])
net_grouped = tf_grouping.group_point(net, nn_idx)
coord = get_coord(tf.reshape(video, [batch_size, -1, new_height, new_width,
num_channels_bottleneck]))
coord_expand = tf.tile(tf.expand_dims(coord, axis=2), [1,1,k,1])
coord_grouped = tf_grouping.group_point(coord, nn_idx)
coord_diff = coord_grouped - coord_expand
end_points['coord'] = {'coord': coord, 'coord_grouped': coord_grouped, 'coord_diff':
coord_diff}
net = tf.concat([coord_diff, net_expand, net_grouped], axis=-1)
with tf.variable_scope(scope) as sc:
for i, num_out_channel in enumerate(mlp):
net = tf_util.conv2d(net, num_out_channel, [1,1], padding='VALID',
stride=[1,1], bn=True, is_training=is_training,
scope='conv%d'%(i), bn_decay=bn_decay, weight_decay=weight_decay,
data_format=data_format, freeze_bn=freeze_bn)
end_points['before_max'] = net
net = tf.reduce_max(net, axis=[2], keepdims=True, name='maxpool')
end_points['after_max'] = net
net = tf.reshape(net, [batch_size, num_frames, new_height, new_width, lp[-1]])
with tf.variable_scope(scope) as sc:
net = tf_util.conv3d(net, num_channels, [1, 1, 1], stride=[1, 1, 1],
bn=False, activation_fn=None, weight_decay=weight_decay, scope='conv_final')
net = tf.contrib.layers.batch_norm(net, center=True, scale=True,
is_training=is_training if not freeze_bn else tf.constant(False,
shape=(), dtype=tf.bool), decay=bn_decay, updates_collections=None,
scope='bn_final', data_format=data_format, param_initializers={'gamma':
tf.constant_initializer(0., dtype=tf.float32)}, trainable=not freeze_bn)
return net, end_points
def cp_module(video, k, mlp, scope, mlp0=None, is_training=None, bn_decay=None, weight_decay=None, data_format='NHWC',
distance='l2', activation_fn=None, shrink_ratio=None, freeze_bn=False):

CPNet Architecture in ResNet
ResNet 101에 CP Modules을 적용할 것 처럼 CP Module은
residual block의 마지막 CNN Layer(ReLU앞)에 붙여 사용할 수 있음

Comparison with Other Architectures
Optical flow를 사용하는 I3D와 같은 Two Stream Network보다 RGB만으로도 더 좋은 성능을
보여주고 기존 RGB기반 Network보다 적은 파라미터를 사용

Model Run Time
Spatial size 112 x 112, ResNet-34, NVIDIA GTX 1080 Ti GPU with Tensorflow and cuDNN
배치 Size가 1일때 processsing speed of 10.1 videos/s for frame length of 8 and 3.9 videos/s for
frame length of 32. The number of videos that can be proces
시간복잡도를 수식으로 계산하면 O((THW) log(THW) · (C + k))
Batch 사이즈가 증가함에 따라 시간도 증가함

Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

[Paper] learning video representations from correspondence proposals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Paper] learning video representations from correspondence proposals

Similar to [Paper] learning video representations from correspondence proposals (20)

More from Susang Kim

More from Susang Kim (16)

Recently uploaded

Recently uploaded (20)

[Paper] learning video representations from correspondence proposals