Saliency prediction is a topic undergoing intense study in computer vision with a broad range of applications. It consists in predicting where the attention is going to be received in an image or a video by a human. Our work is based on a deep neural network named SalGAN, which was trained on a saliency annotated dataset of static images. In this thesis we investigate different approaches for extending SalGAN to the video domain. To this end, we investigate the recently proposed saliency annotated video dataset DHF1K to train and evaluate our models. The obtained results indicate that techniques such as depth estimation or coordconv can effectively be used as additional modalities to enhance the saliency prediction of static images obtained with SalGAN, achieving encouraging results in the DHF1K benchmark. Our work is based on pytorch and it is publicly available here.
Where we look
*Slides from DLCV Seminar by Kevin McGuiness
● Understand what saliency
model is and how do they
● Set a baseline model
based on SalGAN on the
● Explore complementary
modalities to explicitly
model time dynamics as
an input for SalGAN.
Image source and paper: Pan, J., Ferrer, C.C., McGuinness, K., O'Connor, N.E., Torres, J., Sayrol, E. and Giro-i-Nieto, X., 2017.
Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081.
Image source: http://salicon.net/explore/
● General and
● 10K TRAINING
● 5K VALIDATION
● 5K TEST
● Gaussian width
mask 24 pixels
Paper: Jiang, M., Huang, S., Duan, J. and Zhao, Q., 2015. Salicon: Saliency in context. In Proceedings of the IEEE conference on computer vision
and pattern recognition (pp. 1072-1080) 10
Image source and paper: Wang, W., Shen, J., Guo, F., Cheng, M.M. and Borji, A., 2018, January. Revisiting Video Saliency: A
Large-scale Benchmark and a New Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
Images source: DHF1K dataset.
● General and task-free
● 600 VIDEOS TRAINING
● 100 VIDEOS VALIDATION
● 300 VIDEOS TEST
● Gaussian width mask 30 pixels
Paper: Wang, W., Shen, J., Guo, F., Cheng, M.M. and Borji, A., 2018, January. Revisiting Video Saliency: A Large-scale
Benchmark and a New Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
Image source and paper: Jiang, L., Xu, M. and Wang, Z., 2017. Predicting Video Saliency with
Object-to-Motion CNN and Two-layer Convolutional LSTM. arXiv preprint arXiv:1709.06316.
Images source: LEDOV dataset.
● General and task-free
● 436 VIDEOS TRAINING
● 41 VIDEOS VALIDATION
● 41 VIDEOS TEST
● Gaussian width mask 40 pixels
Transfer Learning to
Adding extra input
Transfer learning to DHF1K
AUC_JUDD 0,872 0.880
AUC_SHUF 0,666 0.632
NSS 2,035 2.285
CC 0,379 0.420
SIM 0,267 0.339
Adding extra input signals
Train on DHF1K
RGB and Depth
AUC_JUDD 0.89 0,872 0.880 0.895 0.866
AUC_SHUF 0.601 0,666 0.632 0.648 0.629
NSS 2.354 2,035 2.285 2.524 2.072
CC 0.434 0,379 0.420 0.463 0.389
SIM 0.315 0,267 0.339 0.351 0.304
● Good environment. The project code is
publicly available in
● Study of state-of-the-art models. Rating
second in the public leaderboard with
● Implementation of a pytorch version of
SalGAN with equivalent performance
fine-tuning in SALICON.
● Boost the performance of the baseline
pytorch model to predict saliency in videos.
The baseline model is fine-tuned on the
DHF1k dataset by using RGB information,
RGB + Depth, and RGB + coordinates
● Optical flow
● Combine with depth and coordconv in different streams