The document provides an overview of single image crowd counting using deep learning models. It discusses three major crowd counting approaches: detection-based, regression-based, and density map estimation. Density map estimation has emerged as a promising approach that achieves high accuracy for crowded scenes while preserving spatial information. The document also describes challenges in crowd counting, various crowd counting models, loss functions, datasets used for evaluation, and recent approaches such as multi-scale models, frequency domain methods, and self-supervised pre-training.
1. Single image crowd counting using deep learning models
End semester evaluation of M.Tech. project
By
Pawar Shubham Rajebhau - 2102102013
Under the guidance of
Dr. Vivek Kanhangad
Department of Electrical Engineering
Indian Institute of Technology Indore
2. Table of contents
1
Introduction
Loss functions
Density map estimation method
Frequency domain approach
Challenges
Crowd counting models
References
Future work
Evaluation metrics
Dataset
Overview of crowd counting methods
Multi-scale model approach
Self-supervised pre-training approach
3. Introduction
ο Single image crowd counting is to estimate the number of objects (people, cars, cells, etc.) in an
unconstrained scene.
ο It has important applications in public safety, traffic management, consumer behavior, cell counting,
etc.
ο Extensive research has been done in this area, particularly with the use of deep learning.
2
Image- based crowd counting
5. Overview of crowd counting methods
The three major crowd counting approaches are detection-based, regression-based, and density map
estimation.
ο Based on computer vision techniques.
ο Detect individual objects, heads, or body parts and count the total number in the image.
ο Accuracy deteriorates in crowded scenes with severe occlusions.
ο Requires full identification and outlining of each object, incurring the highest labeling cost.
4
Detection-based
6. Overview of crowd counting methods
ο Estimates the count by directly relating it to the image.
ο Achieves higher accuracy than the detection-based approach in crowded scenes.
ο Lacks spatial information and interpretability, limiting its use in localization study.
ο Does not require annotating individual objects, resulting in a lower annotation cost.
5
Regression-based approach
7. Overview of crowd counting methods
ο Recently emerged as a promising approach.
ο Achieves high accuracy for crowded scenes.
ο Preserves spatial information of people distribution.
ο Requires indicating only the heads of people, resulting in an intermediate labeling cost
between detection-based and regression-based approaches.
6
Density map estimation
8. Density map estimation method
7
ο The annotated density map is a sparse
binary mask.
ο Each individual person is marked with
a single dot on their head or forehead.
ο The spatial extent of each person is not
provided.
Density map
Shanghai-Tech image
9. Density map estimation method
ο Sparse density maps are converted to dense maps using fixed or variable gaussian kernels
ο It helps train the model better than directly using a sparse dot map.
ο Fixed standard deviation (sigma) of the Gaussian kernel.
ο The sigma value is equal to the distance to the nearest neighbor.
ο The sigma value is computed as the average of the distances to the three nearest neighbors, divided
by 10.
8
Sigma value computation methods
13. Loss functions
12
ο Predicted density map is a smooth heat map and available ground truth is a dot map
ο Choice of loss function depends on how we use the ground truth and predicted density maps
ο Loss function improves performance by extracting proper supervisory information from ground
truth
L2 loss
ο Pixel wise loss(network adjusts pixel value according to L2 loss considering same pixels in the
ground truth)
ο Ground truth dot map converted to smooth heat map using Gaussian kernel.
ο L2 loss is sensitive to the choice of variance in the Gaussian kernel
14. Loss functions
13
Optimum transport loss
ο Considers predicted density maps and dot maps as probability distributions and uses balanced
OT to match the shape of the two distributions.
ο Sinkhorn algorithm is used to obtain the optimal transport matrix.
15. Dataset
ο ShanghaiTech dataset contains two parts: Part A and Part B.
ο Part A contains 482 images (300 for training, 182 for testing) and includes high-density crowds
collected from the Internet.
ο Part B contains 716 images (400 for training, 316 for testing) and is captured from busy streets in
urban areas of Shanghai.
ο The average resolution of images in ShanghaiTech Part A 589x868 pixels.
ο The scenes in Part B are less crowded than those in Part A.
14
Shanghai Tech
18. Dataset
ο Contains 4,372 images with an average resolution of 1430x910 pixels.
ο The dataset was collected from various geographical locations and under diverse conditions.
ο Contains a total of 1.51 million dot annotations with an average of 346 dots per image and a
maximum of 25K dots.
ο The dataset includes images captured under adverse weather and various illumination conditions,
ensuring improved diversity.
ο Provides head-level labels such as dots, approx. bounding box, blur-level, etc. and image-level
labels such as scene type and weather condition.
17
JHU crowd
21. Evaluation metrics
ο Determines the accuracy of the estimates
MAE =
1
π π=1
π
πΆπΌπ
ππππ
β πΆπΌπ
ππ‘
ο Indicates the robustness of the estimates
MSE =
1
π π=1
π
πΆπΌπ
ππππ
β πΆπΌπ
ππ‘ 2
N = Number of test images
πͺπ°π
ππππ
= ππππ ππππππ πππππππ, πͺπ°π
ππ
= ππππππ ππππππ
20
Mean Absolute Error (MAE)
Mean Square Error (MSE)
22. Frequency domain approach
ο Fourier transforms used for frequency response of predicted density map and ground truth dot map.
ο The dispersed spatial information in the predicted and ground truth density maps is converted to
compact information.
21
24. Frequency domain approach
Using CMTL model
23
image
Cascaded
β
mtl
model
predicted
density
map
DFT
of
Pred.dm
ground
truth
density
map
DFT
of
Pred.dm
Lf
L
Loss = L+π·Lf
25. Frequency domain approach
Before
Sht A
MAE MSE
86.7 132.0
24
Results
After
Sht A
π· MAE MSE
0 85.78 130.92
0.0001 86.02 132.73
0.001 85.00 133.07
0.1 86.81 131.90
34. Multi-scale model approach
ο Dilated convolution uses sparse kernels.
ο This enlarges the receptive field without increasing the number of parameters or the
amount of computation.
33
Dilation convolution
References:[12]
35. Multi-scale model approach
Before
Sha-A
MAE MSE Epoch Crop-size
65.60 103.13 435 256
34
Results
After
Sha-A
MAE MSE Epoch Crop-
size
Frozen
layers
64.58 97.80 390 256 9
67.13 96.67 270 256 9
67.78 100.06 125 256 9
41. References
[1] J. Wan, Z. Liu and A. B. Chan, "A Generalized Loss Function for Crowd Counting and
Localization," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2021, pp. 1974-1983, doi: 10.1109/CVPR46437.2021.00201
[2] Z. Ma, X. Wei, X. Hong, and Y. Gong, βBayesian loss for crowd count estimation with point
supervision,β in ICCV, 2019, pp. 6142β6151.
[3] Wang, B., Liu, H., Samaras, D., & Nguyen, M. H. (2020). Distribution matching for crowd
counting. Advances in neural information processing systems, 33, 1595-1607.
[4] V. Sindagi and V. Patel, "CNN-Based cascaded multi-task learning of high-level prior and
density estimation for crowd counting," in 2017 14th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 2017 pp. 1-6. doi:
10.1109/AVSS.2017.8078491
[5] Idrees, Haroon, et al. "Composition loss for counting, density map estimation and
localization in dense crowds." Proceedings of the European conference on computer vision
(ECCV). 2018.
[6] Jiang L, Dai B, Wu W, Loy CC. Focal frequency loss for image reconstruction and
synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021
(pp. 13919-13929).
[7] W. Shu, J. Wan, K. C. Tan, S. Kwong and A. B. Chan, "Crowd Counting in the Frequency
Domain," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2022, pp. 19586-19595, doi: 10.1109/CVPR52688.2022.01900.
40
42. References
[8] SSR-HEF: Crowd Counting with Multi-Scale Semantic Refining and Hard Example
Focusing Jiwei Chen, Kewei Wang, Wen Su, Zengfu Wang
[9] Liang, Feng, Yangguang Li, and Diana Marculescu. "Supmae: Supervised masked
autoencoders are efficient vision learners." arXiv preprint arXiv:2205.14540 (2022).
[10] Khan, M. A., Menouar, H., & Hamila, R. (2022). Revisiting crowd counting: State-of-the-
art, trends, and future perspectives. Image and Vision Computing, 104597.
[11] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, βSingle-image crowd counting via multi-
column convolutional neural network,β in CVPR, 2016, pp. 589β597.
[12] Y. Li, X. Zhang, and D. Chen, βCsrnet: Dilated convolutional neural networks for
understanding the highly congested scenes,β in CVPR, 2018, pp. 1091β1100.
42