crowd counting.pptx

Single image crowd counting using deep learning models
End semester evaluation of M.Tech. project
By
Pawar Shubham Rajebhau - 2102102013
Under the guidance of
Dr. Vivek Kanhangad
Department of Electrical Engineering
Indian Institute of Technology Indore

Table of contents
1
Introduction
Loss functions
Density map estimation method
Frequency domain approach
Challenges
Crowd counting models
References
Future work
Evaluation metrics
Dataset
Overview of crowd counting methods
Multi-scale model approach
Self-supervised pre-training approach

Introduction
 Single image crowd counting is to estimate the number of objects (people, cars, cells, etc.) in an
unconstrained scene.
 It has important applications in public safety, traffic management, consumer behavior, cell counting,
etc.
 Extensive research has been done in this area, particularly with the use of deep learning.
2
Image- based crowd counting

The three major crowd counting approaches are detection-based, regression-based, and density map
estimation.
 Based on computer vision techniques.
 Detect individual objects, heads, or body parts and count the total number in the image.
 Accuracy deteriorates in crowded scenes with severe occlusions.
 Requires full identification and outlining of each object, incurring the highest labeling cost.
4
Detection-based

 Estimates the count by directly relating it to the image.
 Achieves higher accuracy than the detection-based approach in crowded scenes.
 Lacks spatial information and interpretability, limiting its use in localization study.
 Does not require annotating individual objects, resulting in a lower annotation cost.
5
Regression-based approach

 Recently emerged as a promising approach.
 Achieves high accuracy for crowded scenes.
 Preserves spatial information of people distribution.
 Requires indicating only the heads of people, resulting in an intermediate labeling cost
between detection-based and regression-based approaches.
6
Density map estimation

7
 The annotated density map is a sparse
binary mask.
 Each individual person is marked with
a single dot on their head or forehead.
 The spatial extent of each person is not
provided.
Density map
Shanghai-Tech image

 Sparse density maps are converted to dense maps using fixed or variable gaussian kernels
 It helps train the model better than directly using a sparse dot map.
 Fixed standard deviation (sigma) of the Gaussian kernel.
 The sigma value is equal to the distance to the nearest neighbor.
 The sigma value is computed as the average of the distances to the three nearest neighbors, divided
by 10.
8
Sigma value computation methods

9
M1 M2 M3
Shanghai-Tech images

10
M1 M2 M3

Crowd counting models
11
Single-column models Multi-column models
Models pre-trained backbone Single-column model with multiple modules
References:[10]

Loss functions
12
 Predicted density map is a smooth heat map and available ground truth is a dot map
 Choice of loss function depends on how we use the ground truth and predicted density maps
 Loss function improves performance by extracting proper supervisory information from ground
truth
L2 loss
 Pixel wise loss(network adjusts pixel value according to L2 loss considering same pixels in the
ground truth)
 Ground truth dot map converted to smooth heat map using Gaussian kernel.
 L2 loss is sensitive to the choice of variance in the Gaussian kernel

Loss functions
13
Optimum transport loss
 Considers predicted density maps and dot maps as probability distributions and uses balanced
OT to match the shape of the two distributions.
 Sinkhorn algorithm is used to obtain the optimal transport matrix.

Dataset
 ShanghaiTech dataset contains two parts: Part A and Part B.
 Part A contains 482 images (300 for training, 182 for testing) and includes high-density crowds
collected from the Internet.
 Part B contains 716 images (400 for training, 316 for testing) and is captured from busy streets in
urban areas of Shanghai.
 The average resolution of images in ShanghaiTech Part A 589x868 pixels.
 The scenes in Part B are less crowded than those in Part A.
14
Shanghai Tech

Dataset
15
Shanghai Tech
References:[11]

Dataset
16
Shanghai Tech

Dataset
 Contains 4,372 images with an average resolution of 1430x910 pixels.
 The dataset was collected from various geographical locations and under diverse conditions.
 Contains a total of 1.51 million dot annotations with an average of 346 dots per image and a
maximum of 25K dots.
 The dataset includes images captured under adverse weather and various illumination conditions,
ensuring improved diversity.
 Provides head-level labels such as dots, approx. bounding box, blur-level, etc. and image-level
labels such as scene type and weather condition.
17
JHU crowd

Dataset
19
JHU crowd
Jhu-crowd images

Evaluation metrics
 Determines the accuracy of the estimates
MAE =
1
𝑁 𝑖=1
𝑁
𝐶𝐼𝑖
𝑝𝑟𝑒𝑑
− 𝐶𝐼𝑖
𝑔𝑡
 Indicates the robustness of the estimates
MSE =
1
𝑁 𝑖=1
𝑁
𝐶𝐼𝑖
𝑝𝑟𝑒𝑑
− 𝐶𝐼𝑖
𝑔𝑡 2
N = Number of test images
𝑪𝑰𝒊
𝒑𝒓𝒆𝒅
= 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝒓𝒆𝒔𝒖𝒍𝒕𝒔, 𝑪𝑰𝒊
𝒈𝒕
= 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉𝒔
20
Mean Absolute Error (MAE)
Mean Square Error (MSE)

 Fourier transforms used for frequency response of predicted density map and ground truth dot map.
 The dispersed spatial information in the predicted and ground truth density maps is converted to
compact information.
21

22
DFT plots

Using CMTL model
23
image
Cascaded
–
mtl
model
predicted
density
map
DFT
of
Pred.dm
ground
truth
density
map
DFT
of
Pred.dm
Lf
L
Loss = L+𝜷Lf

Before
Sht A
MAE MSE
86.7 132.0
24
Results
After
Sht A
𝜷 MAE MSE
0 85.78 130.92
0.0001 86.02 132.73
0.001 85.00 133.07
0.1 86.81 131.90

25
3x3
conv,256
3x3
conv,128
1x1
conv
Up
sampled
output
of
vgg-16
Density
map
Base model (DM-count)
DM-count
Reg-layer Density map

26
3x3
conv,256
3x3
conv,128
1x1
conv
Up
sampled
output
of
vgg-16
Density
map
DM-count plus one extra parallel layer
DM-count+
Reg-layers Density map
5x5
conv,256
5x5
conv,128

Before
JHU-Full
MAE MSE Epoch Crop-size
54.03 204.38 335 384
27
Results
After
JHU-Full
58.63 191.01 265 384
52.69 194.65 410 384
52.68 204.69 905 256

Before
JHU-High n≥100
110.63 297.34 245 384
28
Results
After
JHU-High n≥100
93.96 282.63 325 384
99.55 284.99 135 384

29
3x3
conv,256
3x3
conv,128
1x1
conv
Up
sampled
output
of
vgg-16
Density
map
DM-count plus one extra parallel layer
DM-count+
5x5
conv,256
5x5
conv,128
Pretrained-vgg-16

30
Vgg -16
13 layers of Vgg-16
Frozen Fine-tuned

Before
Sha-A
65.60 103.13 435 256
31
Results
After
Sha-A
MAE MSE Epoch Crop-
size
Frozen
layers
62.11 99.37 290 256 9
65.09 98.38 230 256 9
61.25 102.22 440 256 5

32
3x3
conv,
256,
dr:
2
1x1
conv
Up
sampled
output
of
vgg-16
Density
map
DM-count dilated convolution layer
DM-count++
3x3
conv,256
3x3
conv,128
3x3
conv,
128,
dr:
2

 Dilated convolution uses sparse kernels.
 This enlarges the receptive field without increasing the number of parameters or the
amount of computation.
33
Dilation convolution
References:[12]

Before
Sha-A
65.60 103.13 435 256
34
Results
After
Sha-A
MAE MSE Epoch Crop-
size
Frozen
layers
64.58 97.80 390 256 9
67.13 96.67 270 256 9
67.78 100.06 125 256 9

35
Results
After
JHU-Full
53.87 190.63 45 384
Before
JHU-Full
54.03 204.38 335 384

36
3x3
conv,256
3x3
conv,128
1x1
conv
Up
sampled
output
of
vgg-16
Pred
density
map
DM-count with self-supervised branch
Density map estimation branch
I/P
Image
1x1
conv,
512
Up
sampled
o/p
g.t.
density
map
L1
L2
I/P
Image
L = L1 + β*L2
Self-supervised pre-training

37
Supervised Masked Autoencoders
Self-supervised pre-training
Supervised branch
References:[9]

Before
JHU-High n≥100
MAE MSE Epoch
110.63 297.34 245
38
Results
After
JHU-High n≥100
MAE MSE Epoch β
116.44 255.35 145 0.1
112.83 275.26 125 0.01

Future work
 Adaptive dilation rate models
 Transformer based self–supervised models
39

References
[1] J. Wan, Z. Liu and A. B. Chan, "A Generalized Loss Function for Crowd Counting and
Localization," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2021, pp. 1974-1983, doi: 10.1109/CVPR46437.2021.00201
[2] Z. Ma, X. Wei, X. Hong, and Y. Gong, “Bayesian loss for crowd count estimation with point
supervision,” in ICCV, 2019, pp. 6142–6151.
[3] Wang, B., Liu, H., Samaras, D., & Nguyen, M. H. (2020). Distribution matching for crowd
counting. Advances in neural information processing systems, 33, 1595-1607.
[4] V. Sindagi and V. Patel, "CNN-Based cascaded multi-task learning of high-level prior and
density estimation for crowd counting," in 2017 14th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 2017 pp. 1-6. doi:
10.1109/AVSS.2017.8078491
[5] Idrees, Haroon, et al. "Composition loss for counting, density map estimation and
localization in dense crowds." Proceedings of the European conference on computer vision
(ECCV). 2018.
[6] Jiang L, Dai B, Wu W, Loy CC. Focal frequency loss for image reconstruction and
synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021
(pp. 13919-13929).
[7] W. Shu, J. Wan, K. C. Tan, S. Kwong and A. B. Chan, "Crowd Counting in the Frequency
Domain," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2022, pp. 19586-19595, doi: 10.1109/CVPR52688.2022.01900.
40

References
[8] SSR-HEF: Crowd Counting with Multi-Scale Semantic Refining and Hard Example
Focusing Jiwei Chen, Kewei Wang, Wen Su, Zengfu Wang
[9] Liang, Feng, Yangguang Li, and Diana Marculescu. "Supmae: Supervised masked
autoencoders are efficient vision learners." arXiv preprint arXiv:2205.14540 (2022).
[10] Khan, M. A., Menouar, H., & Hamila, R. (2022). Revisiting crowd counting: State-of-the-
art, trends, and future perspectives. Image and Vision Computing, 104597.
[11] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-
column convolutional neural network,” in CVPR, 2016, pp. 589–597.
[12] Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for
understanding the highly congested scenes,” in CVPR, 2018, pp. 1091–1100.
42

crowd counting.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to crowd counting.pptx

Similar to crowd counting.pptx (20)

Recently uploaded

Recently uploaded (20)

crowd counting.pptx