1. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Paper Title: Human Pose Estimation: Benchmarking Deep
Learning-based Methods
All authors Name and Affiliation
Mayank Lovanshi and Vivek Tiwari, IIIT-Naya Raipur
Paper ID: 8399
Track No. : 3
Presented by
Mayank Lovanshi, IIIT-Naya Raipur
2. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Content
• Introduction
• Related Work
• Methodology: Human Pose Estimation Models
• Dataset Used
• Experiment & Results
• Conclusion
• References
2
3. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
INTRODUCTION
• Human Pose Estimation: Identifying and
classifying the joints in the human body [1,2].
• Way to capture a set of coordinates for each
joint (arm, head, torso, etc.,) Known as key
points [2,3].
• The connection between these points is
known as a Pair [1,2,3].
• Extraction of the angle information between
the body joints [2,3].
3
Fig.1: Sample of Pose Estimation
Source:https://www.quickerhire.com/blogs/human-pose-estimation-for-multiple-subjects-with-machine-learning
4. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Cont…
4
Fig.2: Human body modeling: a) Skeleton based, b) Contour based,
c) Volume-based
Source:https://shop62004.afacetoreframe.org/content?c=body%20pose%20estimation&id=1
Three types of approaches to HPE:
The skeleton-based model includes a set of
key points (joints) like ankles, knees, shoulders, and
elbows [1].
The contour-based model consists of the
contour and rough width of the body, torso, and limbs
[1].
The volume-based model consists of multiple
popular 3D human body models and poses represented
by human geometric meshes and shapes [1].
5. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Cont…
2D Pose Estimation: 2D human pose estimation uses visuals like images and video to
evaluate the 2D human pose or spatial location of the human body’s key points [2,3].
3D Pose Estimation: The 3D Human Pose Estimation method is used to locate human joints
in 3D space [2,3].
5
Fig.3: 2D vs 3D Pose Estimation sample
6. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Related Work
6
S.N. Paper Title Problem Statement Method Limitation
1. DeepPose: Human Pose
Estimation via Deep Neural
Networks by A. Toshev et.al.
(2014) [8].
Aim to extract 2D/3D
key points information
using deep learning
based HPE algorithm
PosePipe: a open-source deep
learning model used to extract
2D/3D keypoints.
Hard to work on the
video based datasets.
2. Human Pose Estimation via
Convolutional Part Heatmap
Regression by A. Bulat et.al.
(2016) [10].
Extraction of Human
body joints using deep
learning based
approach
A Convolutional Neural
Network (CNN) based approach
used for identification of the
human pose
CNN based approach
doesn’t work on part
based posture
identification
3. Combining local appearance and
holistic view: Dual-Source Deep
Neural Networks for human pose
estimation by Xiaochuan Fan
et.al. (2018) [12].
Aim to extract local
part pose information
to enhance human
posture evaluation
Dual-Source Deep
Convolutional Neural Network
(DS-CNN) used for posture
evaluation.
Pose estimation results
is not that much
correct.
7. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
Related Work
7
S.N. Paper Title Problem Statement Method Limitation
4. Deep learning based 2D human
pose estimation: A survey by Q.
Dang et.al. (2019) [2].
Identification of 2D/3D
human pose estimation
using kinematic model
Model-free & model based;two
estimation algorithm is used
It doesn’t work on the
RGB-D images.
5. End-to-end recovery of human
shape and pose by A. Kanazawa
(2020) [11].
Identification of the
human posture with
joints angle & key
points information
Human Mess Recovery (HMR):
an end-to-end system for
generating a complete 3D/2D
mess.
3D mess can’t be
extracted from the
depth RGB
image/video.
6. Hand Pose Estimation from RGB
Images Based on Deep Learning:
A Survey by Y. Liu et.al. (2021)
[1].
Identification of 2D
human pose
estimation.
DeepPose: a cascaded deep
learning-based regressor used
It doesn’t work to
extract 3D human pose
estimation
8. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
HUMAN POSE ESTIMATION MODELS
1. OpenPose [4]
2. ViTPose [13]
3. HRNet [6]
4. AlphaPose [5]
5. DenseNet [14]
6. EfficientPose [15,16]
7. DensePose [17]
8. Hourglass [18]
8
Fig. 4: Architecture of our proposed work
9. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
1. OpenPose:
• OpenPose is based on the VGG-19
convolutional neural network.
• It comprises four parts: input, part confidence
map, bipartite matching, & output image.
2. ViTPose:
• Based on non-hierarchical vision transformers
as backbones.
• Two deconvolution layers and one prediction
layer.
9
Fig. 5: Image extraction through the OpenPose method
Fig. 6: Framework of the ViTPose method
10. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
3. HRNet:
• Backbone model as a convolutional neural
network.
• Used for semantic segmentation, object
recognition, and image categorisation.
4. AlphaPose:
• Used Symmetric Spatial Transformer Network
(SSTN)
• Single-Person Pose Estimator (SPPE)
10
Fig. 8: Image extraction through the AlphaPose method
Fig. 7: Framework of HRNet method
11. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
5. DenseNet:
• The backbone model is Resnet (based on
CNN).
• Solve the vanishing gradient problem by
using LSTM as one layer.
6. EfficientPose:
• The backbone model is a Convolutional
neural network.
• It comprises two main parts; an efficient
backbone and an efficient head.
11
Fig. 9: Layered structure of DenseNet method
Fig. 10: Architecture of the EfficientPose Methods
12. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
7. DensePose:
• A fully-convolutional network design
was used in the Dense Regression
(DenseReg).
• It combines the DenseReg method with
the Mask-RCNN to improve Pose.
8. Hourglass:
• Based on tightly linked fully
convolutional networks.
• Conv-deconv and encoder-decoder
methods are linked to the hourglass
module.
12
Fig. 11: Architecture of the DensePose Methods
Fig. 12: Architecture of the Hourglass Methods
13. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
DATASET USED
A. COCO: [20]
• Images: 66,808
• Annotations: 273,469
• Key points: 17 key points
B. MPII: [12]
• Images: 40,000
• Annotations: 223,589
• Classes: 410 classes
13
Fig.13: COCO Dataset: sample images [20]
Fig.14: MPII Dataset: sample images [12]
14. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
RESULTS
Evaluate results on the basis of the following matrices:
1. Average Precision(AP): The weighted mean of precisions at each threshold; the
weight is the increase in recall from the prior threshold [7,8].
𝐴𝑃@𝛼 = 0
1
𝑝 𝑟 𝑑𝑟
2. Mean Average Precision(mAP): Average precision value over different IOUs [9].
3. Percentage of correct key point (PCK): PCK is a precision metric determining the
anticipated key point and the actual joint in a given distance [10,11].
14
𝑚𝐴𝑃@𝛼 =
1
𝑛
𝑖=1
𝑛
𝐴𝑃𝑖
15. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
RESULTS
TABLE I: BENCHMARKING WITH SOTA POSE ESTIMATION NETWORKS ON COCO DATASET BASED ON AP & MAP
15
Algorithm COCO Dataset
AP AP0.5 AP0.75 APM APL mAP
OpenPose[4] 60.5 83.4 66.4 55.1 68.1 65.9
AlphaPose [5] 73.3 89.2 79.1 69.0 78.6 77.84
HRNet [6] 77.4 92.6 84 73.6 83.7 82.3
ViTPose-B [13] 81.1 95.0 88.2 87.8 86.0 85.6
DenseNet [14] 77.1 93.3 83.6 72.2 83.6 82.6
EfficientPose [15,16] 70.5 91.1 79.0 67.3 76.2 76.1
DensePose [17] 55.8 83.7 56.3 42.2 53.8 61.1
Hourglass [18] 65.6 88.8 69.3 - - 74.5
4*RSN-50 [19] 78.6 94.6 86.6 83.3 75.5 83.8
16. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
RESULTS
TABLE II: BENCHMARKING WITH SOTA POSE ESTIMATION NETWORKS ON MPII DATASET BASED ON PCK OF BODY PART &
AVERAGE PCK
16
Algorithm MPII Dataset
Ankle Knee Hip Wrist Elbow Shoulder Head Avg.
PCK
OpenPose [4] 79.87 87.17 93.0 79.15 89.03 95.97 96.11 88.73
AlphaPose[5] 72.4 79.9 80.3 76.4 84.0 90.5 91.3 82.1
HRNet [6] 82.5 86.1 89.1 85.9 90.5 85.9 96.9 90.0
ViTPose-B [13] 88.3 91.9 92.4 90.1 93.7 97.4 97.6 93.4
EfficientPose [15,16] 83.9 87.5 90.3 87.5 91.7 96.0 98.2 91.2
Hourglass [18] 89.3 92.2 93.2 91.2 94.4 97.5 98.8 94.1
4*RSN-50 [19] 86.8 90.6 92.0 89.9 93.9 97.3 98.5 93.0
17. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
CONCLUSION
• This study helps to yield accurate and spatially precise key point heat maps, average
precision & probability of correct key points of human pose estimation.
• Experimental analysis was done over two datasets, i.e. COCO & MPII datasets.
• ViTPose-B performed better than the others in every AP variant on COCO dataset
because it uses a transformer instead of a convolution.
• OpenPose underperformed on the COCO dataset.
• The average PCK of the MPII dataset and the PCKs for each class were
outperformed by the hourglass model.
• AlphaPose underperformed on the MPII dataset.
17
18. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
WORK DONE(based on discussed work)
Task 1: Human Skeleton Pose and Spatio-Temporal Feature-based Activity
Recognition using ST-GCN
• Human activity recognition using pose estimation algorithm.
• Normalise human activity sequence with the Gaussian filter method.
• Investigate ST-GCN model for extraction of Spatial & Temporal features.
Task 2: 3D Skeleton-based Human Motion Prediction using Dynamic Multi-scale
Spatiotemporal Graph Recurrent Neural Networks
• Human motion prediction using graph recurrent neural network.
• Investigate a novel DMST-GRNN model on the multi-scale variation for the extraction of spatial &
temporal features.
• Validate human motion based on time series-based 3D sequential datasets.
18
19. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
REFERENCES
[1] Y. Liu, J. Jiang, and J. Sun, “Hand Pose Estimation from RGB Images Based on Deep Learning: A Survey.” 2021 IEEE 7th International Conference on
Virtual Reality (ICVR), 2021.
[2] Q. Dang, J. Yin, B. Wang, and W. Zheng, “Deep learning based 2D human pose estimation: A survey.” Tsinghua Science and Technology, vol. 24, no. 6,
pp. 663-676, 2019.
[3] Meenakshi Choudhary, Vivek Tiwari, and Swati Jain. Person reidentification using deep siamese network with multi-layer similarity constraints. Multimedia
Tools and Applications, pages 1– 17, 2021.
[4] D. Osokin, “Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose.” Proceedings of the 8th International Conference on Pattern
Recognition Applications and Methods, 2019.
[5] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional Multiperson Pose Estimation.” 2017 IEEE International Conference on Computer Vision (ICCV),
2017.
[6] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep High-Resolution Representation Learning for Human Pose Estimation.” 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2019.
[7] W. Li, R. Du, and S. Chen, “Skeleton-Based Spatio-Temporal UNetwork for 3D Human Pose Estimation in Video.” Sensors, vol. 22, no. 7, p. 2573, 2022.
[8] A. Toshev and C. Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks.” 2014 IEEE Conference on Computer Vision and Pattern
Recognition, 2014.
[9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Real-time Multi-person 2D Pose Estimation Using Part Affinity Fields.” 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[10] A. Bulat and G. Tzimiropoulos, “Human Pose Estimation via Convolutional Part Heatmap Regression.” Computer Vision – ECCV 2016, pp. 717-732,
2016.
19
20. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies
REFERENCES
[11] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018.
[12] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D Human Pose Estimation: New Benchmark and State of the Art Analysis.” 2014 IEEE
Conference on Computer Vision and Pattern Recognition, 2014.
[13] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.” Computer Vision – ECCV 2022,
2022.
[14] S. W. Chu, Y. Song, J. J. Zouo, and W. Cai, “Human Pose Estimation Using Deep Convolutional Densenet Hourglass Network with Intermediate
Points Voting.” 2019 IEEE International Conference on Image Processing (ICIP), 2019.
[15] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark.” 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[16] D. Groos, H. Ramampiaro, and E. A. Ihlen, “EfficientPose: Scalable single-person pose estimation.” Applied Intelligence, vol. 51, no. 4, pp. 2518-
2533, 2020.
[17] R. A. Guler, N. Neverova, and I. Kokkinos, “DensePose: Dense Human Pose Estimation in the Wild.” 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018.
[18] T. Xu and W. Takano, “Graph Stacked Hourglass Networks for 3D Human Pose Estimation.” 2021 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2021.
[19] Y. Cai, “Learning Delicate Local Representations for Multi-person Pose Estimation.” Computer Vision – ECCV 2020.
[20] T.-Y. Lin, “Microsoft COCO: Common Objects in Context.” Computer Vision – ECCV 2014, pp. 740-755, 2014.
20
21. IEEE International Conference on Interdisciplinary Approaches in Technology and
Management for Social Innovation (Hybrid)
December 21 – 23, 2022, Gwalior, India
Enabling the Change! Social Innovation
for sustainable societies