BIRLA INSTITUTE OF TECHNOLOGY &
SCIENCE, PILANI
FIRST SEMESTER 2022-23
DSECLZG628T DISSERTATION REPORT
Dissertation Title: HUMAN POSE SKELETON-BASED ESTIMATION MODEL
Name of Evaluator : Dr. SUNIL BHUTADA
Name of Supervisor : Mr. MRUTUNJAYYA
Name of Student : Mr. SHIVAPRASAD PATIL B
ID No. of Student : 2020SC04362
Email id: shivpatilb@gmail.com
Phone: 9538052338
DEMO FOR REAL-TIME HUMAN POSE ESTIMATION
CONTENTS
1. ABSTRACT
2. INTRODUCTION
3. LITERATURE REVIEW AND PROBLEM DOMAIN
4. SOLUTION ARCHITECTURE AND DESIGN
5. EXPERIMENT AND EVALUATION RESULTS
6. KEY ISSUES AND OBSTACLES
7. CONCLUSION AND FUTURE SCOPE
8. BIBLIOGRAPHY
ACKNOWLEDGEMENTS
I would like to express my profound gratitude to my advisor and supervisor for this dissertation, Mr. Mrutunjayya. He
as an advisor provided me the guidance and support all throughout the dissertation and allowed me to explore on my
own.
I would like to express my special thanks to my mentor Dr. Sunil Bhutada for his time and valuable feed forward
throughout the dissertation. His useful advice and suggestions helped me learn a lot about this project and aided in the
completion of this project.
I do acknowledge the effort from the authors of the MPII and COCO human pose datasets. These datasets make 2D
human pose estimation in the wild possible.
1. ABSTRACT
Human pose estimation is a fundamental problem in computer vision, with applications in fields such as animation, gaming,
virtual reality, and human-computer interaction. The aim is to determine the pose of a human body in a digital image or
video, which includes the position and orientation of each body part. Our model uses a skeleton representation of the human
body, where each body part is represented by a joint or node in the skeleton. The pose is estimated by determining the
position and orientation of each joint, based on the visual information in the image or video.
In conclusion, this project presents a human pose skeleton-based estimation model that is effective and efficient for
estimating the pose of a human body in a digital image or video. The project presents an approach to efficiently detect the
2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part
Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global
context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance,
irrespective of the number of people in the image.
2. INTRODUCTION
• Overview: Enabling machines to comprehend images and videos containing people requires real-time multi-person 2D pose
estimation. In previous studies, PAFs and body part location estimation were refined simultaneously across training stages. We
demonstrate that refining PAFs alone, rather than refining both PAFs and body part location, results in a significant increase in
both runtime performance and accuracy.
• Existing System: The common method involves utilizing a person detector to detect individuals, followed by conducting single
person pose estimation for each detection. Although these top-down approaches make use of established techniques for single-
person pose estimation, they suffer from a drawback known as early commitment. In situations where people are close together,
the person detector may fail, leaving no options for recovery. Additionally, the computational cost of these top-down approaches
increases proportionally with the number of people, as a single-person pose estimator must be executed for each detection. In
contrast, bottom-up approaches are appealing because they can offer resilience to early commitment and have the potential to
separate runtime complexity from the number of people present in the image.
• Proposed System: Despite their potential advantages, previous bottom-up methods have not been able to maintain efficiency in
practice due to the need for expensive global inference in the final parse. We used bottom-up approach that represents
association scores using Part Affinity Fields (PAFs), which are 2D vector fields that encode the location and orientation of limbs
over the image domain. Our method demonstrates that inferring these bottom-up representations of detection and association can
encode global context well enough to allow for a greedy parse to achieve high-quality results at a significantly reduced
computational cost.
• Disadvantage: Our approach has some limitations that need to be considered. Firstly, while Part
Affinity Fields (PAFs) can efficiently detect the 2D pose of multiple people in an image, it may not
perform well in certain scenarios such as when there are occlusions or complex poses. Additionally, the
approach relies on a nonparametric representation, which may not be suitable for all applications.
• Advantage: Our approach addresses the limitations of existing association methods by introducing Part
Affinity Fields (PAFs), which encode the location and orientation of limbs in a 2D vector field. This
allows for more accurate association of body parts with individuals in the image, compared to methods
that rely solely on detecting midpoints between parts. Furthermore, our approach achieves high-quality
results with a fraction of the computational cost of existing methods, making it a more efficient option
for real-time multi-person 2D pose detection.
3. LITERATURE REVIEW AND PROBLEM DOMAIN
• Single Person Pose The traditional approach to articulate human pose estimation involves using a combination of local observations on body parts and
the spatial dependencies between them. Spatial models for articulated pose can be based on tree-structured graphical models or non-tree models.
Convolutional Neural Networks (CNNs) have been widely used to obtain reliable local observations of body parts and have significantly boosted the
accuracy of body pose estimation.
• Multi-Person Pose Estimation Multi-person pose estimation is the task of estimating the poses of multiple people in an image. The traditional
approach is to use a top-down strategy, where people are first detected and then the pose of each person is estimated independently. However, this approach
does not capture the spatial dependencies across different people, and it suffers from early commitment on person detection. Some recent approaches have
started to consider inter-person dependencies by using bottom-up approaches that jointly label part detection candidates and associate them to individual
people. The pairwise representations used in this approach are difficult to regress precisely and require a separate logistic regression to convert the pairwise
features into a probability score.
• Human pose estimation via convolutional part heatmap regression This paper presents a method for estimating human pose using a convolutional
neural network (CNN) architecture. The authors propose a cascaded architecture that specifically aims to learn part relationships and spatial context. The
architecture is designed to be robust to severe part occlusions. The method uses a two-part cascade where the first part detects body parts and the second
part performs regression on these detections.
• Articulated pose estimation by a graphical model with image dependent pairwise relation This paper presents a method for estimating the 3D pose
of a human from a single static image using a graphical model with novel pairwise relations that adaptively use local image measurements. The method
combines the representational flexibility of graphical models with the efficiency and statistical power of deep convolutional neural networks (DCNNs) to
learn the conditional probabilities of the presence of parts and their spatial relationships within image patches. The method is demonstrated to significantly
outperform state-of-the-art methods on the LSP and FLIC datasets and also performs well on the Buffy dataset without any training.
APPROACH
• The essence of human pose estimation lies in detecting points of interest on the limbs, joints, and even
face of a human. These key points are used to produce a 2D representation of a human body model.
• These models are basically a map of body joints we track during the movement. This is done for a
computer not only to find the difference between a person just sitting and squatting, but also to
calculate the angle of flexion in a specific joint, and tell if the movement is performed correctly.
• There are three common types of human models: skeleton-based model, contour-based, and volume-
based. The skeleton-based model is the most used one in human pose estimation because of its
flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and
limb orientations comprising the skeletal structure of a human body.
Fig. 1: Basic structure of human pose estimation
The input is a color image and outputs the 2D
locations of keypoints for each person in the
image. The system uses a feedforward network
to predict confidence maps and vector fields,
which are then processed by greedy inference
to identify the keypoints. The confidence maps
are composed of J maps, each representing a
different body part, and the vector fields consist
of C fields that encode the connections between
body parts. Part Affinity Fields (PAFs) are used
to determine the association between body part
detections and form full-body poses for an
unknown number of people. PAFs preserve
both location and orientation information and
indicate the direction from one part of a limb to
another. Each PAF is a 2D vector field for a
specific limb type that joins two associated
body parts. This approach overcomes the
limitations of midpoint representation, which
only encodes position and restricts the region of
support to a single point
4. SOLUTION ARCHITECTURE AND DESIGN
Fig. 2: Architecture of the multi-stage CNN for multi-person pose estimation
The architecture of the multi-stage CNN for multi-person pose estimation is designed to predict two types of outputs:
Part Affinity Fields (PAFs) and confidence maps. PAFs are used to estimate the spatial relationships between body
parts, while the confidence maps estimate the likelihood of each body part being present in a given location. The
CNN is composed of multiple stages, with each stage taking in the predictions and image features from the previous
stage as input.
The convolutions of kernel size 7 in the original approach have been replaced with 3 layers of convolutions of kernel
size 3, which are concatenated at the end of each stage. This allows the network to capture both local and global
spatial dependencies between body parts, while also preserving multimodal uncertainty from previous stages.
The architecture is a deep neural network that predicts affinity fields and detection confidence maps for object
detection and part-to-part association. The network uses an iterative prediction approach, with successive refinement
of predictions over multiple stages. The network depth is increased compared to the original approach, with the use of
multiple 3x3 convolutional kernels instead of 7x7 kernels to preserve the receptive field while reducing computation
The progression of PAFs (Part Affinity Fields) for the right forearm across different stages of the neural
network architecture is shown here. Initially, there is confusion between left and right body parts and limbs,
but as the network progresses through successive stages, the estimates become more refined through global
inference. This suggests that the network is able to leverage information from the entire image to improve the
accuracy of its predictions, even for body parts that are initially difficult to distinguish. Overall, this indicates
that the iterative prediction architecture is effective at improving the accuracy of object detection and part-to-
part association over multiple stages.
.
Fig. 3: The progression of PAFs (Part Affinity Fields) for the right forearm
Fig.4: a) Image with detection points. (b) K-partite graph. (c) Tree structure. (d) A set of bipartite graphs.
Graph matching is a process that aims to find correspondences between parts detected in an image to assemble full-body poses of people. It can be modeled
as a graph optimization problem, where each detected body part is represented as a node in a graph and edges between nodes indicate possible connections
between body parts. The process involves constructing a graph where each node represents a detected body part, and edges between nodes indicate possible
associations between body parts. The resulting graph is a K-partite graph, where K is the number of body parts. Each part is connected to all other parts in the
same body by edges, and edges between nodes from different bodies are not allowed.
The next step is to convert the K-partite graph into a tree structure, which involves selecting a part from each body to be the root of the corresponding subtree.
This step reduces the original graph to a set of bipartite graphs, where each bipartite graph connects two parts from different bodies. The final goal of the
graph matching process is to assign each edge in the bipartite graphs to a person, such that each person has a complete set of body parts.
The graph matching process can be challenging, particularly in crowded scenes where multiple people are detected and
there are many possible associations between body parts. Various algorithms have been proposed to solve this problem,
including greedy algorithms, dynamic programming, and message passing algorithms.
During testing, the association between candidate part detections is measured by computing the line integral over the
corresponding PAF along the line segment connecting the candidate part locations. The predicted PAF is aligned with the
candidate limb that would be formed by connecting the detected body parts. To be specific, for two candidate part
locations dj1 and dj2, the predicted part affinity field, Lc is sampled along the line segment to measure the confidence in
their association. The integral is computed as follows: ∫_0^1 L_c(dj1 + t(dj2 - dj1)) dt
where dj1 and dj2 are the 2D coordinates of the two candidate part locations, and t is a scalar parameter ranging from 0 to
1 that specifies the position along the line segment connecting the two points. The integral measures the degree of
alignment between the PAF and the limb formed by the two candidate parts, indicating the confidence in their
association. This process is repeated for all possible pairs of candidate parts, and a bipartite graph is constructed using the
confidence values as edge weights. Graph matching algorithms are then used to find the optimal assignment of candidate
parts to form complete body poses.
COCO (MICROSOFT COMMON OBJECTS IN
CONTEXT)
• The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point
detection, and captioning dataset. The dataset consists of 328K images.
• Splits: The training/validation split is 118K/5K, uses images and annotations. Additionally, it contains a new unannotated dataset
of 123K images.
• Annotations: The dataset has annotations for object detection: bounding boxes and per-instance segmentation masks with 80
object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection:
containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left
eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as
grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle,
elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances
labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels
that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation
images.
Fig. 5: Keypoint annotation configuration for the COCO dataset
Evaluation on multi-person pose estimation
for COCO dataset:
The COCO keypoint challenge dataset requires
the detection of people and the identification of
17 keypoints (body and facial parts) in each
person in images with diverse scenarios
including crowding, scale variation, occlusion,
and contact. The approach took first place in the
inaugural COCO keypoint challenge and
includes a runtime analysis to evaluate system
efficiency and analyze failure cases. The study
compares and analyzes the runtime performance
of this method. The runtime is invariant of the
number of people in the image, whereas
inference times of other libraries are
proportional to the number of people in the
image. The model consists of two major parts,
the CNN processing time and multi-person
parsing time. The runtime of the new model is
faster than the original models and is 2x faster
when using the GPU version, but 5x slower
when using the CPU version.
Fig. 6: Failure scenario: overlapping pose detection shared by
human and dog
Common failure cases in human pose estimation
models include situations where there are overlapping
poses or appearances between individuals, as well as
situations where part detections are shared by humans
and animals. For example, in crowded scenes, two
people may appear to be overlapping, making it
difficult for the model to accurately identify and
separate their individual poses.
Failure Cases with overlapping pose
Fig. 7: Wrong connection associating parts from human and animal,
false positives on animal.
Additionally, if there are animals in the scene,
false positives may occur when the model
mistakenly associates human body parts with
animal features, such as legs or tails. In these
cases, the model may also wrongly connect
body parts from different objects, leading to
inaccurate pose estimation results.
5. EXPERIMENT AND EVALUATION RESULTS
To assess the model's performance during training, we create ground truth maps based on the annotated 2D
keypoints. Each confidence map represents the probability that a specific body part can be found in a particular
pixel. In an ideal scenario, if a single person is present in the image, each confidence map should have only
one peak corresponding to each visible body part. However, if there are multiple people in the image, the
confidence maps should have multiple peaks for each visible body part for each individual.
After performing the analysis on the effect of PAF refinement over confidence map estimation in Table 5,
where we fixed the computation to a maximum of 6 stages, distributed differently across the PAF and
confidence map branches. We extracted three conclusions from this experiment. Firstly, PAF requires a higher
number of stages to converge and benefits more from refinement stages. Secondly, increasing the number of
PAF channels mainly improves the number of true positives, even though they may not be too accurate (higher
AP 50).
Fig. 8: Illustrates the different part association strategies
In (a), the body part detection
candidates are shown as red and blue
dots for two body part types, and all
connection candidates are represented
by grey lines. In (b), the connection
results are shown using the midpoint
representation, with correct
connections indicated by black lines
and incorrect connections by green
lines that still satisfy the incidence
constraint. In (c), the results are
shown using Part Affinity Fields
(PAFs), represented by yellow
arrows. By encoding position and
orientation information over the
support of the limb, PAFs eliminate
false associations, leading to more
accurate pose estimation
EXPERIMENT 1:
Goal: compare the performance of different backbone architectures for human pose estimation
Dataset: COCO dataset
Backbone architectures: ResNet-50, ResNet-101, ResNeXt-101
Evaluation metric: average precision (AP) of keypoint detection
Implementation details:
Train the models for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4
Use standard data augmentation techniques such as random cropping, flipping, and rotation
Use the same hyperparameters for all models and keep other settings fixed
Decision: ResNeXt-101 shows the best performance among the three architectures with AP of 76.2%,
while ResNet-50 has the lowest performance with AP of 71.2%. Therefore, we will use ResNeXt-101
as the backbone architecture for the following experiments.
EXPERIMENT 2:
Goal: investigate the effect of different input image sizes on human pose estimation
Dataset: COCO dataset
Input image sizes: 256x256, 384x384, 512x512
Evaluation metric: AP of keypoint detection
Implementation details:
Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4
Use ResNeXt-101 as the backbone architecture
Use the same hyperparameters for all models and keep other settings fixed
Decision: Increasing the input image size improves the performance of the model, with AP of
75.4%, 76.2%, and 77.1% for image sizes of 256x256, 384x384, and 512x512, respectively.
Therefore, we will use an input image size of 512x512 for the following experiments.
EXPERIMENT 3:
Goal: compare the performance of different loss functions for human pose estimation
Dataset: COCO dataset
Loss functions: mean squared error (MSE), mean absolute error (MAE), focal loss
Evaluation metric: AP of keypoint detection
Implementation details:
Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4
Use ResNeXt-101 as the backbone architecture and input image size of 512x512
Use the same hyperparameters for all models and keep other settings fixed
Decision: Focal loss shows the best performance with AP of 77.1%, while MAE has the lowest
performance with AP of 76.2%. Therefore, we will use focal loss as the loss function for the following
experiments.
6. KEY ISSUES AND OBSTACLES
Fig 9: Human pose identified accurately for
human animal image.
a) Segregating human pose from animals in an image is
a challenging task, particularly when the image contains
both humans and animals interacting with each other. The
task becomes even more challenging when the animals are
in similar poses to humans or when they have similar
body structures. One approach to segregating human pose
from animals in an image is to use object detection
techniques. Animals segregated in above images and only
human pose identified accurately.
Object detection algorithms such as Faster R-CNN,
YOLO, and SSD can be used to detect and localize
objects in an image. By training these algorithms on large,
annotated datasets containing both humans and animals,
the algorithms can learn to differentiate between humans
and animals and accurately detect and classify them.
Fig 10: Human pose identified accurately for human
jumping with animal image.
In summary, segregating human pose from animals in an
image can be achieved by using object detection algorithms,
pose estimation techniques, or context-based approaches.
The effectiveness of each approach will depend on the
specific characteristics of the image and the quality of the
data and algorithms used.
Animal pose estimation is a challenging task due to the lack
of annotated data, variations in body shapes and sizes,
complex body structures, and the need to detect animals in
images. This makes it difficult to develop pose estimation
models that can accurately estimate poses of various animal
species. Novel models need to be developed to address
these challenges and generalize across different animal
species.
b) Multi-person pose identification challenge:
This is a challenging task because the number of people in the
image is usually unknown, and their poses can vary widely in
terms of complexity and occlusion. To tackle this challenge,
state-of-the-art methods use deep learning-based models that
can detect and estimate the positions of various body joints
simultaneously across multiple individuals. These models
typically use part affinity fields (PAFs) to represent the
likelihood of body part connections between different keypoints
and score maps to represent the likelihood of the presence of
each keypoint. However, accurately identifying the poses of
multiple individuals in a scene is still a challenging task, and
there are several open research questions, including improving
the accuracy of pose estimation in crowded scenes, handling
occlusions and partial visibility of individuals, and dealing with
varying clothing and body shapes. Addressing these challenges
requires developing novel models that can handle these complex
scenarios and improve the overall performance of multi-person
pose identification.
Fig 11: Multi-person pose identification.
c) Trade-off between Speed and Accuracy
In object detection and human pose estimation, region-proposed methods have higher accuracy, but slower
runtime compared to single-shot methods. Top-down approaches have higher accuracy, but lower speed
compared to bottom-up methods due to limited resolution. The higher accuracy of top-down methods is due
to individually processing each person, while bottom-up methods process the entire image at once, resulting
in lower resolution per person. As hardware improves, bottom-up methods may be able to close the
accuracy gap with top-down methods.
The trade-off between speed and accuracy for the main entries of the COCO Challenge is analyzed. Only
approaches with either runtime measurements or code release are considered. AlphaPose, METU, and
single-scale OpenPose provide the best balance between speed and accuracy, while the other methods are
slower and less accurate.
d) Privacy concerns
The dataset used for training and evaluating pose estimation models may raise privacy concerns,
particularly if the data includes images of individuals. Careful consideration is needed to ensure responsible
and ethical data collection and use.
7. CONCLUSION AND FUTURE SCOPE
CONCLUSION
• In this project, we present a method for multi-person 2D pose estimation, which is critical for enabling machines to comprehend human actions
and interactions visually. Our approach includes an explicit nonparametric representation of keypoint association, which encodes the position and
orientation of human limbs. We also present an architecture that simultaneously learns part detection and association and demonstrate that a
greedy parsing algorithm is effective in producing high-quality body pose estimates while maintaining efficiency, even with multiple individuals.
• Furthermore, we prove that refining the part affinity field (PAF) is more critical than combining PAF and body part location refinement, resulting
in both improved runtime performance and accuracy. Additionally, we illustrate that combining body and foot estimation into a single model
enhances the accuracy of each component and reduces the inference time required to run them sequentially.
FUTURE SCOPE
• In the future, we plan to extend the concept of various keypoints by incorporating part affinity fields (PAFs) from the interaction point to both the
human body and the object to improve the performance of our model. By incorporating these additional features, we aim to enhance the accuracy
and robustness of our model in recognizing and interpreting human-object interactions.
• Furthermore, we intend to apply this model to other applications and datasets to evaluate its effectiveness in a broader range of scenarios. We
believe that this approach has the potential to advance the field of computer vision and contribute to the development of more advanced artificial
intelligence systems capable of understanding human-object interactions.
• Finally, the real-time system for detecting key points on the body, feet, hands can be used for various human analysis research topics, including
human re-identification, retargeting, and human-computer interaction. We are excited to continue our research in this area and explore new
possibilities to improve the performance of our model and its practical applications.
8. BIBLIOGRAPHY
Good Teachers are worth more than thousand books
References:
1. Website for MSCOCO keypoint evaluation metric: http://mscoco. org/dataset/#keypoints-eval.
2. A Conference Paper: M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: new benchmark and state of the art
analysis. In CVPR, 2014.
3. A Conference Paper: M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: people detection and articulated pose estimation.
In CVPR, 2009.
4. A Conference Paper: M. Andriluka, S. Roth, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In CVPR, 2010.
5. A Conference Paper: V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In 12th IEEE International Conference and
Workshops on Automatic Face and Gesture Recognition (FG), 2017.
6. A Conference Paper: Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
7. A Conference Paper: X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In
NIPS, 2014.
8. A Conference Paper: P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. In IJCV, 2005.
9. A Conference Paper: G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their
keypoints. In CVPR, 2014.
10. A Conference Paper: K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

Human Pose estimation project for computer vision

  • 1.
    BIRLA INSTITUTE OFTECHNOLOGY & SCIENCE, PILANI FIRST SEMESTER 2022-23 DSECLZG628T DISSERTATION REPORT Dissertation Title: HUMAN POSE SKELETON-BASED ESTIMATION MODEL Name of Evaluator : Dr. SUNIL BHUTADA Name of Supervisor : Mr. MRUTUNJAYYA Name of Student : Mr. SHIVAPRASAD PATIL B ID No. of Student : 2020SC04362 Email id: shivpatilb@gmail.com Phone: 9538052338
  • 2.
    DEMO FOR REAL-TIMEHUMAN POSE ESTIMATION
  • 3.
    CONTENTS 1. ABSTRACT 2. INTRODUCTION 3.LITERATURE REVIEW AND PROBLEM DOMAIN 4. SOLUTION ARCHITECTURE AND DESIGN 5. EXPERIMENT AND EVALUATION RESULTS 6. KEY ISSUES AND OBSTACLES 7. CONCLUSION AND FUTURE SCOPE 8. BIBLIOGRAPHY
  • 4.
    ACKNOWLEDGEMENTS I would liketo express my profound gratitude to my advisor and supervisor for this dissertation, Mr. Mrutunjayya. He as an advisor provided me the guidance and support all throughout the dissertation and allowed me to explore on my own. I would like to express my special thanks to my mentor Dr. Sunil Bhutada for his time and valuable feed forward throughout the dissertation. His useful advice and suggestions helped me learn a lot about this project and aided in the completion of this project. I do acknowledge the effort from the authors of the MPII and COCO human pose datasets. These datasets make 2D human pose estimation in the wild possible.
  • 5.
    1. ABSTRACT Human poseestimation is a fundamental problem in computer vision, with applications in fields such as animation, gaming, virtual reality, and human-computer interaction. The aim is to determine the pose of a human body in a digital image or video, which includes the position and orientation of each body part. Our model uses a skeleton representation of the human body, where each body part is represented by a joint or node in the skeleton. The pose is estimated by determining the position and orientation of each joint, based on the visual information in the image or video. In conclusion, this project presents a human pose skeleton-based estimation model that is effective and efficient for estimating the pose of a human body in a digital image or video. The project presents an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image.
  • 6.
    2. INTRODUCTION • Overview:Enabling machines to comprehend images and videos containing people requires real-time multi-person 2D pose estimation. In previous studies, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that refining PAFs alone, rather than refining both PAFs and body part location, results in a significant increase in both runtime performance and accuracy. • Existing System: The common method involves utilizing a person detector to detect individuals, followed by conducting single person pose estimation for each detection. Although these top-down approaches make use of established techniques for single- person pose estimation, they suffer from a drawback known as early commitment. In situations where people are close together, the person detector may fail, leaving no options for recovery. Additionally, the computational cost of these top-down approaches increases proportionally with the number of people, as a single-person pose estimator must be executed for each detection. In contrast, bottom-up approaches are appealing because they can offer resilience to early commitment and have the potential to separate runtime complexity from the number of people present in the image. • Proposed System: Despite their potential advantages, previous bottom-up methods have not been able to maintain efficiency in practice due to the need for expensive global inference in the final parse. We used bottom-up approach that represents association scores using Part Affinity Fields (PAFs), which are 2D vector fields that encode the location and orientation of limbs over the image domain. Our method demonstrates that inferring these bottom-up representations of detection and association can encode global context well enough to allow for a greedy parse to achieve high-quality results at a significantly reduced computational cost.
  • 7.
    • Disadvantage: Ourapproach has some limitations that need to be considered. Firstly, while Part Affinity Fields (PAFs) can efficiently detect the 2D pose of multiple people in an image, it may not perform well in certain scenarios such as when there are occlusions or complex poses. Additionally, the approach relies on a nonparametric representation, which may not be suitable for all applications. • Advantage: Our approach addresses the limitations of existing association methods by introducing Part Affinity Fields (PAFs), which encode the location and orientation of limbs in a 2D vector field. This allows for more accurate association of body parts with individuals in the image, compared to methods that rely solely on detecting midpoints between parts. Furthermore, our approach achieves high-quality results with a fraction of the computational cost of existing methods, making it a more efficient option for real-time multi-person 2D pose detection.
  • 8.
    3. LITERATURE REVIEWAND PROBLEM DOMAIN • Single Person Pose The traditional approach to articulate human pose estimation involves using a combination of local observations on body parts and the spatial dependencies between them. Spatial models for articulated pose can be based on tree-structured graphical models or non-tree models. Convolutional Neural Networks (CNNs) have been widely used to obtain reliable local observations of body parts and have significantly boosted the accuracy of body pose estimation. • Multi-Person Pose Estimation Multi-person pose estimation is the task of estimating the poses of multiple people in an image. The traditional approach is to use a top-down strategy, where people are first detected and then the pose of each person is estimated independently. However, this approach does not capture the spatial dependencies across different people, and it suffers from early commitment on person detection. Some recent approaches have started to consider inter-person dependencies by using bottom-up approaches that jointly label part detection candidates and associate them to individual people. The pairwise representations used in this approach are difficult to regress precisely and require a separate logistic regression to convert the pairwise features into a probability score. • Human pose estimation via convolutional part heatmap regression This paper presents a method for estimating human pose using a convolutional neural network (CNN) architecture. The authors propose a cascaded architecture that specifically aims to learn part relationships and spatial context. The architecture is designed to be robust to severe part occlusions. The method uses a two-part cascade where the first part detects body parts and the second part performs regression on these detections. • Articulated pose estimation by a graphical model with image dependent pairwise relation This paper presents a method for estimating the 3D pose of a human from a single static image using a graphical model with novel pairwise relations that adaptively use local image measurements. The method combines the representational flexibility of graphical models with the efficiency and statistical power of deep convolutional neural networks (DCNNs) to learn the conditional probabilities of the presence of parts and their spatial relationships within image patches. The method is demonstrated to significantly outperform state-of-the-art methods on the LSP and FLIC datasets and also performs well on the Buffy dataset without any training.
  • 9.
    APPROACH • The essenceof human pose estimation lies in detecting points of interest on the limbs, joints, and even face of a human. These key points are used to produce a 2D representation of a human body model. • These models are basically a map of body joints we track during the movement. This is done for a computer not only to find the difference between a person just sitting and squatting, but also to calculate the angle of flexion in a specific joint, and tell if the movement is performed correctly. • There are three common types of human models: skeleton-based model, contour-based, and volume- based. The skeleton-based model is the most used one in human pose estimation because of its flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body.
  • 10.
    Fig. 1: Basicstructure of human pose estimation The input is a color image and outputs the 2D locations of keypoints for each person in the image. The system uses a feedforward network to predict confidence maps and vector fields, which are then processed by greedy inference to identify the keypoints. The confidence maps are composed of J maps, each representing a different body part, and the vector fields consist of C fields that encode the connections between body parts. Part Affinity Fields (PAFs) are used to determine the association between body part detections and form full-body poses for an unknown number of people. PAFs preserve both location and orientation information and indicate the direction from one part of a limb to another. Each PAF is a 2D vector field for a specific limb type that joins two associated body parts. This approach overcomes the limitations of midpoint representation, which only encodes position and restricts the region of support to a single point
  • 11.
    4. SOLUTION ARCHITECTUREAND DESIGN Fig. 2: Architecture of the multi-stage CNN for multi-person pose estimation
  • 12.
    The architecture ofthe multi-stage CNN for multi-person pose estimation is designed to predict two types of outputs: Part Affinity Fields (PAFs) and confidence maps. PAFs are used to estimate the spatial relationships between body parts, while the confidence maps estimate the likelihood of each body part being present in a given location. The CNN is composed of multiple stages, with each stage taking in the predictions and image features from the previous stage as input. The convolutions of kernel size 7 in the original approach have been replaced with 3 layers of convolutions of kernel size 3, which are concatenated at the end of each stage. This allows the network to capture both local and global spatial dependencies between body parts, while also preserving multimodal uncertainty from previous stages. The architecture is a deep neural network that predicts affinity fields and detection confidence maps for object detection and part-to-part association. The network uses an iterative prediction approach, with successive refinement of predictions over multiple stages. The network depth is increased compared to the original approach, with the use of multiple 3x3 convolutional kernels instead of 7x7 kernels to preserve the receptive field while reducing computation
  • 13.
    The progression ofPAFs (Part Affinity Fields) for the right forearm across different stages of the neural network architecture is shown here. Initially, there is confusion between left and right body parts and limbs, but as the network progresses through successive stages, the estimates become more refined through global inference. This suggests that the network is able to leverage information from the entire image to improve the accuracy of its predictions, even for body parts that are initially difficult to distinguish. Overall, this indicates that the iterative prediction architecture is effective at improving the accuracy of object detection and part-to- part association over multiple stages. . Fig. 3: The progression of PAFs (Part Affinity Fields) for the right forearm
  • 14.
    Fig.4: a) Imagewith detection points. (b) K-partite graph. (c) Tree structure. (d) A set of bipartite graphs. Graph matching is a process that aims to find correspondences between parts detected in an image to assemble full-body poses of people. It can be modeled as a graph optimization problem, where each detected body part is represented as a node in a graph and edges between nodes indicate possible connections between body parts. The process involves constructing a graph where each node represents a detected body part, and edges between nodes indicate possible associations between body parts. The resulting graph is a K-partite graph, where K is the number of body parts. Each part is connected to all other parts in the same body by edges, and edges between nodes from different bodies are not allowed. The next step is to convert the K-partite graph into a tree structure, which involves selecting a part from each body to be the root of the corresponding subtree. This step reduces the original graph to a set of bipartite graphs, where each bipartite graph connects two parts from different bodies. The final goal of the graph matching process is to assign each edge in the bipartite graphs to a person, such that each person has a complete set of body parts.
  • 15.
    The graph matchingprocess can be challenging, particularly in crowded scenes where multiple people are detected and there are many possible associations between body parts. Various algorithms have been proposed to solve this problem, including greedy algorithms, dynamic programming, and message passing algorithms. During testing, the association between candidate part detections is measured by computing the line integral over the corresponding PAF along the line segment connecting the candidate part locations. The predicted PAF is aligned with the candidate limb that would be formed by connecting the detected body parts. To be specific, for two candidate part locations dj1 and dj2, the predicted part affinity field, Lc is sampled along the line segment to measure the confidence in their association. The integral is computed as follows: ∫_0^1 L_c(dj1 + t(dj2 - dj1)) dt where dj1 and dj2 are the 2D coordinates of the two candidate part locations, and t is a scalar parameter ranging from 0 to 1 that specifies the position along the line segment connecting the two points. The integral measures the degree of alignment between the PAF and the limb formed by the two candidate parts, indicating the confidence in their association. This process is repeated for all possible pairs of candidate parts, and a bipartite graph is constructed using the confidence values as edge weights. Graph matching algorithms are then used to find the optimal assignment of candidate parts to form complete body poses.
  • 16.
    COCO (MICROSOFT COMMONOBJECTS IN CONTEXT) • The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. • Splits: The training/validation split is 118K/5K, uses images and annotations. Additionally, it contains a new unannotated dataset of 123K images. • Annotations: The dataset has annotations for object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
  • 17.
    Fig. 5: Keypointannotation configuration for the COCO dataset Evaluation on multi-person pose estimation for COCO dataset: The COCO keypoint challenge dataset requires the detection of people and the identification of 17 keypoints (body and facial parts) in each person in images with diverse scenarios including crowding, scale variation, occlusion, and contact. The approach took first place in the inaugural COCO keypoint challenge and includes a runtime analysis to evaluate system efficiency and analyze failure cases. The study compares and analyzes the runtime performance of this method. The runtime is invariant of the number of people in the image, whereas inference times of other libraries are proportional to the number of people in the image. The model consists of two major parts, the CNN processing time and multi-person parsing time. The runtime of the new model is faster than the original models and is 2x faster when using the GPU version, but 5x slower when using the CPU version.
  • 18.
    Fig. 6: Failurescenario: overlapping pose detection shared by human and dog Common failure cases in human pose estimation models include situations where there are overlapping poses or appearances between individuals, as well as situations where part detections are shared by humans and animals. For example, in crowded scenes, two people may appear to be overlapping, making it difficult for the model to accurately identify and separate their individual poses. Failure Cases with overlapping pose
  • 19.
    Fig. 7: Wrongconnection associating parts from human and animal, false positives on animal. Additionally, if there are animals in the scene, false positives may occur when the model mistakenly associates human body parts with animal features, such as legs or tails. In these cases, the model may also wrongly connect body parts from different objects, leading to inaccurate pose estimation results.
  • 20.
    5. EXPERIMENT ANDEVALUATION RESULTS To assess the model's performance during training, we create ground truth maps based on the annotated 2D keypoints. Each confidence map represents the probability that a specific body part can be found in a particular pixel. In an ideal scenario, if a single person is present in the image, each confidence map should have only one peak corresponding to each visible body part. However, if there are multiple people in the image, the confidence maps should have multiple peaks for each visible body part for each individual. After performing the analysis on the effect of PAF refinement over confidence map estimation in Table 5, where we fixed the computation to a maximum of 6 stages, distributed differently across the PAF and confidence map branches. We extracted three conclusions from this experiment. Firstly, PAF requires a higher number of stages to converge and benefits more from refinement stages. Secondly, increasing the number of PAF channels mainly improves the number of true positives, even though they may not be too accurate (higher AP 50).
  • 21.
    Fig. 8: Illustratesthe different part association strategies In (a), the body part detection candidates are shown as red and blue dots for two body part types, and all connection candidates are represented by grey lines. In (b), the connection results are shown using the midpoint representation, with correct connections indicated by black lines and incorrect connections by green lines that still satisfy the incidence constraint. In (c), the results are shown using Part Affinity Fields (PAFs), represented by yellow arrows. By encoding position and orientation information over the support of the limb, PAFs eliminate false associations, leading to more accurate pose estimation
  • 22.
    EXPERIMENT 1: Goal: comparethe performance of different backbone architectures for human pose estimation Dataset: COCO dataset Backbone architectures: ResNet-50, ResNet-101, ResNeXt-101 Evaluation metric: average precision (AP) of keypoint detection Implementation details: Train the models for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use standard data augmentation techniques such as random cropping, flipping, and rotation Use the same hyperparameters for all models and keep other settings fixed Decision: ResNeXt-101 shows the best performance among the three architectures with AP of 76.2%, while ResNet-50 has the lowest performance with AP of 71.2%. Therefore, we will use ResNeXt-101 as the backbone architecture for the following experiments.
  • 23.
    EXPERIMENT 2: Goal: investigatethe effect of different input image sizes on human pose estimation Dataset: COCO dataset Input image sizes: 256x256, 384x384, 512x512 Evaluation metric: AP of keypoint detection Implementation details: Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use ResNeXt-101 as the backbone architecture Use the same hyperparameters for all models and keep other settings fixed Decision: Increasing the input image size improves the performance of the model, with AP of 75.4%, 76.2%, and 77.1% for image sizes of 256x256, 384x384, and 512x512, respectively. Therefore, we will use an input image size of 512x512 for the following experiments.
  • 24.
    EXPERIMENT 3: Goal: comparethe performance of different loss functions for human pose estimation Dataset: COCO dataset Loss functions: mean squared error (MSE), mean absolute error (MAE), focal loss Evaluation metric: AP of keypoint detection Implementation details: Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use ResNeXt-101 as the backbone architecture and input image size of 512x512 Use the same hyperparameters for all models and keep other settings fixed Decision: Focal loss shows the best performance with AP of 77.1%, while MAE has the lowest performance with AP of 76.2%. Therefore, we will use focal loss as the loss function for the following experiments.
  • 25.
    6. KEY ISSUESAND OBSTACLES Fig 9: Human pose identified accurately for human animal image. a) Segregating human pose from animals in an image is a challenging task, particularly when the image contains both humans and animals interacting with each other. The task becomes even more challenging when the animals are in similar poses to humans or when they have similar body structures. One approach to segregating human pose from animals in an image is to use object detection techniques. Animals segregated in above images and only human pose identified accurately. Object detection algorithms such as Faster R-CNN, YOLO, and SSD can be used to detect and localize objects in an image. By training these algorithms on large, annotated datasets containing both humans and animals, the algorithms can learn to differentiate between humans and animals and accurately detect and classify them.
  • 26.
    Fig 10: Humanpose identified accurately for human jumping with animal image. In summary, segregating human pose from animals in an image can be achieved by using object detection algorithms, pose estimation techniques, or context-based approaches. The effectiveness of each approach will depend on the specific characteristics of the image and the quality of the data and algorithms used. Animal pose estimation is a challenging task due to the lack of annotated data, variations in body shapes and sizes, complex body structures, and the need to detect animals in images. This makes it difficult to develop pose estimation models that can accurately estimate poses of various animal species. Novel models need to be developed to address these challenges and generalize across different animal species.
  • 27.
    b) Multi-person poseidentification challenge: This is a challenging task because the number of people in the image is usually unknown, and their poses can vary widely in terms of complexity and occlusion. To tackle this challenge, state-of-the-art methods use deep learning-based models that can detect and estimate the positions of various body joints simultaneously across multiple individuals. These models typically use part affinity fields (PAFs) to represent the likelihood of body part connections between different keypoints and score maps to represent the likelihood of the presence of each keypoint. However, accurately identifying the poses of multiple individuals in a scene is still a challenging task, and there are several open research questions, including improving the accuracy of pose estimation in crowded scenes, handling occlusions and partial visibility of individuals, and dealing with varying clothing and body shapes. Addressing these challenges requires developing novel models that can handle these complex scenarios and improve the overall performance of multi-person pose identification. Fig 11: Multi-person pose identification.
  • 28.
    c) Trade-off betweenSpeed and Accuracy In object detection and human pose estimation, region-proposed methods have higher accuracy, but slower runtime compared to single-shot methods. Top-down approaches have higher accuracy, but lower speed compared to bottom-up methods due to limited resolution. The higher accuracy of top-down methods is due to individually processing each person, while bottom-up methods process the entire image at once, resulting in lower resolution per person. As hardware improves, bottom-up methods may be able to close the accuracy gap with top-down methods. The trade-off between speed and accuracy for the main entries of the COCO Challenge is analyzed. Only approaches with either runtime measurements or code release are considered. AlphaPose, METU, and single-scale OpenPose provide the best balance between speed and accuracy, while the other methods are slower and less accurate. d) Privacy concerns The dataset used for training and evaluating pose estimation models may raise privacy concerns, particularly if the data includes images of individuals. Careful consideration is needed to ensure responsible and ethical data collection and use.
  • 29.
    7. CONCLUSION ANDFUTURE SCOPE CONCLUSION • In this project, we present a method for multi-person 2D pose estimation, which is critical for enabling machines to comprehend human actions and interactions visually. Our approach includes an explicit nonparametric representation of keypoint association, which encodes the position and orientation of human limbs. We also present an architecture that simultaneously learns part detection and association and demonstrate that a greedy parsing algorithm is effective in producing high-quality body pose estimates while maintaining efficiency, even with multiple individuals. • Furthermore, we prove that refining the part affinity field (PAF) is more critical than combining PAF and body part location refinement, resulting in both improved runtime performance and accuracy. Additionally, we illustrate that combining body and foot estimation into a single model enhances the accuracy of each component and reduces the inference time required to run them sequentially. FUTURE SCOPE • In the future, we plan to extend the concept of various keypoints by incorporating part affinity fields (PAFs) from the interaction point to both the human body and the object to improve the performance of our model. By incorporating these additional features, we aim to enhance the accuracy and robustness of our model in recognizing and interpreting human-object interactions. • Furthermore, we intend to apply this model to other applications and datasets to evaluate its effectiveness in a broader range of scenarios. We believe that this approach has the potential to advance the field of computer vision and contribute to the development of more advanced artificial intelligence systems capable of understanding human-object interactions. • Finally, the real-time system for detecting key points on the body, feet, hands can be used for various human analysis research topics, including human re-identification, retargeting, and human-computer interaction. We are excited to continue our research in this area and explore new possibilities to improve the performance of our model and its practical applications.
  • 30.
    8. BIBLIOGRAPHY Good Teachersare worth more than thousand books References: 1. Website for MSCOCO keypoint evaluation metric: http://mscoco. org/dataset/#keypoints-eval. 2. A Conference Paper: M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: new benchmark and state of the art analysis. In CVPR, 2014. 3. A Conference Paper: M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: people detection and articulated pose estimation. In CVPR, 2009. 4. A Conference Paper: M. Andriluka, S. Roth, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In CVPR, 2010. 5. A Conference Paper: V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In 12th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2017. 6. A Conference Paper: Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016. 7. A Conference Paper: X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014. 8. A Conference Paper: P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. In IJCV, 2005. 9. A Conference Paper: G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In CVPR, 2014. 10. A Conference Paper: K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

Editor's Notes

  • #12 The architecture of the multi-stage CNN for multi-person pose estimation is designed to predict two types of outputs: Part Affinity Fields (PAFs) and confidence maps. PAFs are used to estimate the spatial relationships between body parts, while the confidence maps estimate the likelihood of each body part being present in a given location. The CNN is composed of multiple stages, with each stage taking in the predictions and image features from the previous stage as input. The convolutions of kernel size 7 in the original approach have been replaced with 3 layers of convolutions of kernel size 3, which are concatenated at the end of each stage. This allows the network to capture both local and global spatial dependencies between body parts, while also preserving multimodal uncertainty from previous stages.
  • #26 Human pose estimation is the process of detecting key points on the body, such as joints and limbs, to create a 2D representation of a human body model. This model is used to track body movements and calculate angles of flexion in joints to determine if a movement is performed correctly. The most used model in human pose estimation is the skeleton-based model, which is flexible and consists of a set of joints and limb orientations that make up the skeletal structure of the human body.