Human Pose estimation project for computer vision

ShivaprasadPatil10 83 views 30 slides Jul 10, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

Project


Slide Content

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI FIRST SEMESTER 2022-23   DSECLZG628T DISSERTATION Report   Dissertation Title : HUMAN POSE SKELETON-BASED ESTIMATION MODEL Name of Evaluator : Dr. SUNIL BHUTADA Name of Supervisor : Mr. MRUTUNJAYYA Name of Student : Mr. SHIVAPRASAD PATIL B ID No. of Student : 2020SC04362 Email id: [email protected] Phone: 9538052338

DEMO for REAL-TIME HUMAN POSE ESTIMATION

Contents ABSTRACT INTRODUCTION LITERATURE REVIEW AND PROBLEM DOMAIN SOLUTION ARCHITECTURE AND DESIGN EXPERIMENT AND EVALUATION RESULTS KEY ISSUES AND OBSTACLES CONCLUSION AND FUTURE SCOPE BIBLIOGRAPHY

ACKNOWLEDGEMENTS I would like to express my profound gratitude to my advisor and supervisor for this dissertation, Mr. Mrutunjayya . He as an advisor provided me the guidance and support all throughout the dissertation and allowed me to explore on my own.  I would like to express my special thanks to my mentor Dr. Sunil Bhutada for his time and valuable feed forward throughout the dissertation. His useful advice and suggestions helped me learn a lot about this project and aided in the completion of this project.  I do acknowledge the effort from the authors of the MPII and COCO human pose datasets. These datasets make 2D human pose estimation in the wild possible.

1. ABSTRACT Human pose estimation is a fundamental problem in computer vision, with applications in fields such as animation, gaming, virtual reality, and human-computer interaction. The aim is to determine the pose of a human body in a digital image or video, which includes the position and orientation of each body part. Our model uses a skeleton representation of the human body, where each body part is represented by a joint or node in the skeleton. The pose is estimated by determining the position and orientation of each joint, based on the visual information in the image or video. In conclusion, this project presents a human pose skeleton-based estimation model that is effective and efficient for estimating the pose of a human body in a digital image or video. The project presents an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image.

2. INTRODUCTION Overview: Enabling machines to comprehend images and videos containing people requires real-time multi-person 2D pose estimation. In previous studies, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that refining PAFs alone, rather than refining both PAFs and body part location, results in a significant increase in both runtime performance and accuracy. Existing System : The common method involves utilizing a person detector to detect individuals, followed by conducting single person pose estimation for each detection. Although these top-down approaches make use of established techniques for single-person pose estimation, they suffer from a drawback known as early commitment. In situations where people are close together, the person detector may fail, leaving no options for recovery. Additionally, the computational cost of these top-down approaches increases proportionally with the number of people, as a single-person pose estimator must be executed for each detection. In contrast, bottom-up approaches are appealing because they can offer resilience to early commitment and have the potential to separate runtime complexity from the number of people present in the image. Proposed System: Despite their potential advantages, previous bottom-up methods have not been able to maintain efficiency in practice due to the need for expensive global inference in the final parse. We used bottom-up approach that represents association scores using Part Affinity Fields (PAFs), which are 2D vector fields that encode the location and orientation of limbs over the image domain. Our method demonstrates that inferring these bottom-up representations of detection and association can encode global context well enough to allow for a greedy parse to achieve high-quality results at a significantly reduced computational cost.

Disadvantage: Our approach has some limitations that need to be considered. Firstly, while Part Affinity Fields (PAFs) can efficiently detect the 2D pose of multiple people in an image, it may not perform well in certain scenarios such as when there are occlusions or complex poses. Additionally, the approach relies on a nonparametric representation, which may not be suitable for all applications. Advantage: Our approach addresses the limitations of existing association methods by introducing Part Affinity Fields (PAFs), which encode the location and orientation of limbs in a 2D vector field. This allows for more accurate association of body parts with individuals in the image, compared to methods that rely solely on detecting midpoints between parts. Furthermore, our approach achieves high-quality results with a fraction of the computational cost of existing methods, making it a more efficient option for real-time multi-person 2D pose detection.

3. LITERATURE REVIEW AND PROBLEM DOMAIN Single Person Pose The traditional approach to articulate human pose estimation involves using a combination of local observations on body parts and the spatial dependencies between them. Spatial models for articulated pose can be based on tree-structured graphical models or non-tree models. Convolutional Neural Networks (CNNs) have been widely used to obtain reliable local observations of body parts and have significantly boosted the accuracy of body pose estimation. Multi-Person Pose Estimation Multi-person pose estimation is the task of estimating the poses of multiple people in an image. The traditional approach is to use a top-down strategy, where people are first detected and then the pose of each person is estimated independently. However, this approach does not capture the spatial dependencies across different people, and it suffers from early commitment on person detection. Some recent approaches have started to consider inter-person dependencies by using bottom-up approaches that jointly label part detection candidates and associate them to individual people. The pairwise representations used in this approach are difficult to regress precisely and require a separate logistic regression to convert the pairwise features into a probability score. Human pose estimation via convolutional part heatmap regression This paper presents a method for estimating human pose using a convolutional neural network (CNN) architecture. The authors propose a cascaded architecture that specifically aims to learn part relationships and spatial context. The architecture is designed to be robust to severe part occlusions. The method uses a two-part cascade where the first part detects body parts and the second part performs regression on these detections. Articulated pose estimation by a graphical model with image dependent pairwise relation This paper presents a method for estimating the 3D pose of a human from a single static image using a graphical model with novel pairwise relations that adaptively use local image measurements. The method combines the representational flexibility of graphical models with the efficiency and statistical power of deep convolutional neural networks (DCNNs) to learn the conditional probabilities of the presence of parts and their spatial relationships within image patches. The method is demonstrated to significantly outperform state-of-the-art methods on the LSP and FLIC datasets and also performs well on the Buffy dataset without any training.

Approach The essence of human pose estimation lies in detecting points of interest on the limbs, joints, and even face of a human. These key points are used to produce a 2D representation of a human body model. These models are basically a map of body joints we track during the movement. This is done for a computer not only to find the difference between a person just sitting and squatting, but also to calculate the angle of flexion in a specific joint, and tell if the movement is performed correctly.  There are three common types of human models: skeleton-based model, contour-based, and volume-based. The skeleton-based model is the most used one in human pose estimation because of its flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body.

Fig. 1: Basic structure of human pose estimation The input is a color image and outputs the 2D locations of keypoints for each person in the image. The system uses a feedforward network to predict confidence maps and vector fields, which are then processed by greedy inference to identify the keypoints. The confidence maps are composed of J maps, each representing a different body part, and the vector fields consist of C fields that encode the connections between body parts. Part Affinity Fields (PAFs) are used to determine the association between body part detections and form full-body poses for an unknown number of people. PAFs preserve both location and orientation information and indicate the direction from one part of a limb to another. Each PAF is a 2D vector field for a specific limb type that joins two associated body parts. This approach overcomes the limitations of midpoint representation, which only encodes position and restricts the region of support to a single point

4. Solution Architecture and Design Fig. 2: Architecture of the multi-stage CNN for multi-person pose estimation

The architecture of the multi-stage CNN for multi-person pose estimation is designed to predict two types of outputs: Part Affinity Fields (PAFs) and confidence maps. PAFs are used to estimate the spatial relationships between body parts, while the confidence maps estimate the likelihood of each body part being present in a given location. The CNN is composed of multiple stages, with each stage taking in the predictions and image features from the previous stage as input. The convolutions of kernel size 7 in the original approach have been replaced with 3 layers of convolutions of kernel size 3, which are concatenated at the end of each stage. This allows the network to capture both local and global spatial dependencies between body parts, while also preserving multimodal uncertainty from previous stages. The architecture is a deep neural network that predicts affinity fields and detection confidence maps for object detection and part-to-part association. The network uses an iterative prediction approach, with successive refinement of predictions over multiple stages. The network depth is increased compared to the original approach, with the use of multiple 3x3 convolutional kernels instead of 7x7 kernels to preserve the receptive field while reducing computation

The progression of PAFs (Part Affinity Fields) for the right forearm across different stages of the neural network architecture is shown here. Initially, there is confusion between left and right body parts and limbs, but as the network progresses through successive stages, the estimates become more refined through global inference. This suggests that the network is able to leverage information from the entire image to improve the accuracy of its predictions, even for body parts that are initially difficult to distinguish. Overall, this indicates that the iterative prediction architecture is effective at improving the accuracy of object detection and part-to-part association over multiple stages. . Fig. 3: The progression of PAFs (Part Affinity Fields) for the right forearm

Fig.4: a) Image with detection points. (b) K-partite graph. (c) Tree structure. (d) A set of bipartite graphs. Graph matching is a process that aims to find correspondences between parts detected in an image to assemble full-body poses of people. It can be modeled as a graph optimization problem, where each detected body part is represented as a node in a graph and edges between nodes indicate possible connections between body parts. The process involves constructing a graph where each node represents a detected body part, and edges between nodes indicate possible associations between body parts. The resulting graph is a K-partite graph, where K is the number of body parts. Each part is connected to all other parts in the same body by edges, and edges between nodes from different bodies are not allowed. The next step is to convert the K-partite graph into a tree structure, which involves selecting a part from each body to be the root of the corresponding subtree. This step reduces the original graph to a set of bipartite graphs, where each bipartite graph connects two parts from different bodies. The final goal of the graph matching process is to assign each edge in the bipartite graphs to a person, such that each person has a complete set of body parts.

The graph matching process can be challenging, particularly in crowded scenes where multiple people are detected and there are many possible associations between body parts. Various algorithms have been proposed to solve this problem, including greedy algorithms, dynamic programming, and message passing algorithms. During testing, the association between candidate part detections is measured by computing the line integral over the corresponding PAF along the line segment connecting the candidate part locations. The predicted PAF is aligned with the candidate limb that would be formed by connecting the detected body parts. To be specific, for two candidate part locations dj1 and dj2, the predicted part affinity field, Lc is sampled along the line segment to measure the confidence in their association. The integral is computed as follows: ∫_0^1 L_c (dj1 + t(dj2 - dj1)) dt where dj1 and dj2 are the 2D coordinates of the two candidate part locations, and t is a scalar parameter ranging from 0 to 1 that specifies the position along the line segment connecting the two points. The integral measures the degree of alignment between the PAF and the limb formed by the two candidate parts, indicating the confidence in their association. This process is repeated for all possible pairs of candidate parts, and a bipartite graph is constructed using the confidence values as edge weights. Graph matching algorithms are then used to find the optimal assignment of candidate parts to form complete body poses.

COCO (Microsoft Common Objects in Context ) The MS  COCO  ( Microsoft Common Objects in Context ) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. Splits:  The training/validation split is 118K/5K, uses images and annotations. Additionally, it contains a new unannotated dataset of 123K images. Annotations:  The dataset has annotations for object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

Fig. 5: Keypoint annotation configuration for the COCO dataset E valuation on multi-person pose estimation for COCO dataset: The COCO keypoint challenge dataset requires the detection of people and the identification of 17 keypoints (body and facial parts) in each person in images with diverse scenarios including crowding, scale variation, occlusion, and contact. The approach took first place in the inaugural COCO keypoint challenge and includes a runtime analysis to evaluate system efficiency and analyze failure cases. The study compares and analyzes the runtime performance of this method. The runtime is invariant of the number of people in the image, whereas inference times of other libraries are proportional to the number of people in the image. The model consists of two major parts, the CNN processing time and multi-person parsing time. The runtime of the new model is faster than the original models and is 2x faster when using the GPU version, but 5x slower when using the CPU version.

Fig. 6: Failure scenario: overlapping pose detection shared by human and dog Common failure cases in human pose estimation models include situations where there are overlapping poses or appearances between individuals, as well as situations where part detections are shared by humans and animals. For example, in crowded scenes, two people may appear to be overlapping, making it difficult for the model to accurately identify and separate their individual poses. Failure Cases with overlapping pose

Fig. 7: W rong connection associating parts from human and animal, false positives on animal. Additionally, if there are animals in the scene, false positives may occur when the model mistakenly associates human body parts with animal features, such as legs or tails. In these cases, the model may also wrongly connect body parts from different objects, leading to inaccurate pose estimation results.

5. EXPERIMENT AND EVALUATION RESULTS To assess the model's performance during training, we create ground truth maps based on the annotated 2D keypoints. Each confidence map represents the probability that a specific body part can be found in a particular pixel. In an ideal scenario, if a single person is present in the image, each confidence map should have only one peak corresponding to each visible body part. However, if there are multiple people in the image, the confidence maps should have multiple peaks for each visible body part for each individual. After performing the analysis on the effect of PAF refinement over confidence map estimation in Table 5, where we fixed the computation to a maximum of 6 stages, distributed differently across the PAF and confidence map branches. We extracted three conclusions from this experiment. Firstly, PAF requires a higher number of stages to converge and benefits more from refinement stages. Secondly, increasing the number of PAF channels mainly improves the number of true positives, even though they may not be too accurate (higher AP 50).

Fig. 8: Illustrates the different part association strategies In (a), the body part detection candidates are shown as red and blue dots for two body part types, and all connection candidates are represented by grey lines. In (b), the connection results are shown using the midpoint representation, with correct connections indicated by black lines and incorrect connections by green lines that still satisfy the incidence constraint. In (c), the results are shown using Part Affinity Fields (PAFs), represented by yellow arrows. By encoding position and orientation information over the support of the limb, PAFs eliminate false associations, leading to more accurate pose estimation

Experiment 1 : Goal: compare the performance of different backbone architectures for human pose estimation Dataset: COCO dataset Backbone architectures: ResNet-50, ResNet-101, ResNeXt-101 Evaluation metric: average precision (AP) of keypoint detection Implementation details: Train the models for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use standard data augmentation techniques such as random cropping, flipping, and rotation Use the same hyperparameters for all models and keep other settings fixed Decision: ResNeXt-101 shows the best performance among the three architectures with AP of 76.2%, while ResNet-50 has the lowest performance with AP of 71.2%. Therefore, we will use ResNeXt-101 as the backbone architecture for the following experiments.

Experiment 2: Goal: investigate the effect of different input image sizes on human pose estimation Dataset: COCO dataset Input image sizes: 256x256, 384x384, 512x512 Evaluation metric: AP of keypoint detection Implementation details: Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use ResNeXt-101 as the backbone architecture Use the same hyperparameters for all models and keep other settings fixed Decision: Increasing the input image size improves the performance of the model, with AP of 75.4%, 76.2%, and 77.1% for image sizes of 256x256, 384x384, and 512x512, respectively. Therefore, we will use an input image size of 512x512 for the following experiments.

Experiment 3: Goal: compare the performance of different loss functions for human pose estimation Dataset: COCO dataset Loss functions: mean squared error (MSE), mean absolute error (MAE), focal loss Evaluation metric: AP of keypoint detection Implementation details: Train the model for 50 epochs with batch size 32, using Adam optimizer with learning rate 1e-4 Use ResNeXt-101 as the backbone architecture and input image size of 512x512 Use the same hyperparameters for all models and keep other settings fixed Decision: Focal loss shows the best performance with AP of 77.1%, while MAE has the lowest performance with AP of 76.2%. Therefore, we will use focal loss as the loss function for the following experiments.

6. Key issues and obstacles Fig 9: Human pose identified accurately for human animal image. a) Segregating human pose from animals in an image is a challenging task, particularly when the image contains both humans and animals interacting with each other. The task becomes even more challenging when the animals are in similar poses to humans or when they have similar body structures. One approach to segregating human pose from animals in an image is to use object detection techniques. Animals segregated in above images and only human pose identified accurately. Object detection algorithms such as Faster R-CNN, YOLO, and SSD can be used to detect and localize objects in an image. By training these algorithms on large, annotated datasets containing both humans and animals, the algorithms can learn to differentiate between humans and animals and accurately detect and classify them.

Fig 10: Human pose identified accurately for human jumping with animal image. In summary, segregating human pose from animals in an image can be achieved by using object detection algorithms, pose estimation techniques, or context-based approaches. The effectiveness of each approach will depend on the specific characteristics of the image and the quality of the data and algorithms used. Animal pose estimation is a challenging task due to the lack of annotated data, variations in body shapes and sizes, complex body structures, and the need to detect animals in images. This makes it difficult to develop pose estimation models that can accurately estimate poses of various animal species. Novel models need to be developed to address these challenges and generalize across different animal species.

b) Multi-person pose identification challenge: This is a challenging task because the number of people in the image is usually unknown, and their poses can vary widely in terms of complexity and occlusion. To tackle this challenge, state-of-the-art methods use deep learning-based models that can detect and estimate the positions of various body joints simultaneously across multiple individuals. These models typically use part affinity fields (PAFs) to represent the likelihood of body part connections between different keypoints, and score maps to represent the likelihood of the presence of each keypoint . However, accurately identifying the poses of multiple individuals in a scene is still a challenging task, and there are several open research questions, including improving the accuracy of pose estimation in crowded scenes, handling occlusions and partial visibility of individuals, and dealing with varying clothing and body shapes. Addressing these challenges requires developing novel models that can handle these complex scenarios and improve the overall performance of multi-person pose identification. Fig 11: Multi-person pose identification.

c) Trade-off between Speed and Accuracy In object detection and human pose estimation, region-proposed methods have higher accuracy, but slower runtime compared to single-shot methods. Top-down approaches have higher accuracy, but lower speed compared to bottom-up methods due to limited resolution. The higher accuracy of top-down methods is due to individually processing each person, while bottom-up methods process the entire image at once, resulting in lower resolution per person. As hardware improves, bottom-up methods may be able to close the accuracy gap with top-down methods. The trade-off between speed and accuracy for the main entries of the COCO Challenge is analyzed. Only approaches with either runtime measurements or code release are considered. AlphaPose , METU, and single-scale OpenPose provide the best balance between speed and accuracy, while the other methods are slower and less accurate. d) Privacy concerns The dataset used for training and evaluating pose estimation models may raise privacy concerns, particularly if the data includes images of individuals. Careful consideration is needed to ensure responsible and ethical data collection and use.

7. CONCLUSION AND FUTURE SCOPE CONCLUSION In this project, we present a method for multi-person 2D pose estimation, which is critical for enabling machines to comprehend human actions and interactions visually. Our approach includes an explicit nonparametric representation of keypoint association, which encodes the position and orientation of human limbs. We also present an architecture that simultaneously learns part detection and association and demonstrate that a greedy parsing algorithm is effective in producing high-quality body pose estimates while maintaining efficiency, even with multiple individuals. Furthermore, we prove that refining the part affinity field (PAF) is more critical than combining PAF and body part location refinement, resulting in both improved runtime performance and accuracy. Additionally, we illustrate that combining body and foot estimation into a single model enhances the accuracy of each component and reduces the inference time required to run them sequentially. FUTURE SCOPE In the future, we plan to extend the concept of various keypoints by incorporating part affinity fields (PAFs) from the interaction point to both the human body and the object to improve the performance of our model. By incorporating these additional features, we aim to enhance the accuracy and robustness of our model in recognizing and interpreting human-object interactions. Furthermore, we intend to apply this model to other applications and datasets to evaluate its effectiveness in a broader range of scenarios. We believe that this approach has the potential to advance the field of computer vision and contribute to the development of more advanced artificial intelligence systems capable of understanding human-object interactions. Finally, the real-time system for detecting key points on the body, feet, hands can be used for various human analysis research topics, including human re-identification, retargeting, and human-computer interaction. We are excited to continue our research in this area and explore new possibilities to improve the performance of our model and its practical applications .

8. Bibliography Good Teachers are worth more than thousand books References: Website for MSCOCO keypoint evaluation metric: http://mscoco. org/dataset/#keypoints-eval. A Conference Paper: M. Andriluka , L. Pishchulin , P. Gehler , and B. Schiele. 2D human pose estimation: new benchmark and state of the art analysis. In CVPR, 2014. A Conference Paper: M. Andriluka , S. Roth, and B. Schiele. Pictorial structures revisited: people detection and articulated pose estimation. In CVPR, 2009. A Conference Paper: M. Andriluka , S. Roth, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In CVPR, 2010. A Conference Paper: V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In 12th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2017. A Conference Paper: Bulat and G. Tzimiropoulos . Human pose estimation via convolutional part heatmap regression. In ECCV, 2016. A Conference Paper: X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014. A Conference Paper: P. F. Felzenszwalb and D. P. Huttenlocher . Pictorial structures for object recognition. In IJCV, 2005. A Conference Paper: G. Gkioxari , B. Hariharan, R. Girshick , and J. Malik. Using k- poselets for detecting people and localizing their keypoints. In CVPR, 2014. A Conference Paper: K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
Tags