This document summarizes a thesis proposal on using deep learning for articulated human pose estimation. The proposed method uses a deep convolutional neural network (DCNN) as a front-end to extract local appearance features of body parts, combined with message passing layers to model spatial relationships between parts through pairwise constraints. This global pose model is trained end-to-end using a max-sum algorithm to maximize consistency across the entire human pose. Experimental results on standard pose estimation datasets demonstrate state-of-the-art performance.
[Mmlab seminar 2016] deep learning for human pose estimationWei Yang
This document summarizes recent advances in deep learning approaches for human pose estimation. It describes early methods like DeepPose that used cascades of regressors. Later works introduced heatmap regression to capture spatial information. Convolutional Pose Machine and Stacked Hourglass networks further improved accuracy by incorporating stronger context modeling through deeper networks with larger receptive fields and intermediate supervision. These approaches demonstrate that both local appearance cues and modeling of global context and structure are important for accurate human pose estimation.
Articulated human pose estimation by deep learningWei Yang
This document summarizes a research paper on articulated human pose estimation using deep learning techniques. It presents convolutional neural network (CNN) models for holistically regressing joint locations and locally capturing part presence and spatial relationships through deformable convolutions. For regression, different CNN architectures are evaluated on the LSP dataset, with a fully connected network achieving 60.9% mean PCP. For the deformable CNN approach, it achieves higher performance of 74.8% PCP on LSP and 91.1% on FLIC by incorporating local image patches and pairwise relationships. Future work to combine local and holistic models in an end-to-end system is discussed.
This document discusses research on human action recognition using skeleton data. It introduces issues with skeleton-based action recognition, such as variable scales, view orientations, noise and rate/intra-action variations. It then reviews previous work on skeleton-based action recognition using hand-crafted features and deep learning models. The document proposes two ensemble deep learning models called Ensemble TS-LSTM v1 and v2 that use temporal sliding LSTMs to capture short, medium and long-term dependencies from skeleton sequences for action recognition. Experimental results on standard datasets demonstrate the models outperform previous methods.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
This document summarizes recent advances in human pose estimation using deep learning methods. It first discusses traditional approaches like pictorial structures. It then covers several deep learning methods including global/holistic view using joint regression, local appearance using body part detection, and combining global and local information. Other methods discussed are using motion features and pose estimation in videos. Evaluation metrics like PCP and PDJ are also introduced. The document outlines many key papers in this area and provides examples of network architectures and results.
This document discusses methods for estimating human pose from images using deep learning. It covers several approaches including SMPLIFY and SCAPE. SMPLIFY uses a CNN to detect 2D joints then fits a statistical body model called SMPL to estimate 3D pose. SCAPE is a graphics model of human shape learned from 3D scans, capturing pose and shape variability. The document reviews similarities and differences between methods, including using priors, image features, and optimization. It also discusses improving methods by making them fully automatic using detected joints rather than manual inputs.
HML: Historical View and Trends of Deep LearningYan Xu
The document provides a historical view and trends of deep learning. It discusses that deep learning models have evolved in several waves since the 1940s, with key developments including the backpropagation algorithm in 1986 and deep belief networks with pretraining in 2006. Current trends include growing datasets, increasing numbers of neurons and connections per neuron, and higher accuracy on tasks involving vision, NLP and games. Research trends focus on generative models, domain alignment, meta-learning, using graphs as inputs, and program induction.
CVML2011: human action recognition (Ivan Laptev)zukun
This document provides an overview of a lecture on human action recognition. It discusses the historic motivation for studying human motion from early studies in art and biomechanics to modern applications in motion capture and video editing. It also covers challenges in human pose estimation and recent advances in appearance-based, motion-based, and space-time methods for recognizing human actions in images and videos. The lecture focuses on key approaches like pictorial structures, motion history images, and space-time features.
[Mmlab seminar 2016] deep learning for human pose estimationWei Yang
This document summarizes recent advances in deep learning approaches for human pose estimation. It describes early methods like DeepPose that used cascades of regressors. Later works introduced heatmap regression to capture spatial information. Convolutional Pose Machine and Stacked Hourglass networks further improved accuracy by incorporating stronger context modeling through deeper networks with larger receptive fields and intermediate supervision. These approaches demonstrate that both local appearance cues and modeling of global context and structure are important for accurate human pose estimation.
Articulated human pose estimation by deep learningWei Yang
This document summarizes a research paper on articulated human pose estimation using deep learning techniques. It presents convolutional neural network (CNN) models for holistically regressing joint locations and locally capturing part presence and spatial relationships through deformable convolutions. For regression, different CNN architectures are evaluated on the LSP dataset, with a fully connected network achieving 60.9% mean PCP. For the deformable CNN approach, it achieves higher performance of 74.8% PCP on LSP and 91.1% on FLIC by incorporating local image patches and pairwise relationships. Future work to combine local and holistic models in an end-to-end system is discussed.
This document discusses research on human action recognition using skeleton data. It introduces issues with skeleton-based action recognition, such as variable scales, view orientations, noise and rate/intra-action variations. It then reviews previous work on skeleton-based action recognition using hand-crafted features and deep learning models. The document proposes two ensemble deep learning models called Ensemble TS-LSTM v1 and v2 that use temporal sliding LSTMs to capture short, medium and long-term dependencies from skeleton sequences for action recognition. Experimental results on standard datasets demonstrate the models outperform previous methods.
The document discusses human action recognition using spatio-temporal features. It proposes using optical flow and shape-based features to form motion descriptors, which are then classified using Adaboost. Targets are localized using background subtraction. Optical flows within localized regions are organized into a histogram to describe motion. Differential shape information is also captured. The descriptors are used to train a strong classifier with Adaboost that can recognize actions in testing videos.
This document summarizes recent advances in human pose estimation using deep learning methods. It first discusses traditional approaches like pictorial structures. It then covers several deep learning methods including global/holistic view using joint regression, local appearance using body part detection, and combining global and local information. Other methods discussed are using motion features and pose estimation in videos. Evaluation metrics like PCP and PDJ are also introduced. The document outlines many key papers in this area and provides examples of network architectures and results.
This document discusses methods for estimating human pose from images using deep learning. It covers several approaches including SMPLIFY and SCAPE. SMPLIFY uses a CNN to detect 2D joints then fits a statistical body model called SMPL to estimate 3D pose. SCAPE is a graphics model of human shape learned from 3D scans, capturing pose and shape variability. The document reviews similarities and differences between methods, including using priors, image features, and optimization. It also discusses improving methods by making them fully automatic using detected joints rather than manual inputs.
HML: Historical View and Trends of Deep LearningYan Xu
The document provides a historical view and trends of deep learning. It discusses that deep learning models have evolved in several waves since the 1940s, with key developments including the backpropagation algorithm in 1986 and deep belief networks with pretraining in 2006. Current trends include growing datasets, increasing numbers of neurons and connections per neuron, and higher accuracy on tasks involving vision, NLP and games. Research trends focus on generative models, domain alignment, meta-learning, using graphs as inputs, and program induction.
CVML2011: human action recognition (Ivan Laptev)zukun
This document provides an overview of a lecture on human action recognition. It discusses the historic motivation for studying human motion from early studies in art and biomechanics to modern applications in motion capture and video editing. It also covers challenges in human pose estimation and recent advances in appearance-based, motion-based, and space-time methods for recognizing human actions in images and videos. The lecture focuses on key approaches like pictorial structures, motion history images, and space-time features.
Object Detection Classification, tracking and CountingShounak Mitra
This document summarizes an object detection, tracking, classification, and counting project. The project involved using video from cameras to:
1) Detect objects in video frames using background subtraction and blob analysis. Kalman filters were then used to track objects across frames and reduce noise.
2) Classify objects by color and count them. Shadow detection methods like Gaussian smoothing and thresholding were also applied to filter out shadows.
3) The project aimed to synchronize object counts passing over a bridge with strain gauge and accelerometer readings, to study pedestrian impacts. The document outlines the full algorithm and issues like noise, shadows and tracking.
Object detection is an important computer vision technique with applications in several domains such as autonomous driving, personal and industrial robotics. The below slides cover the history of object detection from before deep learning until recent research. The slides aim to cover the history and future directions of object detection, as well as some guidelines for how to choose which type of object detector to use for your own project.
This presentation covers the following topics-
1. Video Classification as a sequence of frames
2. Video Classification as a sequence of frame-blocks
3. 2D ConvNets for Videos
4. CNN + LSTM
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.
A Small Helping Hand from me to my Engineering collegues and my other friends in need of Object Detection
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
The document discusses human activity recognition from video data using computer vision techniques. It describes recognizing activities at different levels from object locations to full activities. Basic activities like walking and clapping are the focus. Key steps involve tracking segmented objects across frames and comparing motion patterns to templates to identify activities through model fitting. The DEV8000 development kit and Linux are used to process video and recognize activities in real-time. Applications discussed include surveillance, sports analysis, and unmanned vehicles.
This document outlines a project to develop a system for detecting motorcyclists who are violating helmet laws using image processing and convolutional neural networks. The system is designed to detect motorbikes, determine if the rider is wearing a helmet or not, and if not, extract and recognize the license plate number. The document includes sections on the abstract, introduction, objectives, system analysis, specification, design including UML diagrams, modules, inputs/outputs, and conclusion.
Lidar for Autonomous Driving II (via Deep Learning)Yu Huang
The document outlines research on using LiDAR data for autonomous vehicle object detection. It begins with an introduction to sensor fusion techniques using LiDAR and camera data. Several deep learning approaches for 3D object detection from LiDAR point clouds are then summarized, including methods that project the point cloud into 2D feature maps or 3D voxel grids as input to convolutional networks. Finally, techniques for exploiting HD maps and performing real-time on-device detection are discussed. The document provides an overview of the state-of-the-art in LiDAR-based object detection for autonomous driving applications.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
This document summarizes a student project on human activity recognition using smartphones. A group of 4 students submitted the project to partially fulfill requirements for a Bachelor of Technology degree in computer science and engineering. The project involved developing a system to recognize human activities using the accelerometer and gyroscope sensors in smartphones. Various machine learning algorithms were tested and evaluated on experimental data collected from smartphone sensors. The goal of the project was to create an accurate and lightweight activity recognition system for smartphones, while also exploring active learning methods to reduce the amount of labeled training data needed.
This document describes various algorithms used to build a facial emotion recognition system, including Haar cascade, HOG, Eigenfaces, and Fisherfaces. It explains how each algorithm works, such as how Haar cascade detects facial features and HOG extracts histograms of gradients. The system is trained on the CK+ dataset and uses Eigenface and Fisherface classifiers to classify emotions, achieving higher accuracy (86.54%) with Fisherfaces. It provides code snippets of key steps like cropping, resizing images, splitting data, and predicting emotions.
The document describes a project that aims to develop a mobile application for real-time object and pose detection. The application will take in a real-time image as input and output bounding boxes identifying the objects in the image along with their class. The methodology involves preprocessing the image, then using the YOLO framework for object classification and localization. The goals are to achieve high accuracy detection that can be used for applications like vehicle counting and human activity recognition.
Recent Progress on Object Detection_20170331Jihong Kang
This slide provides a brief summary of recent progress on object detection using deep learning.
The concept of selected previous works(R-CNN series/YOLO/SSD) and 6 recent papers (uploaded to the Arxiv between Dec/2016 and Mar/2017) are introduced in this slide.
Most papers are focusing on improving the performance of small object detection.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
YOLO (You Only Look Once) is a real-time object detection system that frames object detection as a regression problem. It uses a single neural network that predicts bounding boxes and class probabilities directly from full images in one evaluation. This approach allows YOLO to process images and perform object detection over 45 frames per second while maintaining high accuracy compared to previous systems. YOLO was trained on natural images from PASCAL VOC and can generalize to new domains like artwork without significant degradation in performance, unlike other methods that struggle with domain shift.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Pose Machines is a method for estimating articulated human pose from images using convolutional neural networks. It uses a multi-stage approach where each stage predicts confidence maps for body parts using features from the previous stage. This allows the model to incorporate strong contextual cues between related body joints across stages. Key aspects of the approach include using intermediate supervision to address vanishing gradients, large receptive fields to capture context, and hierarchical prediction of parts to leverage top-down cues. Evaluation on public datasets shows it achieves state-of-the-art performance for articulated human pose estimation from single images.
Deformable Part Models are Convolutional Neural NetworksWei Yang
Girshick, Ross, et al. "Deformable part models are convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
Object Detection Classification, tracking and CountingShounak Mitra
This document summarizes an object detection, tracking, classification, and counting project. The project involved using video from cameras to:
1) Detect objects in video frames using background subtraction and blob analysis. Kalman filters were then used to track objects across frames and reduce noise.
2) Classify objects by color and count them. Shadow detection methods like Gaussian smoothing and thresholding were also applied to filter out shadows.
3) The project aimed to synchronize object counts passing over a bridge with strain gauge and accelerometer readings, to study pedestrian impacts. The document outlines the full algorithm and issues like noise, shadows and tracking.
Object detection is an important computer vision technique with applications in several domains such as autonomous driving, personal and industrial robotics. The below slides cover the history of object detection from before deep learning until recent research. The slides aim to cover the history and future directions of object detection, as well as some guidelines for how to choose which type of object detector to use for your own project.
This presentation covers the following topics-
1. Video Classification as a sequence of frames
2. Video Classification as a sequence of frame-blocks
3. 2D ConvNets for Videos
4. CNN + LSTM
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.
A Small Helping Hand from me to my Engineering collegues and my other friends in need of Object Detection
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
The document discusses human activity recognition from video data using computer vision techniques. It describes recognizing activities at different levels from object locations to full activities. Basic activities like walking and clapping are the focus. Key steps involve tracking segmented objects across frames and comparing motion patterns to templates to identify activities through model fitting. The DEV8000 development kit and Linux are used to process video and recognize activities in real-time. Applications discussed include surveillance, sports analysis, and unmanned vehicles.
This document outlines a project to develop a system for detecting motorcyclists who are violating helmet laws using image processing and convolutional neural networks. The system is designed to detect motorbikes, determine if the rider is wearing a helmet or not, and if not, extract and recognize the license plate number. The document includes sections on the abstract, introduction, objectives, system analysis, specification, design including UML diagrams, modules, inputs/outputs, and conclusion.
Lidar for Autonomous Driving II (via Deep Learning)Yu Huang
The document outlines research on using LiDAR data for autonomous vehicle object detection. It begins with an introduction to sensor fusion techniques using LiDAR and camera data. Several deep learning approaches for 3D object detection from LiDAR point clouds are then summarized, including methods that project the point cloud into 2D feature maps or 3D voxel grids as input to convolutional networks. Finally, techniques for exploiting HD maps and performing real-time on-device detection are discussed. The document provides an overview of the state-of-the-art in LiDAR-based object detection for autonomous driving applications.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
This document summarizes a student project on human activity recognition using smartphones. A group of 4 students submitted the project to partially fulfill requirements for a Bachelor of Technology degree in computer science and engineering. The project involved developing a system to recognize human activities using the accelerometer and gyroscope sensors in smartphones. Various machine learning algorithms were tested and evaluated on experimental data collected from smartphone sensors. The goal of the project was to create an accurate and lightweight activity recognition system for smartphones, while also exploring active learning methods to reduce the amount of labeled training data needed.
This document describes various algorithms used to build a facial emotion recognition system, including Haar cascade, HOG, Eigenfaces, and Fisherfaces. It explains how each algorithm works, such as how Haar cascade detects facial features and HOG extracts histograms of gradients. The system is trained on the CK+ dataset and uses Eigenface and Fisherface classifiers to classify emotions, achieving higher accuracy (86.54%) with Fisherfaces. It provides code snippets of key steps like cropping, resizing images, splitting data, and predicting emotions.
The document describes a project that aims to develop a mobile application for real-time object and pose detection. The application will take in a real-time image as input and output bounding boxes identifying the objects in the image along with their class. The methodology involves preprocessing the image, then using the YOLO framework for object classification and localization. The goals are to achieve high accuracy detection that can be used for applications like vehicle counting and human activity recognition.
Recent Progress on Object Detection_20170331Jihong Kang
This slide provides a brief summary of recent progress on object detection using deep learning.
The concept of selected previous works(R-CNN series/YOLO/SSD) and 6 recent papers (uploaded to the Arxiv between Dec/2016 and Mar/2017) are introduced in this slide.
Most papers are focusing on improving the performance of small object detection.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
YOLO (You Only Look Once) is a real-time object detection system that frames object detection as a regression problem. It uses a single neural network that predicts bounding boxes and class probabilities directly from full images in one evaluation. This approach allows YOLO to process images and perform object detection over 45 frames per second while maintaining high accuracy compared to previous systems. YOLO was trained on natural images from PASCAL VOC and can generalize to new domains like artwork without significant degradation in performance, unlike other methods that struggle with domain shift.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Pose Machines is a method for estimating articulated human pose from images using convolutional neural networks. It uses a multi-stage approach where each stage predicts confidence maps for body parts using features from the previous stage. This allows the model to incorporate strong contextual cues between related body joints across stages. Key aspects of the approach include using intermediate supervision to address vanishing gradients, large receptive fields to capture context, and hierarchical prediction of parts to leverage top-down cues. Evaluation on public datasets shows it achieves state-of-the-art performance for articulated human pose estimation from single images.
Deformable Part Models are Convolutional Neural NetworksWei Yang
Girshick, Ross, et al. "Deformable part models are convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
Deep convolutional neural fields for depth estimation from a single imageWei Yang
The document proposes a new method for depth estimation from a single image using deep convolutional neural fields (DCNF). DCNF formulates depth estimation as a deep continuous conditional random field (CRF) learning problem. It jointly trains a deep CNN and a graphical model, where the CNN generates unary potentials and the graphical model models pairwise potentials. This allows end-to-end training of the unary and pairwise potentials of the CRF. Experimental results on indoor and outdoor datasets show the DCNF approach outperforms previous methods for monocular depth estimation.
This document provides an introduction to manifold learning. It defines what a manifold is and discusses how data lies on low-dimensional manifolds even when represented in high-dimensional space. It introduces several linear and nonlinear manifold learning algorithms, including Principal Components Analysis, Multidimensional Scaling, Isomap, Locally Linear Embedding, and Laplacian Eigenmaps. For each algorithm, it provides a brief overview of the motivation, key steps, and examples of applications like super-resolution imaging.
The document is a presentation about monocular human pose estimation using Bayesian networks. It includes:
- An outline with sections on introduction, approach overview, model learning, pose estimation, feature extraction, experiments and conclusions.
- Discussion of applications of human motion capture such as animation, games, medical diagnosis and visual surveillance.
- Comparison of different sensor approaches for human pose estimation including active markers, passive markers and markerless methods using cameras.
- Description of the proposed approach which uses Bayesian networks to represent the articulated human body and estimate 2D and 3D joint positions through representation, learning and inference steps.
This document shares Instagram photos from users that will make you want to visit the Grand Canyon to take photos. It includes over 20 Instagram handles and short bios of the users. It also provides contact information for the National Geographic Visitor Center at the Grand Canyon, which is open from 8am-10pm March through October and 10am-8pm November through February.
All pose face alignment robust to occlusionJongju Shin
The document proposes a method for pose and occlusion robust face alignment using multiple shape models and partial inference. It introduces shape representation using point distribution models and multiple shape models to handle various poses and expressions. The method detects local features hierarchically using modified census transform and Adaboost. It then hypothesizes transformation and shape parameters using partial inference to estimate visible and invisible features. Experimental results on public databases show the method achieves accurate alignment under poses, expressions, and occlusions.
The document discusses extending the concept of pose beyond just physical positioning of the body to include digital representations and interactions. Pose is analyzed as a complex interplay between physical, digital, and social dimensions that is shaped by technologies. Extending the notion of pose allows for considering how identities are performed through various mediums and interactions online and in blended digital-physical spaces.
1. The document summarizes research on improving a single person pose recognition and tracking system using computer vision techniques. The goal is to better detect body parts and recognize poses in real-time using a single camera.
2. Key aspects of the system include using a mixture of Gaussians model for background subtraction, and a particle filter for tracking the torso and head. Hand detection is improved by combining skin color detection with the human blob silhouette.
3. The research aims to improve pose recognition performance by classifying "non-poses" - poses that are different from the predefined poses. Experiments show that increasing the dataset size and adding a "non-pose" class leads to better detection results.
Pose Method clinic held at CrossFit Ferus in Fayetteville, NC. Covers running form and technique from an efficiency and injury prevention standpoint. Programming for marathon training and interval sessions described.
Estimating Human Pose from Occluded Images (ACCV 2009)Jia-Bin Huang
We address the problem of recovering 3D human pose from single 2D images, in which the pose estimation problem is formulated as a direct nonlinear regression from image observation to 3D joint positions. One key issue that has not been addressed in the literature is how to estimate 3D pose when humans in the scenes are partially or heavily occluded. When occlusions occur, features extracted from image observations (e.g., silhouettes-based shape features, histogram of oriented gradient, etc.) are seriously corrupted, and consequently the regressor (trained on un-occluded images) is unable to estimate pose states correctly. In this paper, we present a method that is capable of handling occlusions using sparse signal representations, in which each test sample is represented as a compact linear combination of training samples. The sparsest solution can then be efficiently obtained by solving a convex optimization problem with certain norms (such as l1-norm). The corrupted test image can be recovered with a sparse linear combination of un-occluded training images which can then be used for estimating human pose correctly (as if no occlusions exist). We also show that the proposed approach implicitly performs relevant feature selection with un-occluded test images. Experimental results on synthetic and real data sets bear out our theory that with sparse representation 3D human pose can be robustly estimated when humans are partially or heavily occluded in the scenes.
This document provides instructions for 10 poses called "Himal Poses" created by Himal Fernando. It cautions that some poses should not be attempted without training. The poses include basic poses like the Vertical and 66, as well as more advanced poses like the Stargazer, Palm Face, Chicken Dance, Can't Touch This, Intense, Devil's Horns, and Win which is described as the ultimate winning pose. The document encourages practicing the poses but cautions not to overexert and advises taking breaks between poses.
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Abdulrahman Kerim
This presentation summarizes a paper on multi-person pose estimation using a two-stage deep learning model. The approach uses a Faster R-CNN model to detect person boxes, then applies a separate ResNet model to each box to predict keypoints. It trains on the COCO dataset and evaluates on COCO test images, achieving state-of-the-art accuracy for multi-person pose estimation. Key aspects covered include the motivation, problem definition, approach using heatmap and offset predictions, model training procedure, evaluation metrics and results.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
The document summarizes the author's computer vision research from 2020 to the present. It covers areas of research including image segmentation, 3D reconstruction, image restoration, and lip generation. Specific projects are mentioned under each area, such as YOLACT and MODNet for image segmentation, PIFu and SMPL for 3D reconstruction, and Wav2Lip and SyncTalkFace for lip generation from speech. The author also outlines plans for future research directions involving multimodal learning, generative models, and representing scenes with neural radiance fields.
This document describes a machine learning project that uses support vector machines (SVM) and k-nearest neighbors (k-NN) algorithms to segment gesture phases based on radial basis function (RBF) kernels and k-nearest neighbors. The project aims to classify frames of movement data into five gesture phases (rest, preparation, stroke, hold, retraction) using two classifiers. The SVM approach achieved 53.27% accuracy on test data while the k-NN approach achieved significantly higher accuracy of 92.53%. The document provides details on the dataset, feature extraction methods, model selection process and results of applying each classifier to the test data.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
This presentation slides will help to make bridge with knowledge and reality in traffic flow modelling based on real understanding of mathematical terms in modelling equations. I hope it will make good contribution to improve our knowledge level for performing simulation of any model based on numerical method e.g., finite difference scheme.
All the best.
Nikhil Chandra Sarkar
1. The document discusses barriers to scaling electronic structure methods to large systems, such as the inability of sparse matrix multiplication kernels to access strong parallel scaling and entrenched data structures that limit innovation.
2. It proposes a fast, generic, and data local N-body solver approach using new mathematics that is not constrained by row-column data structures and allows a single programming model.
3. Key aspects of this approach include exploiting locality in higher dimensional product volumes through techniques like occlusion-culling, resolving identity iteratively to compress matrices by orders of magnitude, and developing optimized sparse matrix multiplication kernels.
Human action recognition with kinect using a joint motion descriptorSoma Boubou
- We proposed a novel descriptor for motion of skeleton joints.
- Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.
- Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).
- Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).
- Grouping joints into eight groups is always effective with actions of MSR3D dataset.
Tutorial Equivariance in Imaging ICMS 23.pptxJulián Tachella
Equivariant deep learning enables unsupervised learning of inverse problems from measurements alone by exploiting signal symmetries. The measurement operator must not be equivariant to the symmetry group in order for the underlying signal set to be uniquely identified. If the signal set has low dimensionality and the symmetry group is large, the number of measurements needed is the same as for supervised signal recovery. This approach generalizes supervised training by allowing learning from unlabeled data.
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
This document summarizes a paper on detecting and tracking people in crowded scenes. It proposes an energy formulation approach that leverages global scene structure and resolves all detections jointly. The approach formulates detection as an energy minimization problem involving terms for person detector confidence scores, non-overlapping detections, and crowd density estimation. It estimates crowd density using a Gaussian mixture model and learns model parameters by minimizing a mean squared error distance between annotated and estimated density maps.
1. The document discusses various machine learning classification algorithms including neural networks, support vector machines, logistic regression, and radial basis function networks.
2. It provides examples of using straight lines and complex boundaries to classify data with neural networks. Maximum margin hyperplanes are used for support vector machine classification.
3. Logistic regression is described as useful for binary classification problems by using a sigmoid function and cross entropy loss. Radial basis function networks can perform nonlinear classification with a kernel trick.
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
Analysis of large scale spiking networks dynamics with spatio-temporal constr...Hassan Nasser
Recent experimental advances have made it possible to record up to several hundreds of neurons simultaneously in the cortex or in the retina. Analysing such data requires mathematical and numerical methods to describe the spatio-temporal correlations in population activity. This can be done thanks to Maximum Entropy method. Here, a crucial parameter is the product NxR where N is the number of neurons and R the memory depth of correlations (how far in the past does the spike activity affects the current state). Standard statistical mechanics methods are limited to spatial correlation structure with
R = 1 (e.g. Ising model) whereas methods based on transfer matrices, allowing the analysis of spatio-temporal correlations, are limited to NR = 20.
In the first part of the thesis we propose a modified version of the transfer matrix method, based on the parallel version of the Montecarlo algorithm, allowing us to go to NR = 100.
In the second part we present EnaS, a C++ library with a Graphical User Interface developed for neuroscientists. EnaS offers highly interactive tools that allow users to manage data, perform empirical statistics, modeling and visualizing results.
Finally, in a third part, we test our method on synthetic and real data sets. Real data set correspond to retina data provided by neuroscientists partners. Our non extensive analysis shows the advantages of considering spatio-temporal correlations for the analysis of retina spike trains, but it also outlines the limits of Maximum Entropy methods.
For more information about the software that I co-developed with my colleagues, please visit this page:
https://enas.inria.fr/
For more information about the publications, please visit this page:
https://scholar.google.fr/citations?user=L97ZODwAAAAJ
For the thesis, please visit this link:
https://www.theses.fr/178166669
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
1. The document summarizes a research paper on neural image caption generation using visual attention mechanisms. It introduces attention models that allow an image captioning model to focus on salient regions of the image dynamically.
2. It describes the image captioning model which uses an LSTM decoder conditioned on an encoded image representation and a context vector. The context vector is generated by taking a weighted sum of image features, with the weights determined by an attention model.
3. It discusses two types of attention mechanisms - "hard" or stochastic attention which selects a single image location at each time step, and "soft" or deterministic attention which blends all locations with learned weights. The model is trained end-to-end to maximize
Technical presentation of the gesture based NUI I developed for the Aigaio smart conference room in IIT Demokritos
Demo In Greek:
https://www.youtube.com/watch?v=5C_p7MHKA4g
본 논문에서는 분배형 강화학습(Distributional Reinforcement Learning)에서 벨만 다이내믹스를 통해 확률 분포를 학습하는 문제를 고려합니다. 이전 연구들은 각 반환 분포의 유한 개의 통계량을 신경망을 통해 학습하는 방법을 사용해왔으나, 이 방법은 반환 분포의 함수적 형태에 제한을 받아 제한적인 표현력을 가지며, 미리 정의된 통계량을 유지하는 것이 어려웠습니다. 본 논문에서는 이러한 제한을 없애기 위해 최대 평균 거리(Maximum Mean Discrepancy, MMD)라는 가설 검정 기술을 활용해 반환 분포의 결정론적인(의사 난수를 사용한) 표본들을 학습하는 방법을 제안합니다. 이를 통해 반환 분포와 벨만 타겟 간의 모든 모멘트(순간값)를 암묵적으로 일치시킴으로써 분배형 벨만 연산자의 수렴성을 보장하며, 분포 근사에 대한 유한 샘플 분석을 제시합니다. 실험 결과, 본 논문에서 제안한 방법은 분배형 강화학습의 기본 모델보다 우수한 성능을 보이며, Atari 게임에서 분산형 에이전트를 사용하지 않는 경우에도 최고 성적을 기록합니다.
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Chris Rackauckas
This document discusses scientific machine learning and differentiable simulation. It begins by explaining that scientific machine learning uses both data and physical knowledge to make accurate predictions with less data. It then discusses differentiable simulation and how universal differential equations can be used to replace unknown portions of models with neural networks while preserving known physical structure. Several examples are provided of applications in various domains like epidemiology, black hole detection, earthquake engineering, and chemistry. The document emphasizes that understanding the engineering principles and numerical properties of the domain is important for applying these methods stably and efficiently.
This document presents statistical analysis of stochastic multi-robot boundary coverage. It begins by introducing the problem of stochastic boundary coverage using multiple robots and defines key terms. It then provides the problem statement of analyzing saturation probability and distributions when robots attach randomly to a closed boundary. The document proceeds to solve this problem analytically for point robots and extends the solution to finite-sized robots. It compares the analytical solutions to results from Monte Carlo simulations to validate the statistical analysis.
Artificial Intelligence Applications in Petroleum Engineering - Part IRamez Abdalla, M.Sc
This document discusses applications of artificial intelligence, specifically artificial neural networks and genetic algorithms, in petroleum engineering. It provides an overview of neural networks in OnePetro papers, describes the basic concepts and training processes of neural networks and genetic algorithms. It then discusses various applications of these techniques in reservoir engineering, production technologies, and oil well drilling, including reservoir characterization, modeling, well test analysis, permeability prediction, production monitoring, drilling optimization, and more. The presentation aims to explore these applications in more depth.
Background Estimation Using Principal Component Analysis Based on Limited Mem...IJECEIAES
Given a video of 푀 frames of size ℎ × 푤. Background components of a video are the elements matrix which relative constant over 푀 frames. In PCA (principal component analysis) method these elements are referred as “principal components”. In video processing, background subtraction means excision of background component from the video. PCA method is used to get the background component. This method transforms 3 dimensions video (ℎ × 푤 × 푀) into 2 dimensions one (푁 × 푀), where 푁 is a linear array of size ℎ × 푤 . The principal components are the dominant eigenvectors which are the basis of an eigenspace. The limited memory block Krylov subspace optimization then is proposed to improve performance the computation. Background estimation is obtained as the projection each input image (the first frame at each sequence image) onto space expanded principal component. The procedure was run for the standard dataset namely SBI (Scene Background Initialization) dataset consisting of 8 videos with interval resolution [146 150, 352 240], total frame [258,500]. The performances are shown with 8 metrics, especially (in average for 8 videos) percentage of error pixels (0.24%), the percentage of clustered error pixels (0.21%), multiscale structural similarity index (0.88 form maximum 1), and running time (61.68 seconds).
Similar to Deep learning-for-pose-estimation-wyang-defense (20)
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
1. Deep Learning for Articulated
Human Pose Estimation
Thesis Proposal Defence
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
May 25, 2016
6. Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
• Pictorial structures
Yang & Ramanan 2011
Traditional Methods
• Unary templates
• Pairwise springs
• Mixture of mini-parts
• Mixtures of each part
• Unary template for each mixture type
• Pairwise springs between mixture types
of two parts
Unable to handle large
variations
(e.g., foreshortening)
6
8. Heatmap Regression
Deep Learning Based Methods
(𝑥, 𝑦) Coordinate Regression
• Learning better representations
• Geometric constraints among body parts
are missing in training the DCNN
• Holistic View
• Mapping from images to coordinates
are too difficult to learn
• Inaccurate in high-precision region
[Tompson et al. CVPR’15][Toshev & Szegedy . CVPR’14]
8
9. CNN
Spatial
constraints
Local evidence is weak
Forward
Backward
Global consistency helps
training
Motivation: Global Pose Consistency Helps in
Learning Better Representation
Forward
Backward
9
10. Difficulties in Modeling Spatial Constraints
∗
∗
=
=
face
shoulder
s | f
s | s
face to shoulder
shoulder to shoulder
Tompson, Jonathan J., et al. "Joint training of a convolutional network and a graphical model for human pose estimation." NIPS. 2014. 10
shoulder
⨂
Weakly spatial histogram over body part
locations
• Less effective for large variations
Learned by convolutional kernels
• Parameter space is too large hence is
difficult to learn
11. Graph Models
𝐺 = (𝑉, 𝐸)
Vertices
• Locations and mixture types of
body parts
• Modeled by a front-end CNN
Edges
• Pairwise spatial relationships
between body parts
• Modeled by message passing
layers
11
message passing
32. Evaluation Metrics
Percentage of Correct Parts (PCP)
• Correctly localized body parts
• A candidate body part is treated as correct if its
segment endpoints lie within 50% of the length
of the ground-truth annotated endpoints.
• Penalize short limbs
Percentage of Detected Joints (PDJ)
• Correctly localized joints invariant to scale
• Curve computed by varying localization precision
precision threshold, which is normalized by the
scale defined as distance between left shoulder
and right hip
32
41. Unary Term vs. Full Model
83.4
69
53.5
34.9
72.2
63.5
60.1
96.5
83.1
78.8
66.7
88.7
81.7
81.1
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
STRCT PCP ON THE LSP DATASET (VGG-LG)
Unary Full Model
41
42. Tree-Structured Model vs. Loopy Model
42
96.2
83.4
78.7
65.8
87.9
81.1
80.7
96.5
83.1
78.8
66.7
88.7
81.7
81.1
62 67 72 77 82 87 92 97
Torso
Head
Upper Arms
Lower Arms
Upper Legs
Lower Legs
Mean
Loopy Model Tree Model
43. Future work
Deep Residual Learning for Human Pose Estimation
Image Dependent Graph Structure Learning
43
45. Residual Learning:
Intuition
• A deeper model should not
have higher training error
than its shallower
counterpart.
• One solution:
identity mapping
Identity mapping
45
46. Plain Network
• 𝐻(𝐱) is the underlying mapping
• Expect stacked two layers to
approximate 𝐻(𝐱) Weight layer
Weight layer
𝐻(𝐱)
ReLU
ReLU
𝐱
46
47. Residual Learning
• Explicitly fit a residual mapping
𝐹 𝐱 = 𝐻 𝐱 − 𝐱
Weight layer
Weight layer
ReLU
ReLU
𝐱
Insight:
Finding optimal around
zero is easier! 𝐹 𝐱
𝐻 𝐱 = 𝐹 𝐱 + 𝐱
+
𝐻(𝐱)
47
51. Thank you.
Deep Learning for Articulated Human Pose Estimation
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
Committee
Prof. Xiaogang Wang (EE)
Prof. Wai-kuen Cham (EE)
Prof. Dahua Lin (IE)
52. Appendix: Number of Message Passing Layers
80.7
80.9
81.1
MEAN
52
80.7
81.2
81.7
LOWER LEGS
87.9
88.3
88.7
UPPER LEGS
66.3
66.3
66.7
LOWER ARMS
78.4
78.2
78.8
UPPER ARMS
1st Layer 2nd Layer 3rd Layer
53. Appendix: Independent Training vs. Joint
Training
93
82.1
70.6
55.4
82.1
75.3
74.2
95
83.5
75
61.9
86.9
79.8
78.6
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
Independent Joint
53
Good afternoon. Welcome to my thesis proposal defence.
I’m Wei Yang from the IVP group. The title of this talk is deep learning for articulated human pose estimation.
So the first question is that: what is articulated human pose estimation?
Given an image or a video, the goal of articulated pose estimation is to recover the joint positions of articulated limbs of human body, as shown in this image.
Applications of articulated human pose estimation is very broad. From recognizing activities to interactive game systems, and from creating movies to clothing recognition, human pose estimation is a very useful information to help solve the problems or to make the original problems easier.
However, the pose estimation problem itself is not a trivial task. Human limbs are highly articulated and flexible, hence a people can appear with a variety of poses and body shape.
Meanwhile, different view points lead to different body shape or foreshortening. Various clothing also lead to various appearance of human body. All these factors make the problem more difficult.
To solve the problem, earlier methods adopt part based models, which divide the human body into a set of body parts, such as the head, torso, arms, and legs. In 3D space, these parts can be modeled as cylinders.
Later works, such as Pictorial structures, use 2 dimensional part templates, and encode the spatial relationships among different body parts by using springs (or the edges). However, capturing the whole range of appearances using pictorial structures is still quite difficult.
Take this picture as an example, A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template.
To better handle the large variations, the mixture of mini parts model has been proposed. Each part is clustered into several mixtures according to its appearance. And each mixture has its own unary template for detection. For example, in this image the mini-parts are tuned to represent near-vertical and near horizontal limbs. to approximate the transformations
In implementation, the mixture of parts is obtained by clustering the relative locations of two neighboring body parts. We can see that the samples from the same cluster share similar visual appearance.
Recently, the state-of-the-art performance on pose estimation are achieved by deep learningmethods.
Deeppose [26] estimates the (x, y) locations of the body part by a regressor in a holistic manner. The regressor is based on the deep convolutional neural networks, and its expressive power is strong. However, the mapping from raw images to (x, y) coordinates are too difficult to learn, hence this method suffers from inaccuracy in the high-precision region.
CNN-based heatmap regression models have shown the potential of learning better representations. However, geometric constraints between body parts are usually missing in the training stage. As a concequence, during training stage, these kind of methods may produce imperfect heat maps during training.
For example, these methods may produce many high response regions to the head of unannotated people, and the error will be backpropagated to update the model parameters. However, this is inappropriate.
Since the local evidence is weak, we should consider the global consistency of the whole human body. This could be done by considering the geometric relationships between body parts during the training stage.
A natural way to model spatial constraints is to use convolutions. Once the spatial kernels have been learned, one can use these kernels to enforce the global pose consistency. There kernels can be calculated by creating a histogram of joint 𝑎 locations over the training set, given that the adjacent joint b is located at the kernel center. These kernels can also be learned by using the standard backpropagation algorithm. However, there are two limitations of this method.
First, these kernels are difficult to handle large variations, especially for the highly articulated parts such as arm and legs. Second, the kernels should large enough to cover sufficient context. Hence the parameter space is very large and the parameters are difficult to learn.
In this proposal, we propose to incorporate the CNN and the expressive mixture of parts model into an end-to-end framework. This enables us to predict the body part locations with the consideration of global pose configurations during the training stage.
We formulate the human pose estimation problem by using a graph model G=(V, E). V denotes the vertices, which specify the positions and the mixture types of body parts. The vertices are modeled by a front-end CNN in our framework. The edges model the pairwise spatial relationships between body parts. a node sends a message to each of its neighbors and receives messages from each neighbor (indicated by arrows).
Here is an illustration of the proposed framework.
It can be viewed as two components: a front-end DCNN for learning feature
representations of body parts, which followed by a softmax layer and a logarithm layer.
The second component is the message passing layers for conducting inference and learning on mixture of parts with deformation constraints between
parts. Specifically, each message passing layer performs one iteration of message passing algorithm in a forward pass. Finally, the final score map of each body part is computed by compute the maximum value over mixture types.
Given an image image I. the full score of a pose configuration is as this equation.
l is the (x, y) location of each part
T is the mixture type of each part i
The full score consists of the unary term and the pairwise term. The unary term is to model the part appearance, which is denoted by phi. The parameter theta is learned by the front-end CNN followed by a softmax layer and a logarithm layer.
The pairwise terms model the spatial relationships between body parts. we use standard quadratic deformation constraints to model this term, which will be discussed later.
We will first discuss the front-end CNN of our framework. It is a fully convolutional network. Given an input image, the output of the network are scoremaps for mixture types. Note that the front-end CNN does not take the global pose consistency into consideration, hence unary term may contain lots of false positives.
The mathematical formulation of the unary term is written as this equation. F denote the raw score of each mixture type predicted by the front-end CNN. Then the following softmax layer compute the normalized score of each mixture type. Then the logarithm layer transform the normalized score into the log space.
To make the training easier and faster. We first pretrain the front-end CNN with image patches. Suppose we have p parts, and each part is clustered into K mixture types, Then an arbitrary image patch is either the background, or belongs to one of the PXK classes. Then given a training image patch, the network predicts a label out of PxK + 1 classes. As mentioned before, the mixtures are obtained by performing clustering on the relative locations of neighboring body parts.
The second term consists of a deformation model that evaluates the relative locations of pairs of parts. We write psi for the squared offset between two part locations, and we write beta for the parameters of a spring that favors certain offsets over others. Beta encodes both the rest position and rigidity of the spring. In a Gaussian model, this would be the mean and covariance.
We employ the Max-sum algorithm to infer the best configuration in graphical models. Although the max-sum algorithm is only an approximation and the convergence cannot be guaranteed on loopy structures, it still provided excellent experimental results.
At each iteration, a vertex sends a message to its neighbors and receives messages from its neighbors. We denote mij(lj ; tj) as the message sent from part i to part j, and ui(li; ti) as the belief of part i, then the max-sum algorithm updates the messages and beliefs by these two equations.
This process iterates several times until convergence. And then we are able to obtain the max-sum assignment by compute the argmax of ui.
This process iterates several times until convergence. And then we are able to obtain the max-sum assignment by compute the argmax of ui.
Here are two examples demonstrate the results produced by different message passing layers. We can see that the results are getting better when we increase the number of message passing layers. It is not difficult to understand this phenomenon. Intuitively, a part could receive messages from further parts as the number of message passing layer increases, which may result in better pose estimations.
We demonstrate the effective of the proposed method on three widely used public datasets. The first one is the LSP dataset, namely the LEEDS sports dataset, it consists of 1000 training images and 1000 testing images from sports activities with challenging articulations.
The second dataset is the Frames Labeled in Cinema dataset, namely the FLIC dataset. This dataset is collected from popular Hollywood movies with diverse appearances and poses. Each person is annotated by 10 upper-body joints. It consists of about 4000 training and 1016 testing images.
The third dataset is the Image Parse dataset which contains diverse activities. We did not train on this dataset. It only used for cross-dataset validation to evaluate the generalization ability of the proposed method.
We adopt two widely used evaluation metrics for evaluation.
The first one is the Percentage of Correct Parts (PCP). It measures the rate of correctly detected limbs: a limb is considered as correctly detected if the distances between the detected limb endpoints and groundtruth limb endpoints are within
half of the limb length.
However, this metric penalize very short limbs. Hence the adopt the Percentage of Detected Joints as the complementary evaluation metric. This metric measures the rate of correctly localized joints, and it is invariant to scale. It computes a curve by varying localization precision threshold.
Some results on the LSP dataset are visualized in this slide. The proposed method is robust to highly articulated poses with variant orientation, foreshortening, cluttered background, occlusion, and overlapping people.
We report the PCP results on the LSP dataset on six limbs: torso, head, upper arms, lower arms, upper legs and lower legs. The cyan bar denote our method. We can see that our method can get the highest PCP value in average and on most of the limbs compared with the previous methods. We can also find that the most difficult body parts are the lower arms. Because lower arms are the body parts with the largest articulations.
We also demonstrate the PDJ curve on the LSP dataset on four body joints, namely the elbows, wrists, knees, and ankles. The red curve denote our method. By comparing the PDJ value at the threshold 0.2, our method outperforms the previous methods by a large margin on all body parts except ankles.
In this slide, we show some sample results on the FLIC dataset. Compared with previous method, our method is robust to large appearance variation and overlapping people, for example, existing methods have difficulty to accurately locate the body part for the man in the costume. However, our method is able to handle this case.
From the PDJ curve, we can also show that our method has some improvement compared with previous methods.
To demonstrate the generalization ability, we directly used the full-body model trained on the LSP dataset to predict the poses on the test images in the image parse dataset. The visualized results are pretty satisfactory. The PCP results are also reported. The proposed method achieve better or comparable results with the state-of-the-art methods. Note that most of the previous methods used the training data from the image parse dataset to train the model.
Some failure cases are showed. Our method may lead to wrong estimations due to significant
occlusions, ambiguous background, or heavily overlapping persons.
To evaluate the improvement brought by spatial constraints and joint learning, we compare the unary term with the full model. We found that the spatial constraints and the joint learning boost the performance by about 20 percent.
Our framework is flexible for both the tree-structured model and the loopy graph models. By following the previous work, we add symmetry constraints between left and right knees. We find that this constraint is very helpful for reducing the double counting problem in legs.
In future work, we plan to extend the proposed framework in two directions. First, we could use the deeper and more powerful network architecture to boost the performance. And currently, the graph structure is hand crafted, and may not be the optimal structure for every image. We want to learn the graph structure.
The depth of the network grows rapid in recent years. And generally, we find that the deeper the network, the better the performance. But is there a limitation? Through experiment, people find that the deeper network may produce higher training error when compared to its shallower counterpart.
There are several reasons. First is the notorious gradient vanishing or exploding problem. Moreover, current solvers such as Stochastic gradient descent is difficult to find the optimal mappings in the very deep network.
However, we find the a deeper model should not have higher training error than its shallower counterpart. For example, if the stacked layer are identity mapping, then the training error will not increase no matter how many layers are stacked. This is the basic idea of residual learning.
Let’s call the conventional network as the plain network. And H(x) is the underlying mapping. We hope to approximante the underlying mapping Hx by stacking of two layers. And we know it is difficult.
But how about learning the residual of Hx and x? Because find optimal around zeror is much easier. Hence we can fit a residual mapping explicitly. One building block is like this.
We stack many building blocks to build a very deep network for pose estimation. We call it the ResNet. It achieves better results on the VGG network. And we will investigate more variants of ResNet to better fit the pose estimation problem.
In literature, the graph structure for modeling the relationships among body parts is usually designed.
manually [60, 5]. However, no theoretical analysis shows how to build the connections among
body parts, or which graph structure is optimal. Some efforts have been made on learning graph
structures [55] from data. However, the graph structure is fixed once it has been learned and lacks flexibility to handle large variations.
As mentioned before, previous work use convolutional kernels to learn the geometric relationships between parts. This process can be formulated by this equation. It approximates message passing from one score map to another score map by using a convolution layer, as illustrated in the figure.
In previous work, this kind of convolution layer is either fully connected, or connected by hand crafted graph structures, and lacks flexibility to handle large variations.
We propose to adjust the graph structure according to the image by incorporating gates to control the message passing.