Synthesizing pseudo 2.5 d content from monocular videos for mixed realityNAVER Engineering
Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media.
In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content.
To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.
Synthesizing pseudo 2.5 d content from monocular videos for mixed realityNAVER Engineering
Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media.
In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content.
To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.
발표자: 이광무(빅토리아대 교수)
발표일: 2018.7.
Local features are one of the core building blocks of Computer Vision, used in various tasks such as Image Retrieval, Visual Tracking, Image Registration, and Image Matching. Especially for geometric applications, that is, finding the pose of the camera from images, it still remains the state of the art, even in the era of Deep Learining.
In this talk, I will introduce our recent works on using local features, as well as a recent self-supervised pipeline that learns to keypoints that are used to match images from scratch. I will first talk about how to learn to find good correspondences, then talk about how to go through losses based on eigen decomposition, which is essential when retrieving the camera pose. I will also introduce our latest local feature pipeline, which is self-supervised inspired by reinforcement learning.
Scaling up Deep Learning Based Super Resolution AlgorithmsXiaoyong Zhu
Superresolution is a process for obtaining one or more high-resolution images from one or more low-resolution observations. It has been used for many applications, including satellite and aerial imaging, medical image processing, ultrasound imaging, line fitting, automated mosaicking, infrared imaging, facial image improvement, text image improvement, compressed image and video enhancement, and fingerprint image enhancement. While research on superresolution began in the 1970s, recently, with the power of deep learning, many notable new methods have been created, including SRCNN, SRResNet, and lately, SRGANs, which use generative adversarial networks. However, since these approaches require a lot of images to train the deep learning network, they are supercompute intensive. Fortunately, with the power of the cloud, you can easily scale up the compute resources as needed, making the algorithm converge faster.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Segmentation
Active Contours
Split and Merge
Watershed
Region Splitting and Merging
Graph-based Segmentation
Mean shift and Model finding
Normalized Cut
Image Completion using Planar Structure Guidance (SIGGRAPH 2014)Jia-Bin Huang
We propose a method for automatically guiding patch-based image completion using mid-level structural cues. Our method first estimates planar projection parameters, softly segments the known region into planes, and discovers translational regularity within these planes. This information is then converted into soft constraints for the low-level completion algorithm by defining prior probabilities for patch offsets and transformations. Our method handles multiple planes, and in the absence of any detected planes falls back to a baseline fronto-parallel image completion algorithm. We validate our technique through extensive comparisons with state-of-the-art algorithms on a variety of scenes.
Project page: https://sites.google.com/site/jbhuang0604/publications/struct_completion
This is an intensive meetup at Samsung Next IL covering most interesting papers that were presented in CVPR 2017 last month. It is a good opportunity to have an overview of recent advancements in the field of Deep Learning with applications to Computer-Vision.
The following topics are covered:
• Object detection
• Pose estimation
• Efficient networks
Slide for study session given by Dr. Enrico Rinaldi at Arithmer inc.
It is a summary of established methods for parametric modeling of 3D human body "SMPL", which has many possible applications in apparel/health care industry.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
PR-240: Modulating Image Restoration with Continual Levels viaAdaptive Featu...Hyeongmin Lee
이번 논문은 Modulating Image Restoration with Continual Levels via Adaptive Feature Modification Layers로, Image Processing을 위해 학습된 Network가 여러 Noise Level에 대하여 동작할 수 있도록 Control 가능한 Parameter를 추가하는 방법론을 소개하는 논문입니다.
동영상 링크: https://youtu.be/WXGqYbKQzWY
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as composition of spatio-temporal scene graphs. arXiv preprint arXiv:1912.06992, 2019.
Mask R-CNN present a conceptually simple, flexible, and general framework for object instance segmentation. This approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.
presentation: https://www.youtube.com/watch?v=FZePQKPEwoo (한국어)
reference: He, Kaiming, et al. "Mask r-cnn." arXiv preprint arXiv:1703.06870 (2017).
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical FlowHyeongmin Lee
이번 논문은 ECCV2020에서 Best Paper를 받은 논문으로, 기존 방법들과는 다르게 반복적인 Update를 통해 Optical Flow를 예측하여 꽤나 높은 성능을 기록한 논문입니다.
paper link: https://arxiv.org/pdf/2003.12039.pdf
video link: https://youtu.be/OnZIDatotZ4
Human action recognition with kinect using a joint motion descriptorSoma Boubou
- We proposed a novel descriptor for motion of skeleton joints.
- Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.
- Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).
- Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).
- Grouping joints into eight groups is always effective with actions of MSR3D dataset.
발표자: 이광무(빅토리아대 교수)
발표일: 2018.7.
Local features are one of the core building blocks of Computer Vision, used in various tasks such as Image Retrieval, Visual Tracking, Image Registration, and Image Matching. Especially for geometric applications, that is, finding the pose of the camera from images, it still remains the state of the art, even in the era of Deep Learining.
In this talk, I will introduce our recent works on using local features, as well as a recent self-supervised pipeline that learns to keypoints that are used to match images from scratch. I will first talk about how to learn to find good correspondences, then talk about how to go through losses based on eigen decomposition, which is essential when retrieving the camera pose. I will also introduce our latest local feature pipeline, which is self-supervised inspired by reinforcement learning.
Scaling up Deep Learning Based Super Resolution AlgorithmsXiaoyong Zhu
Superresolution is a process for obtaining one or more high-resolution images from one or more low-resolution observations. It has been used for many applications, including satellite and aerial imaging, medical image processing, ultrasound imaging, line fitting, automated mosaicking, infrared imaging, facial image improvement, text image improvement, compressed image and video enhancement, and fingerprint image enhancement. While research on superresolution began in the 1970s, recently, with the power of deep learning, many notable new methods have been created, including SRCNN, SRResNet, and lately, SRGANs, which use generative adversarial networks. However, since these approaches require a lot of images to train the deep learning network, they are supercompute intensive. Fortunately, with the power of the cloud, you can easily scale up the compute resources as needed, making the algorithm converge faster.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Segmentation
Active Contours
Split and Merge
Watershed
Region Splitting and Merging
Graph-based Segmentation
Mean shift and Model finding
Normalized Cut
Image Completion using Planar Structure Guidance (SIGGRAPH 2014)Jia-Bin Huang
We propose a method for automatically guiding patch-based image completion using mid-level structural cues. Our method first estimates planar projection parameters, softly segments the known region into planes, and discovers translational regularity within these planes. This information is then converted into soft constraints for the low-level completion algorithm by defining prior probabilities for patch offsets and transformations. Our method handles multiple planes, and in the absence of any detected planes falls back to a baseline fronto-parallel image completion algorithm. We validate our technique through extensive comparisons with state-of-the-art algorithms on a variety of scenes.
Project page: https://sites.google.com/site/jbhuang0604/publications/struct_completion
This is an intensive meetup at Samsung Next IL covering most interesting papers that were presented in CVPR 2017 last month. It is a good opportunity to have an overview of recent advancements in the field of Deep Learning with applications to Computer-Vision.
The following topics are covered:
• Object detection
• Pose estimation
• Efficient networks
Slide for study session given by Dr. Enrico Rinaldi at Arithmer inc.
It is a summary of established methods for parametric modeling of 3D human body "SMPL", which has many possible applications in apparel/health care industry.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
PR-240: Modulating Image Restoration with Continual Levels viaAdaptive Featu...Hyeongmin Lee
이번 논문은 Modulating Image Restoration with Continual Levels via Adaptive Feature Modification Layers로, Image Processing을 위해 학습된 Network가 여러 Noise Level에 대하여 동작할 수 있도록 Control 가능한 Parameter를 추가하는 방법론을 소개하는 논문입니다.
동영상 링크: https://youtu.be/WXGqYbKQzWY
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as composition of spatio-temporal scene graphs. arXiv preprint arXiv:1912.06992, 2019.
Mask R-CNN present a conceptually simple, flexible, and general framework for object instance segmentation. This approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.
presentation: https://www.youtube.com/watch?v=FZePQKPEwoo (한국어)
reference: He, Kaiming, et al. "Mask r-cnn." arXiv preprint arXiv:1703.06870 (2017).
PR-278: RAFT: Recurrent All-Pairs Field Transforms for Optical FlowHyeongmin Lee
이번 논문은 ECCV2020에서 Best Paper를 받은 논문으로, 기존 방법들과는 다르게 반복적인 Update를 통해 Optical Flow를 예측하여 꽤나 높은 성능을 기록한 논문입니다.
paper link: https://arxiv.org/pdf/2003.12039.pdf
video link: https://youtu.be/OnZIDatotZ4
Human action recognition with kinect using a joint motion descriptorSoma Boubou
- We proposed a novel descriptor for motion of skeleton joints.
- Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.
- Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).
- Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).
- Grouping joints into eight groups is always effective with actions of MSR3D dataset.
Age Estimation And Gender Prediction Using Convolutional Neural Network.pptxBulbul Agrawal
Identifying the attributes of humans such as age, gender, ethnicity, emotions etc. using computer vision have been given increased attention in recent years. Such attributes can play an important role in many applications such as human-computer interaction, surveillance, searching, biometrics, sale of product, entertainment, and cosmetology. Generally, it is possible to classify human life into one of four age groups: Children, Young, Adult, and Old. The image of a person’s face exhibits many variations which may affect the ability of a computer vision system to recognize the gender. In this dissertation, we evaluate the CNN architecture along with the PCA for gaining good performance.
VIBE: Video Inference for Human Body Pose and Shape EstimationArithmer Inc.
These slides were prepared for study session given by Christian Saravia at Arithmer inc.
It is a summary of recent methods for human pose/shape estimation from movie.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...csandit
This paper proposes a robust tracking method which concatenates appearance and geometrical
features to re-identify human in non-overlapping views. A uniformly-partitioning method is
proposed to extract local HSV(Hue, Saturation, Value) color features in upper and lower
portion of clothing. Then adaptive principal view selecting algorithm is presented to locate
principal view which contains maximum appearance feature dimensions captured from different
visual angles. For each appearance feature dimension in principal view, all its inner frames get
involved in training a support vector machine (SVM). In matching process, human candidate
filtering is first operated with an integrated geometrical feature which connects height estimate
with gait feature. The appearance features of the remaining human candidates are later tested
by SVMs to determine the object’s existence in new cameras. Experimental results show the
feasibility and effectiveness of this proposal and demonstrate the real-time in appearance
feature extraction and robustness to illumination and visual angle change.
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...cscpconf
This paper proposes a robust tracking method which concatenates appearance and geometrical
features to re-identify human in non-overlapping views. A uniformly-partitioning method is
proposed to extract local HSV(Hue, Saturation, Value) color features in upper and lower
portion of clothing. Then adaptive principal view selecting algorithm is presented to locate
principal view which contains maximum appearance feature dimensions captured from different
visual angles. For each appearance feature dimension in principal view, all its inner frames get
involved in training a support vector machine (SVM). In matching process, human candidate
filtering is first operated with an integrated geometrical feature which connects height estimate
with gait feature. The appearance features of the remaining human candidates are later tested
by SVMs to determine the object’s existence in new cameras. Experimental results show the
feasibility and effectiveness of this proposal and demonstrate the real-time in appearance
feature extraction and robustness to illumination and visual angle change.
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...sugiuralab
Wearable Accelerometer Optimal Positions for Human Motion Recognition. The 2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech 2020), March 10-11, 2020
Similar to Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II (20)
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
ICME2019 Tutorial: Intelligent Image Enhancement and Restoration - From Prior Driven Model to Advanced Deep Learning Part 4: retinex model based low light enhancement
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
ICME2019 Tutorial: Intelligent Image Enhancement and Restoration - From Prior Driven Model to Advanced Deep Learning Part 3: prior embedding deep super resolution
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
ICME2019 Tutorial: Intelligent Image Enhancement and Restoration - From Prior Driven Model to Advanced Deep Learning Part 2: text centric image style transfer
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
ICME2019 Tutorial: Intelligent Image Enhancement and Restoration - From Prior Driven Model to Advanced Deep Learning Part 1: prior embedding deep rain removal
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
4. 4
Introduction
• Human pose estimation
Single person Multi person
1. Right_Shoulder
2. Right_Elbow
3. Right_Wrist
4. Left_Shoulder
5. Left_Elbow
6. Left_Wrist
7. Right_Hip
8. Right_Knee
9. Right_Ankle
10. Left_Hip
11. Left_Knee
12. Left_Ankle
13. Head
14. Neck
15. Spine
16. Pelvis
5. 5
Applications
• Human action recognition
• Human-computer interaction
• Animation
• Intelligent Retail, such as self-service supermarket and intelligent
warehouses
6. 6
Challenges
• Various appearances and low-resolutions
• Diverse human poses and views
• Occluded or invisible key points
• Crowded background
7. 7
Top-down Methods
[1] Stacked hourglass net-works for human pose estimation. [Newell, ECCV2016]
[2] Towards accurate multi-person pose estimation in the wild. [Papandreou, CVPR2017]
[3] RMPE: Regional Multi-Person Pose Estimation. [Fang, ICCV2017]
[4] Simple Baselines for Human Pose Estimation and Tracking. [Xiao, ECCV2018]
[5] Cascaded Pyramid Network for Multi-Person Pose Estimation. [Chen, CVPR2018]
[6] HRNet:Deep High-Resolution Representation Learning for Human Pose Estimation.[Sun,
CVPR2019)
Human detection + single person key points detection
Advantage: State-of-the-art accuracy
Problem: Lower speed, human detection accuracy.
8. 8
Bottom-Up Methods
[1] Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. [Cao, CVPR2017]
[2] Associative Embedding : End-to-End Learning for Joint Detection and Grouping. [Newell
A, NeurIPS 2017]
[3] MultiPoseNet: Fast multi-person pose estimation using pose residual network. [Kocabas,
ECCV2018]
[4] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-
Based, Geometric. [Papandreou, ECCV2018]
[5] PifPaf: Composite Fields for Human Pose Estimation. [Sven, CVPR2019]
[6] Multi-person Articulated Tracking with Spatial and Temporal Embeddings. [CVPR2019]
Detecting key points + synthesizing human bodies
Advantage: Higher speed, do not rely on human detection
Problem: Lower accuracy
9. 9
• Single person: stacked hourglass – basic network backbone [1]
• Each hourglass first subsamples the feature maps, and then upsamples the feature
maps with the combination of higher resolution features from bottom layers.
• This bottom-up, top-down processing is repeated for several times.
Single Person
Alejandro Newell, Kaiyu Yang, Jia Deng: Stacked Hourglass Networks for
Human Pose Estimation. ECCV (8) 2016: 483-499.
10. 10
• Single person: feature pyramid module [2]
• Feature pyramid representation can provide sufficient context information,
especially for the occluded and invisible key points.
• The residual blocks are substituted by feature pyramid modules. Each module
consists of bottlenecks at different resolutions.
Learning feature pyramids for human pose estimation. W. Yang, S. Li, W. Ouyang, et al. ICCV 2017.
Single Person
https://github.com
/bearpaw/PyraNet
11. 11
Top-down Methods
• George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris
Bregler, Kevin Murphy: Towards Accurate Multi-person Pose Estimation in the Wild. CVPR
2017: 3711-3719
14. 14
• Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, Jian Sun:
Cascaded Pyramid Network for Multi-Person Pose Estimation. CVPR 2018: 7103-7112
Top-down Methods
• This model applies pyramid features. In globalnet, different level features are
added together to give a rough prediction of key point positions.
• Refinenet utilizes globalnet’s output, upsamples the pyramid features and use
hard point mining to improve the accuracy.
15. 15
• Bin Xiao, Haiping Wu, Yichen Wei: Simple Baselines for Human Pose Estimation and
Tracking. ECCV (6) 2018: 472-487
Top-down Methods
https://github.com/leoxiaobin/pose.pytorch
How high resolution feature maps
are generated
This method combines the
upsampling and convolutional
parameters into deconvolutional
layers in a much simpler way,
without using skip layer connections.
16. 16
• Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang: Deep High-Resolution Representation
Learning for Human Pose Estimation. CVPR 2019
Top-down Methods
1. Proposed human pose estimation network maintains high-resolution representations through the whole
process;
2. start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks
one by one to form more stages, and connect the mutli-resolution subnetworks in parallel.
3. repeated multi-scale fusions such that each of the high-to-low resolution representations receives
information from other parallel representations over and over, leading to rich high-resolution representations.
https://github.com/leoxiaobin/deep-high-resolution-
net.pytorch
17. 17
Bottom-Up Methods
[1] Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. [Cao, CVPR2017]
[2] Associative Embedding : End-to-End Learning for Joint Detection and Grouping. [Newell
A, NeurIPS 2017]
[3] MultiPoseNet: Fast multi-person pose estimation using pose residual network. [Kocabas,
ECCV2018]
[4] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-
Based, Geometric. [Papandreou, ECCV2018]
[5] PifPaf: Composite Fields for Human Pose Estimation. [Sven, CVPR2019]
[6] Multi-person Articulated Tracking with Spatial and Temporal Embeddings. [CVPR2019]
Detecting key points + synthesizing human bodies
Advantage: Higher speed
Problem: Lower accuracy
20. 20
• Associative Embedding: End-to-end Learning for Joint Detection and Grouping. Alejandro Newell, Zhiao Huang,
and Jia Deng. Neural Information Processing Systems (NIPS), 2017.
Bottom-Up Methods
https://github.com/princeton-vl/pose-ae-train
Detection + Grouping
21. 21
• Muhammed Kocabas, Salih Karagoz, Emre Akbas: MultiPoseNet: Fast Multi-Person Pose Estimation Using
Pose Residual Network. ECCV (11) 2018: 437-453
Bottom-Up Methods
https://github.com/mkocabas/pose-residual-network
MultiPoseNet can jointly handle person detection, keypoint detection, person
segmentation and pose estimation problems.
22. 22
• George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy:
PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric
Embedding Model. ECCV (14) 2018: 282-299
Bottom-Up Methods
• PersonLab system consists of a CNN model that
predicts: (1) keypoint heatmaps, (2) short-range offsets,
(3) mid-range pairwise offsets, (4) person segmentation
maps, and (5) long-range offsets.
• The first three predictions are used by the Pose
Estimation Module in order to detect human poses.
• The latter two, along with the human pose detections,
are used by the Instance Segmentation Module in order
to predict person instance segmentation masks.
23. 24
Pose Estimation Dataset
Dataset Single person Multi-person Num of Kpts Num of Person
LSP Y N 14 ~2K
FLIC Y N 9 ~20K
MPII Y Y 16 ~25K
COCO N Y 17 ~100K
AI Challenger N Y 14 ~700K
PoseTrack N Y 15 ~160K
26. 27
Human Pose Estimation API @ Neuhub
(1)CVPR 2018 LIP Challenge Single Human Pose Estimation 1st place
(2)CVPR 2018 LIP Challenge Multi-Human Pose Estimation 1st place
27. 28
‘Finger Heart & 618’Gesture for
AR Scan
WeChat Mini Program for
Halloween
WeChat Mini Program for
POPMART
Human Pose Estimation API @ Neuhub
29. 30
PoseTrack
• Mykhaylo Andriluka, Google Research, Zürich, Switzerland
• Umar Iqbal, University of Bonn, Germany
• Anton Milan, Amazon
• Christoph Lassner, Amazon
• Eldar Insafutdinov, MPI for Informatics, Saarbrücken, Germany
• Leonid Pishchulin, MPI for Informatics, Saarbrücken, Germany
• Juergen Gall, University of Bonn, Germany
• Bernt Schiele, MPI for Informatics, Saarbrücken, Germany
PoseTrack is a joint project of
the Max Planck Institute for
Informatics, University of Bonn
and the PoseTrack team.
30. 31
PoseTrack
Key Figures
1356 video sequences
46K annotated video frames
276K body pose annotations
Two challenges:
Multi-Person Pose Estimation
Multi-Person Pose Tracking
31. 32
Challenges
• Large pose and scale variations
• Fast motions
• a varying number of persons
• Visible body parts due to occlusion or truncation
32. 33
Related Work
Bottom-up Methods
[1] Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and Tracking. In CVPR 2017 & CVPR 2018.
[2] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. ArtTrack:
Articulated Multi-Person Tracking in the Wild. In CVPR 2017.
[4] Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for Multi Person Pose Tracking. In BMVC 2018.
[5] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to Detect and Track Visible
and Occluded Body Joints in a Virtual World. In ECCV 2018.
[6] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a
virtual world. In ECCV 2018.
[7] Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial and Temporal Embeddings. CVPR 2019
Top-down Methods
[1] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-Track: Effcient Pose Estimation in Videos. In
CVPR 2018.
[2] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online Pose Tracking. In BMVC 2018.
[3] Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking. In ECCV 2018
33. 34
Top-down Methods
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-
Track: Effcient Pose Estimation in Videos. In CVPR 2018.
https://github.com/facebookresearch/DetectAndTrack
They propose a two-stage approach to keypoint estimation
and tracking in videos.
1) a novel video pose estimation formulation, 3D Mask R-
CNN, that takes a short video clip as input and produces
a tubelet per person and keypoints within those.
2) lightweight optimization to link the detections over time.
34. 35
Top-down Methods
Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online
Pose Tracking. In BMVC 2018. https://github.com/YuliangXiu/PoseFlow
• Overall Pipeline: 1) Pose Estimator. 2) Pose
Flow Builder. 3) Pose Flow NMS.
• First, they estimate multi-person poses.
• Second, they build pose flows by maximizing
overall confidence and purify them by Pose
Flow NMS.
• Finally, reasonable multi-pose trajectories
can be obtained.
35. 36
Top-down Methods
Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking.
In ECCV 2018
https://github.com/microsoft/human-pose-estimation.pytorch
36. 37
Related Work
Bottom-up Methods
[1] Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and Tracking. In CVPR 2017 & CVPR 2018.
[2] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. ArtTrack:
Articulated Multi-Person Tracking in the Wild. In CVPR 2017.
[4] Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for Multi Person Pose Tracking. In BMVC 2018.
[5] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to Detect and Track Visible
and Occluded Body Joints in a Virtual World. In ECCV 2018.
[6] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a
virtual world. In ECCV 2018.
[7] Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial and Temporal Embeddings. CVPR 2019
Top-down Methods
[1] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-Track: Effcient Pose Estimation in Videos. In
CVPR 2018.
[2] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online Pose Tracking. In BMVC 2018.
[3] Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking. In ECCV 2018
37. 38
Bottom-up Methods
• Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and
Tracking. In CVPR 2017.
• Mykhaylo Andriluka, Umar Iqbal, Anton Milan, Eldar Insafutdinov, Leonid Pishchulin, Juergen
Gall, and Bernt Schiele. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In
CVPR 2018.
OpenPose / DeepCut + Graph partition
38. 39
Bottom-up Methods
• Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern
Andres, and Bernt Schiele. ArtTrack: Articulated Multi-Person Tracking in the Wild. In CVPR
2017. https://github.com/eldar/pose-tensorflow
39. 40
Bottom-up Methods
• Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for
Multi Person Pose Tracking. In BMVC 2018.
40. 41
Bottom-up Methods
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita
Cucchiara. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World. In
ECCV 2018.
41. 42
Bottom-up Methods
• Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for
Multi Person Pose Tracking. In BMVC 2018.
42. 43
Bottom-up Methods
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita
Cucchiara. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World. In
ECCV 2018.
43. 44
Bottom-up Methods
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial
and Temporal Embeddings. CVPR 2019
A unified framework for
Pose estimation and tracking
A bottom-up method
State-of-the-art result
Part-level grouping
Part appearance
Geometric information
Temporal grouping
Human embedding
Temporal embedding
Pose tracking
bipartite
graph matching
44. 45
Bottom-up Methods
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial
and Temporal Embeddings. CVPR 2019
Hourglass
Model [20]
Human
Embedding (HE)
Temporal Instance
Embedding (TIE)
Human-level
representation
Temporal representation for ID
association
45. 46
PoseTrack in JD AI Research
1. An end-to-end POINet: feature extraction and identity association in a unified network.
2. Pose-guided feature extraction network: pose information + part-alignment attention in hierarchical
convolution features.
3. Ovonic insight network to learn the identity matching and switching across frames.
[ACM MM 2019]
51. 52
Challenges of Human Parsing?
• Intrinsic
Varied Person Appearance
Ambiguity of Clothing
Complexity of Clothing
Low Efficiency
Small Targets
Unbalance of Data
• Extrinsic
Occlusion
Clutter
52. 53
Human Parsing History
Clothing
Parsing
Human & Object
Parsing
Pedestrian
Parsing
[Bo et al., CVPR11]
Fashion Parsing
[Yamaguchi et al., CVPR12 ] [Liu et al., MM14,
TMM14, MM15 ]
[Liang et al., ICCV15,
TPAMI15, ECCV16 ]
Constrained Un-constrained
53. 54
Related Work
• Single Human parsing [Bo et al., CVPR11 ]
• Unsupervised super-pixel
• Shape-based matching
• Spatial constraints
Conventional methods:
Yihang Bo, Charless C. Fowlkes: Shape-based pedestrian parsing. CVPR 2011: 2265-2272
54. 55
Related Work
• Single Human parsing
• Conventional methods:
• Yamaguchi, Kota, et al. "Parsing clothing in fashion photographs." CVPR, 2012.
• Yamaguchi, Kota, M. Hadi Kiapour, and Tamara L. Berg. "Paper doll parsing: Retrieving similar
styles to parse clothing items." ICCV, 2013.
• Dong, Jian, et al. "A deformable mixture parsing model with parselets." ICCV, 2013.
Pose Parsing
55. 56
Related Work
• Single Human parsing
• Conventional methods:
• Liu, Si, et al. "Fashion parsing with video context." MM2014, TMM2015.
• Liu, Si, et al. "Fashion parsing with weak color-category labels." TMM, 2014.
weak supervision
56. 57
Related Work
• Single Human parsing
• Deep learning-based methods before 2017:
• Luo, Ping, Xiaogang Wang, and Xiaoou Tang. "Pedestrian parsing via deep
decompositional network." ICCV, 2013.
Hog + DNN Deep Decompositional Network
57. 58
Related Work
• Single Human parsing
• Deep learning-based methods before 2017:
• Liu, Si, et al. "Matching-cnn meets knn: Quasi-parametric human parsing." CVPR. 2015.
• Liang, Xiaodan, et al. "Deep human parsing with active template regression." TPAMI, 2015
Parsing by Matching
58. 59
Related Work
• Single Human parsing
• Deep learning-based methods before 2017:
• Liang, Xiaodan, et al. "Human parsing with contextualized convolutional neural network."
ICCV2015, TPAMI2017.
Parsing
Image-level
Label
Edge Superpixel
59. 60
Related Work
• Single Human parsing
• Deep learning-based methods in 2017
• Gong, Ke, et al. "Look into Person: Self-Supervised Structure-Sensitive Learning and
a New Benchmark for Human Parsing." CVPR. 2017.
SSL: Self-supervised Structure-sensitive Learning
https://github.com/Engineering-Course/LIP_SSL
60. 61
Related Work
• Single Human parsing
• Deep learning-based methods in 2017
• Liang, Xiaodan, et al. "Look into Person: Joint Body Parsing & Pose Estimation
Network and A New Benchmark." TPAMI, 2018.
JPP-Net: Joint Body Parsing & Pose Estimation Network
Pose
Parsing
https://github.com/Engineering-Course/LIP_JPPNet
61. 62
Related Work
• Single Human parsing
• Deep learning-based methods in 2018
• Luo, Yawei, et al. "Macro-micro adversarial network for human parsing." ECCV. 2018.
MMAN: Macro-Micro Adversarial Network
Parsing
GAN
https://github.com/RoyalVane/MMAN
62. 63
Related Work
• Single Human parsing
• Deep learning-based methods in 2018
• Liu, Si, et al. "Cross-domain human parsing via adversarial feature and label
adaptation.“ AAAI, 2018.
Cross-domain Human Parsing
Parsing
GAN
https://github.com/mathfinder/Cross-domain-Human-
Parsing-via-Adversarial-Feature-and-Label-Adaptation
63. 64
Related Work
• Single Human parsing
• Deep learning-based methods in 2018
• Luo, Xianghui, et al. "Trusted Guidance Pyramid Network for Human Parsing."
ACMMM, 2018
TGPNet: Trusted Guidance Pyramid Network
64. 65
Related Work
• Multi Human parsing
• Li, Qizhu, Anurag Arnab, and Philip HS Torr. "Holistic, Instance-level Human Parsing."
BMVC, 2017.
Detector FCN“parsing-by-detection”
65. 67
Related Work
• Multi Human parsing
• Fang, Hao-Shu, et al. “Weakly and Semi Supervised Human Body Part Parsing via
Pose-Guided Knowledge Transfer.” CVPR, 2018.
Parsing Pose RefineNet
https://github.com/MVIG-SJTU/WSHP
66. 68
Related Work
• Multi Human parsing
• Gong, Ke, et al. "Instance-level human parsing via part grouping network." ECCV,
2018
Parsing Edge
https://github.com/Engineering-
Course/CIHP_PGN
67. 69
Related Work
• Multi Human parsing
• Zhao, Jian, et al. "Understanding Humans in Crowded Scenes: Deep Nested Adversarial
Learning and A New Benchmark for Multi-Human Parsing." ACMMM, 2018, Best Student
Paper.
https://github.com/ZhaoJ901
4/Multi-Human-Parsing
68. 70
Related Work
• Multi Human parsing
• Zhao, Jian, et al. "Understanding Humans in Crowded Scenes: Deep Nested Adversarial
Learning and A New Benchmark for Multi-Human Parsing." ACMMM, 2018, Best Student
Paper.
Parsing GAN
semantic saliency
prediction
instance-agnostic
parsing
instance-aware
clustering
https://github.com/ZhaoJ901
4/Multi-Human-Parsing
69. 71
Related Work
• Multi Human parsing
• Li, Jianshu, et al. "Multi-Human Parsing Machines." ACM MM, 2018.
GAN
Instance
Segmentation
Parsing
70. 72
Related Work
• Multi Human parsing
• Tao Ruan, Ting Liu, et al. "Devil in the details: Towards accurate single and
multiple human parsing." AAAI, 2019.
Parsing Edge
Context Embedding with Edge Perceiving
PSPNet
U-Net
Edge-Net
https://github.com/liutinglt/CE2P
CE2P
71. 73
Related Work
• Multi Human parsing
• Liu, Ting, et al. "Devil in the details: Towards accurate single and multiple human parsing."
AAAI, 2019.
Parsing
Mask-
RCNN
72. 74
Related Work
• Multi Human parsing
• Gong, Ke et al. "Graphonomy: Universal Human Parsing via Graph Transfer Learning."
CVPR, 2019.
Universal Human Parsing: One Model for Different Datasets
Parsing Graph
Transfer
Learning
https://github.com/
Gaoyiminggithub/
Graphonomy
73. 75
Related Work
• Multi Human parsing
• Yang, Lu et al. "Parsing R-CNN for Instance-Level Human Analysis." CVPR, 2019.
An End-to-end Framework for Multi-Human Parsing
FPN RPN
Non-
Local
Parsing R-CNN
74. 76
Related Work
• Video Human parsing
• Zhou, Qixian, et al. “Adaptive Temporal Encoding Network for Video Instance-level
Human Parsing.” ACMMM, 2018. https://github.com/HCPLab-SYSU/ATEN
75. 77
Related Work
• Multi Human parsing
• Xinchen, Liu, et al. “Devil in the details: Towards accurate single
and multiple human parsing.” MM, 2019.
A Braiding Network with
two sub-nets:
• A deep-and-narrow net to
learn semantic knowledge;
• A shallow-but-wide net to
capture local structures.
A novel Braiding Module:
• Exchange information
between the two sub-nets
• Learn robust and effective
features for small targets.
Pairwise Hard Region
Embedding:
• Differentiate ambiguous
parsing targets through a
hard-aware regional metric
learning loss.
77. 79
Evaluation Metric
• Single Human Parsing
• Pixel accuracy
• Mean pixel accuracy
• Mean IoU
• Frequency weighted IoU
• F1-score
F1 = 2 ∙
𝑃 ∙ 𝑅
𝑃 + 𝑅
78. 80
Evaluation Metric
• Multi Human Parsing
• Mean IoU
• APr & mAP
• Percentage of Correctly Parsed (PCP)
• Video Human Parsing
• Similar to Single & Multi Human Parsing
• Additional: FPS