Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II

1
Human Behavior Understanding:
From Human-Oriented Analysis to Action
Recognition
liuwu1@jd.com
CV Lab
JD AI Research
Wu Liu

2
Human Behavior Understanding: Human-Oriented Analysis
ParsingPose PoseTrack

3

4
Introduction
• Human pose estimation
Single person Multi person
1. Right_Shoulder
2. Right_Elbow
3. Right_Wrist
4. Left_Shoulder
5. Left_Elbow
6. Left_Wrist
7. Right_Hip
8. Right_Knee
9. Right_Ankle
10. Left_Hip
11. Left_Knee
12. Left_Ankle
13. Head
14. Neck
15. Spine
16. Pelvis

5
Applications
• Human action recognition
• Human-computer interaction
• Animation
• Intelligent Retail, such as self-service supermarket and intelligent
warehouses

6
Challenges
• Various appearances and low-resolutions
• Diverse human poses and views
• Occluded or invisible key points
• Crowded background

7
Top-down Methods
[1] Stacked hourglass net-works for human pose estimation. [Newell, ECCV2016]
[2] Towards accurate multi-person pose estimation in the wild. [Papandreou, CVPR2017]
[3] RMPE: Regional Multi-Person Pose Estimation. [Fang, ICCV2017]
[4] Simple Baselines for Human Pose Estimation and Tracking. [Xiao, ECCV2018]
[5] Cascaded Pyramid Network for Multi-Person Pose Estimation. [Chen, CVPR2018]
[6] HRNet：Deep High-Resolution Representation Learning for Human Pose Estimation.[Sun,
CVPR2019）
Human detection + single person key points detection
Advantage: State-of-the-art accuracy
Problem: Lower speed, human detection accuracy.

8
Bottom-Up Methods
[1] Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. [Cao, CVPR2017]
[2] Associative Embedding : End-to-End Learning for Joint Detection and Grouping. [Newell
A, NeurIPS 2017]
[3] MultiPoseNet: Fast multi-person pose estimation using pose residual network. [Kocabas,
ECCV2018]
[4] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-
Based, Geometric. [Papandreou, ECCV2018]
[5] PifPaf: Composite Fields for Human Pose Estimation. [Sven, CVPR2019]
[6] Multi-person Articulated Tracking with Spatial and Temporal Embeddings. [CVPR2019]
Detecting key points + synthesizing human bodies
Advantage: Higher speed, do not rely on human detection
Problem: Lower accuracy

9
• Single person: stacked hourglass – basic network backbone [1]
• Each hourglass first subsamples the feature maps, and then upsamples the feature
maps with the combination of higher resolution features from bottom layers.
• This bottom-up, top-down processing is repeated for several times.
Single Person
Alejandro Newell, Kaiyu Yang, Jia Deng: Stacked Hourglass Networks for
Human Pose Estimation. ECCV (8) 2016: 483-499.

10
• Single person: feature pyramid module [2]
• Feature pyramid representation can provide sufficient context information,
especially for the occluded and invisible key points.
• The residual blocks are substituted by feature pyramid modules. Each module
consists of bottlenecks at different resolutions.
Learning feature pyramids for human pose estimation. W. Yang, S. Li, W. Ouyang, et al. ICCV 2017.
Single Person
https://github.com
/bearpaw/PyraNet

11
Top-down Methods
• George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris
Bregler, Kevin Murphy: Towards Accurate Multi-person Pose Estimation in the Wild. CVPR
2017: 3711-3719

12
• Haoshu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu: RMPE: Regional Multi-person Pose
Estimation. ICCV 2017: 2353-2362
Top-down Methods
• Handle inaccurate bounding boxes and redundant detections
• Symmetric Spatial Transformer Network (SSTN)
• Parametric Pose Non-Maximum-Suppression (NMS)
• Pose-Guided Proposals Generator (PGPG)
https://cvsjtu.wordpress.com/rmpe-regional-multi-person-pose-estimation/

13
• Haoshu Fang, Shuqin Xie, Yu-Wing Tai, Cewu Lu: RMPE: Regional Multi-person Pose
Estimation. ICCV 2017: 2353-2362
Top-down Methods
https://cvsjtu.wordpress.com/rmpe-regional-multi-person-pose-estimation/
. Problem of bounding box localization errors Symmetric Spatial Transformer Network
• Symmetric Spatial Transformer Network (SSTN)
• Parametric Pose Non-Maximum-Suppression (NMS)
• Pose-Guided Proposals Generator (PGPG)

14
• Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, Jian Sun:
Cascaded Pyramid Network for Multi-Person Pose Estimation. CVPR 2018: 7103-7112
Top-down Methods
• This model applies pyramid features. In globalnet, different level features are
added together to give a rough prediction of key point positions.
• Refinenet utilizes globalnet’s output, upsamples the pyramid features and use
hard point mining to improve the accuracy.

15
• Bin Xiao, Haiping Wu, Yichen Wei: Simple Baselines for Human Pose Estimation and
Tracking. ECCV (6) 2018: 472-487
Top-down Methods
https://github.com/leoxiaobin/pose.pytorch
How high resolution feature maps
are generated
This method combines the
upsampling and convolutional
parameters into deconvolutional
layers in a much simpler way,
without using skip layer connections.

16
• Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang: Deep High-Resolution Representation
Learning for Human Pose Estimation. CVPR 2019
Top-down Methods
1. Proposed human pose estimation network maintains high-resolution representations through the whole
process;
2. start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks
one by one to form more stages, and connect the mutli-resolution subnetworks in parallel.
3. repeated multi-scale fusions such that each of the high-to-low resolution representations receives
information from other parallel representations over and over, leading to rich high-resolution representations.
https://github.com/leoxiaobin/deep-high-resolution-
net.pytorch

17
Bottom-Up Methods
[1] Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. [Cao, CVPR2017]
[2] Associative Embedding : End-to-End Learning for Joint Detection and Grouping. [Newell
A, NeurIPS 2017]
[3] MultiPoseNet: Fast multi-person pose estimation using pose residual network. [Kocabas,
ECCV2018]
[4] PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-
Based, Geometric. [Papandreou, ECCV2018]
[5] PifPaf: Composite Fields for Human Pose Estimation. [Sven, CVPR2019]
[6] Multi-person Articulated Tracking with Spatial and Temporal Embeddings. [CVPR2019]
Detecting key points + synthesizing human bodies
Advantage: Higher speed
Problem: Lower accuracy

18
• Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh: Convolutional Pose Machines. CVPR 2016:
4724-4732
• Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh: Realtime Multi-person 2D Pose Estimation Using Part
Affinity Fields. CVPR 2017: 1302-1310
• Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh: OpenPose: Realtime Multi-Person 2D Pose
Estimation using Part Affinity Fields. CoRR abs/1812.08008 (2018)
Bottom-Up Methods
OpenPose

19
Bottom-Up Methods
OpenPose
Part association strategies.Architecture of the two-branch multi-stage CNN. Graph matching.

20
• Associative Embedding: End-to-end Learning for Joint Detection and Grouping. Alejandro Newell, Zhiao Huang,
and Jia Deng. Neural Information Processing Systems (NIPS), 2017.
Bottom-Up Methods
https://github.com/princeton-vl/pose-ae-train
Detection + Grouping

21
• Muhammed Kocabas, Salih Karagoz, Emre Akbas: MultiPoseNet: Fast Multi-Person Pose Estimation Using
Pose Residual Network. ECCV (11) 2018: 437-453
Bottom-Up Methods
https://github.com/mkocabas/pose-residual-network
MultiPoseNet can jointly handle person detection, keypoint detection, person
segmentation and pose estimation problems.

22
• George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy:
PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric
Embedding Model. ECCV (14) 2018: 282-299
Bottom-Up Methods
• PersonLab system consists of a CNN model that
predicts: (1) keypoint heatmaps, (2) short-range offsets,
(3) mid-range pairwise offsets, (4) person segmentation
maps, and (5) long-range offsets.
• The first three predictions are used by the Pose
Estimation Module in order to detect human poses.
• The latter two, along with the human pose detections,
are used by the Instance Segmentation Module in order
to predict person instance segmentation masks.

24
Pose Estimation Dataset
Dataset Single person Multi-person Num of Kpts Num of Person
LSP Y N 14 ~2K
FLIC Y N 9 ~20K
MPII Y Y 16 ~25K
COCO N Y 17 ~100K
AI Challenger N Y 14 ~700K
PoseTrack N Y 15 ~160K

25
Pose Estimation COCO leaderboard

26
Pose Estimation Paper Leaderboard
Category Method Pub mAP
Bottom-up Methods
Openpose CVPR2017 61.8
Associative Embedding NeurlPS 2017 65.5
MultiPoseNet ECCV2018 69.6
PersonLab ECCV2018 68.7
Pifpaf CVPR2019 66.7
Multi-person Articulated Tracking CVPR2019 68.0
Top-down Methods
G-RMI CVPR2017 64.9
Mask RCNN ICCV2017 63.1
RMPE ICCV2017 72.3
Simple Baseline ECCV2018 73.7
CPN CVPR2018 72.1
HRNet CVPR2019 75.5
Category Method Pub PCKh@50
Bottom-up
Methods
Openpose CVPR2017 75.6
Associative Embedding NeurlPS 2017 77.5
Top-down
Methods
RMPE ICCV2017 82.1
Simple Baseline ECCV2018 91.5
HRNet CVPR2019 92.3
COCO
MPII

27
Human Pose Estimation API @ Neuhub
（1）CVPR 2018 LIP Challenge Single Human Pose Estimation 1st place
（2）CVPR 2018 LIP Challenge Multi-Human Pose Estimation 1st place

28
‘Finger Heart & 618’Gesture for
AR Scan
WeChat Mini Program for
Halloween
WeChat Mini Program for
POPMART
Human Pose Estimation API @ Neuhub

29

30
PoseTrack
• Mykhaylo Andriluka, Google Research, Zürich, Switzerland
• Umar Iqbal, University of Bonn, Germany
• Anton Milan, Amazon
• Christoph Lassner, Amazon
• Eldar Insafutdinov, MPI for Informatics, Saarbrücken, Germany
• Leonid Pishchulin, MPI for Informatics, Saarbrücken, Germany
• Juergen Gall, University of Bonn, Germany
• Bernt Schiele, MPI for Informatics, Saarbrücken, Germany
PoseTrack is a joint project of
the Max Planck Institute for
Informatics, University of Bonn
and the PoseTrack team.

31
PoseTrack
Key Figures
 1356 video sequences
 46K annotated video frames
 276K body pose annotations
Two challenges:
 Multi-Person Pose Estimation
 Multi-Person Pose Tracking

32
Challenges
• Large pose and scale variations
• Fast motions
• a varying number of persons
• Visible body parts due to occlusion or truncation

33
Related Work
Bottom-up Methods
[1] Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and Tracking. In CVPR 2017 & CVPR 2018.
[2] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. ArtTrack:
Articulated Multi-Person Tracking in the Wild. In CVPR 2017.
[4] Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for Multi Person Pose Tracking. In BMVC 2018.
[5] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to Detect and Track Visible
and Occluded Body Joints in a Virtual World. In ECCV 2018.
[6] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a
virtual world. In ECCV 2018.
[7] Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial and Temporal Embeddings. CVPR 2019
Top-down Methods
[1] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-Track: Effcient Pose Estimation in Videos. In
CVPR 2018.
[2] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online Pose Tracking. In BMVC 2018.
[3] Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking. In ECCV 2018

34
Top-down Methods
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-
Track: Effcient Pose Estimation in Videos. In CVPR 2018.
https://github.com/facebookresearch/DetectAndTrack
They propose a two-stage approach to keypoint estimation
and tracking in videos.
1) a novel video pose estimation formulation, 3D Mask R-
CNN, that takes a short video clip as input and produces
a tubelet per person and keypoints within those.
2) lightweight optimization to link the detections over time.

35
Top-down Methods
Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online
Pose Tracking. In BMVC 2018. https://github.com/YuliangXiu/PoseFlow
• Overall Pipeline: 1) Pose Estimator. 2) Pose
Flow Builder. 3) Pose Flow NMS.
• First, they estimate multi-person poses.
• Second, they build pose flows by maximizing
overall confidence and purify them by Pose
Flow NMS.
• Finally, reasonable multi-pose trajectories
can be obtained.

36
Top-down Methods
Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking.
In ECCV 2018
https://github.com/microsoft/human-pose-estimation.pytorch

37
Related Work
Bottom-up Methods
[1] Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and Tracking. In CVPR 2017 & CVPR 2018.
[2] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. ArtTrack:
Articulated Multi-Person Tracking in the Wild. In CVPR 2017.
[4] Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for Multi Person Pose Tracking. In BMVC 2018.
[5] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to Detect and Track Visible
and Occluded Body Joints in a Virtual World. In ECCV 2018.
[6] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a
virtual world. In ECCV 2018.
[7] Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial and Temporal Embeddings. CVPR 2019
Top-down Methods
[1] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-Track: Effcient Pose Estimation in Videos. In
CVPR 2018.
[2] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Effcient Online Pose Tracking. In BMVC 2018.
[3] Bin Xiao, Haiping Wu, and Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking. In ECCV 2018

38
Bottom-up Methods
• Umar Iqbal, Anton Milan, and Juergen Gall. PoseTrack: Joint Multi-person Pose Estimation and
Tracking. In CVPR 2017.
• Mykhaylo Andriluka, Umar Iqbal, Anton Milan, Eldar Insafutdinov, Leonid Pishchulin, Juergen
Gall, and Bernt Schiele. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In
CVPR 2018.
OpenPose / DeepCut + Graph partition

39
Bottom-up Methods
• Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern
Andres, and Bernt Schiele. ArtTrack: Articulated Multi-Person Tracking in the Wild. In CVPR
2017. https://github.com/eldar/pose-tensorflow

40
Bottom-up Methods
• Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for
Multi Person Pose Tracking. In BMVC 2018.

41
Bottom-up Methods
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita
Cucchiara. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World. In
ECCV 2018.

42
Bottom-up Methods
• Andreas Doering, Umar Iqbal, Juergen Gall, and DE Bonn. JointFlow: Temporal Flow Fields for
Multi Person Pose Tracking. In BMVC 2018.

43
Bottom-up Methods
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita
Cucchiara. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World. In
ECCV 2018.

44
Bottom-up Methods
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial
and Temporal Embeddings. CVPR 2019
 A unified framework for
Pose estimation and tracking
 A bottom-up method
 State-of-the-art result
Part-level grouping
 Part appearance
 Geometric information
Temporal grouping
 Human embedding
 Temporal embedding
Pose tracking
bipartite
graph matching

45
Bottom-up Methods
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian: Multi-person Articulated Tracking with Spatial
and Temporal Embeddings. CVPR 2019
Hourglass
Model [20]
Human
Embedding (HE)
Temporal Instance
Embedding (TIE)
Human-level
representation
Temporal representation for ID
association

46
PoseTrack in JD AI Research
1. An end-to-end POINet: feature extraction and identity association in a unified network.
2. Pose-guided feature extraction network: pose information + part-alignment attention in hierarchical
convolution features.
3. Ovonic insight network to learn the identity matching and switching across frames.
[ACM MM 2019]

47
人体姿态估计+人体跟踪技术
https://posetrack.net/leaderboard.php

49

50
What is Human Parsing?
Single
Human Parsing
Multiple
Human Parsing
Instance-level
Human Parsing
Fine-grained Human Parsing
59 Categories

51
Human Parsing Applications
SnapShot Fashion Analysis
Recommendation
Fashion Captioning
Clothing
Search
搭配分析
Fashion
Analysis
流行指数
★★★★★
气质指数
★★★★☆
性感指数
★★★★☆
文本生成
飘逸的长发散发着
青春与活力，搭配
天鹅黄长裙彰显修
长的身材，褐色外
套与包包更增添几
分优雅气质。
Human Parsing + X

52
Challenges of Human Parsing?
• Intrinsic
Varied Person Appearance
Ambiguity of Clothing
Complexity of Clothing
Low Efficiency
Small Targets
Unbalance of Data
• Extrinsic
Occlusion
Clutter

53
Human Parsing History
Clothing
Parsing
Human & Object
Parsing
Pedestrian
Parsing
[Bo et al., CVPR11]
Fashion Parsing
[Yamaguchi et al., CVPR12 ] [Liu et al., MM14,
TMM14, MM15 ]
[Liang et al., ICCV15,
TPAMI15, ECCV16 ]
Constrained Un-constrained

54
Related Work
• Single Human parsing [Bo et al., CVPR11 ]
• Unsupervised super-pixel
• Shape-based matching
• Spatial constraints
Conventional methods:
Yihang Bo, Charless C. Fowlkes: Shape-based pedestrian parsing. CVPR 2011: 2265-2272

55
Related Work
• Single Human parsing
• Conventional methods:
• Yamaguchi, Kota, et al. "Parsing clothing in fashion photographs." CVPR, 2012.
• Yamaguchi, Kota, M. Hadi Kiapour, and Tamara L. Berg. "Paper doll parsing: Retrieving similar
styles to parse clothing items." ICCV, 2013.
• Dong, Jian, et al. "A deformable mixture parsing model with parselets." ICCV, 2013.
Pose Parsing

56
Related Work
• Conventional methods:
• Liu, Si, et al. "Fashion parsing with video context." MM2014, TMM2015.
• Liu, Si, et al. "Fashion parsing with weak color-category labels." TMM, 2014.
weak supervision

57
Related Work
• Deep learning-based methods before 2017:
• Luo, Ping, Xiaogang Wang, and Xiaoou Tang. "Pedestrian parsing via deep
decompositional network." ICCV, 2013.
Hog + DNN Deep Decompositional Network

58
Related Work
• Liu, Si, et al. "Matching-cnn meets knn: Quasi-parametric human parsing." CVPR. 2015.
• Liang, Xiaodan, et al. "Deep human parsing with active template regression." TPAMI, 2015
Parsing by Matching

59
Related Work
• Liang, Xiaodan, et al. "Human parsing with contextualized convolutional neural network."
ICCV2015, TPAMI2017.
Parsing
Image-level
Label
Edge Superpixel

60
Related Work
• Deep learning-based methods in 2017
• Gong, Ke, et al. "Look into Person: Self-Supervised Structure-Sensitive Learning and
a New Benchmark for Human Parsing." CVPR. 2017.
SSL: Self-supervised Structure-sensitive Learning
https://github.com/Engineering-Course/LIP_SSL

61
Related Work
• Liang, Xiaodan, et al. "Look into Person: Joint Body Parsing & Pose Estimation
Network and A New Benchmark." TPAMI, 2018.
JPP-Net: Joint Body Parsing & Pose Estimation Network
Pose
Parsing
https://github.com/Engineering-Course/LIP_JPPNet

62
Related Work
• Luo, Yawei, et al. "Macro-micro adversarial network for human parsing." ECCV. 2018.
MMAN: Macro-Micro Adversarial Network
Parsing
GAN
https://github.com/RoyalVane/MMAN

63
Related Work
• Liu, Si, et al. "Cross-domain human parsing via adversarial feature and label
adaptation.“ AAAI, 2018.
Cross-domain Human Parsing
Parsing
GAN
https://github.com/mathfinder/Cross-domain-Human-
Parsing-via-Adversarial-Feature-and-Label-Adaptation

64
Related Work
• Luo, Xianghui, et al. "Trusted Guidance Pyramid Network for Human Parsing."
ACMMM, 2018
TGPNet: Trusted Guidance Pyramid Network

65
Related Work
• Multi Human parsing
• Li, Qizhu, Anurag Arnab, and Philip HS Torr. "Holistic, Instance-level Human Parsing."
BMVC, 2017.
Detector FCN“parsing-by-detection”

67
Related Work
• Fang, Hao-Shu, et al. “Weakly and Semi Supervised Human Body Part Parsing via
Pose-Guided Knowledge Transfer.” CVPR, 2018.
Parsing Pose RefineNet
https://github.com/MVIG-SJTU/WSHP

68
Related Work
• Gong, Ke, et al. "Instance-level human parsing via part grouping network." ECCV,
2018
Parsing Edge
https://github.com/Engineering-
Course/CIHP_PGN

69
Related Work
• Zhao, Jian, et al. "Understanding Humans in Crowded Scenes: Deep Nested Adversarial
Learning and A New Benchmark for Multi-Human Parsing." ACMMM, 2018, Best Student
Paper.
https://github.com/ZhaoJ901
4/Multi-Human-Parsing

70
Related Work
• Zhao, Jian, et al. "Understanding Humans in Crowded Scenes: Deep Nested Adversarial
Learning and A New Benchmark for Multi-Human Parsing." ACMMM, 2018, Best Student
Paper.
Parsing GAN
semantic saliency
prediction
instance-agnostic
parsing
instance-aware
clustering
https://github.com/ZhaoJ901
4/Multi-Human-Parsing

71
Related Work
• Li, Jianshu, et al. "Multi-Human Parsing Machines." ACM MM, 2018.
GAN
Instance
Segmentation
Parsing

72
Related Work
• Tao Ruan, Ting Liu, et al. "Devil in the details: Towards accurate single and
multiple human parsing." AAAI, 2019.
Parsing Edge
Context Embedding with Edge Perceiving
PSPNet
U-Net
Edge-Net
https://github.com/liutinglt/CE2P
CE2P

73
Related Work
• Liu, Ting, et al. "Devil in the details: Towards accurate single and multiple human parsing."
AAAI, 2019.
Parsing
Mask-
RCNN

74
Related Work
• Gong, Ke et al. "Graphonomy: Universal Human Parsing via Graph Transfer Learning."
CVPR, 2019.
Universal Human Parsing: One Model for Different Datasets
Parsing Graph
Transfer
Learning
https://github.com/
Gaoyiminggithub/
Graphonomy

75
Related Work
• Yang, Lu et al. "Parsing R-CNN for Instance-Level Human Analysis." CVPR, 2019.
An End-to-end Framework for Multi-Human Parsing
FPN RPN
Non-
Local
Parsing R-CNN

76
Related Work
• Video Human parsing
• Zhou, Qixian, et al. “Adaptive Temporal Encoding Network for Video Instance-level
Human Parsing.” ACMMM, 2018. https://github.com/HCPLab-SYSU/ATEN

77
Related Work
• Xinchen, Liu, et al. “Devil in the details: Towards accurate single
and multiple human parsing.” MM, 2019.
 A Braiding Network with
two sub-nets:
• A deep-and-narrow net to
learn semantic knowledge;
• A shallow-but-wide net to
capture local structures.
 A novel Braiding Module:
• Exchange information
between the two sub-nets
• Learn robust and effective
features for small targets.
 Pairwise Hard Region
Embedding:
• Differentiate ambiguous
parsing targets through a
hard-aware regional metric
learning loss.

78
Datasets
Single Total Train Val Test Class Instance
Fashionista 685 456 - 229 56 1
ATR 17,700 16,000 700 1,000 18 1
LIP 50,462 30,462 10,000 10,000 20 1
JD-Fashion 16,497 16,317 180 - 21 1
Multiple
PASCAL-Person-Part 3,533 1,716 - 1,817 7 ×
CIHP 38,280 28,280 5,000 5,000 20 √
MHP v1.0 4,980 3,000 1,000 980 19 √
MHP v2.0 25,403 15,403 5,000 5,000 59 √
Video
Indoor (1 frame label) 700 400 200 100 13 1
Outdoor (1 frame label) 741 421 120 200 13 1
VIP (1/25 frame label) 404 354 - 50 20 √

79
Evaluation Metric
• Single Human Parsing
• Pixel accuracy
• Mean pixel accuracy
• Mean IoU
• Frequency weighted IoU
• F1-score
F1 = 2 ∙
𝑃 ∙ 𝑅
𝑃 + 𝑅

80
Evaluation Metric
• Multi Human Parsing
• Mean IoU
• APr & mAP
• Percentage of Correctly Parsed (PCP)
• Video Human Parsing
• Similar to Single & Multi Human Parsing
• Additional: FPS

81
Results of Single Human Parsing
• On ATR
Method Pub Pixel Acc F1-score
Paper Doll CVPR13 88.96 44.76
M-CNN CVPR15 89.57 62.81
ATR PAMI15 91.11 64.38
Deeplab-v2(vgg16) PAMI16 94.42 73.53
PSPnet (resnet101) CVPR17 95.20 75.84
Co-CNN ICCV15 95.23 76.95
Attention(vgg16) CVPR16 95.41 77.23
Deeplab-v3+ ECCV18 95.96 79.49
LG-LSTM CVPR16 96.18 80.97
TGPN MM18 96.45 81.76
Graph-LSTM ECCV16 97.60 83.76
Graphonomy CVPR19 98.32 90.89

82
Results of Single Human Parsing
• On LIP validation
Method Pub Pixel Acc mIoU
SegNet PAMI17 69.04 18.17
FCN-8s CVPR15 76.06 28.29
DeepLabV2 ICLR15 82.66 41.64
Attention CVPR16 83.43 42.92
DeepLabV2 + SSL CVPR17 83.16 42.44
Attention + SSL CVPR17 84.36 44.73
SS-NAN CVPRW17 87.59 47.92
MMAN ECCV18 - 46.81
JPPNet PAMI18 86.39 51.37
CE2P AAAI19 87.37 53.10
BraidNet MM19 87.60 54.42

83
Results of Multi Human Parsing
• On CIHP
Method Pub
mIoU AP @IoU Threshold
mAP
0.5 0.6 0.7
PGN ECCV18 55.8 35.8 28.6 20.5 33.6
DMNet CVPR18 61.51 46.12 41.50
M-CE2P AAAI19 59.50 48.69 40.13 29.74 42.83
Graphonomy CVPR19 58.58 - - - -
Parsing RCNN CVPR19 59.8 - - - -
BraidNet MM19 60.62 48.99 41.67 32.71 43.59

84
Results of Multi Human Parsing
• On MHP v2
Method Pub
PCP
@0.5
AP @IoU Threshold
mAP
0.5 0.6 0.7
Mask R-CNN ICCV17 25.11 14.90
MH-Parser MM18 26.98 17.99 - - -
PGN ECCV18 32.25 25.14 - - 41.78
S-LAB CVPR18 38.27 31.47 - - 40.71
CE2P AAAI19 41.82 33.34 - - 42.25
Parsing RCNN CVPR19 44.2 - - - 40.3

85
Thinking in Human Parsing
• Methodology
 Multi-task learning:
Parsing + Pose + Edge
 Multi-granularity supervision
Low + Middle + High
Un/Semi-supervised
 Improve Efficiency
To be real-time
 Cross-domain
Fashion  Surveillance
 Multi-modality
Image  Video

86
Thanks!
liuwu1@jd.com
Computer Vision and Multimedia Lab
AI Platform and Research

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II

Similar to Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II (20)

More from Wanjin Yu

More from Wanjin Yu (10)

Recently uploaded

Recently uploaded (20)

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition II