HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation

taeseonryu 697 views 24 slides Mar 31, 2022
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 Deep High-Resolution Representation Learning for Human Pose Estimation 라는 제목의 논문입니다.

오늘 소개드릴 논문은 Pose Estimation에 관련된 논문 입니다. 기존 Pose Estimation 모델...


Slide Content

HRNet
Image Processing Team |Detection
ByunghyunKim, ChanhyeokLee, EungiHong, JongsikAhn,
HyeonjinKim, JaewanPark, ChungchunHyun, SeonokKim(!)
CVPR 2019
Authors: KeSun1,2, Bin Xiao2, Dong Liu1, JingdongWang2,
1University of Science and Technology of China, 2Microsoft Research Asia
Deep High-Resolution Representation Learning
for Human Pose Estimation

Introduction

HRNet
Pose Estimation
-A computer vision task that represents the orientation of a person in a graphical format.
-Widely applied to predict a person’s body parts or joint position.
A Comprehensive Guide on Human Pose Estimation (Analytics Vidhya)
Human Pose Estimation (Open DMQA Seminar)
Human Pose Estimation with Deep Learning (Neuralet), Deep Learning Based 2D Human Pose Estimation: A Survey (TUP, 2019)
HeatMap
Regression
3D2D
Single
Person
Multi
Person
Direct
Regression
HPE 2
Direct
Regression
HeatMap
Regression
regresses the key body points
directly from the feature maps.3estimates the probability of the existence
of a key point in each pixel of the image.
Introduction
3

Pose Estimation
4Deep Learning Based 2D Human Pose Estimation: A Survey (2019)
3D2D
Single
Person
Multi
Person
HPE
Bottom-upTop-down
Top-down
Bottom-up
is detecting all individuals in a givenimage
using a human detector module.
is detecting all key-points (body parts) in an instance agnostic manner
and then associating key-points to build a human instance
HRNet
Introduction

Previous Methods
5
-Most existing methods pass the input through a network, typically consisting of high-to-low resolution sub-networks
that are connected in series, and then raise the resolution.
Stacked Hourglass Networks for Human Pose Estimation (2016)
Simple Baselines for Human Pose Estimation and Tracking (ECCV 2018)HRNet
Introduction
Stacked hourglassSimple Baseline
recovers the high resolution through a symmetric low-to-high
process.
uses transposed convolutions for low-to-high processing.

Approach

HRNet
7
-HRNetstarts from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks
one by one to form more stages and connect the mutli-resolution subnetworks in parallel.
HRNet
Approach
Previous Methods
typically consist of high-to-low resolution
sub-networks that are connected in series,
and then raise the resolution.
Proposed Method
consists of parallel high-to-low resolution subnetworks with repeated
information exchange across multi-resolution subnetworks (multi-scale fusion)
Stage 1Stage 2Stage 3Stage 4

Parallel Multi-resolution Subnetworks
8HRNet
Approach
Sequential Subnetworks ParallelSubnetworks
Existing networks for pose estimation are built by connecting
high-to-low resolution subnetworks in series.
The resolutions for the parallel sub-networks of a later stage
consists of the resolutions from the previous stage, and an extra
lower one.
-Parallel Subnetworks start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution
subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
-"!": the subnetwork, #∶#thstage, %: the resolution index
High-Resolution Representations for Labeling Pixels and Regions (2019)

Repeated Multi-scale Fusion
9
-HRNetcontains four stages with four parallel subnetworks, whose resolution is gradually decreased to a half and
accordingly the width (the number of channels) is increased to the double.
-The resolution is #
$!"#of the resolution of the first subnetwork.
HRNet
Approach
Stage 1Stage 2Stage 3Stage 4
TheStages of HRNet
-There are four stages. The 1st stage
consists of high-resolution convolutions.
-The 2nd (3rd, 4th) stage repeats two-
resolution (three-resolution, four-resolution)
blocks.

Repeated Multi-scale Fusion
10HRNet
Approach
-Inputs : {Χ#,Χ$,…,Χ!}
-Outputs : {Υ#,Υ$,…,Υ!}
-#thstage:#
-Up-sampling or down-sampling Χ%from resolution ,to resolution -:αΧ%,-
-If :,=-, αΧ%,-= Χ%.
-Each output is an aggregation of the input maps:Υ&=∑%'#!αΧ%,-
-The exchange unit across stages has an extra output map: Υ!)#=αΥ!,#+1
: Down-sampling (stride = 2): Up-sampling (Nearest-neighbor)
-Exchange units: each subnetwork repeatedly receives the information from other parallel subnetworks.
Stage 3

Repeated Multi-scale Fusion
11HRNet
Approach
Exchange unit
Convolution unit
Stage
Block index
Resolution index
-Exchange units: each subnetwork repeatedly receives the information from other parallel subnetworks.
-Each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is
given as follows:

Heatmap Estimation
12
-The groundtruthheatmaps are generated by
applying 2D Gaussian.
-The mean squared error is applied for regressing
the heatmaps.
UniPose+: A unified framework for 2D and 3D human pose estimation in images and videos (TPAMI 2021) HRNet
Approach
Gaussian heatmap
-The network is instantiated by following the
design rule of ResNet.
-The resolution is gradually decreased to a
half, and accordingly, the width (the number
of channels) is increased to double.
Network Instantiation
HRNet-W32
HRNet-W48
The widths(C) of the high-resolution
subnetworks in last three stages
64, 128, 256
96, 192, 384
The type of nets

Question !
13

Experiments

Dataset
15HRNet
Experiments
COCO KeypointDatasetMPII Human Pose Dataset
-17 keypoints
-200K images and 250K person instances
-Train, validation, test set
-Metric: OKS, AP
-16 keypoints
-25K images with 40K subjects
-Train, test set
-Metric: PCKh

Evaluation Metric:Object KeypointSimilarity (OKS)
16HRNet
Experiments
Euclidean distances between each
ground truth and detected keypoint!4
Visibility flags of the ground truth "4
Scale*keypointconstant, the standard
deviation of this gaussian#∗%4
Perfect predictions will have OKS = 1
Predictions with wrong keypointswill have OKS~0
Towards Accurate Multi-person Pose Estimation in the Wild (CVPR 2017)
Pose Estimation. Metrics. (stasiuk.medium.com)
5= 0 : not labeled
5= 1 : labeled but not visible
5= 2 : labeled and visible
67*+AP at OKS = 0.50
The mean ofAPscores at 10
positions*, for medium objects
and large objects
67,,-
68Average recall at 10 positions*
*10 positions : OKS = 0.50, 0.55, . . . , 0.90, 0.95
Constants for joints
OKS=

Evaluation Metric:Head-normalized Probability of Correct Keypoint(PCKh)
17HRNet
Experiments
-A detected joint is considered
‘correct’ if the distance between
the predicted and the true joint is
within a certain threshold.
ArticulatedHumanPoseEstimation withFlexibleMixtures ofParts (CVPR 2011)
-PCKhuses head size instead of bounding
box size.
[email protected] is when the threshold = 50%
of the head bone link
PCK PCKh

COCO KeypointDetection
18
-HRNetis significantly better than bottom-up approaches.
-It outperforms all the other top-down approaches and is more efficient in terms of model size and computation
complexity.
HRNet
Experiments

MPII Human Pose Estimation
19HRNet
Experiments
-The result is the best one among the previously-published results on the leaderboard of Nov. 16th, 2018.
-HRNet-W32 achieves a 92.3 [email protected] score and outperforms the stacked hourglass approach and its extensions.

Ablation Study
20HRNet
Experiments
-Figure 5 implies that the resolution does impact the keypointprediction quality.
-Figure 6 implies that the improvement for the smaller input size is more significant than the larger input size.

Conclusion

Conclusion
22HRNet
-The proposed method maintains the high resolution through the whole process without the need of recovering the
high resolution.
-It fuses multi-resolution representations repeatedly, rendering reliable high-resolution representations.
-HRNetgets good performance in other computer vision tasks, including human pose estimation, semantic
segmentation, and object detection.

Question !
23

SeonokKim ([email protected])
Presenter
Paper
-Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019)
-High-Resolution Representations for Labeling Pixels and Regions(arXiv:1904.04514)
-Towards Accurate Multi-person Pose Estimation in the Wild (CVPR 2017)
-ArticulatedHumanPoseEstimation withFlexibleMixtures ofParts (CVPR 2011)
-UniPose+: A unified framework for 2D and 3D human pose estimation in images and videos (TPAMI 2021)
YouTube
-[Paper Review] Deep High-Resolution Representation Learning for Human Pose Estimation
(https://www.youtube.com/watch?v=w39bjQxm1eg)
Blog
-Pose Estimation. Metrics.
(https://stasiuk.medium.com/pose-estimation-metrics-844c07ba0a78)
Sources