[NS][Lab_Seminar_240805]CheckerPose.pptx

thanhdowork 62 views 10 slides Aug 06, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

CheckerPose: Progressive Dense Keypoint Localization for Object Pose Estimation with Graph Neural Network


Slide Content

CheckerPose: Progressive Dense Keypoint Localization for Object Pose Estimation with Graph Neural Network Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/08/05 Ruyi Lian et al. ICCV 2023

Introduction Object pose estimation from RGB images aims to estimate the rotation and translation of a given rigid object Most existing methods estimate intermediate geometric representation (correspondences between 3D object keypoints and 2D image locations), then recover object pose using Perspective-n-Point (PnP) algorithm. However, sparse correspondences easily degrade due to occlusion, background clutter, lighting variation Anther methods densely sample 2D image pixels and predict 3D object coordinates. However, predictions consider only visible pixels and ignore global relations between visible and occluded keypoints, making them unstable To overcome issues, propose 6D pose estimation algorithm - improves dense correspondence with 3 cooperative components: dense 3D sampling, progressive 2D localization through binary coding, shape prior encoding with GNN Figure 1: CheckerPose

Method Problem formulation and method overview Given RGB image I and rigid object O, goal is to estimate rotation R and translation t Adopt two-stage pipeline for object pose estimation Predict 2D projection p for each keypoint P, regress the rotation and translation from 3D-2D correspondences via PnP solver Use an object detector to detect object bounding box and extract RoI IO Process IO by backbone network to obtain backbone feature FI(0) and keypoint embedding FG(0) in k-NN graph G Use graph network layers (Edge-Conv) to localize the keypoints (bv, bx, by) Use CNN decoder to transform FI(0) to series of image feature maps, fuse features in GNN based on current predicted locations CNN decoder outputs object segmentation masks M as auxiliary learning task Convert binary codes to 2D coordinates and use PnP solver to recover the pose form established correspondences

Method Problem formulation and method overview Figure 2: Framework

Method Hierarchical representation of 2D keypoints Localize dense set of predefined 3D keypoints P in the 2D image plane Predict 2D projections appear in RoI, then localize the keypoints inside I O , denoted P I Superpose 2 d *2 d grid S on I O and predict which cell s contains 2D projection p

Method Hierarchical representation of 2D keypoints Interaction j (2<=j<=d) Use binary codes to represent the localization, by ZebraPose [56] k-th bit of b x

Method Dense Keypoint Localization via GNN Modeling interactions among keypoints P is crucial for predicting 2D locations Utilize GNN to process features F of N keypoints P Keypoint P i as node, connect P i to its k nearest neighbors in 3D Euclidean space to generate edges Adopt EdgeConv [68] as graph network layer Edge feature m-th channel of e ij weights of filters Feature Pi is updated by aggregating the edge feature By stacking multiple EdgeConv operations, our network can gradually learn the non-local interactions in a computationally efficient way for dense keypoints

Method Training Use binary cross-entropy loss For d-bit index codes bx, by, since we only localize the keypoints inside the RoI (i.e., bv = 1), denoted as PI , we compute binary cross-entropy loss for each bit of bx as Overall loss function

Experiments Dataset Linemod (LM) 13 sequences of real images with ground truth poses for single object with background clutter and mild occlusion Each sequence contains around 1,200 images Linemod-Occlusion (LM-O) Consists of 1,214 images from a sequence of LM, ground truth poses of 8 objects with partial occlusion YCB-V More than 110,000 real images of 21 objects with severe occlusion and clutter

Experiments SOTA comparison Table 2: Comparison with SOTA Methods on the LM-O dataset