Language-Conditioned Affordance-Pose Detection in 3D Point Clouds Toan Nguyen, Minh Nhat Vu, Baoru Huang, Tuan Van Vo, Vy Truong, Ngan Le, Thieu Vo, Bac Le, Anh Nguyen
Introduction
Introduction We address the task of language-conditioned affordance-pose joint learning in 3D point clouds. Given a 3D point cloud object, we detect the affordance region and generate appropriate 6-DoF poses for open-vocabulary affordance text input.
Key Contributions
Key Contributions 3DAPNet , a novel and effective method for the task of affordance-pose joint learning. 3DAP , a high-quality dataset of 3D point cloud objects with affordance language labels and affordance-specific 6-DoF poses. Our method is useful in several real-world robotic manipulation tasks.
3DAP Dataset
3DAP Dataset We collect affordance-annotated point clouds from 3D AffordanceNet [1]. Affordances are expressed in the form of language texts. We utilize 6-DoF GraspNet [2] to generate pose candidates and then manually select poses for different affordances. [1] Deng et al. , 3d affordancenet: A benchmark for visual object affordance understanding. In CVPR 2021 [2] Mousavian et al. , 6-dof graspnet: Variational grasp generation for object manipulation. In ICCV 2019
3DAPNet
3DAPNet Our 3DAPNet consists of two branches: 1. Open-vocabulary affordance detection branch: detects unlimited affordances. 2. Language-conditioned pose generation branch: produces poses conditioned on the point cloud and the input text. Our ContextNet combines two conditions for pose generation. Pose generation Affordance Detection ContextNet