DenseMatcher Banana Icon: Learning 3D Semantic Correspondence for Category-Level Manipulation from One Demo

1IIIS, Tsinghua University 2Tepan Inc. 3Shanghai Qi Zhi Institute 4UC Berkeley 5Stanford University 6Shanghai AI Lab 7Shanghai Jiao Tong University

Abstract

Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object categories.

To this end, we present DenseMatcher, a method capable of computing 3D correspondences between in-the-wild objects that share similar structures. DenseMatcher first computes vertex features by projecting multiview 2D features onto meshes and refining them with a 3D network, and subsequently finds dense correspondences with the obtained features using functional map. In addition, we craft the first 3D matching dataset that contains colored object meshes across diverse categories.

In our experiments, we show that DenseMatcher significantly outperforms prior 3D matching baselines by 43.5%. We demonstrate the downstream effectiveness of DenseMatcher in (i) robotic manipulation, where it achieves cross-instance and cross-category generalization on long-horizon complex manipulation tasks from observing only one demo; (ii) zero-shot color mapping between digital assets, where appearance can be transferred between different objects with relatable geometry.

Method

DenseMatcher computes dense correspondences between two colored objects via the following stages: (1) 2D feature extraction, (2) 3D feature refinement, and (3) dense correspondence computation.

In the first stage, SD-DINO is used to extract 2D features maps from different rendered views. The feature for each vertex is then computed by averaging features retrieved from different views, as shown below.

Noisy Multiview Features \( f_\text{multiview} \)   (Drag to Rotate)

In the second stage, since the above feature is noisy and does not utilize geometry information, we concatenate it with the vertex positions and then refine the feature with DiffusionNet, a 3D neural network architecture specifically designed for meshes.

Refined Output Features \( f_\text{output} \)   (Drag to Rotate)

Finally, using refined features of two objects, we compute dense correspondences by solving for a functional map between them (with our own novel optimization constraints!)

Centered Image

Our robot deployment pipeline first finds manipulation keypoints on a template object with a hand-object detector, and then transfers them to the target object seen by the robot for downstream manipulation via our proposed DenseMatcher correspondence model.

Robotic Experiments

Our method only requires a single RGB-D human demo video for each task. The robot mimicks the human action by transferring keypoints across different object instances and categories, maintaining spatial and semantic consistency throughout tasks requiring multiple steps and graps.

DenseCorr3D Dataset

We release our training and test dataset DenseCorr3D, which contains 589 colored assets across 23 daily object categories. Since previous 3D matching datasets do not contain colors and focus on few categories, our dataset is the first to provide colored meshes across diverse categories.

Our dense correspondence annotation comes in the form of semantic groups, which divides vertices of each asset into corresponding groups that are consistent within each category. Below are some examples.

Colored Objects

Annotations (Try Dragging & Scrolling!)