1Shanghai Qi Zhi Institute 2IIIS, Tsinghua University 3School of Software, Tsinghua University
4Shanghai Jiao Tong University 5Shanghai AI Lab
Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points.
Inspired by the natural way humans think, we propose Robo-ABC. Through our framework, robots can generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.
We demonstrated the generalization capabilities of Robo-ABC across object categories and different viewpoints in the real world. We select the grasp pose from all the possible poses which are generated by AnyGrasp to deploy on real robots.
We aim to showcase our method’s ability to generalize the affordance of a small group of seen objects to various objects beyond its category. To this end, we fix a category of source images and provide the contact points derived from human videos. For each object of the other category, we use the same semantic correspondence setting of Robo-ABC, then obtain the target affordance.
In each group of figures from left to right, the span of object categories gradually increases. represents the contact points extracted from human videos, while represents the inferred points found by Robo-ABC across object categories.
We demonstrate the performance of Robo-ABC and other baselines across various object categories within the entire evaluation dataset. As can be seen, in the vast majority of cases, Robo-ABC exhibits superior zero-shot generalization capabilities.
If you have any questions, please feel free to contact us:
If you find this project helpful, please cite us: