Robo-ABC icon : Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

Yuanchen Ju1,2*  Kaizhe Hu1,2,5*  Guowei Zhang2,3  Gu Zhang1,4,2  Mingrun Jiang2  Huazhe Xu2,5,1

1Shanghai Qi Zhi Institute  2IIIS, Tsinghua University  3School of Software, Tsinghua University 

4Shanghai Jiao Tong University  5Shanghai AI Lab 

*Equal contribution


Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points.

Teaser Image

Inspired by the natural way humans think, we propose Robo-ABC. Through our framework, robots can generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.

Real World Deployment Video

We demonstrated the generalization capabilities of Robo-ABC across object categories and different viewpoints in the real world. We select the grasp pose from all the possible poses which are generated by AnyGrasp to deploy on real robots.

Our Pipeline

Affordance Generalization Beyond Categories Visualization

We aim to showcase our method’s ability to generalize the affordance of a small group of seen objects to various objects beyond its category. To this end, we fix a category of source images and provide the contact points derived from human videos. For each object of the other category, we use the same semantic correspondence setting of Robo-ABC, then obtain the target affordance.

Teaser Image

In each group of figures from left to right, the span of object categories gradually increases. represents the contact points extracted from human videos, while represents the inferred points found by Robo-ABC across object categories.

Zero-shot Affordance Generalization Visualization

Teaser Image

We demonstrate the performance of Robo-ABC and other baselines across various object categories within the entire evaluation dataset. As can be seen, in the vast majority of cases, Robo-ABC exhibits superior zero-shot generalization capabilities.

Teaser Image

Robot experiments under cluster setting and cross-view setting

Teaser Image

We showcased the deployment pipeline of Robo-ABC in the real world


If you have any questions, please feel free to contact us:


If you find this project helpful, please cite us:

@article{ju2024robo, title={Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation}, author={Ju, Yuanchen and Hu, Kaizhe and Zhang, Guowei and Zhang, Gu and Jiang, Mingrun and Xu, Huazhe}, journal={arXiv preprint arXiv:2401.07487}, year={2024} }