Two by Two : Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation

Yu Qi^2,1* Yuanchen Ju^1,3* Tianming Wei^1,4 Chi Chu¹ Lawson L.S. Wong² Huazhe Xu^1,3,5

¹Shanghai Qi Zhi Institute ²Northeastern University ³IIIS, Tsinghua University

⁴Shanghai Jiao Tong University ⁵Shanghai AI Lab

*Equal contribution

🌷 CVPR 2025 🧩

Abstract

3D assembly tasks, such as furniture assembly and component fitting, play a crucial role in daily life and represent essential capabilities for future home robots. Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts, which fall short in addressing the complexities of everyday object interactions and assemblies. To bridge this gap, we present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly, covering 18 fine-grained tasks that reflect real-life scenarios, such as plugging into sockets, arranging flowers in vases, and inserting bread into toasters. 2BY2 dataset includes 1,034 instances and 517 pairwise objects with pose and symmetry annotations, requiring approaches that align geometric shapes while accounting for functional and spatial relationships between objects. Leveraging the 2BY2 dataset, we propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints. Compared to previous shape assembly methods, our approach achieves state-of-the-art performance across all 18 tasks in the 2BY2 dataset. Additionally, robot experiments further validate the reliability and generalization ability of our method for complex 3D assembly tasks.

Dataset Comparison

We compare 2BY2 dataset with exsiting datasets and benchmarks. #OC stands for the number of object categories. #OS stands for the number of object shapes. Pair denotes whether the dataset is pairwise. Task Number refers to the number of distinct assembly tasks, with the assembly of fractured pieces considered as a single task. Task Hierarchy stands for the different categories of task from coarse to fine. Everyday Scenario means whether the assemble task has practical significance in real-world human applications. Symmetry denotes whether the dataset contains part symmetry annotation.

Task Diversity Visualization

The image shows selected objects from four different tasks: USB, Bottle, Letter, and Plug in Socket. On the left are the objects selected on training set, and on the right is the testing set. As seen in the legend, object geometry varies in both the training and testing set, with the testing set containing novel shapes not seen in the training set.

Our Pipeline

Results Visualization

We highlight Bottle, Plug, Bread, Letter, Childrentoy, Key, and Flower tasks to demonstrate our improved translation and rotation predictions compared to baseline methods.

Real Robot Experiments

We conduct real-world robot experiments on Cup, Flower, Bread and Plug tasks.

Cup

Flower

Bread

Plug

Contact

If you have any questions, please feel free to contact us:

Yu Qi: qi.yu2@northeastern.edu
Yuanchen Ju: juuycc0213@gmail.com

Citation

If you find this project helpful, please cite us:

@article{qi2025two, title={Two by two: Learning multi-task pairwise objects assembly for generalizable robot manipulation}, author={Qi, Yu and Ju, Yuanchen and Wei, Tianming and Chu, Chi and Wong, Lawson LS and Xu, Huazhe}, journal={CVPR 2025}, year={2025}}