XL-VLA: Cross-Hand Latent Representation for Vision-Language-Action Models
Guangqi Jiang*, Yutong Liang*, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang†, Xueyan Zou†
Takeaway: Embodiment-invariant latent action space enhances performance as demonstrations scale across different hand embodiments, similarly to scaling with additional data from a single hand.
@misc{jiang2026crosshandlatentrepresentationvisionlanguageaction,
title={Cross-Hand Latent Representation for Vision-Language-Action Models},
author={Guangqi Jiang and Yutong Liang and Jianglong Ye and Jia-Yang Huang and Changwei Jing and Rocky Duan and Pieter Abbeel and Xiaolong Wang and Xueyan Zou},
year={2026},
eprint={2603.10158},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.10158},
}

