XL-VLA: Cross-Hand Latent Representation for Vision-Language-Action Models
Guangqi Jiang*, Yutong Liang*, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang†, Xueyan Zou†
Takeaway: Embodiment-invariant latent action space enhances performance as demonstrations scale across different hand embodiments, similarly to scaling with additional data from a single hand.
@article{jiang2026cross,
title={Cross-Hand Latent Representation for Vision-Language-Action Models},
author={Jiang, Guangqi and Liang, Yutong and Ye, Jianglong and Huang, Jia-Yang and Jing, Changwei and Duan, Rocky and Abbeel, Pieter and Wang, Xiaolong and Zou, Xueyan},
journal={arXiv preprint},
year={2026}
}

