VLA Model-Expert Collaboration for Bi-directional Manipulation Learning

Institute of Automation, Chinese Academy of Sciences
Indicates Equal Contribution

Abstract

The emergence of vision-language-action (VLA) models has given rise to foundation models for robot manipulation. Although these models have achieved significant improvements, their generalization in multi-task manipulation remains limited. This study proposes a VLA model-expert collaboration framework that leverages a limited number of expert actions to enhance VLA model performance. This approach reduces expert workload relative to manual operation while simultaneously improving the reliability and generalization of VLA models. Furthermore, manipulation data collected during collaboration can further refine the VLA model, while human participants concurrently enhance their skills. This bi-directional learning loop boosts the overall performance of the collaboration system. Experimental results across various VLA models demonstrate the effectiveness of the proposed system in collaborative manipulation and learning, as evidenced by improved success rates across tasks. Additionally, validation using a brain-computer interface (BCI) indicates that the collaboration system enhances the efficiency of low-speed action systems by involving VLA model during manipulation. These promising results pave the way for advancing human-robot interaction in the era of foundation models for robotics.

MY ALT TEXT

Figure 1. The proposed VLA model-expert collaboration system integrates a VLA model and expert interactions to enhance manipulation. The VLA model generates actions by processing task instructions as text tokens and environmental inputs as vision tokens. Meanwhile, the expert makes decisions at a lower frequency, assisting the VLA model. Expert-executed actions are collected to fine-tune the VLA model, improving system performance.

MY ALT TEXT

Figure 2. Collaboration pipeline between VLA model and expert for manipulation and learning.

MY ALT TEXT

Success rate of collaboration between VLA models and rule-based expert policy under different ratios (VLA/expert) in MT10 and MT50 benchmarks.

MY ALT TEXT

Comparison of the baseline VLA model (Octo) and the VLA model after collaborative learning (tuning). The success rates of the fine-tuned VLA model—with and without the rule-based expert policy (V vs. V-R)—are presented at the task level (a) and at the average level (b) in the MT10 benchmark.

MY ALT TEXT

Application of the collaboration framework in SSVEP-based BCI: A comparison between pure SSVEP-based control and the collaboration between the VLA model and the BCI user. Although in some cases the policy of the human participant performs better than the VLA model (steps: 77 vs. 32), the collaboration system significantly improves time efficiency for a given task (time: 15s vs. 96s).

BibTeX

@misc{xiang2025vlamodelexpertcollaborationbidirectional,
    title={VLA Model-Expert Collaboration for Bi-directional Manipulation Learning}, 
    author={Tian-Yu Xiang and Ao-Qun Jin and Xiao-Hu Zhou and Mei-Jiang Gui and Xiao-Liang Xie and Shi-Qi Liu and Shuang-Yi Wang and Sheng-Bin Duang and Si-Cheng Wang and Zheng Lei and Zeng-Guang Hou},
    year={2025},
    eprint={2503.04163},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2503.04163}, 
}