Learning Novel Skills from Language-Generated Demonstrations

Ao-Qun Jin^*, Tian-Yu Xiang^*, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Yue Cao, Sheng-Bin Duan, Fu-Chao Xie, Zeng-Guang Hou

Institute of Automation, Chinese Academy of Sciences
ICLR 2025 Workshop on Generative Models for Robot Learning (GenBot)
^*Indicates Equal Contribution

Paper Code arXiv

Demonstration of the novel skill learning steps of the proposed pipeline. For each task, VLM generates an extended text description. With the extended text description, a DVG generates the demonstration videos. Finally, these videos undergo an inverse dynamic model IDM to extract action labels. Robots can learn from generated demonstrations and acquire novel tasks.

Abstract

Current robot learning algorithms for acquiring novel skills often rely on demonstration datasets or environment interactions, resulting in high labor costs and potential safety risks. To address these challenges, this study proposes a skill-learning framework that enables robots to acquire novel skills from natural language instructions. The proposed pipeline leverages vision-language models to generate demonstration videos of novel skills, which are processed by an inverse dynamics model to extract actions from the unlabeled demonstrations. These actions are subsequently mapped to environmental contexts via imitation learning, enabling robots to learn new skills effectively.

Overview of the proposed pipeline. The task learning process involves four modules: vision language model (a), demonstration video generator (b), inverse dynamic model (c) and imitation learning model (d).

The proposed framework can generate demonstrations that show fidelity, diversity (a, b) and creativity (c).

Comparison between learning from expert data and generated data.

Human evaluation results comparing Few-Shot Tasks, Zero-Shot Tasks, and overall performance across three criteria: adherence to physical laws, task accomplishment, and consistency with descriptions. Tuning Tasks consistently achieved high scores, particularly in physical, while Zero-Shot Tasks most times performed well, especially in description consistency.

BibTeX

@misc{jin2024learningnovelskillslanguagegenerated,
        title={Learning Novel Skills from Language-Generated Demonstrations}, 
        author={Ao-Qun Jin and Tian-Yu Xiang and Xiao-Hu Zhou and Mei-Jiang Gui and Xiao-Liang Xie and Shi-Qi Liu and Shuang-Yi Wang and Yue Cao and Sheng-Bin Duan and Fu-Chao Xie and Zeng-Guang Hou},
        year={2024},
        eprint={2412.09286},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2412.09286}, 
  }