Introducing OneTwoVLA, a single unified model capable of both reasoning and acting, and can adaptively switch between two modes. OneTwoVLA demonstrates superior performance in the following capabilities:
Click to jump to each section.
OneTwoVLA excels at handling long-horizon manipulation tasks. It consistently demonstrates the ability to understand the physical scene, generate correct plans, track task progress accurately, and produce precise actions. This allows OneTwoVLA to successfully complete challenging tasks such as hotpot cooking, tomato-egg scramble, and cocktail mixing.
Moreover, Co-training with our synthetic embodied-reasoning centric vision-language data enables OneTwoVLA to demonstrate generalizable planning capabilities on unseen tasks.
Recovering from mistakes is a critical capability for general-purpose robots. OneTwoVLA can detect errors in real-time, rapidly reason about recovery strategies, and subsequently generate corrective actions.
Show recovery for
To deploy robots in human-centric scenarios, the ability to interact naturally with humans is indispensable. Due to its adaptive nature and explicit reasoning process, OneTwoVLA is able to engage with humans in a natural way — seamlessly handling human interventions and proactively seek clarification when faced with ambiguities.
Show interaction for
Co-training OneTwoVLA with our synthetic embodied reasoning-centric vision-language data endows it with open-world visual grounding capabilities, enabling it to effectively comprehend spatial relationships, object attributes, and semantic features, even for objects unseen during training (e.g., GoPro, Sprite, Starbucks Coffee). The following videos demonstrate our robot successfully reaching to target objects based on language instructions.
To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable, automatic pipeline for synthesizing embodied reasoning centric vision-language data without any artificial intervention, used for co-training with robot data. The task instructions for each synthetic image are categorized into two types: visual grounding tasks and long-horizon planning tasks. We show some examples here:
@misc{lin2025onetwovlaunifiedvisionlanguageactionmodel,
title={OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning},
author={Fanqi Lin and Ruiqian Nai and Yingdong Hu and Jiacheng You and Junming Zhao and Yang Gao},
year={2025},
eprint={2505.11917},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.11917},
}