OneTwoVLA: A Unified Vision-Language-
Action Model with Adaptive Reasoning

Introducing OneTwoVLA, a single unified model capable of both reasoning and acting, and can adaptively switch between two modes. OneTwoVLA demonstrates superior performance in the following capabilities:

Visual Representation Icon
Long-Horizon Task Planning: OneTwoVLA reasons to formulate, track, and dynamically adjust task plans based on execution feedback. Moreover, Co-training with our syhthetic embodied-reasoning centric vision-language data enables OneTwoVLA to demonstrate generalizable planning capabilities on unseen tasks.
Connector Design Icon
Error Detection and Recovery: OneTwoVLA detects execution errors in real time, reasons about corrective strategies, and performs agile recovery actions.
Instruction Tuning Data Icon
Natural Human-Robot Interaction: OneTwoVLA adjusts actions immediately upon human intervention and proactively seeks clarification when faced with ambiguity.
Instruction Tuning Recipes Icon
Generalizable Visual Grounding: OneTwoVLA has superior understanding of spatial relationships, object attributes, and semantic features, genealizing to unseen objects.
Visual Representation Logo Long-Horizon
Task Planning
Connector Logo Error Detection
and Recovery
Data Logo Human-Robot
Interaction
Recipe Logo Generalizable
Visual Grounding
Eval Logo Synthetic Vision
Language Data

Click to jump to each section.


Long-Horizon Task Planning

OneTwoVLA excels at handling long-horizon manipulation tasks. It consistently demonstrates the ability to understand the physical scene, generate correct plans, track task progress accurately, and produce precise actions. This allows OneTwoVLA to successfully complete challenging tasks such as hotpot cooking, tomato-egg scramble, and cocktail mixing.

Hotpot Cooking

Tomato-Egg Scramble

Cocktail Mixing

Moreover, Co-training with our synthetic embodied-reasoning centric vision-language data enables OneTwoVLA to demonstrate generalizable planning capabilities on unseen tasks.

Generalizable Planning Tasks


Error Detection and Recovery

Recovering from mistakes is a critical capability for general-purpose robots. OneTwoVLA can detect errors in real-time, rapidly reason about recovery strategies, and subsequently generate corrective actions.

Show recovery for


Natural Human-Robot Interaction

To deploy robots in human-centric scenarios, the ability to interact naturally with humans is indispensable. Due to its adaptive nature and explicit reasoning process, OneTwoVLA is able to engage with humans in a natural way — seamlessly handling human interventions and proactively seek clarification when faced with ambiguities.

Show interaction for


Open-World Visual Grounding

Co-training OneTwoVLA with our synthetic embodied reasoning-centric vision-language data endows it with open-world visual grounding capabilities, enabling it to effectively comprehend spatial relationships, object attributes, and semantic features, even for objects unseen during training (e.g., GoPro, Sprite, Starbucks Coffee). The following videos demonstrate our robot successfully reaching to target objects based on language instructions.


Synthetic Vision-Language Data Examples

To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable, automatic pipeline for synthesizing embodied reasoning centric vision-language data without any artificial intervention, used for co-training with robot data. The task instructions for each synthetic image are categorized into two types: visual grounding tasks and long-horizon planning tasks. We show some examples here:

Visual Grounding

Visual Grounding Image
Spatial Instruction: Can you grab the item draped over the left edge of the table?
Reasoning: I need to pick up the brown knitted scarf which provides warmth.
Attribute Instruction: Hand me the glass sphere decoration.
Reasoning: I need to pick up the snow globe ornament containing a Christmas tree behind the book.
Semantic Instruction: I'd like something to read, please pass me that.
Reasoning: I need to pick up the book on the right side of the table.

Long-Horizon Planning

Long-Horizon Planning Image
Instruction: Prepare a fresh salad using the ingredients on the table.
Reasoning Plan:
1. Pour the cherry tomatoes into the large wooden bowl. 2. Pour the arugula into the large wooden bowl. 3. Add some sliced cucumbers to the large wooden bowl. 4. Take the croutons and sprinkle them evenly on top of the salad. 5. Pour olive oil over the salad. 6. gently toss the ingredients together.

Hardware

benchmark category

BibTeX

@misc{lin2025onetwovlaunifiedvisionlanguageactionmodel,
  title={OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning},
  author={Fanqi Lin and Ruiqian Nai and Yingdong Hu and Jiacheng You and Junming Zhao and Yang Gao},
  year={2025},
  eprint={2505.11917},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2505.11917},
}