Fanqi Lin^1,2,3,5* Ruiqian Nai^1,2,3,5* Yingdong Hu^1,2,3*
Jiacheng You^1,2,3 Junming Zhao^1,4 Yang Gao^1,2,3,5†

¹Tsinghua University, ²Shanghai Qi Zhi Institute, ³Shanghai AI Lab, ⁴Fudan University, ⁵Spirit AI

^*Equal Contribution ^†Corresponding author

Long-Horizon
Task Planning

Error Detection
and Recovery

Human-Robot
Interaction

Generalizable
Visual Grounding

Synthetic Vision
Language Data

Click to jump to each section.

Long-Horizon Task Planning

OneTwoVLA excels at handling long-horizon manipulation tasks. It consistently demonstrates the ability to understand the physical scene, generate correct plans, track task progress accurately, and produce precise actions. This allows OneTwoVLA to successfully complete challenging tasks such as hotpot cooking, tomato-egg scramble, and cocktail mixing.

Hotpot Cooking

Tomato-Egg Scramble

Cocktail Mixing

Moreover, Co-training with our synthetic embodied-reasoning centric vision-language data enables OneTwoVLA to demonstrate generalizable planning capabilities on unseen tasks.

Generalizable Planning Tasks

Error Detection and Recovery

Recovering from mistakes is a critical capability for general-purpose robots. OneTwoVLA can detect errors in real-time, rapidly reason about recovery strategies, and subsequently generate corrective actions.

Show recovery for

Natural Human-Robot Interaction

To deploy robots in human-centric scenarios, the ability to interact naturally with humans is indispensable. Due to its adaptive nature and explicit reasoning process, OneTwoVLA is able to engage with humans in a natural way — seamlessly handling human interventions and proactively seek clarification when faced with ambiguities.

Show interaction for

Open-World Visual Grounding

Co-training OneTwoVLA with our synthetic embodied reasoning-centric vision-language data endows it with open-world visual grounding capabilities, enabling it to effectively comprehend spatial relationships, object attributes, and semantic features, even for objects unseen during training (e.g., GoPro, Sprite, Starbucks Coffee). The following videos demonstrate our robot successfully reaching to target objects based on language instructions.

Synthetic Vision-Language Data Examples

To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable, automatic pipeline for synthesizing embodied reasoning centric vision-language data without any artificial intervention, used for co-training with robot data. The task instructions for each synthetic image are categorized into two types: visual grounding tasks and long-horizon planning tasks. We show some examples here:

Visual Grounding

Spatial Instruction: Can you grab the item draped over the left edge of the table?
Reasoning: I need to pick up the brown knitted scarf which provides warmth.

Attribute Instruction: Hand me the glass sphere decoration.
Reasoning: I need to pick up the snow globe ornament containing a Christmas tree behind the book.

Semantic Instruction: I'd like something to read, please pass me that.
Reasoning: I need to pick up the book on the right side of the table.

Long-Horizon Planning

Instruction: Prepare a fresh salad using the ingredients on the table.
Reasoning Plan: 1. Pour the cherry tomatoes into the large wooden bowl. 2. Pour the arugula into the large wooden bowl. 3. Add some sliced cucumbers to the large wooden bowl. 4. Take the croutons and sprinkle them evenly on top of the salad. 5. Pour olive oil over the salad. 6. gently toss the ingredients together.

Hardware

OneTwoVLA: A Unified Vision-Language-
Action Model with Adaptive Reasoning