Green-VLA

Staged Vision–Language–Action Model for Generalist Robots

Manipulation Team
Sber Robotics Center · February 2026

Abstract

We introduce Green-VLA, a staged Vision–Language–Action framework for real-world deployment on the humanoid Green robot, while maintaining generalization across diverse embodiments. Green-VLA follows a five-stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) RL-based policy alignment. Progression builds semantic and physical priors, learns shared affordances, and aligns policies for long-horizon execution beyond behavior cloning. At its core is a unified data and control stack for robot fleets.

A scalable data-processing pipeline including DataQA and temporal-alignment filters and synchronizes 3,000 hours of demonstrations; a unified, embodiment-aware action interface enables a single policy to control humanoids, mobile manipulators, and fixed-base arms; and the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and a joint-prediction-based guidance module that generalizes to unseen objects. Optimized for the Green humanoid, Green-VLA generalizes in a zero-shot manner to new embodiments and achieves state-of-the-art performance across bimanual systems and benchmarks, with RL alignment providing gains in success rate, robustness, and long-horizon efficiency.

Method Overview

Green-VLA architecture

Green-VLA is a ~4B-parameter Vision-Language-Action model. We use PaliGemma (3B) as the vision–language backbone, augmented with a dedicated flow-matching action expert and lightweight auxiliary heads. The architecture features:

The training follows a staged curriculum: (L0) base VLM → (L1) web/multimodal pretraining for physical world understanding → (R0) general robotics pretraining on 3,000+ hours of demonstrations → (R1) embodiment-specific supervised fine-tuning → (R2) RL-based policy alignment. This progression builds semantic priors, learns shared affordances, and aligns policies for robust real-world execution.

Results


Below are example results demonstrating our method on various scenes.

Aloha Inference
Humanoid Inference


Benchmark Results Summary (Models × Datasets)


SimplerEnv: Google Robot Tasks

SimplerEnv evaluation across different policies on Google Robot tasks for default number of Simpler episode steps
Model Visual Matching Variant Aggregation Overall Avg
Drawer Move Near Pick Coke AVG VM Drawer Move Near Pick Coke AVG VA Average
π0 (Fine-tune) 38.3 65.3 72.7 58.8 25.6 63.7 75.2 54.8 56.8
OpenVLA 35.6 46.2 16.3 27.7 17.7 47.7 54.5 39.8 33.8
RT-1-X 59.7 31.7 56.7 53.4 49.0 32.3 29.7 39.6 46.5
Green-VLA (R0) 62.9 61.2 90.4 71.4 33.5 38.1 75.5 49.1 60.2
Green-VLA (R1) 47.0 58.7 95.0 66.9 34.1 42.9 92.1 56.3 61.3
Green-VLA (R2) 61.0 50.8 98.1 69.9 51.6 71.2 98.2 73.7 71.8


SimplerEnv: WidowX Robot Tasks

SimplerEnv WidowX Evaluation: Pick Tasks & Success Rates (R0→R1→R2 progression)
Model Grasp Success Task Success
Spoon Cubes Eggplant Carrot AVG Spoon Cubes Eggplant Carrot AVG
π0 (Finetune) 45.8 50.0 91.6 25.0 53.1 29.1 16.7 62.5 0.0 27.1
DB-MemVLA 91.7 83.3 79.2 100 88.6 85.1 57.6 100 50.0 73.2
Green-VLA (R0) 66.7 91.7 91.7 50.0 75.0 33.3 33.3 88.5 25.0 45.0
Green-VLA (R1) 75.0 91.7 87.5 50.0 76.1 66.7 37.5 79.2 37.5 55.2
Green-VLA (R2) 87.5 95.8 91.7 91.6 91.7 90.1 52.6 84.8 89.0 79.1


CALVIN Benchmark: Average Chain Length

CALVIN ABC→D: Average Chain Length (ACL) Comparison
Metric π0 R1 Green-VLA R1 Flower R1 Green-VLA R2
ACL 3.6 4.18 4.53 4.63


Humanoid (Green Robot): Task Success Rates

Humanoid instruction-conditioned manipulation success rates (%)
Pick [item] Place in [basket] Pick [item] from [basket] Give [item] to user Hand over [item] Clean full [table] Average
98 100 77 99 84 87 90

E-Commerce: RL Alignment (R1 → R2)

Success Rates for Challenging E-Commerce Items
Item R1 Success R2 Success Gain
Cookies 30% 82% +52%
Deodorant 62% 88% +26%
Shampoo 12% 22% +10%
Pet food 25% 68% +43%


E-Commerce: JPM + Guidance Impact

E-Commerce Shelf Picking: Top-1 Success Rate (%)
Metric Base + Guidance Gain
ID Coarse 62 95 +33
ID SKU 36 93 +57
OOD 10 72 +62

Citation

@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
      title={Green-VLA: Staged Vision-Language-Action Model for Generalist Robots}, 
      author={I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
      year={2026},
      eprint={2602.00919},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.00919}, 
}