Green-VLA

Staged Vision–Language–Action Model for Generalist Robots

Manipulation Team
Sber Robotics Center · February 2026

📄 Paper 🤗 Hugging Face 💻 Code (Coming soon) 📝 BibTeX

Abstract

We introduce Green-VLA, a staged Vision–Language–Action framework for real-world deployment on the humanoid Green robot, while maintaining generalization across diverse embodiments. Green-VLA follows a five-stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) RL-based policy alignment. Progression builds semantic and physical priors, learns shared affordances, and aligns policies for long-horizon execution beyond behavior cloning. At its core is a unified data and control stack for robot fleets.

A scalable data-processing pipeline including DataQA and temporal-alignment filters and synchronizes 3,000 hours of demonstrations; a unified, embodiment-aware action interface enables a single policy to control humanoids, mobile manipulators, and fixed-base arms; and the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and a joint-prediction-based guidance module that generalizes to unseen objects. Optimized for the Green humanoid, Green-VLA generalizes in a zero-shot manner to new embodiments and achieves state-of-the-art performance across bimanual systems and benchmarks, with RL alignment providing gains in success rate, robustness, and long-horizon efficiency.

Method Overview

Green-VLA is a ~4B-parameter Vision-Language-Action model. We use PaliGemma (3B) as the vision–language backbone, augmented with a dedicated flow-matching action expert and lightweight auxiliary heads. The architecture features:

Unified Action Space (64D): Semantic layout with embodiment masks → enables zero-shot cross-robot transfer without spurious gradients from padding.
Task Planner: High-level VLM decomposes user goals into atomic subtasks → achieves long-horizon instruction following with adaptive replanning.
DataQA Pipeline: Auto-filters 3,000+ hours using jitter, sharpness, diversity, and state variance metrics + optical-flow alignment → stable multi-embodiment training with improved data efficiency.
Episode Progress & OOD Detection: Real-time task state monitoring → safe recovery from unfamiliar configurations, prevents catastrophic failures.
JPM + Guidance: Training-free 3D target prediction + flow-matching steering → 36%→93% SKU-level accuracy, 10%→72% on OOD items.
RL Alignment (R2): Trajectory optimization + source distribution tuning → +10-52% SR gain on challenging tasks, longer average chain length compared with behavior cloning.

The training follows a staged curriculum: (L0) base VLM → (L1) web/multimodal pretraining for physical world understanding → (R0) general robotics pretraining on 3,000+ hours of demonstrations → (R1) embodiment-specific supervised fine-tuning → (R2) RL-based policy alignment. This progression builds semantic priors, learns shared affordances, and aligns policies for robust real-world execution.

Results

ALOHA table cleaning (R0, no extra SFT): Green-VLA achieves 83.1% success rate on tape, 52.1% on screwdrivers, 63.7% on pliers, and 69.5% first-item SR with an average completion time of 1m35s—outperforming π₀ (2m59s), Agibot GO-1 (3m57s), and other baselines trained with additional finetuning.
SimplerEnv (Google Robot & WidowX): R0 Green-VLA achieves 60.2% average success on Google Robot tasks (vs. π₀ pretrain 6.39%, π₀ finetune 56.8%) and 45.0% on WidowX pick tasks at R0; R1 improves to 55.2%, and R2 RL alignment reaches 79.1% overall success.
CALVIN: R1 Green-VLA reaches 4.18 ACL (vs. π₀ 3.6 ACL); R2 RL alignment pushes ACL to 4.62, demonstrating superior long-horizon task execution and error recovery.
Humanoid (Green Robot): Full upper-body control (head, torso, dual arms, dexterous hands) with instruction-conditioned pick, place, basket retrieval, hand-over, and fruit sorting. Achieves robust task following and multi-step coordination in both in-domain and OOD scene layouts.
E-Commerce (JPM+Guidance): SKU-level shelf picking improves from 36% (base) to 93% (with guidance) on in-domain exact-variant tasks, and from 10% to 72% on out-of-distribution unseen SKUs, demonstrating precise object targeting in dense visual environments.

Below are example results demonstrating our method on various scenes.

Aloha Inference

Humanoid Inference

Benchmark Results Summary (Models × Datasets)

SimplerEnv: Google Robot Tasks

SimplerEnv evaluation across different policies on Google Robot tasks for default number of Simpler episode steps
Model	Visual Matching				Variant Aggregation				Overall Avg
	Drawer	Move Near	Pick Coke	AVG VM	Drawer	Move Near	Pick Coke	AVG VA	Average
π₀ (Fine-tune)	38.3	65.3	72.7	58.8	25.6	63.7	75.2	54.8	56.8
OpenVLA	35.6	46.2	16.3	27.7	17.7	47.7	54.5	39.8	33.8
RT-1-X	59.7	31.7	56.7	53.4	49.0	32.3	29.7	39.6	46.5
Green-VLA (R0)	62.9	61.2	90.4	71.4	33.5	38.1	75.5	49.1	60.2
Green-VLA (R1)	47.0	58.7	95.0	66.9	34.1	42.9	92.1	56.3	61.3
Green-VLA (R2)	61.0	50.8	98.1	69.9	51.6	71.2	98.2	73.7	71.8

SimplerEnv: WidowX Robot Tasks

SimplerEnv WidowX Evaluation: Pick Tasks & Success Rates (R0→R1→R2 progression)
Model	Grasp Success					Task Success
	Spoon	Cubes	Eggplant	Carrot	AVG	Spoon	Cubes	Eggplant	Carrot	AVG
π₀ (Finetune)	45.8	50.0	91.6	25.0	53.1	29.1	16.7	62.5	0.0	27.1
DB-MemVLA	91.7	83.3	79.2	100	88.6	85.1	57.6	100	50.0	73.2
Green-VLA (R0)	66.7	91.7	91.7	50.0	75.0	33.3	33.3	88.5	25.0	45.0
Green-VLA (R1)	75.0	91.7	87.5	50.0	76.1	66.7	37.5	79.2	37.5	55.2
Green-VLA (R2)	87.5	95.8	91.7	91.6	91.7	90.1	52.6	84.8	89.0	79.1

CALVIN Benchmark: Average Chain Length

CALVIN ABC→D: Average Chain Length (ACL) Comparison
Metric	π₀ R1	Green-VLA R1	Flower R1	Green-VLA R2
ACL	3.6	4.18	4.53	4.63

Humanoid (Green Robot): Task Success Rates

Humanoid instruction-conditioned manipulation success rates (%)
Pick [item]	Place in [basket]	Pick [item] from [basket]	Give [item] to user	Hand over [item]	Clean full [table]	Average
98	100	77	99	84	87	90

E-Commerce: RL Alignment (R1 → R2)

Success Rates for Challenging E-Commerce Items
Item	R1 Success	R2 Success	Gain
Cookies	30%	82%	+52%
Deodorant	62%	88%	+26%
Shampoo	12%	22%	+10%
Pet food	25%	68%	+43%

E-Commerce: JPM + Guidance Impact

E-Commerce Shelf Picking: Top-1 Success Rate (%)
Metric	Base	+ Guidance	Gain
ID Coarse	62	95	+33
ID SKU	36	93	+57
OOD	10	72	+62

Citation

@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
      title={Green-VLA: Staged Vision-Language-Action Model for Generalist Robots}, 
      author={I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
      year={2026},
      eprint={2602.00919},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.00919}, 
}