We introduce Green-VLA, a staged Vision–Language–Action framework for real-world deployment on the humanoid Green robot, while maintaining generalization across diverse embodiments. Green-VLA follows a five-stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) RL-based policy alignment. Progression builds semantic and physical priors, learns shared affordances, and aligns policies for long-horizon execution beyond behavior cloning. At its core is a unified data and control stack for robot fleets.
A scalable data-processing pipeline including DataQA and temporal-alignment filters and synchronizes 3,000 hours of demonstrations; a unified, embodiment-aware action interface enables a single policy to control humanoids, mobile manipulators, and fixed-base arms; and the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and a joint-prediction-based guidance module that generalizes to unseen objects. Optimized for the Green humanoid, Green-VLA generalizes in a zero-shot manner to new embodiments and achieves state-of-the-art performance across bimanual systems and benchmarks, with RL alignment providing gains in success rate, robustness, and long-horizon efficiency.
Method Overview
Green-VLA is a ~4B-parameter Vision-Language-Action model. We use PaliGemma (3B) as the vision–language backbone, augmented with a dedicated flow-matching action expert and lightweight auxiliary heads. The architecture features:
Unified Action Space (64D): Semantic layout with embodiment masks → enables zero-shot cross-robot transfer without spurious gradients from padding.
Task Planner: High-level VLM decomposes user goals into atomic subtasks → achieves long-horizon instruction following with adaptive replanning.
DataQA Pipeline: Auto-filters 3,000+ hours using jitter, sharpness, diversity, and state variance metrics + optical-flow alignment → stable multi-embodiment training with improved data efficiency.
Episode Progress & OOD Detection: Real-time task state monitoring → safe recovery from unfamiliar configurations, prevents catastrophic failures.
JPM + Guidance: Training-free 3D target prediction + flow-matching steering → 36%→93% SKU-level accuracy, 10%→72% on OOD items.
RL Alignment (R2): Trajectory optimization + source distribution tuning → +10-52% SR gain on challenging tasks, longer average chain length compared with behavior cloning.
The training follows a staged curriculum: (L0) base VLM → (L1) web/multimodal pretraining for physical world understanding → (R0) general robotics pretraining on 3,000+ hours of demonstrations → (R1) embodiment-specific supervised fine-tuning → (R2) RL-based policy alignment. This progression builds semantic priors, learns shared affordances, and aligns policies for robust real-world execution.
Results
ALOHA table cleaning (R0, no extra SFT):Green-VLA achieves 83.1% success rate on tape, 52.1% on screwdrivers, 63.7% on pliers, and 69.5% first-item SR with an average completion time of 1m35s—outperforming π0 (2m59s), Agibot GO-1 (3m57s), and other baselines trained with additional finetuning.
SimplerEnv (Google Robot & WidowX): R0 Green-VLA achieves 60.2% average success on Google Robot tasks (vs. π0 pretrain 6.39%, π0 finetune 56.8%) and 45.0% on WidowX pick tasks at R0; R1 improves to 55.2%, and R2 RL alignment reaches 79.1% overall success.
CALVIN: R1 Green-VLA reaches 4.18 ACL (vs. π0 3.6 ACL); R2 RL alignment pushes ACL to 4.62, demonstrating superior long-horizon task execution and error recovery.
Humanoid (Green Robot): Full upper-body control (head, torso, dual arms, dexterous hands) with instruction-conditioned pick, place, basket retrieval, hand-over, and fruit sorting. Achieves robust task following and multi-step coordination in both in-domain and OOD scene layouts.
E-Commerce (JPM+Guidance): SKU-level shelf picking improves from 36% (base) to 93% (with guidance) on in-domain exact-variant tasks, and from 10% to 72% on out-of-distribution unseen SKUs, demonstrating precise object targeting in dense visual environments.
Below are example results demonstrating our method on various scenes.
Aloha Inference
Humanoid Inference
Benchmark Results Summary (Models × Datasets)
SimplerEnv: Google Robot Tasks
SimplerEnv evaluation across different policies on Google Robot tasks for default number of Simpler episode steps
@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
title={Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
author={I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
year={2026},
eprint={2602.00919},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.00919},
}