S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation

1School of Computer Science and Technology, East China Normal University
2School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
*Indicates Equal Contribution | Corresponding Author

Abstract

Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, but their performance degrades significantly in long-horizon tasks due to cumulative error propagation. This limitation largely arises from static feature fusion mechanisms that rely on fixed weights to combine visual, language, and action representations, preventing the model from adapting to different phases of task execution.

To address this limitation, we propose S²-VLA, a framework that introduces a State-Space Guided Adaptive Attention (SSGAA) mechanism. SSGAA maintains a belief state that tracks task progression and generates dynamic gating weights to adaptively fuse information from three complementary sources: visual features for spatial perception, task intents for high-level task planning, and temporal action sequences for execution consistency. Despite its compact 2B parameter size, S²-VLA consistently outperforms larger 7B-scale models and achieves state-of-the-art performance on long-horizon manipulation benchmarks.

Real-World Evaluations on ALOHA Platform

We evaluated S²-VLA on a real-world ALOHA dual-arm mobile manipulation platform operating at a 30 Hz control loop. The system was tested across four complex, long-horizon tasks under random position initializations. Thanks to the SSGAA mechanism, S²-VLA demonstrates superior real-world transfer capabilities and robustness against error accumulation.

1. Pick-and-place (Success Rate: 90%)

Procedure: Blue and yellow block positions are randomly initialized. The dual-arm system must accurately pick up the blue block and place it securely into a bowl.

2. Stacking (Success Rate: 80%)

Procedure: After random block position initialization, the model executes precise spatial alignment to stack the yellow block cleanly on top of the blue block.

3. Desktop Organization (Success Rate: 75%)

Procedure: A multi-step composite planning task where the system is required to first place the yellow block into a bowl, followed sequentially by placing the blue block into the same bowl.

4. Utensil Distribution (Success Rate: 60%)

Procedure: High-dexterity bimanual coordination task. The left arm picks up a pair of chopsticks, transfers them mid-air to the right arm, which then places them into a target bowl.

Real-World Success Rate Comparison

Compared to baseline imitation methods (ACT) and parallel tokenization models ($\pi_0$-FAST), S²-VLA achieves significantly higher success rates across all evaluated long-horizon routines:

Task ACT $\pi_0$-FAST S²-VLA (Ours)
Pick-and-place 30% 75% 90%
Stacking 20% 55% 80%
Desktop organization 20% 50% 75%
Utensil distribution 10% 45% 60%

Real-World Bimanual Deployment

Simulation Benchmarks

We evaluated S²-VLA across two rigorous simulation ecosystems: LIBERO (for multi-stage lifelong task horizons) and SimplerEnv (evaluating real-to-sim policy alignment on a WidowX arm). S²-VLA sets a new state-of-the-art success baseline, demonstrating that adaptive state-guided attention scales execution fidelity more reliably than raw parameter scaling.

LIBERO Benchmark Comparison

Evaluated across the four standard task distributions (Spatial, Object, Goal, and Long-Horizon), S²-VLA consistently outperforms larger 7B and 8.5B scale static networks:

Method Scale (B) Spatial (%) Object (%) Goal (%) Long (%) Avg. (%)
FlowVLA (2025, ArXiv) 8.5 93.2 95.0 91.6 72.6 88.1
UnifiedVLA (2025, ArXiv) 8.5 95.4 98.8 93.6 94.0 95.5
OpenVLA (2024, CoRL) 7 84.7 88.4 79.2 53.7 76.5
OpenVLA-OFT (2025, RSS) 7 97.6 98.4 97.9 94.5 97.1
UniVLA (2025, RSS) 7 96.5 96.8 95.6 92.0 95.2
MemoryVLA (2025, ArXiv) 7 98.4 98.4 96.4 93.4 96.7
CoT-VLA (2025, CVPR) 7 87.5 91.6 87.6 69.0 81.1
WorldVLA (2025, ArXiv) 7 87.6 99.2 83.4 60.0 81.8
CronusVLA (2025, AAAI) 7 97.3 99.6 96.9 94.0 97.0
TraceVLA (2025, ArXiv) 7 84.6 85.2 75.1 54.1 74.8
MolmoAct (2025, ArXiv) 7 87.0 95.4 87.6 77.2 86.6
ThinkAct (2025, NeurIPS) 7 88.3 91.4 87.1 70.9 84.4
4D-VLA (2025, ArXiv) 4 88.9 95.2 90.9 79.1 88.6
SpatialVLA (2025, RSS) 4 88.2 89.9 78.6 55.5 78.1
$\pi_0$ (2025, RSS) 3 96.8 98.8 95.8 85.2 94.2
$\pi_0$-FAST (2025, RSS) 3 96.4 96.8 88.6 60.2 85.5
SmolVLA (2025, ArXiv) 2.2 93.0 94.0 91.0 77.0 88.8
GR00T N1 (2025, ArXiv) 2 94.4 97.6 93.0 90.6 93.9
Seer (2025, ArXiv) 0.57 - - - 78.7 78.7
VLA-OS (2025, ArXiv) 0.5 87.0 96.5 92.7 66.0 85.6
S²-VLA (ours) 2 98.4 99.6 98.4 96.4 98.2

SimplerEnv-Bridge (WidowX Robot)

Evaluated under realistic simulation tracking, our model recovers missing geometric details better than alternative policy designs:

Method Spoon on Towel Carrot on Plate Stack Cube Eggplant in Basket Avg. Success
RT-1-X (2024, ICRA) 0.0 4.2 0.0 0.0 1.1
OpenVLA (2024, RSS) 4.2 0.0 0.0 12.5 4.2
Octo-Base (2024, ArXiv) 15.8 12.5 0.0 41.7 17.5
RoboVLMs (2024, ArXiv) 45.8 20.8 4.2 79.2 37.5
CogACT-Base (2024, ArXiv) 71.7 50.8 15.0 67.5 51.3
CogACT-Large (2024, ArXiv) 58.3 45.8 29.2 95.8 57.3
$\pi_0$-Uniform* (2024, ArXiv) 63.3 58.8 21.3 79.2 55.7
$\pi_0$-Beta* (2024, ArXiv) 84.6 55.8 47.9 85.4 68.4
Magma (2025, PP) 37.5 29.2 20.8 91.7 44.8
TraceVLA (2025, ArXiv) 12.5 16.6 16.6 65.0 27.7
SpatialVLA (2025, ArXiv) 16.7 25.0 29.2 100.0 42.7
MemoryVLA (2025, ArXiv) 75.0 75.0 37.5 100.0 71.9
S²-VLA (ours) 83.3 87.5 41.7 100.0 78.1

BibTeX

@article{xie2026s2vla,
  title={S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation},
  author={Zhipeng Xie, Zongyi Han, Xiangyi Wei, Shiliang Sun, Yang Li and Jing Zhao},
  journal={arXiv preprint arXiv:2606.27872},
  year={2026},
  url={https://arxiv.org/abs/2606.27872}
}