S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation

Xie, Zhipeng; Han, Zongyi; Wei, Xiangyi; Sun, Shiliang; Li, Yang; Zhao, Jing

S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation

Zhipeng Xie^1*, Zongyi Han^1*, Xiangyi Wei¹, Shiliang Sun², Yang Li¹, Jing Zhao^1†

¹School of Computer Science and Technology, East China Normal University
²School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
^*Indicates Equal Contribution | ^†Corresponding Author

Paper Code arXiv

Abstract

Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, but their performance degrades significantly in long-horizon tasks due to cumulative error propagation. This limitation largely arises from static feature fusion mechanisms that rely on fixed weights to combine visual, language, and action representations, preventing the model from adapting to different phases of task execution.

To address this limitation, we propose S²-VLA, a framework that introduces a State-Space Guided Adaptive Attention (SSGAA) mechanism. SSGAA maintains a belief state that tracks task progression and generates dynamic gating weights to adaptively fuse information from three complementary sources: visual features for spatial perception, task intents for high-level task planning, and temporal action sequences for execution consistency. Despite its compact 2B parameter size, S²-VLA consistently outperforms larger 7B-scale models and achieves state-of-the-art performance on long-horizon manipulation benchmarks.

Static vs. Dynamic Focus: Static fusion models easily get stuck due to error accumulation. S²-VLA dynamically toggles Visual and Intent Focus to correct deviations

Gating Weight Interpretability: During spatial tasks ("Aim target"), visual attention gating peaks ($g_{visual}$), while subtask execution switching triggers sharp intent peaks ($g_{intent}$).

Real-World Evaluations on ALOHA Platform

We evaluated S²-VLA on a real-world ALOHA dual-arm mobile manipulation platform operating at a 30 Hz control loop. The system was tested across four complex, long-horizon tasks under random position initializations. Thanks to the SSGAA mechanism, S²-VLA demonstrates superior real-world transfer capabilities and robustness against error accumulation.

1. Pick-and-place (Success Rate: 90%)

Procedure: Blue and yellow block positions are randomly initialized. The dual-arm system must accurately pick up the blue block and place it securely into a bowl.

2. Stacking (Success Rate: 80%)

Procedure: After random block position initialization, the model executes precise spatial alignment to stack the yellow block cleanly on top of the blue block.

3. Desktop Organization (Success Rate: 75%)

Procedure: A multi-step composite planning task where the system is required to first place the yellow block into a bowl, followed sequentially by placing the blue block into the same bowl.

4. Utensil Distribution (Success Rate: 60%)

Procedure: High-dexterity bimanual coordination task. The left arm picks up a pair of chopsticks, transfers them mid-air to the right arm, which then places them into a target bowl.

Real-World Success Rate Comparison

Compared to baseline imitation methods (ACT) and parallel tokenization models ($\pi_0$-FAST), S²-VLA achieves significantly higher success rates across all evaluated long-horizon routines:

Task	ACT	$\pi_0$-FAST	S²-VLA (Ours)
Pick-and-place	30%	75%	90%
Stacking	20%	55%	80%
Desktop organization	20%	50%	75%
Utensil distribution	10%	45%	60%

Real-World Bimanual Deployment

Simulation Benchmarks

We evaluated S²-VLA across two rigorous simulation ecosystems: LIBERO (for multi-stage lifelong task horizons) and SimplerEnv (evaluating real-to-sim policy alignment on a WidowX arm). S²-VLA sets a new state-of-the-art success baseline, demonstrating that adaptive state-guided attention scales execution fidelity more reliably than raw parameter scaling.

LIBERO Benchmark Comparison

Evaluated across the four standard task distributions (Spatial, Object, Goal, and Long-Horizon), S²-VLA consistently outperforms larger 7B and 8.5B scale static networks:

Method	Scale (B)	Spatial (%)	Object (%)	Goal (%)	Long (%)	Avg. (%)
FlowVLA (2025, ArXiv)	8.5	93.2	95.0	91.6	72.6	88.1
UnifiedVLA (2025, ArXiv)	8.5	95.4	98.8	93.6	94.0	95.5
OpenVLA (2024, CoRL)	7	84.7	88.4	79.2	53.7	76.5
OpenVLA-OFT (2025, RSS)	7	97.6	98.4	97.9	94.5	97.1
UniVLA (2025, RSS)	7	96.5	96.8	95.6	92.0	95.2
MemoryVLA (2025, ArXiv)	7	98.4	98.4	96.4	93.4	96.7
CoT-VLA (2025, CVPR)	7	87.5	91.6	87.6	69.0	81.1
WorldVLA (2025, ArXiv)	7	87.6	99.2	83.4	60.0	81.8
CronusVLA (2025, AAAI)	7	97.3	99.6	96.9	94.0	97.0
TraceVLA (2025, ArXiv)	7	84.6	85.2	75.1	54.1	74.8
MolmoAct (2025, ArXiv)	7	87.0	95.4	87.6	77.2	86.6
ThinkAct (2025, NeurIPS)	7	88.3	91.4	87.1	70.9	84.4
4D-VLA (2025, ArXiv)	4	88.9	95.2	90.9	79.1	88.6
SpatialVLA (2025, RSS)	4	88.2	89.9	78.6	55.5	78.1
$\pi_0$ (2025, RSS)	3	96.8	98.8	95.8	85.2	94.2
$\pi_0$-FAST (2025, RSS)	3	96.4	96.8	88.6	60.2	85.5
SmolVLA (2025, ArXiv)	2.2	93.0	94.0	91.0	77.0	88.8
GR00T N1 (2025, ArXiv)	2	94.4	97.6	93.0	90.6	93.9
Seer (2025, ArXiv)	0.57	-	-	-	78.7	78.7
VLA-OS (2025, ArXiv)	0.5	87.0	96.5	92.7	66.0	85.6
S²-VLA (ours)	2	98.4	99.6	98.4	96.4	98.2

SimplerEnv-Bridge (WidowX Robot)

Evaluated under realistic simulation tracking, our model recovers missing geometric details better than alternative policy designs:

Method	Spoon on Towel	Carrot on Plate	Stack Cube	Eggplant in Basket	Avg. Success
RT-1-X (2024, ICRA)	0.0	4.2	0.0	0.0	1.1
OpenVLA (2024, RSS)	4.2	0.0	0.0	12.5	4.2
Octo-Base (2024, ArXiv)	15.8	12.5	0.0	41.7	17.5
RoboVLMs (2024, ArXiv)	45.8	20.8	4.2	79.2	37.5
CogACT-Base (2024, ArXiv)	71.7	50.8	15.0	67.5	51.3
CogACT-Large (2024, ArXiv)	58.3	45.8	29.2	95.8	57.3
$\pi_0$-Uniform* (2024, ArXiv)	63.3	58.8	21.3	79.2	55.7
$\pi_0$-Beta* (2024, ArXiv)	84.6	55.8	47.9	85.4	68.4
Magma (2025, PP)	37.5	29.2	20.8	91.7	44.8
TraceVLA (2025, ArXiv)	12.5	16.6	16.6	65.0	27.7
SpatialVLA (2025, ArXiv)	16.7	25.0	29.2	100.0	42.7
MemoryVLA (2025, ArXiv)	75.0	75.0	37.5	100.0	71.9
S²-VLA (ours)	83.3	87.5	41.7	100.0	78.1

BibTeX

@article{xie2026s2vla,
  title={S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation},
  author={Zhipeng Xie, Zongyi Han, Xiangyi Wei, Shiliang Sun, Yang Li and Jing Zhao},
  journal={arXiv preprint arXiv:2606.27872},
  year={2026},
  url={https://arxiv.org/abs/2606.27872}
}