Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

arXiv:2512.20595v1 Announce Type: cross
Abstract: We introduce Cube Bench, a Rubik’s-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one’s own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

[2603.14845] Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting

The New Experience of Coding with AI

A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

Manifold-Matching Autoencoders

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

What it is and How to Win it in 2026

[2603.14845] Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting

Prompt Volume Shouldn’t Drive Strategy

The New Experience of Coding with AI

Cacio e Pepe (Classic Roman Cheese and Pepper Pasta) Recipe

How To Build An SEO Commissioning Workflow

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Related Posts

Subscribe to Updates