Efficient High-Resolution Visual Understanding for Vision-Language Models

[Submitted on 26 Sep 2025 (v1), last revised 17 Mar 2026 (this version, v2)]

View a PDF of the paper titled ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models, by Jewon Lee and 7 other authors

View PDF
HTML (experimental)

Abstract:Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of “thinking with images” models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage “coarse-to-fine” reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: this https URL.

Submission history

From: Wooksu Shin [view email]
[v1]
Fri, 26 Sep 2025 07:15:19 UTC (2,902 KB)
[v2]
Tue, 17 Mar 2026 09:34:26 UTC (3,001 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Efficient High-Resolution Visual Understanding for Vision-Language Models

Large Language Model Enhanced Greybox Fuzzing

Why You Should Stop Worrying About AI Taking Data Science Jobs

[2603.14845] Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting

The New Experience of Coding with AI

A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Why the Best AI Use Cases in Marketing Start with Intelligence, Not Creation

Efficient High-Resolution Visual Understanding for Vision-Language Models

Google retires several legacy ad format policies

Google Explains Why HTTPS Migration May Negatively Impact SEO

Large Language Model Enhanced Greybox Fuzzing

Small publisher search traffic fell 60% over two years: Data

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Efficient High-Resolution Visual Understanding for Vision-Language Models

Submission history

Related Posts

Subscribe to Updates