[2512.15372] Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

[Submitted on 17 Dec 2025]

View a PDF of the paper titled Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models, by Mikel Williams-Lekuona and Georgina Cosma

View PDF
HTML (experimental)

Abstract:Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

Submission history

From: Mikel Williams-Lekuona [view email]
[v1]
Wed, 17 Dec 2025 12:19:54 UTC (2,257 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2512.15372] Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

A Benchmark Dataset for Epitope-Specific Antibody Design

The 8 best AI image generators in 2026

Fast Image and Video Editing with Diffusion Guidance

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Gram-Eigenmode INR Editing with Closed-Form Geometry Updates

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

A Benchmark Dataset for Epitope-Specific Antibody Design

The 8 best AI image generators in 2026

2026 Social Media Ecommerce Trends & Statistics

Bing is testing a much larger sponsored product carousel in shopping results

Best Times to Post on Instagram in 2026 [Updated]

What Is the Best Greek Yogurt? Taste Test Results

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2512.15372] Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Submission history

Related Posts

Subscribe to Updates