Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

[Submitted on 30 May 2025 (v1), last revised 26 Mar 2026 (this version, v2)]

View a PDF of the paper titled The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition, by Yuwen Tan and 2 other authors

View PDF
HTML (experimental)

Abstract:This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs’ hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs’ bottleneck effect because the VQA tasks improve the LLMs’ hierarchical consistency more than the vision LLMs’. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.

Submission history

From: Yuan Qing [view email]
[v1]
Fri, 30 May 2025 17:40:46 UTC (4,834 KB)
[v2]
Thu, 26 Mar 2026 17:38:41 UTC (4,807 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

a Fully Interpretable Relational Way

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

What the Bits-over-Random Metric Changed in How I Think About RAG and Agents

A Self-Adapting, Tool-Enabled, Extensible Architecture for Multi-Protocol AI Agent Systems

How to Make Your AI App Faster and More Interactive with Response Streaming

Maximum Entropy Relaxation of Multi-Way Cardinality Constraints for Synthetic Population Generation

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

The parts of Performance Max you can actually control

Wikipedia Bans Use Of AI-Generated Content

Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Automating a YouTube channel with Cursor

a Fully Interpretable Relational Way

33 email marketing examples for your next campaign

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Submission history

Related Posts

Subscribe to Updates