[2503.06884] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

[Submitted on 10 Mar 2025 (v1), last revised 23 Jan 2026 (this version, v2)]

View a PDF of the paper titled Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help, by Xuyang Guo and 6 other authors

View PDF

Abstract:Generative modeling is widely regarded as one of the most essential problems in today’s AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability.

Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.

Submission history

From: Xuyang Guo [view email]
[v1]
Mon, 10 Mar 2025 03:28:18 UTC (19,443 KB)
[v2]
Fri, 23 Jan 2026 07:51:15 UTC (19,701 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2503.06884] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Escaping the SQL Jungle | Towards Data Science

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

How to Measure AI Value

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Plantain and Black Bean Salad Recipe

What Is Buttermilk? How It’s Made and Used

Why your law firm’s best leads don’t convert after research

For Demi Lovato, Learning to Cook Meant Starting to Heal

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

23 Radish Recipes for Salads, Pickles, and More

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2503.06884] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Submission history

Related Posts

Subscribe to Updates