[2509.18234] The Illusion of Readiness in Health AI

[Submitted on 22 Sep 2025 (v1), last revised 11 Dec 2025 (this version, v3)]

Authors:Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Hao Cheng, HoHin Lee, Praneeth Sanapathi, Sarah Hilado, Tristan Naumann, Javier Alvarez-Valle, Jiang Bian, Mu Wei, Khalil Malik, Lidong Zhou, Jianfeng Gao, Eric Horvitz, Matthew P. Lungren, Doug Burger, Eric Topol, Hoifung Poon, Paul Vozila

View a PDF of the paper titled The Illusion of Readiness in Health AI, by Yu Gu and 31 other authors

View PDF
HTML (experimental)

Abstract:Large language models have demonstrated remarkable performance in a wide range of medical benchmarks. Yet underneath the seemingly promising results lie salient growth areas, especially in cutting-edge frontiers such as multimodal reasoning. In this paper, we introduce a series of adversarial stress tests to systematically assess the robustness of flagship models and medical benchmarks. Our study reveals prevalent brittleness in the presence of simple adversarial transformations: leading systems can guess the right answer even with key inputs removed, yet may get confused by the slightest prompt alterations, while fabricating convincing yet flawed reasoning traces. Using clinician-guided rubrics, we demonstrate that popular medical benchmarks vary widely in what they truly measure. Our study reveals significant competency gaps of frontier AI in attaining real-world readiness for health applications. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold AI systems accountable to ensure robustness, sound reasoning, and alignment with real medical demands.

Submission history

From: Yu Gu [view email]
[v1]
Mon, 22 Sep 2025 17:48:05 UTC (17,963 KB)
[v2]
Wed, 1 Oct 2025 17:21:09 UTC (16,928 KB)
[v3]
Thu, 11 Dec 2025 20:55:53 UTC (19,298 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2509.18234] The Illusion of Readiness in Health AI

[2504.18346] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

CraniMem: Cranial Inspired Gated and Bounded Memory for Agentic Systems

The Basics of Vibe Engineering

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2504.18346] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Perplexity’s Comet for iOS uses Google Search by default

Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

404 Crawling Means Google Is Open To More Of Your Content

ChatGPT checkout converted 3x worse than website

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2509.18234] The Illusion of Readiness in Health AI

Submission history

Related Posts

Subscribe to Updates