[2510.13632] Closing the Gap Between Text and Speech Understanding in LLMs

[Submitted on 15 Oct 2025 (v1), last revised 23 Feb 2026 (this version, v2)]

View a PDF of the paper titled Closing the Gap Between Text and Speech Understanding in LLMs, by Santiago Cuervo and 7 other authors

View PDF
HTML (experimental)

Abstract:Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts–and even cascaded pipelines–on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD–Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation–which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

Submission history

From: Zakaria Aldeneh [view email]
[v1]
Wed, 15 Oct 2025 14:57:16 UTC (291 KB)
[v2]
Mon, 23 Feb 2026 18:05:51 UTC (290 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2510.13632] Closing the Gap Between Text and Speech Understanding in LLMs

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

Generalizing Real-World Robot Manipulation via Generative Visual Transfer

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Follow the AI Footpaths | Towards Data Science

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Post, Story, and Reels Dimensions

How nonprofits can build a digital presence that actually drives impact

How Google Profits From Demand You Already Own

Why entity authority is the foundation of AI search visibility

Vibe Coding Plugins? Validate With Official WordPress Plugin Checker

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2510.13632] Closing the Gap Between Text and Speech Understanding in LLMs

Submission history

Related Posts

Subscribe to Updates