View a PDF of the paper titled Step-Audio-EditX Technical Report, by Chao Yan and 13 other authors
View PDF
HTML (experimental)
Abstract:We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
Submission history
From: Boyong Wu [view email]
[v1]
Wed, 5 Nov 2025 16:22:19 UTC (1,174 KB)
[v2]
Wed, 19 Nov 2025 04:56:09 UTC (1,172 KB)


![[2511.03601] Step-Audio-EditX Technical Report Measuring Intelligence Efficiency of Local AI](https://skytik.cc/wp-content/uploads/2025/11/Measuring-Intelligence-Efficiency-of-Local-AI-768x448.png)