[2510.10767] Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

[Submitted on 12 Oct 2025 (v1), last revised 16 Dec 2025 (this version, v2)]

View a PDF of the paper titled Understanding Sampler Stochasticity in Training Diffusion Models for RLHF, by Jiayuan Sheng and 4 other authors

View PDF
HTML (experimental)

Abstract:Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.

Submission history

From: Jiayuan Sheng [view email]
[v1]
Sun, 12 Oct 2025 19:08:38 UTC (29,945 KB)
[v2]
Tue, 16 Dec 2025 18:10:07 UTC (38,490 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2510.10767] Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

Biological Mechanisms, Computational Approaches, and Future Opportunities

I Built a Podcast Clipping App in One Weekend Using Vibe Coding

Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

Neuro-Symbolic Fraud Detection: Catching Concept Drift Before F1 Drops (Label-Free)

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

4 Pandas Concepts That Quietly Break Your Data Pipelines

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Shrimp Stir-Fry With Garlic Chives and Chiles Recipe

Baked Cheddar and Leek Pasta Recipe

ChatGPT ads pilot leaves advertisers without proof of ROI

Biological Mechanisms, Computational Approaches, and Future Opportunities

Google Ads API to block duplicate Lookalike user lists

Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2510.10767] Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

Submission history

Related Posts

Subscribe to Updates