View a PDF of the paper titled Bootstrapping LLMs via Preference-Based Policy Optimization, by Chen Jia
View PDF
HTML (experimental)
Abstract:Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.
Submission history
From: Chen Jia [view email]
[v1]
Mon, 17 Nov 2025 01:41:14 UTC (1,207 KB)
[v2]
Wed, 24 Dec 2025 13:31:20 UTC (1,312 KB)


![[2511.12867] Bootstrapping LLMs via Preference-Based Policy Optimization Measuring Intelligence Efficiency of Local AI](https://skytik.cc/wp-content/uploads/2025/11/Measuring-Intelligence-Efficiency-of-Local-AI-768x448.png)