[2511.12867] Bootstrapping LLMs via Preference-Based Policy Optimization

[Submitted on 17 Nov 2025 (v1), last revised 24 Dec 2025 (this version, v2)]

View a PDF of the paper titled Bootstrapping LLMs via Preference-Based Policy Optimization, by Chen Jia

View PDF
HTML (experimental)

Abstract:Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

Submission history

From: Chen Jia [view email]
[v1]
Mon, 17 Nov 2025 01:41:14 UTC (1,207 KB)
[v2]
Wed, 24 Dec 2025 13:31:20 UTC (1,312 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2511.12867] Bootstrapping LLMs via Preference-Based Policy Optimization

Escaping the SQL Jungle | Towards Data Science

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

How to Measure AI Value

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Escaping the SQL Jungle | Towards Data Science

SEO’s new battleground: Winning the consensus layer

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

23 Radish Recipes for Salads, Pickles, and More

Google confirms AI headline rewrites test in Search results

How to add Google Calendar to Outlook

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2511.12867] Bootstrapping LLMs via Preference-Based Policy Optimization

Submission history

Related Posts

Subscribe to Updates