Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

[Submitted on 30 Mar 2026 (v1), last revised 4 Apr 2026 (this version, v2)]

View a PDF of the paper titled OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training, by Haiyue Song and Masao Utiyama

View PDF
HTML (experimental)

Abstract:Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

Submission history

From: Haiyue Song [view email]
[v1]
Mon, 30 Mar 2026 18:00:02 UTC (1,202 KB)
[v2]
Sat, 4 Apr 2026 01:16:42 UTC (1,210 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

[2604.03057] Querying Structured Data Through Natural Language Using Language Models

[2506.16255] Category-based Galaxy Image Generation via Diffusion Models

The Geometry Behind the Dot Product: Unit Vectors, Projections, and Intuition

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

ChatGPT Search Is Citing Fewer Sites, Data Shows

How to Run Claude Code Agents in Parallel

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Higher standards, AI influence, and a web still catching up

Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

What is AI lead generation? (+tools)

Navigate Feedly Faster with Go To

The Top 6 Search Engines & The AI Search Engines To Know

Is AI Killing Web Traffic? How AI Overviews Impact Organic Website Traffic

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Submission history

Related Posts

Subscribe to Updates