Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

arXiv:2511.17473v1 Announce Type: cross
Abstract: Test-time scaling has been shown to substantially improve large language models’ (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR’s scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT’s self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via “masked-then-fill” and “step reordering” to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR’s scalability and performance in only outcome-verifiable settings.

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

Bridging Facts for Cross-Document Reasoning at Index Time

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment

Bridging Modality Gap with Temporal Evolution Semantic Space

How to Effectively Review Claude Code Output

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

Why customer personas help you win earlier in AI search

Broccoli Confetti Rice Recipe | Epicurious

SEO Test Shows It’s Trivial To Rank Misinformation On Google

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

How a Neural Network Learned Its Own Fraud Rules: A Neuro-Symbolic AI Experiment

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Related Posts

Subscribe to Updates