Gradual Imitation Learning for Machine Translation

[Submitted on 29 Jul 2025 (v1), last revised 17 Dec 2025 (this version, v2)]

View a PDF of the paper titled RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation, by Dongyub Jude Lee and 2 other authors

View PDF

Abstract:Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

Submission history

From: Jude Dongyub Lee [view email]
[v1]
Tue, 29 Jul 2025 20:35:35 UTC (7,798 KB)
[v2]
Wed, 17 Dec 2025 21:28:41 UTC (7,817 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Gradual Imitation Learning for Machine Translation

Escaping the SQL Jungle | Towards Data Science

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

How to Measure AI Value

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Escaping the SQL Jungle | Towards Data Science

SEO’s new battleground: Winning the consensus layer

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

23 Radish Recipes for Salads, Pickles, and More

Google confirms AI headline rewrites test in Search results

How to add Google Calendar to Outlook

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Gradual Imitation Learning for Machine Translation

Submission history

Related Posts

Subscribe to Updates