View a PDF of the paper titled RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation, by Dongyub Jude Lee and 2 other authors
View PDF
Abstract:Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
Submission history
From: Jude Dongyub Lee [view email]
[v1]
Tue, 29 Jul 2025 20:35:35 UTC (7,798 KB)
[v2]
Wed, 17 Dec 2025 21:28:41 UTC (7,817 KB)


