GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 6.5 分 — Theoretical analysis showing SFT as sparse implicit reward RL, explains entropy collapse and gradient explosion, important for training understanding

评分 6.5 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据：Theoretical analysis showing SFT as sparse implicit reward RL, explains entropy collapse and gradient explosion, important for training understanding

arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion.