Skip to content
星际流动

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

发布
采集
学术前沿 6.5 分 — Theoretical analysis showing SFT as sparse implicit reward RL, explains entropy collapse and gradient explosion, important for training understanding
原文: cs.LG updates on arXiv.org

评分 6.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据:Theoretical analysis showing SFT as sparse implicit reward RL, explains entropy collapse and gradient explosion, important for training understanding

arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion.