评分 6.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17
评分依据:Theoretical analysis showing SFT as sparse implicit reward RL, explains entropy collapse and gradient explosion, important for training understanding
arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion.