评分 7 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17
评分依据:Identifies fundamental limitation in DPO/SimPO (reward-generation gap), important for alignment research
arXiv:2506.09457v3 Announce Type: replace-cross Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap”, a discrepancy between training objectives and autoregressive decoding dynamics.