Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 7.0 分 — Identifies fundamental limitation in DPO/SimPO (reward-generation gap), important for alignment research

评分 7 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据：Identifies fundamental limitation in DPO/SimPO (reward-generation gap), important for alignment research

arXiv:2506.09457v3 Announce Type: replace-cross Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap”, a discrepancy between training objectives and autoregressive decoding dynamics.