Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

发布

2026年04月15日

采集 2026年04月15日 04:35

学术前沿 3.2 分 — Moderate AI relevance +novelty(1) +practical(1)

评分 3.2 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-15

评分依据：Moderate AI relevance +novelty(1) +practical(1)

arXiv:2604.13016v1 Announce Type: cross Abstract: On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores,…