How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

发布

2026年04月29日

采集 2026年04月29日 06:31

学术前沿 6.5 分 — Tsallis q-logarithm 插值 RLVR 和密度估计，理论框架优雅

评分 6.5 · 来源：arXiv cs.LG · 发布于 2026-04-29

评分依据：Tsallis q-logarithm 插值 RLVR 和密度估计，理论框架优雅

Reasoning model 在仅有输出级监督的后训练中，当初始成功概率 p0 很小时 RLVR 会停滞。本文用 Tsallis q-logarithm 定义损失族 J_Q，在 q=0（exploitation 极）插值 RLVR，q=1（density estimation 极）插值边际似然。为 reasoning model 应该多快 commit 到监督提供了理论框架。

标签：

reasoning-models
rlvr
loss-functions
post-training

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective