V-GRPO: Online RL for Denoising Generative Models Is Easier than You Think

发布

2026年04月28日

采集 2026年04月28日 10:31

行业动态 6.5 分 — 为去噪生成模型提供直接的policy gradient online RL方法，绕过不可行似然问题，简化了对齐流程。

原文： arxiv.org

评分 6.5 · 来源： · 发布于

评分依据：为去噪生成模型提供直接的policy gradient online RL方法，绕过不可行似然问题，简化了对齐流程。

ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling