Skip to content
星际流动

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

发布
采集
学术前沿 5.0 分 — 中等质量:常规学术论文,有适度参考价值
原文: cs.AI updates on arXiv.org

评分 5.0 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据:中等质量:常规学术论文,有适度参考价值

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

arXiv:2504.13818v4 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples…