评分依据:Creative use of pre-training text as self-play signal source. Relevant to reducing RL data requirements.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
发布
采集
行业动态 6.0 分
— Creative use of pre-training text as self-play signal source. Relevant to reducing RL data requirements. 原文: arXiv