Skip to content
星际流动

Influencing Humans to Conform to Preference Models for RLHF

发布
采集
学术前沿 5.0 分 — 中等质量:常规学术论文,有适度参考价值
原文: cs.AI updates on arXiv.org

评分 5.0 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据:中等质量:常规学术论文,有适度参考价值

Influencing Humans to Conform to Preference Models for RLHF

arXiv:2501.06416v3 Announce Type: replace-cross Abstract: Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human’s unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human’s reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of…