Skip to content
星际流动

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Training in LLMs

发布
采集
学术前沿 7.0 分 — Important finding about tension between ethical reasoning and safety training, has implications for alignment tradeoffs
原文: cs.AI updates on arXiv.org

评分 7 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-17

评分依据:Important finding about tension between ethical reasoning and safety training, has implications for alignment tradeoffs

arXiv:2509.05367v4 Announce Type: replace-cross Abstract: Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacity to reason through moral trade-offs creates a distinct attack surface. We formalize this vulnerability through TRIAL, a multi-turn red-teaming methodology that embeds harmful requests within ethical framings.