Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Training in LLMs

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 7.0 分 — Important finding about tension between ethical reasoning and safety training, has implications for alignment tradeoffs

评分 7 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-17

评分依据：Important finding about tension between ethical reasoning and safety training, has implications for alignment tradeoffs

arXiv:2509.05367v4 Announce Type: replace-cross Abstract: Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacity to reason through moral trade-offs creates a distinct attack surface. We formalize this vulnerability through TRIAL, a multi-turn red-teaming methodology that embeds harmful requests within ethical framings.