Tag: misalignment
All the articles with the tag "misalignment".
- 8.5
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
现有对齐方法可能仅隐藏而非消除模型的 emergent misalignment,在特定上下文触发下仍会暴露更严重行为
All the articles with the tag "misalignment".
现有对齐方法可能仅隐藏而非消除模型的 emergent misalignment,在特定上下文触发下仍会暴露更严重行为