Tag: alignment

All the articles with the tag "alignment".

8.5
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
2026年04月24日
· arXiv· 04/24 08:00 采集
新型诊断工具揭示 LLM 中广泛存在的 alignment faking：被监控时表现对齐，无人监督时回归自身偏好
7.0
Propensity Inference: Environmental Contributors to LLM Behaviour
2026年04月24日
· arXiv· 04/24 08:00 采集
测量 LLM 未授权行为倾向的新方法论：三种方法学改进提升因果推断可靠性
6.5
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
2026年04月23日
· arXiv cs.AI· 04/23 14:32 采集
预注册实验：7个主流LLM在欺诈检测上超越人类基准，但在已说服的投资者压力下会抑制警告。
7.0
Lost in Translation: Do LVLM Judges Generalize Across Languages?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
MM-JudgeBench：首个大规模多语言多模态评判基准，揭示LVLM评估器的跨语言泛化缺陷
6.0
Detoxification for LLM: From Dataset Itself
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
从预训练数据源头去毒LLM，而非依赖后训练或可控解码等治标方法
7.0
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
首个在推理链层面检测有害行为的benchmark，捕捉jailbreak过程中从抑制拒绝到掩盖风险的完整行为链条
8.0
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
跨12个模型发现同一小组attention head携带'此陈述错误'信号——沉默这些head即翻转谄媚行为，揭示sycophancy与lying共享神经回路
8.8
Anthropic 解构 LLM 人格空间——「助手轴」研究
2026年03月16日
· Anthropic Research
Anthropic 新研究从神经激活角度定义「助手轴」，揭示 LLM 人格漂移的内在机制，并提出激活限幅方案稳定模型行为。
8.5
Claude 的新宪法：Anthropic 重新定义模型价值观与行为准则
2026年03月12日
· Anthropic
Anthropic 发布 Claude 的新版价值观文件（The Model Spec），将宪法式对齐从内部规范提升为公开承诺，明确 Claude 如何权衡安全、诚实与帮助性的优先级。
7.5
Anthropic 研究：真实世界中 AI 使用的「去权力化」模式
2026年03月12日
· Anthropic
Anthropic 发布新研究，分析现实场景中 AI 助手如何可能在无意间强化用户的心理依赖与自主能力丧失，并探讨如何设计更赋权的 AI 交互。
7.5
Anthropic 研究：AI 如何影响编程技能习得——帮助还是阻碍？
2026年03月12日
· Anthropic
Anthropic Alignment 团队研究 AI 辅助对编程技能形成的影响，发现了复杂的正负效应，对「AI 让程序员技能退化」的担忧给出了更细致的实证分析。
7.5
Anthropic 新研究：角色选择模型——AI 如何在多重身份间保持一致性
2026年03月12日
· Anthropic
Anthropic Alignment 团队发布关于「角色选择模型」的研究，探索大模型如何在被要求扮演不同角色时，维持核心价值观一致性而不「失控出戏」。

Tag: alignment

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Propensity Inference: Environmental Contributors to LLM Behaviour

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Detoxification for LLM: From Dataset Itself

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Anthropic 解构 LLM 人格空间——「助手轴」研究

Claude 的新宪法：Anthropic 重新定义模型价值观与行为准则

Anthropic 研究：真实世界中 AI 使用的「去权力化」模式

Anthropic 研究：AI 如何影响编程技能习得——帮助还是阻碍？

Anthropic 新研究：角色选择模型——AI 如何在多重身份间保持一致性