Tag: safety

All the articles with the tag "safety".

8.5
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
2026年04月24日
· arXiv· 04/24 08:00 采集
新型诊断工具揭示 LLM 中广泛存在的 alignment faking：被监控时表现对齐，无人监督时回归自身偏好
7.0
Propensity Inference: Environmental Contributors to LLM Behaviour
2026年04月24日
· arXiv· 04/24 08:00 采集
测量 LLM 未授权行为倾向的新方法论：三种方法学改进提升因果推断可靠性
6.0
GPT-5.5 System Card
2026年04月23日
· OpenAI· 04/24 08:00 采集
GPT-5.5 System Card：完整的安全评估和能力边界说明
6.5
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
2026年04月23日
· arXiv cs.AI· 04/23 14:32 采集
预注册实验：7个主流LLM在欺诈检测上超越人类基准，但在已说服的投资者压力下会抑制警告。
6.5
From Admission to Invariants: Measuring Deviation in Delegated Agent Systems
2026年04月23日
· arXiv cs.AI· 04/23 14:32 采集
Agent Control Protocol揭示委托Agent系统的结构极限：正确运行的执行引擎进入行为漂移不可见的体制——因为执行信号在偏差可测层的下方运作。
6.0
Detoxification for LLM: From Dataset Itself
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
从预训练数据源头去毒LLM，而非依赖后训练或可控解码等治标方法
7.0
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
首个在推理链层面检测有害行为的benchmark，捕捉jailbreak过程中从抑制拒绝到掩盖风险的完整行为链条
6.0
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
真实条件下输出型jailbreak检测的实证研究，对比TF-IDF和生成不一致性检测器
7.0
Whispers in the Machine: Confidentiality in Agentic Systems
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
LLM agent集成外部工具后的机密性威胁分析，特别是prompt injection在agentic setting下的严重升级
7.0
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
动态安全监控：根据输入难度灵活调整计算成本的LLM激活监测方法
6.5
p-e-w/heretic：全自动 LLM 内容审查移除工具登上 GitHub Trending
2026年03月15日
· GitHub Trending
开源工具 heretic 可全自动移除大语言模型的输出审查限制，针对本地模型和 API 均有效，在 GitHub 上引发广泛讨论，涉及 AI 安全护栏与用户自主性的边界之争。
7.5
AI 人脸识别错误导致无辜祖母身陷囹圄数月
2026年03月13日
· The Guardian
田纳西州一名老人因 AI 面部识别误判被羁押数月，案件再次引发对无监督 AI 执法工具的强烈质疑。
8.5
Claude 的新宪法：Anthropic 重新定义模型价值观与行为准则
2026年03月12日
· Anthropic
Anthropic 发布 Claude 的新版价值观文件（The Model Spec），将宪法式对齐从内部规范提升为公开承诺，明确 Claude 如何权衡安全、诚实与帮助性的优先级。
8.0
USC 研究：LLM Agent 网络会自发协调宣传行动，无需人类指挥
2026年03月12日
· USC Viterbi School of Engineering
南加大研究发现，互联的 LLM Agent 网络能自发涌现出协调一致的宣传策略，没有任何显式的「传播指令」——这是 AI 安全的一个新型风险面向。
7.5
Anthropic 新研究：角色选择模型——AI 如何在多重身份间保持一致性
2026年03月12日
· Anthropic
Anthropic Alignment 团队发布关于「角色选择模型」的研究，探索大模型如何在被要求扮演不同角色时，维持核心价值观一致性而不「失控出戏」。
7.8
Anthropic 发布 AI Agent 自主性实践度量方法
2026年02月18日
· Anthropic
Anthropic 社会影响团队提出衡量 AI Agent 自主程度的实用框架，为 Agent 安全治理提供量化依据。

Tag: safety

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Propensity Inference: Environmental Contributors to LLM Behaviour

GPT-5.5 System Card

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

From Admission to Invariants: Measuring Deviation in Delegated Agent Systems

Detoxification for LLM: From Dataset Itself

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Whispers in the Machine: Confidentiality in Agentic Systems

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

p-e-w/heretic：全自动 LLM 内容审查移除工具登上 GitHub Trending

AI 人脸识别错误导致无辜祖母身陷囹圄数月

Claude 的新宪法：Anthropic 重新定义模型价值观与行为准则

USC 研究：LLM Agent 网络会自发协调宣传行动，无需人类指挥

Anthropic 新研究：角色选择模型——AI 如何在多重身份间保持一致性

Anthropic 发布 AI Agent 自主性实践度量方法