Tag: mechanistic-interpretability

All the articles with the tag "mechanistic-interpretability".

6.0
Cell-Based Representation of Relational Binding in Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
发现LLM通过称为Cell-Based Representation的低维线性子空间编码篇章级关系绑定
8.0
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
跨12个模型发现同一小组attention head携带'此陈述错误'信号——沉默这些head即翻转谄媚行为，揭示sycophancy与lying共享神经回路

Cell-Based Representation of Relational Binding in Language Models