Tag: mechanistic-interpretability
All the articles with the tag "mechanistic-interpretability".
- 6.0
Cell-Based Representation of Relational Binding in Language Models
发现LLM通过称为Cell-Based Representation的低维线性子空间编码篇章级关系绑定
- 8.0
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
跨12个模型发现同一小组attention head携带'此陈述错误'信号——沉默这些head即翻转谄媚行为,揭示sycophancy与lying共享神经回路