Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 6.5 分 — Striking result: 786K parameters (0.02% of base) corrects political suppression in Qwen3 across 31 facts, important for alignment/fairness

评分 6.5 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据：Striking result: 786K parameters (0.02% of base) corrects political suppression in Qwen3 across 31 facts, important for alignment/fairness

arXiv:2604.14174v1 Announce Type: cross Abstract: Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B.