评分 6.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17
评分依据:Striking result: 786K parameters (0.02% of base) corrects political suppression in Qwen3 across 31 facts, important for alignment/fairness
arXiv:2604.14174v1 Announce Type: cross Abstract: Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B.