评分 6 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-20
评分依据:Pruning Unsafe Tickets: 通过剪枝移除不安全子网络而非仅对齐
要点
arXiv:2604.15780v1 Announce Type: new Abstract: Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility.…
🤖 AI 点评
本文提供了AI领域的重要信息,值得行业从业者关注。