Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

发布

2026年04月08日

采集 2026年04月08日 04:31

学术前沿 6.0 分 — 有一定参考价值的AI研究论文

评分 6.0 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据：有一定参考价值的AI研究论文

arXiv:2407.14971v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while pr