Skip to content
星际流动

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

发布
采集
学术前沿 6.0 分 — 有一定参考价值的AI研究论文
原文: cs.CL updates on arXiv.org

评分 6.0 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据:有一定参考价值的AI研究论文

arXiv:2407.14971v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while pr


标签: