BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

发布

2026年04月14日

采集 2026年04月14日 04:31

学术前沿 5.4 分 — 中等质量：常规学术论文，有适度参考价值

评分 5.4 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据：中等质量：常规学术论文，有适度参考价值

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

arXiv:2604.11136v1 Announce Type: cross Abstract: Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding…