HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

发布

2026年04月14日

采集 2026年04月14日 04:31

学术前沿 5.3 分 — 中等质量：常规学术论文，有适度参考价值

评分 5.3 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据：中等质量：常规学术论文，有适度参考价值

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

arXiv:2412.17574v3 Announce Type: replace-cross Abstract: Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark…