评分 3.2 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-15
评分依据:Moderate AI relevance +novelty(1) +practical(2)
arXiv:2604.12335v1 Announce Type: cross Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically…