All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

发布

2026年04月15日

采集 2026年04月15日 04:35

学术前沿 3.2 分 — Moderate AI relevance +novelty(1) +practical(2)

评分 3.2 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-15

评分依据：Moderate AI relevance +novelty(1) +practical(2)

arXiv:2604.12335v1 Announce Type: cross Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically…