HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 6.0 分 — Practical architecture for streaming video understanding with VLMs, training-free approach is valuable

评分 6 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-17

评分依据：Practical architecture for streaming video understanding with VLMs, training-free approach is valuable

arXiv:2601.14724v3 Announce Type: replace-cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams.