Skip to content
星际流动

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

发布
采集
学术前沿 6.0 分 — Practical architecture for streaming video understanding with VLMs, training-free approach is valuable
原文: cs.CL updates on arXiv.org

评分 6 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-17

评分依据:Practical architecture for streaming video understanding with VLMs, training-free approach is valuable

arXiv:2601.14724v3 Announce Type: replace-cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams.