评分 6 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-20
评分依据:Ragged Paged Attention: 首个面向TPU的高性能灵活LLM推理内核
要点
arXiv:2604.15464v1 Announce Type: cross Abstract: Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google’s Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures—particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-per…
🤖 AI 点评
本文提供了AI领域的重要信息,值得行业从业者关注。