Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

发布

2026年04月20日

采集 2026年04月20日 09:04

学术前沿 6.0 分 — Ragged Paged Attention: 首个面向TPU的高性能灵活LLM推理内核

评分 6 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-20

评分依据：Ragged Paged Attention: 首个面向TPU的高性能灵活LLM推理内核

要点

arXiv:2604.15464v1 Announce Type: cross Abstract: Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google’s Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures—particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-per…

🤖 AI 点评

本文提供了AI领域的重要信息，值得行业从业者关注。