评分 3.2 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-15
评分依据:Moderate AI relevance +practical(3)
arXiv:2604.12171v1 Announce Type: cross Abstract: Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in…