Skip to content
星际流动

Attention to Mamba: A Recipe for Cross-Architecture Distillation

发布
采集
学术前沿 5.5 分 — Practical distillation recipe from Transformer to SSM/Mamba, leverages existing Transformer knowledge
原文: cs.LG updates on arXiv.org

评分 5.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据:Practical distillation recipe from Transformer to SSM/Mamba, leverages existing Transformer knowledge

arXiv:2604.14191v1 Announce Type: cross Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available.