Tag: quantization

All the articles with the tag "quantization".

7.0
Highly Efficient and Effective LLMs with Multi-Boolean Architectures
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
用多核布尔参数表示LLM的新型二值化框架，无需全精度潜权重即可实现高效推理
7.0
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
SAW-INT4：考虑真实LLM serving约束（paged memory、正则访存、融合attention）的4-bit KV缓存量化方案
7.7
DyMoE：MoE 模型边缘推理的动态混合精度框架，TTFT 加速最高 22.7 倍
2026年03月21日
· arXiv· 03/21 14:45 采集
通过重要性感知的动态量化和深度自适应调度，在商业边缘硬件上实现 MoE 模型的实时推理
8.8
Microsoft BitNet: 1000 亿参数 1-Bit 模型可在本地 CPU 运行
2026年03月11日
· Hacker News
微软开源 BitNet，一个 1000 亿参数的 1-bit 量化模型，可以在普通 CPU 上高效运行，无需 GPU。

Highly Efficient and Effective LLMs with Multi-Boolean Architectures