Posts
All the articles I've posted.
- 7.0
Numerical Instability and Chaos: Quantifying the Unpredictability of LLMs
严格分析LLM不可预测性的数值根源——有限精度算术的混沌行为,对部署可靠性有根本性启示
- 6.0
Quantifying and Understanding Uncertainty in Large Reasoning Models
将保形预测应用于LRM的推理答案生成,提供有限样本统计保证的不确定性量化
- 7.0
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
仅需二元偏好即可构建评分准则增强的reward model,解决rubric标注成本瓶颈,发现低质量rubric的反作用
- 7.0
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
CSD恢复被标准验证丢弃的有效token,频率引导的候选选择大幅降低拒绝率,训练-free的推理加速方案
- 7.5
The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
零开销探针监控LLM agent推理退化,检测率高达30%任务失败的前兆,轻量级并行监控架构极具工程价值
- 7.5
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study
超过1万亿token的合成预训练数据系统性研究:重述策略×生成器模型×源数据的受控实验,发现结构化格式显著提升效果
- 6.5
Adaptive Conformal Prediction for Improving Factuality of Generations by LLMs
自适应保形预测改善LLM生成事实性,prompt自适应避免过度/不足过滤
- 7.0
From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
首次系统化研究并形式化用户的「vibe-testing「行为,弥合非正式体验与可复现评估之间的鸿沟
- 6.0
RL-PLUS: Countering Capability Boundary Collapse of LLMs in RL with Hybrid-policy Optimization
解决RLVR中LLM能力边界坍塌问题,混合策略优化突破on-policy + 大动作空间的限制
- 6.5
Memp: Exploring Agent Procedural Memory
将agent轨迹提炼为程序记忆(细粒度指令+高级脚本抽象),赋予agent可学习的终身程序记忆,对agent架构设计有深远启发
- 6.5
TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
仅将关键推理步骤路由到强模型的多步推理路由方法,精准分配计算预算,实用价值高
- 6.0
Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity
双向可预测性实时监控LLM交互完整性,无需后验judge/重复采样/密集计算,零额外开销的监控信号
- 7.0
Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic
Novel Operator Test巧妙分离推理逻辑与输出答案,证明LLM可以每步正确却最终答错——揭示推理-输出解离现象
- 6.0
EVE: A Domain-Specific LLM Framework for Earth Intelligence
地球观测领域专用开源LLM框架EVE-Instruct 24B,领域适配的新范式示范
- 6.0
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
统一的全模态LLM生成归因框架,支持文本/图像/音频/视频输入的自回归解码器归因
- 6.5
PersonaVLM: Long-Term Personalized Multimodal LLMs
长期个性化多模态LLM,超越静态单轮个性化,捕捉用户演进的偏好和人格
- 6.0
English is Not All You Need: Systematically Exploring Multilinguality in LLM Post-Training
220次SFT运行的大规模受控实验,系统考察训练语言覆盖率/模型规模/任务域对多语言性能的影响
- 7.0
AgentSPEX: An Agent SPecification and EXecution Language
Agent专用DSL,显式定义控制流和中间状态,弥补reactive prompting的隐式性问题,与LangGraph/DSPy解耦工作流逻辑
- 6.0
Peer-Predictive Self-Training for Language Model Reasoning
无外部监督的LLM自我改进框架,交叉模型聚合响应作为训练信号,PST协作自提升
- 6.5
CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding
多Agent视觉故事板生成框架,解决长视频生成中的一致性难题(角色/场景/转场连续性)