Posts
All the articles I've posted.
- 6.0
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
Beyond Rating:评估AI审稿文本论证质量的整体框架,超越标量评分范式
- 7.0
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
大规模研究LLM引导的进化搜索:收集15个LLM在8个任务上的优化轨迹,揭示优化增益的驱动机制
- 7.0
Lost in Translation: Do LVLM Judges Generalize Across Languages?
MM-JudgeBench:首个大规模多语言多模态评判基准,揭示LVLM评估器的跨语言泛化缺陷
- 6.0
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized LLMs in Medical Domain?
通过持续预训练和模型merging缩小小型专用模型与大型通用模型在医学领域的性能差距
- 7.0
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
DASH-KV:通过非对称KV缓存哈希加速长上下文LLM推理的创新框架
- 7.0
Are Large Language Models Economically Viable for Industry Deployment?
从能源、延迟、硬件利用率等工业约束角度审视LLM经济可行性,批判纯accuracy评估范式
- 6.0
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
通过Soft-Hybrid字母表估计量化黑盒LLM的不确定性,以语义模式数量作为幻觉风险代理指标
- 6.0
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT:两阶段反思内化算法,将推理模型的显式反思内化到MT模型以兼顾质量和效率
- 6.0
Detoxification for LLM: From Dataset Itself
从预训练数据源头去毒LLM,而非依赖后训练或可控解码等治标方法
- 6.0
Cell-Based Representation of Relational Binding in Language Models
发现LLM通过称为Cell-Based Representation的低维线性子空间编码篇章级关系绑定
- 6.0
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
RADAR:角色锚定多智能体辩论框架,针对遗漏式操纵(半真半假)的事实核查新方法
- 7.0
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
首个在推理链层面检测有害行为的benchmark,捕捉jailbreak过程中从抑制拒绝到掩盖风险的完整行为链条
- 6.0
Disparities In Negation Understanding Across Languages In Vision-Language Models
揭示VLM肯定偏置(affirmation bias)在不同语言的形态学差异,质疑现有解决方案的跨语言普适性
- 6.0
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval
LogosKG:硬件对齐的可扩展可解释知识图谱k-hop检索框架
- 6.0
Mango: Multi-Agent Web Navigation via Global-View Optimization
Mango:利用网站全局结构动态确定最优导航路径的多智能体Web导航方法
- 6.0
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
真实条件下输出型jailbreak检测的实证研究,对比TF-IDF和生成不一致性检测器
- 6.0
Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models
识别扩散语言模型T2T编辑的三种结构性失败模式并提出Token-to-Mask修正方案
- 7.0
Highly Efficient and Effective LLMs with Multi-Boolean Architectures
用多核布尔参数表示LLM的新型二值化框架,无需全精度潜权重即可实现高效推理
- 7.0
Whispers in the Machine: Confidentiality in Agentic Systems
LLM agent集成外部工具后的机密性威胁分析,特别是prompt injection在agentic setting下的严重升级
- 6.0
The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation
PROPER框架:主动辅助系统中用户知识缺口建模与导航的benchmark与方法