Posts

All the articles I've posted.

6.0
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
Beyond Rating：评估AI审稿文本论证质量的整体框架，超越标量评分范式
7.0
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
大规模研究LLM引导的进化搜索：收集15个LLM在8个任务上的优化轨迹，揭示优化增益的驱动机制
7.0
Lost in Translation: Do LVLM Judges Generalize Across Languages?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
MM-JudgeBench：首个大规模多语言多模态评判基准，揭示LVLM评估器的跨语言泛化缺陷
6.0
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized LLMs in Medical Domain?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
通过持续预训练和模型merging缩小小型专用模型与大型通用模型在医学领域的性能差距
7.0
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
DASH-KV：通过非对称KV缓存哈希加速长上下文LLM推理的创新框架
7.0
Are Large Language Models Economically Viable for Industry Deployment?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
从能源、延迟、硬件利用率等工业约束角度审视LLM经济可行性，批判纯accuracy评估范式
6.0
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
通过Soft-Hybrid字母表估计量化黑盒LLM的不确定性，以语义模式数量作为幻觉风险代理指标
6.0
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
ReflectMT：两阶段反思内化算法，将推理模型的显式反思内化到MT模型以兼顾质量和效率
6.0
Detoxification for LLM: From Dataset Itself
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
从预训练数据源头去毒LLM，而非依赖后训练或可控解码等治标方法
6.0
Cell-Based Representation of Relational Binding in Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
发现LLM通过称为Cell-Based Representation的低维线性子空间编码篇章级关系绑定
6.0
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
RADAR：角色锚定多智能体辩论框架，针对遗漏式操纵（半真半假）的事实核查新方法
7.0
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
首个在推理链层面检测有害行为的benchmark，捕捉jailbreak过程中从抑制拒绝到掩盖风险的完整行为链条
6.0
Disparities In Negation Understanding Across Languages In Vision-Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
揭示VLM肯定偏置（affirmation bias）在不同语言的形态学差异，质疑现有解决方案的跨语言普适性
6.0
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
LogosKG：硬件对齐的可扩展可解释知识图谱k-hop检索框架
6.0
Mango: Multi-Agent Web Navigation via Global-View Optimization
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
Mango：利用网站全局结构动态确定最优导航路径的多智能体Web导航方法
6.0
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
真实条件下输出型jailbreak检测的实证研究，对比TF-IDF和生成不一致性检测器
6.0
Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
识别扩散语言模型T2T编辑的三种结构性失败模式并提出Token-to-Mask修正方案
7.0
Highly Efficient and Effective LLMs with Multi-Boolean Architectures
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
用多核布尔参数表示LLM的新型二值化框架，无需全精度潜权重即可实现高效推理
7.0
Whispers in the Machine: Confidentiality in Agentic Systems
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
LLM agent集成外部工具后的机密性威胁分析，特别是prompt injection在agentic setting下的严重升级
6.0
The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation
2026年04月22日
· cs.LG updates on arXiv.org· 04/22 14:31 采集
PROPER框架：主动辅助系统中用户知识缺口建模与导航的benchmark与方法

Posts

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized LLMs in Medical Domain?

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

Are Large Language Models Economically Viable for Industry Deployment?

Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

Detoxification for LLM: From Dataset Itself

Cell-Based Representation of Relational Binding in Language Models

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Disparities In Negation Understanding Across Languages In Vision-Language Models

LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

Mango: Multi-Agent Web Navigation via Global-View Optimization

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Whispers in the Machine: Confidentiality in Agentic Systems

The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation