Tag: 推理

All the articles with the tag "推理".

7.3
The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly 「show their work「 by writing step-by-step reasoning before answeri...
6.7
URSA: The Universal Research and Scientific Agent
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2506.22653v2 Announce Type: replace Abstract: Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now c...
6.7
Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05549v1 Announce Type: new Abstract: With the widespread application of LLM-based agents across various domains, their complexity has in...
6.4
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2510.07432v2 Announce Type: replace Abstract: Large language models (LLMs) exhibit strong symbolic and compositional reasoning, yet they stru...
6.4
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05681v1 Announce Type: cross Abstract: We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic ...
6.4
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2601.03054v4 Announce Type: replace-cross Abstract: Recent research on medical MLLMs has gradually shifted its focus from image-level underst...
6.4
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.06066v1 Announce Type: new Abstract: Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning ...
6.0
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2511.04570v2 Announce Type: replace-cross Abstract: The 「Thinking with Text「 and 「Thinking with Images「 paradigms significantly improve the r...
6.0
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05540v1 Announce Type: new Abstract: Large language models (LLMs) can effectively handle outdated information through knowledge editing....
6.0
Automatic Replication of LLM Mistakes in Medical Conversations
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2512.20983v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimens...
6.0
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05876v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of u...
8.3
单智能体在多跳推理中优于多智能体系统：固定 Token 预算下的信息论论证
2026年04月06日
· arXiv cs.CL· 04/06 12:33 采集
在同等思考 Token 预算下，单智能体系统在多跳推理任务中信息效率更高，多智能体系统的性能增益主要来自更多计算而非架构优势。
6.5
Google Gemini API 推出 Flex 和 Priority 推理层级，平衡成本与可靠性
2026年04月03日
· Google Blog· 04/03 18:31 采集
Google 为 Gemini API 引入 Flex 和 Priority 两种推理模式。
6.8
字节前高管创办蓝芯算力：RISC-V AI推理芯片获数亿融资
2026年03月29日
· 36氪· 03/29 10:32 采集
前字节跳动芯片负责人卢山创办蓝芯算力，专注RISC-V架构AI推理芯片，已获联想、腾讯云等20万片订单。
8.0
林俊旸离职首曝：Qwen 推理链方向存在致命技术误区
2026年03月27日
· 全部-虎嗅网· 03/27 12:31 采集
前阿里 Qwen 团队成员林俊旸离职后首次公开反思，指出堆叠推理链是错误方向，揭示大模型训练中的关键技术决策失误。
7.4
GTC 巅峰对话 Jeff Dean x Bill Dally：预训练范式已死，下一前沿在推理与系统
2026年03月20日
· 36氪· 03/20 04:32 采集
Google Jeff Dean 与 NVIDIA Bill Dally 在 GTC 2026 对话，指出 AI 发展重心正从预训练转向推理优化和系统工程。
7.3
边缘设备上的高效推理：LoRA + 强化学习让小模型学会思考
2026年03月19日
· arXiv· 03/19 02:33 采集
Qualcomm 提出轻量级方案，在移动设备上实现 LLM 推理能力
9.0
Mistral Small 4 发布：119B MoE 统一推理、多模态和编程能力，Apache 2 开源
2026年03月16日
· Simon Willison· 03/17 16:38 采集
Mistral 发布 Small 4 模型，119B 参数（6B 激活）MoE 架构，首次统一 Magistral 推理、Pixtral 多模态和 Devstral 编程能力，Apache 2 许可
7.5
大模型悖论：思考越久越诚实，但代价高昂
2026年03月16日
· 36氪
DeepMind 研究发现 AI 深度思考会提升诚实度，但这与追求速度和成本的商业逻辑形成尖锐冲突

Tag: 推理

The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models

URSA: The Universal Research and Scientific Agent

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

Automatic Replication of LLM Mistakes in Medical Conversations

Mechanistic Circuit-Based Knowledge Editing in Large Language Models

单智能体在多跳推理中优于多智能体系统：固定 Token 预算下的信息论论证

Google Gemini API 推出 Flex 和 Priority 推理层级，平衡成本与可靠性

字节前高管创办蓝芯算力：RISC-V AI推理芯片获数亿融资

林俊旸离职首曝：Qwen 推理链方向存在致命技术误区

GTC 巅峰对话 Jeff Dean x Bill Dally：预训练范式已死，下一前沿在推理与系统

边缘设备上的高效推理：LoRA + 强化学习让小模型学会思考

Mistral Small 4 发布：119B MoE 统一推理、多模态和编程能力，Apache 2 开源

大模型悖论：思考越久越诚实，但代价高昂