Tag: 强化学习

All the articles with the tag "强化学习".

6.7
URSA: The Universal Research and Scientific Agent
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2506.22653v2 Announce Type: replace Abstract: Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now c...
6.4
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2603.19282v2 Announce Type: replace Abstract: In many real-world applications, large language models (LLMs) operate as independent agents wit...
6.4
Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2601.09365v2 Announce Type: replace Abstract: Common ground plays a critical role in situated spoken dialogs, where interlocutors must establ...
6.4
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2603.21357v2 Announce Type: replace-cross Abstract: LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% ...
6.4
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05483v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse ran...
6.4
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.06066v1 Announce Type: new Abstract: Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning ...
6.4
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05808v1 Announce Type: new Abstract: Large language model (LLM) agents have demonstrated strong capabilities in complex interactive deci...
6.4
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05943v1 Announce Type: new Abstract: Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous ...
6.4
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.06132v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows...
6.3
The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05995v1 Announce Type: cross Abstract: Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitabl...
6.0
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05876v1 Announce Type: new Abstract: Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of u...
8.3
GrandCode：多智能体 RL 系统在竞赛编程中达到 Grandmaster 级别
2026年04月06日
· arXiv cs.AI· 04/06 12:33 采集
GrandCode 通过假设提议、测试生成、方案选择等多个 Agent 模块的协同 RL 训练，在竞赛编程中超越 Gemini 3 Deep Think，首次达到人类 Grandmaster 水平。
7.0
多轮强化学习训练工具调用 Agent：MT-GRPO 与迭代奖励校准
2026年04月06日
· arXiv cs.AI· 04/06 12:33 采集
首次将 MT-GRPO 与 GTPO 结合用于工具调用 Agent 训练，发现基于规则的密集奖励比 LLM 判断更稳定，提出迭代奖励校准方法。
7.0
Reaching Beyond the Mode：强化学习实现语言模型分布推理
2026年03月27日
· cs.CL updates on arXiv.org· 03/27 12:31 采集
用 RL 训练语言模型输出多答案分布而非单一最优答案，解决当前模型在医疗诊断等不确定性场景的局限性。
7.7
PrismAudio：518M 参数击败数十亿模型，国产多模态音频生成刷新 SOTA
2026年03月26日
· 36氪 - 科技频道· 03/26 14:33 采集
阿里通义联合港科大发布 PrismAudio，首个将 RL 与 CoT 规划集成到视频配音生成
7.4
规模化 RL 代码生成：合成数据与课程学习的深度实践
2026年03月26日
· cs.LG updates on arXiv.org· 03/26 14:33 采集
教师基于学生表现迭代优化问题，无需教师微调即可构建结构化难度递进
7.4
Reward Is Enough：LLM 推理时涌现强化学习能力
2026年03月26日
· cs.LG updates on arXiv.org· 03/26 14:33 采集
揭示 LLM 在推理时自然涌现 RL 行为，通过多轮提示即可实现自我改进
6.7
3D-Layout-R1：用场景图推理实现结构化空间布局编辑
2026年03月24日
· arXiv· 03/24 18:34 采集
3D-Layout-R1 利用场景图推理和强化学习进行文本驱动的空间布局编辑，相比 CoT-SFT 方法 IoU 提升 15%，首次将结构化空间推理应用于 3D 场景编辑。
6.7
OS-Themis：可扩展的多 Agent 评判框架，GUI Agent RL 训练提升 10.3%
2026年03月22日
· arXiv· 03/22 10:31 采集
将 GUI Agent 轨迹分解为可验证里程碑，通过多 Agent 审查机制构建高质量奖励函数
7.1
MiniMax M2.7 发布：模型参与自身 30-50% 的训练流程
2026年03月21日
· VentureBeat· 03/21 04:45 采集
MiniMax 推出自进化 LLM M2.7，模型自主完成训练调试、指标分析等研发环节，MLE Bench Lite 奖牌率 66.6%

Tag: 强化学习

URSA: The Universal Research and Scientific Agent

Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Mechanistic Circuit-Based Knowledge Editing in Large Language Models

GrandCode：多智能体 RL 系统在竞赛编程中达到 Grandmaster 级别

多轮强化学习训练工具调用 Agent：MT-GRPO 与迭代奖励校准

Reaching Beyond the Mode：强化学习实现语言模型分布推理

PrismAudio：518M 参数击败数十亿模型，国产多模态音频生成刷新 SOTA

规模化 RL 代码生成：合成数据与课程学习的深度实践

Reward Is Enough：LLM 推理时涌现强化学习能力

3D-Layout-R1：用场景图推理实现结构化空间布局编辑

OS-Themis：可扩展的多 Agent 评判框架，GUI Agent RL 训练提升 10.3%

MiniMax M2.7 发布：模型参与自身 30-50% 的训练流程