林俊旸离职后首发长文:从「想得更久」到「为行动而想」
NOTE
3 月 4 日凌晨发出那句「me stepping down. bye my beloved qwen」之后,林俊旸在社交媒体上沉默了三周
今天,他在 X(Twitter)上发布了离职以来的第一篇长文

https://x.com/JustinLin610/status/2037116325210829168
在这篇文章里,他没有谈离职原因,没有回应去向传闻。全文只做了一件事:写下他对 AI 下一阶段方向的判断
从「让模型想得更久」到「让模型边做边想」
以下是原文全文,采用中英对照呈现
开篇
The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.
过去两年,整个行业对模型的评判标准和预期都变了。OpenAI 的 o1 让大家看到,「思考」本身可以是一种被训练出来的能力。DeepSeek-R1 紧随其后,证明推理式的后训练可以在原始实验室之外被复现、被扩展
That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.
那个阶段很重要,但 2025 上半年基本还是在围绕一个问题打转:怎么让模型在推理的时候多想一会儿。现在该问下一步了。我的判断是智能体式思考(agentic thinking)。为了行动而思考,在跟环境打交道的过程中思考,根据真实反馈不断修正计划
1. o1 和 R1 真正教会了我们什么
The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.
第一波推理模型教会我们一件事:要在语言模型上把强化学习跑起来,反馈信号得是确定的、稳定的、能规模化的。数学、代码、逻辑这些可以验证对错的领域,成了 RL 的主战场。因为在这些场景里,奖励信号的质量远高于「让人类标注员投票选哪个回答更好」
Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.
模型一旦开始在更长的推理轨迹上训练,RL 就不再是 SFT 上面加的一层薄薄的东西了,它变成了一个系统工程问题。你需要大规模的 rollout、高吞吐的验证、稳定的策略更新。推理模型的诞生,说到底是一个基础设施的故事。第一个大转变:从扩展预训练,到扩展推理后训练
2. 真正的难题从来不是「合并思考与指令」
At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.
2025 年初,我们千问团队有一个很大的野心:做一个统一的系统,让思考模式和指令模式合二为一。用户可以调推理力度,低、中、高三档。更好的情况是模型自己判断这道题该想多久,简单的直接答,难的多花点算力
Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.
方向是对的。Qwen3 是业内最清晰的一次公开尝试,引入了「混合思考模式」,一个模型家族里同时支持想和不想两种状态,还有一个四阶段的后训练流水线,专门做「思考模式融合」
But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.
但做起来比说起来难多了。难点在数据。大家聊合并的时候,第一反应往往是模型侧的问题:一个 checkpoint 能不能同时装两种模式。真正的麻烦在更深处,两种模式要的数据分布和行为目标,本质上就不一样
We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.
这件事我们没有全做对。过程中我们一直在看用户到底怎么用这两种模式。好的指令模型讲究干脆利落,回复短、格式规矩、延迟低,适合企业里那种大批量的改写、标注、模板客服。好的思考模型则相反,它需要在难题上多花 token,走不同的路径去探索,保持足够的内部计算来真正提升最终的准确率
These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.
两种行为画像天然互斥。数据没策展好的话,两头都会变平庸:思考模式变得啰嗦、膨胀、不果断,指令模式则变得不够干脆、不够稳定,还比客户实际需要的更贵
Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.
所以分开做在实践中仍然有吸引力。2025 年下半年,Qwen 的 2507 版本就发了独立的 Instruct 和 Thinking 版本,30B 和 235B 各一套。很多企业客户要的就是高吞吐、低成本、高度可控的指令模型。分开做让团队能更干净地解决各自的问题
Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.
也有实验室走了反方向。Anthropic 明确主张集成路线,Claude 3.7 Sonnet 就是一个混合推理模型,用户想让它多想就多想,API 还能设思考预算。GLM-4.5、DeepSeek V3.1 后来也往这个方向走了
The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.
关键在于这个合并是不是自然长出来的。如果两种模式只是硬塞在一个 checkpoint 里,表现得像两个尴尬拼起来的人格,用户体验不会好。真正成功的合并需要一个平滑的推理力度光谱,模型能自己判断该花多少力气去想。GPT 的 effort control 机制指向了这个方向:对计算的策略分配,而非开关切换
3. Anthropic 的方向是一个有用的纠偏
Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.
Anthropic 在 Claude 3.7 和 Claude 4 上的公开表述一直比较克制。强调的是集成推理、用户可控的思考预算、真实世界任务、代码质量。到了 Claude 4,推理可以跟工具调用交叉进行了,编程和 Agent 工作流被放在了最优先的位置
Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.
推理链更长,不等于模型更聪明。很多时候,推理链越长,反而说明模型在乱花算力。什么都用同一种冗长的方式去想,说明它不会分轻重、不会压缩、不会动手。Anthropic 的路径暗示了一个更有纪律的思路:思考应该由目标任务来塑造。写代码就帮你导航代码库、做规划、拆解问题。跑 Agent 工作流就提升长周期的执行质量,而非产出漂亮的中间文本
This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.
这个思路往大了看,指向的是一个更根本的变化:我们正在从训练模型的时代,走向训练智能体的时代。我们在 Qwen3 的博客里也明确写过这句话。Agent 是什么?能做计划、能判断什么时候动手、能用工具、能感知环境给的反馈、能改策略、能持续跑下去。它的定义特征是跟真实世界的闭环交互
4. 「智能体式思考」到底指什么
Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.
智能体式思考和推理式思考,优化目标就不一样。推理式思考看的是模型在给出最终答案之前的内部推演质量:能不能解这道定理,能不能写对代码,能不能过 benchmark。智能体式思考看的是另一件事:模型在跟环境打交道的过程中,能不能持续往前走
The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:
核心问题从「模型能不能想得够久」变成了「模型能不能用一种撑得起有效行动的方式来想」。智能体式思考要处理几件纯推理模型基本不用管的事:
+ 什么时候该停下来不想了,开始动手
+ 该调哪个工具,先调哪个后调哪个
+ 环境给回来的信息可能是残缺的、有噪声的,得能用
+ 失败了得能改计划
+ 跨很多轮对话、很多次工具调用,思路不能断
Agentic thinking is a model that reasons through action.
智能体式思考,就是通过行动来推理
5. 为什么智能体 RL 的基础设施更难
Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.
目标一旦从解 benchmark 变成解交互式任务,整个 RL 技术栈就得跟着变。以前推理 RL 的基础设施不够用了。推理 RL 里,rollout 基本上是自己跑完的一条轨迹,配个相对干净的评估器就行。智能体 RL 里,策略被塞进了一个大得多的 harness:工具服务器、浏览器、终端、搜索引擎、模拟器、沙盒、API 层、记忆系统、编排框架。环境不再是一个静态的判分器,它是训练系统的一部分
This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.
这就带来一个新的系统需求:训练和推理必须更干净地分开。不分开的话,rollout 吞吐量直接塌掉。举个例子,一个编程 Agent 得把生成的代码对着真实测试跑一遍。推理端等着执行反馈,训练端等着完整轨迹,整条流水线的 GPU 利用率远不如你预想的那么高。再加上工具延迟、信息不完整、环境有状态,这些低效层层叠加。结果就是实验变慢,还没到你想要的能力水平就已经很痛苦了
The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.
环境本身也成了一等研究对象。SFT 时代大家执着于数据多样性,Agent 时代应该执着于环境质量:稳不稳定、够不够真实、覆盖面多大、状态够不够丰富、模型能不能找到漏洞刷分。环境构建已经开始变成一个真正的创业方向了,不再是边角料
6. 下一个前沿是更有用的思考
My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.
我预期智能体式思考会成为主流。它大概率会替代掉大部分旧式的推理方式:那种又长又封闭的内部独白,试图靠吐出越来越多的文字来弥补自己没法跟外界交互的缺陷。哪怕是极难的数学或编程任务,一个真正先进的系统也应该能搜索、能模拟、能执行、能检查、能回头改。目标是把问题稳稳当当地解决
The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.
训练这类系统最难的是 reward hacking。模型一旦拿到工具,作弊就变得容易得多。有搜索能力的模型可能在 RL 训练时直接去查答案;编程 Agent 可能利用仓库里不该看到的信息、滥用日志、找到绕过任务的捷径。环境里藏着漏洞的话,策略看起来超强,其实是学会了作弊。这是 Agent 时代比推理时代更微妙的地方。工具越好,模型越有用,但虚假优化的空间也越大。接下来真正卡脖子的研究瓶颈大概率来自环境设计、评估器的鲁棒性、反作弊机制。但方向是清楚的:能用工具的思考就是比封闭思考更有用
Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.
智能体式思考也意味着 harness 工程会变得越来越重要。核心智能会越来越多地取决于多个 Agent 怎么组织:谁来编排分工,谁当领域专家,谁执行具体任务同时帮忙管上下文、防污染。从训练模型到训练智能体,再从训练智能体到训练系统
结语
The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.
推理浪潮的第一阶段确立了一件事:反馈信号够可靠、基础设施撑得住的话,语言模型上的 RL 能产出质变级别的认知提升
The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.
更深层的变化是从推理式思考到智能体式思考:从想得更久,到为了动手而想。训练的核心对象变了,变成了模型加环境的整个系统。哪些东西重要也跟着变了:模型架构和训练数据当然还重要,但环境设计、rollout 基础设施、评估器的稳健程度、多个 Agent 之间怎么协调,这些都进了核心圈。「好的思考」的定义也变了:在真实约束下最能撑起行动的那条轨迹,而非最长或最显眼的那条
It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.
竞争优势的来源也不一样了。推理时代拼的是 RL 算法、反馈信号、训练流水线的扩展性。智能体时代拼的是环境质量、训练和推理的紧耦合、harness 工程能力,以及能不能把模型的决策和决策的后果真正串成一个闭环
原文发布于 X(Twitter),作者 林俊旸(Junyang Lin)
编译 赛博禅心
