Qwen3.7-Max Conquers Coding's No.2 Spot JeariCk

On May 26, 2026, Code Arena — a globally recognized blind evaluation platform for AI coding ability — updated its leaderboard with a historic shift.

Alibaba’s Tongyi Qianwen flagship model, Qwen3.7-Max, scored 1541, ranking second globally. Only Anthropic’s Claude series performed better. Behind it: GPT-5.5, Gemini-3.5-Flash, DeepSeek-v4-pro.

This is the first time a Chinese LLM has broken into the global top two in the hardcore domain of software engineering.

What a Score of 1541 Actually Means

Code Arena isn’t a “write Hello World and you’re done” kind of test. It’s a blind benchmark under LMArena that covers real-world scenarios — front-end development, back-end logic, agentic coding. It measures how well a model actually writes code in complex engineering environments.

The rankings:
1. Claude series (v7-thinking) — 1667
2. Qwen3.7-Max — 1541
3. GLM-5.1 — 1633 (point ranking, behind Qwen in comprehensive blind evaluation)
4. Others

Drilling into specific metrics: SWE-bench Pro (software engineering) — 60.6%, SWE-Multilingual — 78.3%, SciCode — 53.5%. Individually these numbers might not jump off the page, but together they paint a clear picture: Qwen3.7-Max is the first Chinese model that can go toe-to-toe with Claude in real engineering scenarios.

35 Hours. It Wrote a Chip Kernel From Scratch.

The test that really shook the developer community wasn’t a benchmark score — it was an endurance trial revealed at the Alibaba Cloud Summit on May 20.

Engineers gave Qwen3.7-Max a problem it had never encountered: optimize the AI inference kernel on a brand-new, unreleased chip — the Pingtouge Zhenwu M890.

No reference performance data. No hardware documentation. No existing kernel examples to build on. Starting from an empty workspace, Qwen3.7-Max went to work:

– 432 kernel evaluations
– 1,158 tool calls
– 35 hours of continuous operation
– Fully autonomous — writing, compiling, analyzing performance, iterating, improving

The final result: the chip’s inference speed improved by 10x compared to the official baseline. The previous-generation model only managed a 1.1x speedup on the same task.

This wasn’t “AI helping you write code.” This was an AI that identified a problem, designed a solution, iterated through trial and error, and delivered results exceeding what a human engineer could achieve.

The Agent Ecosystem Hub

Qwen3.7-Max’s real positioning isn’t “a better chat model.” It’s an agent-native intelligence foundation designed from the ground up for autonomous task execution. Looking at it through the lens of the broader 2026 AI agent ecosystem makes this clearer.

Works Across Frameworks, No Lock-In

Qwen3.7-Max doesn’t try to build a walled garden. It was tested and proven stable across three different frameworks — Claude Code, OpenClaw, and Qwen Code. You can plug it into your existing agent stack without redesigning your architecture.

This matters in practice. Claude Code holds 54% of the AI coding tools market, and OpenClaw is the fastest-growing local agent framework. Building a model that performs well in both means the team invested real effort in tool-calling format adherence, instruction following, and multi-turn consistency.

Deep MCP Integration

MCP (Model Context Protocol) has become the de facto standard for LLM-to-tool communication. Qwen3.7-Max supports MCP natively and scored the highest among Chinese models on MCP-Atlas and MCP-Mark — two benchmarks that measure real-world agent capability. What this means for developers: you can point the model at databases, file systems, and third-party APIs without writing adapter layers.

Alibaba Cloud ships ready-to-use OpenClaw configuration for Qwen3.7-Max — three lines of JSON and it’s running as your agent inference engine:

```json
 "agents": {
 "defaults": {
 "model": {
 "primary": "modelstudio/qwen3.7-max"
 }
 }
 }
 ```

All-Field Thinking vs. The Competition’s Agent Modes

The top-tier models take notably different approaches to agent workloads:

– GPT-5.5 uses “adaptive reasoning” — fast for simple tasks, auto-allocates more compute for complex ones. Efficient but opaque.
– Claude Opus 4.7’s “thinking mode” offers more transparency in reasoning steps, letting users inspect the chain of thought — but text input only.
Qwen3.7-Max’s “All-Field Thinking” is the first to unify text, image, and code into a single reasoning chain. Think mode for deep reasoning, No-Think for fast responses, and users can toggle between them per scenario.

This distinction matters for agent development. Agent tasks frequently require mixing code context, UI screenshots, and documentation. All-Field Thinking lets the model process these mixed signals within one reasoning framework, eliminating the overhead of switching models or building bridges between them.

More Than a Model — Qianwen Cloud Is the Agent OS

Qwen3.7-Max didn’t launch in isolation. The summit also introduced Qianwen Cloud, designed around a fundamentally different philosophy.

Its homepage has no product lists, no nested navigation menus — just one line of code:

```bash
 npx skills add QianWen-AI/qianwen-ai
 ```

That’s code written for agents, not humans.

What Qianwen Cloud does: aggregates 150+ mainstream model APIs (including Qwen, GLM, Kimi), all packaged as Skills and CLI tools that agents can invoke directly. Developers don’t jump between dozens of product pages — they compare parameters, pricing, context windows in one place, validate with real tasks, and deploy without friction.

Paired with the “LangLong” (LangChain-compatible) toolchain on the Bailian MaaS platform, this makes for a ready-to-run agent development environment.

“We’re watching our cloud users transform from humans to agents,” said Liu Weiguang, Senior VP of Alibaba Cloud. “This changes everything about how we design cloud services.”

Built for Agent Concurrency, Not Just Inference

Agent scenarios differ from traditional inference in a critical way: agents are concurrent. Thousands of agents running simultaneously, talking to each other, create orders of magnitude more inter-chip communication pressure than standard inference workloads.

The Zhenwu M890 chip, paired with the in-house ICN Switch 1.0, pushes P2P communication latency under 150 nanoseconds across 128 fully interconnected cards. This is an architecture designed specifically for the massive concurrency demands of the agent era. With 560,000 Zhenwu chips already shipped across 400+ enterprise customers, this chip-to-model-to-cloud loop is producing real-world capacity, not just benchmarks.

What This Means for Developers

Qwen3.7-Max’s API is expected to launch on the Bailian platform in June. For full-stack developers, a few things to watch:

Another serious contender for coding assistance. In agentic coding scenarios, Qwen3.7-Max has proven it can handle real engineering problems, not just boilerplate generation.

Long-horizon agent tasks are no longer theoretical. The 35-hour autonomous run was extreme, but it validates a critical premise — models can maintain context coherence over extended periods. This is a bottleneck that’s held back many agent applications.

The barrier to building an agent is dropping. Qianwen Cloud + native MCP + cross-framework compatibility together mean you don’t need to be an LLM expert to build something useful.

Chinese models keep extending their cost advantage. After DeepSeek V3 set the example, Qwen3.7-Max now matches international top-tier coding performance. Developers can get elite-level agent capabilities at a fraction of the cost.

—

From the ChatGPT wave in 2023 to a Chinese LLM standing at #2 in global coding benchmarks — it took just three years. But the number worth paying attention to isn’t the ranking. It’s the agent-native stack — model, cloud platform, and silicon — that’s shifting from “chat better” to “actually get things done.”

📖 Recommended Reading

Take a look at these articles; you might find them interesting

Qwen3.7-Max Conquers Coding’s No.2 Spot