We ran 10 leading AI models through 500 atomic bash command generation tasks, then had GPT-4o judge each command for correctness. The results are sobering: even frontier models struggle with basic tool use.

The benchmark tested models across 12 categories: filesystem operations, text processing, network commands, version control, containers, security, and more. Each task asked the model to generate a single bash command achieving a specific goal. No chain-of-thought. No multi-turn refinement. Just raw command generation.

The Models

We tested three tiers of models via OpenRouter's unified API:

Top Frontier: Claude Opus 4.5, GPT-4o, Gemini 2.5 Pro, Claude 3.5 Sonnet

Open Source: Llama 3.3 70B, DeepSeek V3, Mistral Large 2512

Chinese Frontier: GLM-4.7, Qwen 2.5 72B, DeepSeek R1

Quality Rankings (GPT-4o Judge)

Every generated command was evaluated by GPT-4o for correctness: does this command actually achieve the stated goal? The results reveal a stark gap between "producing output" and "producing correct output":

RankModelCorrectRateAvg Score
1Claude Opus 4.558/45512.7%1.28/10
2Qwen 2.5 72B59/50011.8%1.23/10
3DeepSeek R148/41611.5%1.18/10
4Mistral Large55/48211.4%1.15/10
5Claude 3.5 Sonnet51/45211.3%1.15/10
6DeepSeek V356/50011.2%1.15/10
7GLM-4.745/40711.1%1.12/10
8Llama 3.3 70B55/50011.0%1.14/10
9GPT-4o52/50010.4%1.06/10
10Gemini 2.5 Pro46/5009.2%1.00/10

The Sobering Reality

All models achieved correctness rates between 9-13%. This isn't a measurement error—it reflects how hard atomic tool use actually is. These tasks require:

A command like gzip file may look correct, but GPT-4o judges whether it actually compresses a file in a way that matches the operation's intent. Many "plausible" commands fail this bar.

Claude Opus 4.5 Takes the Lead

Claude Opus 4.5 leads the quality rankings at 12.7% correctness, edging out Qwen 2.5 (11.8%) and DeepSeek R1 (11.5%). Notably, GPT-4o ranks near the bottom at 10.4%—interesting given GPT-4o is also the judge.

This suggests the benchmark measures genuine command quality rather than stylistic similarity to the judge model. A biased judge would rank its own family higher.

Latency and Token Efficiency

Quality aside, models vary dramatically in cost and speed:

ModelAvg LatencyAvg Tokens
Qwen 2.5 72B236ms91
GPT-4o489ms93
Mistral Large571ms98
DeepSeek V3666ms85
Claude 3.5 Sonnet876ms94
DeepSeek R11,408ms1,897
Llama 3.3 70B1,606ms96
Claude Opus 4.51,719ms94
Gemini 2.5 Pro2,389ms555
GLM-4.73,852ms843

Qwen 2.5 remains the speed champion at 236ms—more than twice as fast as any other model. Combined with competitive quality (11.8%, rank 2), it offers the best value for high-volume workloads where you can tolerate retry logic.

Reasoning Models: Still Expensive

DeepSeek R1 and GLM-4.7 use "reasoning tokens"—internal chain-of-thought that counts against output. For simple commands:

Despite this overhead, DeepSeek R1 ranks 3rd in quality (11.5%). The reasoning helps—but at 20x the token cost. For atomic tool use, the economics rarely justify it.

Methodology

500 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text, network, process, security, containers, version control, data transformation, archiving, system administration, user management, and time operations.

Phase 1: Generation. Each model generates commands with temperature=0. Heavy parallelization (100 concurrent requests) to minimize wall-clock time. Results cached incrementally.

Phase 2: Quality Judging. GPT-4o evaluates each (goal, command) pair for correctness and assigns a 0-10 quality score. 5,000 total judgments. The judge prompt asks: "Does this command correctly achieve the goal? Consider: correctness, safety, whether it would actually work."

The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.