Tool Use Benchmark: 10 Models, 500 Operations, Quality Judged

We ran 10 leading AI models through 500 atomic bash command generation tasks, then had GPT-4o judge each command for correctness. The results are sobering: even frontier models struggle with basic tool use.

The benchmark tested models across 12 categories: filesystem operations, text processing, network commands, version control, containers, security, and more. Each task asked the model to generate a single bash command achieving a specific goal. No chain-of-thought. No multi-turn refinement. Just raw command generation.

The Models

We tested three tiers of models via OpenRouter's unified API:

Top Frontier: Claude Opus 4.5, GPT-4o, Gemini 2.5 Pro, Claude 3.5 Sonnet

Open Source: Llama 3.3 70B, DeepSeek V3, Mistral Large 2512

Chinese Frontier: GLM-4.7, Qwen 2.5 72B, DeepSeek R1

Quality Rankings (GPT-4o Judge)

Every generated command was evaluated by GPT-4o for correctness: does this command actually achieve the stated goal? The results reveal a stark gap between "producing output" and "producing correct output":

Rank	Model	Correct	Rate	Avg Score
1	Claude Opus 4.5	58/455	12.7%	1.28/10
2	Qwen 2.5 72B	59/500	11.8%	1.23/10
3	DeepSeek R1	48/416	11.5%	1.18/10
4	Mistral Large	55/482	11.4%	1.15/10
5	Claude 3.5 Sonnet	51/452	11.3%	1.15/10
6	DeepSeek V3	56/500	11.2%	1.15/10
7	GLM-4.7	45/407	11.1%	1.12/10
8	Llama 3.3 70B	55/500	11.0%	1.14/10
9	GPT-4o	52/500	10.4%	1.06/10
10	Gemini 2.5 Pro	46/500	9.2%	1.00/10

The Sobering Reality

All models achieved correctness rates between 9-13%. This isn't a measurement error—it reflects how hard atomic tool use actually is. These tasks require:

Exact flag syntax (one wrong flag = wrong command)
Correct argument ordering
Awareness of common defaults vs explicit requirements
Understanding what "achieves the goal" actually means

A command like gzip file may look correct, but GPT-4o judges whether it actually compresses a file in a way that matches the operation's intent. Many "plausible" commands fail this bar.

Claude Opus 4.5 Takes the Lead

Claude Opus 4.5 leads the quality rankings at 12.7% correctness, edging out Qwen 2.5 (11.8%) and DeepSeek R1 (11.5%). Notably, GPT-4o ranks near the bottom at 10.4%—interesting given GPT-4o is also the judge.

This suggests the benchmark measures genuine command quality rather than stylistic similarity to the judge model. A biased judge would rank its own family higher.

Latency and Token Efficiency

Quality aside, models vary dramatically in cost and speed:

Model	Avg Latency	Avg Tokens
Qwen 2.5 72B	236ms	91
GPT-4o	489ms	93
Mistral Large	571ms	98
DeepSeek V3	666ms	85
Claude 3.5 Sonnet	876ms	94
DeepSeek R1	1,408ms	1,897
Llama 3.3 70B	1,606ms	96
Claude Opus 4.5	1,719ms	94
Gemini 2.5 Pro	2,389ms	555
GLM-4.7	3,852ms	843

Qwen 2.5 remains the speed champion at 236ms—more than twice as fast as any other model. Combined with competitive quality (11.8%, rank 2), it offers the best value for high-volume workloads where you can tolerate retry logic.

Reasoning Models: Still Expensive

DeepSeek R1 and GLM-4.7 use "reasoning tokens"—internal chain-of-thought that counts against output. For simple commands:

DeepSeek R1: 1,897 avg tokens (20x typical) at 1,408ms
GLM-4.7: 843 avg tokens (8x typical) at 3,852ms

Despite this overhead, DeepSeek R1 ranks 3rd in quality (11.5%). The reasoning helps—but at 20x the token cost. For atomic tool use, the economics rarely justify it.

Methodology

500 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text, network, process, security, containers, version control, data transformation, archiving, system administration, user management, and time operations.

Phase 1: Generation. Each model generates commands with temperature=0. Heavy parallelization (100 concurrent requests) to minimize wall-clock time. Results cached incrementally.

Phase 2: Quality Judging. GPT-4o evaluates each (goal, command) pair for correctness and assigns a 0-10 quality score. 5,000 total judgments. The judge prompt asks: "Does this command correctly achieve the goal? Consider: correctness, safety, whether it would actually work."

The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.