We ran 10 leading AI models through 500 atomic bash command generation tasks, then had GPT-4o judge each command for correctness. The results are sobering: even frontier models struggle with basic tool use.
The benchmark tested models across 12 categories: filesystem operations, text processing, network commands, version control, containers, security, and more. Each task asked the model to generate a single bash command achieving a specific goal. No chain-of-thought. No multi-turn refinement. Just raw command generation.
The Models
We tested three tiers of models via OpenRouter's unified API:
Top Frontier: Claude Opus 4.5, GPT-4o, Gemini 2.5 Pro, Claude 3.5 Sonnet
Open Source: Llama 3.3 70B, DeepSeek V3, Mistral Large 2512
Chinese Frontier: GLM-4.7, Qwen 2.5 72B, DeepSeek R1
Quality Rankings (GPT-4o Judge)
Every generated command was evaluated by GPT-4o for correctness: does this command actually achieve the stated goal? The results reveal a stark gap between "producing output" and "producing correct output":
| Rank | Model | Correct | Rate | Avg Score |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 58/455 | 12.7% | 1.28/10 |
| 2 | Qwen 2.5 72B | 59/500 | 11.8% | 1.23/10 |
| 3 | DeepSeek R1 | 48/416 | 11.5% | 1.18/10 |
| 4 | Mistral Large | 55/482 | 11.4% | 1.15/10 |
| 5 | Claude 3.5 Sonnet | 51/452 | 11.3% | 1.15/10 |
| 6 | DeepSeek V3 | 56/500 | 11.2% | 1.15/10 |
| 7 | GLM-4.7 | 45/407 | 11.1% | 1.12/10 |
| 8 | Llama 3.3 70B | 55/500 | 11.0% | 1.14/10 |
| 9 | GPT-4o | 52/500 | 10.4% | 1.06/10 |
| 10 | Gemini 2.5 Pro | 46/500 | 9.2% | 1.00/10 |
The Sobering Reality
All models achieved correctness rates between 9-13%. This isn't a measurement error—it reflects how hard atomic tool use actually is. These tasks require:
- Exact flag syntax (one wrong flag = wrong command)
- Correct argument ordering
- Awareness of common defaults vs explicit requirements
- Understanding what "achieves the goal" actually means
A command like gzip file may look correct, but GPT-4o judges whether it actually compresses a file in a way that matches the operation's intent. Many "plausible" commands fail this bar.
Claude Opus 4.5 Takes the Lead
Claude Opus 4.5 leads the quality rankings at 12.7% correctness, edging out Qwen 2.5 (11.8%) and DeepSeek R1 (11.5%). Notably, GPT-4o ranks near the bottom at 10.4%—interesting given GPT-4o is also the judge.
This suggests the benchmark measures genuine command quality rather than stylistic similarity to the judge model. A biased judge would rank its own family higher.
Latency and Token Efficiency
Quality aside, models vary dramatically in cost and speed:
| Model | Avg Latency | Avg Tokens |
|---|---|---|
| Qwen 2.5 72B | 236ms | 91 |
| GPT-4o | 489ms | 93 |
| Mistral Large | 571ms | 98 |
| DeepSeek V3 | 666ms | 85 |
| Claude 3.5 Sonnet | 876ms | 94 |
| DeepSeek R1 | 1,408ms | 1,897 |
| Llama 3.3 70B | 1,606ms | 96 |
| Claude Opus 4.5 | 1,719ms | 94 |
| Gemini 2.5 Pro | 2,389ms | 555 |
| GLM-4.7 | 3,852ms | 843 |
Qwen 2.5 remains the speed champion at 236ms—more than twice as fast as any other model. Combined with competitive quality (11.8%, rank 2), it offers the best value for high-volume workloads where you can tolerate retry logic.
Reasoning Models: Still Expensive
DeepSeek R1 and GLM-4.7 use "reasoning tokens"—internal chain-of-thought that counts against output. For simple commands:
- DeepSeek R1: 1,897 avg tokens (20x typical) at 1,408ms
- GLM-4.7: 843 avg tokens (8x typical) at 3,852ms
Despite this overhead, DeepSeek R1 ranks 3rd in quality (11.5%). The reasoning helps—but at 20x the token cost. For atomic tool use, the economics rarely justify it.
Methodology
500 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text, network, process, security, containers, version control, data transformation, archiving, system administration, user management, and time operations.
Phase 1: Generation. Each model generates commands with temperature=0. Heavy parallelization (100 concurrent requests) to minimize wall-clock time. Results cached incrementally.
Phase 2: Quality Judging. GPT-4o evaluates each (goal, command) pair for correctness and assigns a 0-10 quality score. 5,000 total judgments. The judge prompt asks: "Does this command correctly achieve the goal? Consider: correctness, safety, whether it would actually work."
The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.