May 2026

I Was Treating Claude Like a One-Person Shop

My Claude bill kept climbing while Codex and Ollama sat unused. I asked the obvious question — does delegating actually save anything? Here is what the numbers said.

AI EngineeringMulti-AgentClaude CodeCodexCost Optimization

The Bill That Kept Climbing

I had three AI subscriptions sitting in my workflow — Claude, Codex (the ChatGPT one), and Ollama running locally. For months I'd been using Claude for almost everything. Reading files, scanning repos, drafting code, writing reports, all of it. The other two were technically wired up. I rarely reached for them.

Then I checked the 30-day bill. The Claude line was bigger than felt comfortable, and most of it was output tokens — Claude writing analyses, drafts, summaries. Codex and Ollama? Effectively zero marginal cost (one's a flat subscription, the other's local). So I asked the obvious question I'd been avoiding: am I leaving real money on the table by not actually using them?

The A/B Test

Same audit task in both sessions. Same codebase. Fresh Claude Code chat for each.

Session A— strict delegation rules. Codex for repo-wide reads. Ollama for file triage. Claude as the orchestrator, not the doer.
Session B— default setup. Claude handles everything.

I pre-registered the scoring rubric before either session ran — that way I couldn't move goalposts after seeing the output. Token usage came from the session transcripts. An independent third-party model (also Codex, but in a separate role as judge) validated which findings in each report were real bugs versus false positives versus style preferences.

The Numbers

Billable tokens

A: 170KB: 235K

True positives caught

A: 32B: 16

False positive rate

A: 11%B: 11%

Recall vs full pool

A: 89%B: 44%

Severity calibration

A: inflated +0.80B: inflated +0.58

Judge's qualitative pick

A: 13/20B: 18/20

Session A used ~28% fewer tokens and caught roughly 2x as many real issues. Same precision on both. By the pre-registered decision rule, A wins decisively.

What Got Worse

The independent judge picked Session B's report as higher quality. Both sessions found real issues; B's list was tighter and easier to act on. A's list had a long tail of style preferences and inflated severities — 16 “Critical” findings when maybe two actually warranted that tier.

Codex doesn't reserve “Critical” the way a senior engineer would. It returned everything it found, faithfully. The orchestrator's job was to triage that down before the report shipped — and I hadn't told it to.

The Fix

I added four orchestrator-side rules to the global config:

Verify every cited file/line before passing a delegated finding through. Don't dump unedited sub-model output.
Cap audit reports at 25 findings. If more come back, demote the marginal ones to an appendix.
Reserve “Critical” for outage risk, data loss, compliance failure, duplicate-write risk. Everything else is High or below.
Demote long-function and mock-test findings to a Maintainability appendix unless they're tied to a known bug.

None of those say “delegation doesn't work.” They say “Claude didn't edit hard enough on the way out.”

What I Took Away

Frontier models are expensive. They're also good at being directors. The trick was getting Claude to act like a project manager — route the right work to the right tool, then edit what came back — instead of trying to do every task itself.

Same shape as hiring a team. If you don't delegate, you cap out at what one person can do. If you delegate without editing, you ship noise. The middle path is where the gains are — and the editing-discipline part is still on me.

If your AI bill is climbing and you can't tell which spend is signal and which is sprawl, I do free 30-minute data audits for SMBs and founders trying to get a handle on the same thing.

Get in Touch →