RoninForge / Tutorials / Minimize your Claude Code and API costs
Tutorial
The output rate is five times the input rate, and the tokens you never see bill at the output rate too. This tutorial lays out six confirmed levers to cut a Claude Code or API token bill, then shows how to measure the change with BudgetClaw so you are moving a real number, not guessing.
Problem
Output tokens cost 5x input
Time
7-minute read
Output
Six levers you can apply today
Token cost is not symmetric, and the most expensive part is the part you cannot see. Two facts drive almost every wasted dollar:
$5 per million tokens and output is $25. Anything that generates fewer output tokens is worth five input tokens saved.display setting can hide it from you, but hiding it changes nothing about the bill. So "make the answer shorter" trims the small visible tail while the large invisible part keeps costing.There are exactly four ways to pay less: generate fewer output tokens, reuse input tokens through caching, pay a lower rate per token, or feed less context to the most expensive model. The six levers below are concrete applications of those four. Everything here is mechanism, not folklore.
The single largest lever is the model string. The Claude family spans a 5x price range. Most tasks do not need the top model, and routing them down is a flat cut on every token they touch:
MODEL INPUT $/1M OUTPUT $/1M Claude Haiku 4.5 $1.00 $5.00 Claude Sonnet 4.6 $3.00 $15.00 Claude Opus 4.8 $5.00 $25.00 Claude Fable 5 $10.00 $50.00
Use Haiku for classification, extraction, and short mechanical tasks. Use Sonnet for high-volume production work. Reserve Opus for genuinely hard reasoning where a wrong answer costs more than the token difference. On the API, set model per call, not once per app, so cheap calls actually run on the cheap model.
Do this now: start your next mechanical task on Haiku instead of inheriting your default model.
Already in a session? Type /model and pick the cheaper tier before the next request.
On recent models, thinking depth is set by the effort parameter, and thinking tokens are billed as output. The default is high. Lower effort means fewer thinking tokens, less preamble, and terser tool confirmations, all of which are output you stop paying for:
Two things people get wrong here:
display: "omitted" only removes the summary from your view. The model still thinks, and you still pay the output rate for it. Use effort to actually reduce it.low and medium on your own prompts before assuming a task needs high.In Claude Code the harness runs hard coding work at high effort by design. The lever there is to not point an agentic, high-effort session at a task a single short prompt would have answered.
Do this now: drop effort one notch on your highest-volume API route, run the same workload, and compare the output-token count from response.usage.output_tokens before and after. If the answers hold up, keep the lower setting.
Every turn re-sends the whole conversation. A bloated context is an input-token tax you pay on every single message, not once. Two moves cut it:
/clear between unrelated tasks and /compact when a session gets long. One job per session keeps the context small and the answers sharper.Caching helps input only. It does nothing for output, so it stacks with Lever 1 and Lever 2 rather than replacing them. A single moving byte in the prefix (a datetime.now() in the system prompt, unsorted JSON keys) silently invalidates the whole cache and you pay full input price again. Confirm a cache is actually hitting by checking that response.usage.cache_read_input_tokens is non-zero on the second identical request.
Do this now: clear the context the moment you switch tasks instead of letting one session sprawl.
The levers above lower the average. Caps protect you from the runaway that is ten times the average. There are three, at increasing levels of bluntness:
max_tokens is a hard ceiling on a single response. The model is not aware of it, so hitting it truncates the answer mid-thought and you retry. It bounds the worst case but is not a tuning dial.task_budget (beta) tells the model how many tokens it has for a whole agentic loop. It sees a running countdown and wraps up gracefully instead of being cut off. Minimum is 20,000 tokens. This is the right cap for multi-step agents.SIGTERM to the Claude Code process on breach. The spend-cap tutorial walks the setup in five minutes.Do this now: set a hard daily cap on your current project so a stuck loop cannot run past it.
Swap myapp for your project name and the cap for your number. Use --action warn for a phone push without the kill.
If a job does not need an answer in the next second, send it through the Batches API. It runs the exact same requests at 50 percent of standard token prices, input and output alike. Most batches finish within an hour, and the cap is 24 hours.
Good candidates: evaluation runs, bulk extraction or classification over a dataset, document summarization, and overnight backfills. This is a price cut, not a token cut, so it composes with every lever above. Batching a Haiku job with a cached prefix is three discounts stacked on the same request.
Do this now: move one non-urgent loop off messages.create and onto the batches endpoint. Same requests, half the price.
Each request takes its own custom_id and a normal message params body. Poll the returned batch until processing_status is ended, then read the results.
The most capable models cost the most per token. Claude Fable 5 is $10 / $50 per million, twice Opus 4.8. The waste is not using the premium tier. The waste is feeding it a huge context when only a small slice of that context is the actual hard problem. You pay the top rate to re-read background it did not need.
The fix is an orchestrator split:
RoninForge packages this as the fable-orchestrator skill. It decomposes a large planning or research task into several focused streams, runs each on the premium model with a small context, then knits the results together in one synthesis pass, with human approval gates between phases and every expensive output saved as a research file. The orchestrator keeps the big context cheap; the specialist only ever sees what it needs.
Do this now: even without the skill, run the pattern by hand in Claude Code. Instead of dumping the whole repo into one expensive session, tell the coordinator to delegate each hard piece to a focused subagent that only sees what it needs, and to save the result:
The subagent burns its tokens on a small context, the conclusion lands in a file you can reuse, and the main thread never has to re-read the whole codebase to continue.
On Fable 5 specifically: it is not generally available to anyone at the time of writing, so you cannot run this exact split today. We describe it anyway because the pattern is model-agnostic. Whenever a premium tier and a cheaper large-context tier coexist, keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs. The shape holds no matter which model sits at the top of the price list next.
Move a real number, not a hunch
Every lever here is testable. Change one thing, run the same work, and read the spend back. BudgetClaw attributes cost per project and per branch from the logs Claude Code already writes to disk, so you can see a model switch or an effort drop land in the numbers:
PROJECT BRANCH TODAY WEEK MONTH MyClient main $18.40 $96.10 $402.55 RoninForge.org main $6.12 $58.30 $311.08 TOTAL $24.52 $154.40 $713.63
If you have not wired BudgetClaw into your status line yet, the ongoing-costs tutorial puts live per-project spend at the bottom of every Claude Code prompt. Not installed at all?
Three things are widely believed to save money and do not. Skip them so you spend your effort on the levers that move the bill:
display: "omitted" is visibility only. Billed identically.Yes, but modestly, and only on the visible text. On thinking-enabled models the larger cost is often the thinking the model does before it answers, which is billed at the output rate but never shown to you. Trimming the visible answer does not touch that. The bigger lever is the effort level, which controls how much the model thinks in the first place.
display: "omitted" save money?No. The display setting changes visibility only. Thinking happens and is billed identically whether you see a summary or an empty block. This is the most common false economy. To actually cut thinking tokens, lower the effort level or disable thinking for routine calls.
Neither. Streamed and non-streamed responses produce identical token counts and identical billing. Streaming only changes transport: it avoids HTTP timeouts on long outputs. Choose it for reliability, not cost.
No. Caching discounts input tokens only. A cache read costs roughly a tenth of the normal input price, but output is always billed in full at the output rate. Pair caching with the output-side levers; it is not a substitute for them.
Right-sizing the model, then the effort level. Moving a task from Opus to Haiku is a 5x cut on every token. Dropping effort one notch can roughly halve thinking tokens on hard prompts. Measure both with BudgetClaw so you can see the number move rather than guessing.
Not generally, to anyone, at the time of writing. We describe the orchestrator pattern around it because the pattern is model-agnostic. Whenever a premium model tier and a cheaper large-context tier exist side by side, the right move is to keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs. Today that split is Opus 4.8 as the coordinator and a premium model as the specialist; the shape does not change when the model names do.
BudgetClaw is Claude Code only. Goei applies the same budget caps and runaway-session kills across Codex CLI, Gemini CLI, and Copilot CLI, so the measurement and cap parts of this tutorial carry over to a mixed toolchain.