RoninForge / Tutorials / Minimize your Claude Code and API costs

~2k tokens | View as Markdown |

Tutorial

How to minimize your Claude Code and API costs

The output rate is five times the input rate, and the tokens you never see bill at the output rate too. This tutorial lays out six confirmed levers to cut a Claude Code or API token bill, then shows how to measure the change with BudgetClaw so you are moving a real number, not guessing.

Problem

Output tokens cost 5x input

Time

7-minute read

Output

Six levers you can apply today

The problem

Token cost is not symmetric, and the most expensive part is the part you cannot see. Two facts drive almost every wasted dollar:

  • Output is the expensive side. On Claude Opus 4.8 input is $5 per million tokens and output is $25. Anything that generates fewer output tokens is worth five input tokens saved.
  • The biggest line item is invisible. On thinking-enabled models, the reasoning the model does before it answers is billed at the output rate. The display setting can hide it from you, but hiding it changes nothing about the bill. So "make the answer shorter" trims the small visible tail while the large invisible part keeps costing.

There are exactly four ways to pay less: generate fewer output tokens, reuse input tokens through caching, pay a lower rate per token, or feed less context to the most expensive model. The six levers below are concrete applications of those four. Everything here is mechanism, not folklore.

Lever 1

Right-size the model

The single largest lever is the model string. The Claude family spans a 5x price range. Most tasks do not need the top model, and routing them down is a flat cut on every token they touch:

MODEL               INPUT $/1M   OUTPUT $/1M
Claude Haiku 4.5        $1.00         $5.00
Claude Sonnet 4.6       $3.00        $15.00
Claude Opus 4.8         $5.00        $25.00
Claude Fable 5         $10.00        $50.00

Use Haiku for classification, extraction, and short mechanical tasks. Use Sonnet for high-volume production work. Reserve Opus for genuinely hard reasoning where a wrong answer costs more than the token difference. On the API, set model per call, not once per app, so cheap calls actually run on the cheap model.

Do this now: start your next mechanical task on Haiku instead of inheriting your default model.

Already in a session? Type /model and pick the cheaper tier before the next request.

Lever 2

Control thinking, the invisible line item

On recent models, thinking depth is set by the effort parameter, and thinking tokens are billed as output. The default is high. Lower effort means fewer thinking tokens, less preamble, and terser tool confirmations, all of which are output you stop paying for:

Two things people get wrong here:

  • Hiding thinking saves nothing. Setting display: "omitted" only removes the summary from your view. The model still thinks, and you still pay the output rate for it. Use effort to actually reduce it.
  • Turn it off for routine calls. For classification and formatting, omit thinking entirely. Sweep low and medium on your own prompts before assuming a task needs high.

In Claude Code the harness runs hard coding work at high effort by design. The lever there is to not point an agentic, high-effort session at a task a single short prompt would have answered.

Do this now: drop effort one notch on your highest-volume API route, run the same workload, and compare the output-token count from response.usage.output_tokens before and after. If the answers hold up, keep the lower setting.

Lever 3

Keep context lean and cacheable

Every turn re-sends the whole conversation. A bloated context is an input-token tax you pay on every single message, not once. Two moves cut it:

  • Scope the session. In Claude Code, /clear between unrelated tasks and /compact when a session gets long. One job per session keeps the context small and the answers sharper.
  • Make the prefix cacheable. Prompt caching serves repeated context at roughly a tenth of the input price, but only if the prefix is byte-for-byte identical across requests. Put stable content first; put anything volatile (timestamps, request IDs, the changing question) at the very end.

Caching helps input only. It does nothing for output, so it stacks with Lever 1 and Lever 2 rather than replacing them. A single moving byte in the prefix (a datetime.now() in the system prompt, unsorted JSON keys) silently invalidates the whole cache and you pay full input price again. Confirm a cache is actually hitting by checking that response.usage.cache_read_input_tokens is non-zero on the second identical request.

Do this now: clear the context the moment you switch tasks instead of letting one session sprawl.

Lever 4

Cap the worst case

The levers above lower the average. Caps protect you from the runaway that is ten times the average. There are three, at increasing levels of bluntness:

  • max_tokens is a hard ceiling on a single response. The model is not aware of it, so hitting it truncates the answer mid-thought and you retry. It bounds the worst case but is not a tuning dial.
  • task_budget (beta) tells the model how many tokens it has for a whole agentic loop. It sees a running countdown and wraps up gracefully instead of being cut off. Minimum is 20,000 tokens. This is the right cap for multi-step agents.
  • A dollar cap is the backstop that does not care why spend happened. BudgetClaw enforces a per-project and per-branch daily cap and sends SIGTERM to the Claude Code process on breach. The spend-cap tutorial walks the setup in five minutes.

Do this now: set a hard daily cap on your current project so a stuck loop cannot run past it.

Swap myapp for your project name and the cap for your number. Use --action warn for a phone push without the kill.

Lever 5

Batch the non-interactive work

If a job does not need an answer in the next second, send it through the Batches API. It runs the exact same requests at 50 percent of standard token prices, input and output alike. Most batches finish within an hour, and the cap is 24 hours.

Good candidates: evaluation runs, bulk extraction or classification over a dataset, document summarization, and overnight backfills. This is a price cut, not a token cut, so it composes with every lever above. Batching a Haiku job with a cached prefix is three discounts stacked on the same request.

Do this now: move one non-urgent loop off messages.create and onto the batches endpoint. Same requests, half the price.

Each request takes its own custom_id and a normal message params body. Poll the returned batch until processing_status is ended, then read the results.

Lever 6

Keep the expensive model on a small context

The most capable models cost the most per token. Claude Fable 5 is $10 / $50 per million, twice Opus 4.8. The waste is not using the premium tier. The waste is feeding it a huge context when only a small slice of that context is the actual hard problem. You pay the top rate to re-read background it did not need.

The fix is an orchestrator split:

  • A cheaper large-context model (today Opus 4.8) is the orchestrator. It holds the full picture, does the cheap coordination and glue work, and decides what the hard sub-problems are.
  • The premium model is the specialist. It receives only a small, distilled, focused slice for each hard sub-problem, runs deep reasoning on that, and returns.
  • The expensive output is persisted to a durable file, so you never pay the premium rate twice to regenerate the same conclusion.

RoninForge packages this as the fable-orchestrator skill. It decomposes a large planning or research task into several focused streams, runs each on the premium model with a small context, then knits the results together in one synthesis pass, with human approval gates between phases and every expensive output saved as a research file. The orchestrator keeps the big context cheap; the specialist only ever sees what it needs.

Do this now: even without the skill, run the pattern by hand in Claude Code. Instead of dumping the whole repo into one expensive session, tell the coordinator to delegate each hard piece to a focused subagent that only sees what it needs, and to save the result:

The subagent burns its tokens on a small context, the conclusion lands in a file you can reuse, and the main thread never has to re-read the whole codebase to continue.

On Fable 5 specifically: it is not generally available to anyone at the time of writing, so you cannot run this exact split today. We describe it anyway because the pattern is model-agnostic. Whenever a premium tier and a cheaper large-context tier coexist, keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs. The shape holds no matter which model sits at the top of the price list next.

Measure it

Move a real number, not a hunch

Every lever here is testable. Change one thing, run the same work, and read the spend back. BudgetClaw attributes cost per project and per branch from the logs Claude Code already writes to disk, so you can see a model switch or an effort drop land in the numbers:

PROJECT          BRANCH                     TODAY    WEEK     MONTH
MyClient         main                       $18.40   $96.10   $402.55
RoninForge.org   main                       $6.12    $58.30   $311.08
TOTAL                                       $24.52   $154.40  $713.63

If you have not wired BudgetClaw into your status line yet, the ongoing-costs tutorial puts live per-project spend at the bottom of every Claude Code prompt. Not installed at all?

What does not help

Three things are widely believed to save money and do not. Skip them so you spend your effort on the levers that move the bill:

  • Hiding thinking with display: "omitted" is visibility only. Billed identically.
  • Streaming does not change billing at all. It only avoids HTTP timeouts on long outputs.
  • Prompt caching discounts input only. It will not lower an output-heavy bill on its own.

FAQ

Does adding "be concise" to the prompt actually lower the bill?

Yes, but modestly, and only on the visible text. On thinking-enabled models the larger cost is often the thinking the model does before it answers, which is billed at the output rate but never shown to you. Trimming the visible answer does not touch that. The bigger lever is the effort level, which controls how much the model thinks in the first place.

Does hiding thinking with display: "omitted" save money?

No. The display setting changes visibility only. Thinking happens and is billed identically whether you see a summary or an empty block. This is the most common false economy. To actually cut thinking tokens, lower the effort level or disable thinking for routine calls.

Does streaming cost more or less than a normal request?

Neither. Streamed and non-streamed responses produce identical token counts and identical billing. Streaming only changes transport: it avoids HTTP timeouts on long outputs. Choose it for reliability, not cost.

Does prompt caching reduce output tokens?

No. Caching discounts input tokens only. A cache read costs roughly a tenth of the normal input price, but output is always billed in full at the output rate. Pair caching with the output-side levers; it is not a substitute for them.

Which single lever matters most?

Right-sizing the model, then the effort level. Moving a task from Opus to Haiku is a 5x cut on every token. Dropping effort one notch can roughly halve thinking tokens on hard prompts. Measure both with BudgetClaw so you can see the number move rather than guessing.

Is Claude Fable 5 available to use right now?

Not generally, to anyone, at the time of writing. We describe the orchestrator pattern around it because the pattern is model-agnostic. Whenever a premium model tier and a cheaper large-context tier exist side by side, the right move is to keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs. Today that split is Opus 4.8 as the coordinator and a premium model as the specialist; the shape does not change when the model names do.

What about non-Claude providers?

BudgetClaw is Claude Code only. Goei applies the same budget caps and runaway-session kills across Codex CLI, Gemini CLI, and Copilot CLI, so the measurement and cap parts of this tutorial carry over to a mixed toolchain.

Source