# How to minimize your Claude Code and API costs

> The output rate is five times the input rate, and the tokens you never see (thinking) bill at the output rate too. This tutorial lays out six confirmed levers to cut a Claude Code or API token bill, then shows how to measure the change with BudgetClaw so you are moving a real number, not guessing.

- HTML version: https://roninforge.org/tutorials/how-to-minimize-claude-code-costs
- Tool used: BudgetClaw - https://roninforge.org/budgetclaw.md
- Audience: developers using Claude Code daily, and teams building on the Claude API
- Last updated: 2026-06-15

## Why this matters

Token cost is not symmetric, and the most expensive part is the part you cannot see. Two facts drive almost every wasted dollar:

1. **Output is the expensive side.** On Claude Opus 4.8 input is `$5` per million tokens and output is `$25`. Anything that generates fewer output tokens is worth five input tokens saved.
2. **The biggest line item is invisible.** On thinking-enabled models, the reasoning the model does before it answers is billed at the output rate. The `display` setting can hide it from you, but hiding it changes nothing about the bill. So "make the answer shorter" trims the small visible tail while the large invisible part keeps costing.

There are exactly four ways to pay less: generate fewer output tokens, reuse input tokens through caching, pay a lower rate per token, or feed less context to the most expensive model. The six levers below are concrete applications of those four. Everything here is mechanism, not folklore.

## Lever 1 - Right-size the model

The single largest lever is the model string. The Claude family spans a 5x price range. Most tasks do not need the top model, and routing them down is a flat cut on every token they touch.

    MODEL               INPUT $/1M   OUTPUT $/1M
    Claude Haiku 4.5        $1.00         $5.00
    Claude Sonnet 4.6       $3.00        $15.00
    Claude Opus 4.8         $5.00        $25.00
    Claude Fable 5         $10.00        $50.00

Use Haiku for classification, extraction, and short mechanical tasks. Use Sonnet for high-volume production work. Reserve Opus for genuinely hard reasoning where a wrong answer costs more than the token difference. On the API, set `model` per call, not once per app, so cheap calls actually run on the cheap model.

**Do this now:** start your next mechanical task on Haiku instead of inheriting your default model.

    claude --model claude-haiku-4-5

Already in a session? Type `/model` and pick the cheaper tier before the next request.

## Lever 2 - Control thinking, the invisible line item

On recent models, thinking depth is set by the `effort` parameter, and thinking tokens are billed as output. The default is `high`. Lower effort means fewer thinking tokens, less preamble, and terser tool confirmations, all of which are output you stop paying for.

    output_config: { effort: "medium" }, thinking: { type: "adaptive" }

Two things people get wrong here:

- **Hiding thinking saves nothing.** Setting `display: "omitted"` only removes the summary from your view. The model still thinks, and you still pay the output rate for it. Use `effort` to actually reduce it.
- **Turn it off for routine calls.** For classification and formatting, omit thinking entirely. Sweep `low` and `medium` on your own prompts before assuming a task needs `high`.

In Claude Code the harness runs hard coding work at high effort by design. The lever there is to not point an agentic, high-effort session at a task a single short prompt would have answered.

**Do this now:** drop `effort` one notch on your highest-volume API route, run the same workload, and compare `response.usage.output_tokens` before and after. If the answers hold up, keep the lower setting.

## Lever 3 - Keep context lean and cacheable

Every turn re-sends the whole conversation. A bloated context is an input-token tax you pay on every single message, not once. Two moves cut it:

- **Scope the session.** In Claude Code, `/clear` between unrelated tasks and `/compact` when a session gets long. One job per session keeps the context small and the answers sharper.
- **Make the prefix cacheable.** Prompt caching serves repeated context at roughly a tenth of the input price, but only if the prefix is byte-for-byte identical across requests. Put stable content first; put anything volatile (timestamps, request IDs, the changing question) at the very end.

    system: [{ type: "text", text: BIG_STABLE_CONTEXT, cache_control: { type: "ephemeral" } }]

Caching helps input only. It does nothing for output, so it stacks with Lever 1 and Lever 2 rather than replacing them. A single moving byte in the prefix (a `datetime.now()` in the system prompt, unsorted JSON keys) silently invalidates the whole cache and you pay full input price again. Confirm a cache is actually hitting by checking that `response.usage.cache_read_input_tokens` is non-zero on the second identical request.

**Do this now:** clear the context the moment you switch tasks instead of letting one session sprawl.

    /clear

## Lever 4 - Cap the worst case

The levers above lower the average. Caps protect you from the runaway that is ten times the average. There are three, at increasing levels of bluntness:

- **`max_tokens`** is a hard ceiling on a single response. The model is not aware of it, so hitting it truncates the answer mid-thought and you retry. It bounds the worst case but is not a tuning dial.
- **`task_budget`** (beta) tells the model how many tokens it has for a whole agentic loop. It sees a running countdown and wraps up gracefully instead of being cut off. Minimum is 20,000 tokens. This is the right cap for multi-step agents.
- **A dollar cap** is the backstop that does not care why spend happened. BudgetClaw enforces a per-project and per-branch daily cap and sends `SIGTERM` to the Claude Code process on breach. See the [spend-cap tutorial](https://roninforge.org/tutorials/how-to-set-a-hard-spend-cap-on-claude-code).

**Do this now:** set a hard daily cap on your current project so a stuck loop cannot run past it.

    budgetclaw limit set --project myapp --period daily --cap 5.00 --action kill

Swap `myapp` for your project name and the cap for your number. Use `--action warn` for a phone push without the kill.

## Lever 5 - Batch the non-interactive work

If a job does not need an answer in the next second, send it through the Batches API. It runs the exact same requests at **50 percent** of standard token prices, input and output alike. Most batches finish within an hour, and the cap is 24 hours.

Good candidates: evaluation runs, bulk extraction or classification over a dataset, document summarization, and overnight backfills. This is a price cut, not a token cut, so it composes with every lever above. Batching a Haiku job with a cached prefix is three discounts stacked on the same request.

**Do this now:** move one non-urgent loop off `messages.create` and onto the batches endpoint. Same requests, half the price.

    client.messages.batches.create(requests=[{ "custom_id": "1", "params": {...} }])

Each request takes its own `custom_id` and a normal message `params` body. Poll the returned batch until `processing_status` is `ended`, then read the results.

## Lever 6 - Keep the expensive model on a small context

The most capable models cost the most per token. Claude Fable 5 is `$10 / $50` per million, twice Opus 4.8. The waste is not using the premium tier. The waste is feeding it a huge context when only a small slice of that context is the actual hard problem. You pay the top rate to re-read background it did not need.

The fix is an orchestrator split:

- A **cheaper large-context model** (today Opus 4.8) is the orchestrator. It holds the full picture, does the cheap coordination and glue work, and decides what the hard sub-problems are.
- The **premium model** is the specialist. It receives only a small, distilled, focused slice for each hard sub-problem, runs deep reasoning on that, and returns.
- The expensive output is **persisted to a durable file**, so you never pay the premium rate twice to regenerate the same conclusion.

RoninForge packages this as the **fable-orchestrator** skill. It decomposes a large planning or research task into several focused streams, runs each on the premium model with a small context, then knits the results together in one synthesis pass, with human approval gates between phases and every expensive output saved as a research file. The orchestrator keeps the big context cheap; the specialist only ever sees what it needs.

**Do this now:** even without the skill, run the pattern by hand in Claude Code. Instead of dumping the whole repo into one expensive session, tell the coordinator to delegate each hard piece to a focused subagent that only sees what it needs, and to save the result:

    Spawn a subagent for the auth refactor. Give it only src/auth/ and the
    failing tests. Have it write its plan to notes/auth-plan.md, then
    summarize back in three lines.

The subagent burns its tokens on a small context, the conclusion lands in a file you can reuse, and the main thread never has to re-read the whole codebase to continue.

**On Fable 5 specifically:** it is not generally available to anyone at the time of writing, so you cannot run this exact split today. We describe it anyway because the pattern is model-agnostic. Whenever a premium tier and a cheaper large-context tier coexist, keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs. The shape holds no matter which model sits at the top of the price list next.

## Measure it

Every lever here is testable. Change one thing, run the same work, and read the spend back. BudgetClaw attributes cost per project and per branch from the logs Claude Code already writes to disk, so you can see a model switch or an effort drop land in the numbers:

    budgetclaw status

    PROJECT          BRANCH                     TODAY    WEEK     MONTH
    MyClient         main                       $18.40   $96.10   $402.55
    RoninForge.org   main                       $6.12    $58.30   $311.08
    TOTAL                                       $24.52   $154.40  $713.63

If you have not wired BudgetClaw into your status line yet, the [ongoing-costs tutorial](https://roninforge.org/tutorials/how-to-see-ongoing-project-costs) puts live per-project spend at the bottom of every Claude Code prompt. Not installed at all?

    curl -fsSL https://roninforge.org/budgetclaw/install.sh | sh

## What does not help

Three things are widely believed to save money and do not. Skip them so you spend your effort on the levers that move the bill:

- **Hiding thinking** with `display: "omitted"` is visibility only. Billed identically.
- **Streaming** does not change billing at all. It only avoids HTTP timeouts on long outputs.
- **Prompt caching** discounts input only. It will not lower an output-heavy bill on its own.

## FAQ

**Does adding "be concise" to the prompt actually lower the bill?** Yes, but modestly, and only on the visible text. On thinking-enabled models the larger cost is often the thinking the model does before it answers, which is billed at the output rate but never shown to you. The bigger lever is the `effort` level, which controls how much the model thinks in the first place.

**Does hiding thinking with `display: "omitted"` save money?** No. The `display` setting changes visibility only. Thinking happens and is billed identically whether you see a summary or an empty block. To actually cut thinking tokens, lower the `effort` level or disable thinking for routine calls.

**Does streaming cost more or less than a normal request?** Neither. Streamed and non-streamed responses produce identical token counts and identical billing. Streaming only avoids HTTP timeouts on long outputs.

**Does prompt caching reduce output tokens?** No. Caching discounts input tokens only. A cache read costs roughly a tenth of the normal input price, but output is always billed in full. Pair caching with the output-side levers.

**Which single lever matters most?** Right-sizing the model, then the `effort` level. Moving a task from Opus to Haiku is a 5x cut on every token. Dropping effort one notch can roughly halve thinking tokens on hard prompts. Measure both with BudgetClaw.

**Is Claude Fable 5 available to use right now?** Not generally, to anyone, at the time of writing. The orchestrator pattern is model-agnostic: whenever a premium tier and a cheaper large-context tier exist side by side, keep the bulk of the tokens on the cheaper tier and spend the premium tier only on small, distilled inputs.

**What about non-Claude providers?** BudgetClaw is Claude Code only. Goei applies the same budget caps and runaway-session kills across Codex CLI, Gemini CLI, and Copilot CLI.

## Source

- Anthropic pricing: https://platform.claude.com/docs/en/about-claude/pricing
- Prompt caching docs: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- BudgetClaw: https://github.com/RoninForge/budgetclaw
- This tutorial: https://roninforge.org/tutorials/how-to-minimize-claude-code-costs
