A practical, demo-driven session showing five platform patterns that meaningfully reduce cost and improve quality for Claude-powered agents: prompt caching, tool search, programmatic tool calling, context compaction, and the advisor strategy. Each pattern includes a before/after comparison and an explanation of when to apply it.
Prompt caching reuses computed key-value representations of tokens that appear at the start of a prompt. The key to maximizing cache hit rate is structure: put stable content (system prompt, tool definitions, long documents) at the front of every request. Variable content (user messages, task-specific context) goes at the end. In live demos, Anthropic shows cache hit rates moving from 20% to 85% with prompt reordering alone.
Best for: agents with large system prompts, RAG applications with shared document context, APIs called with the same configuration repeatedly.
Instead of loading 50 tools into every context, tool search dynamically retrieves the 5-10 tools most relevant to the current task. This is implemented via a vector index of tool definitions — at inference time, the model's task description is used to retrieve the most relevant tools. Token savings are significant; quality is equal or better because the model isn't distracted by irrelevant options.
Best for: agents with large tool libraries, MCP-equipped agents, domain-specific agents that need different tools in different modes.
Not all tool calls need to go through the model. When your code knows deterministically that a tool should be called (e.g., always retrieve the current date, always fetch the user's profile), call it directly and inject the result into context. Reserve model-driven tool selection for genuinely ambiguous cases. This cuts latency and token usage for the deterministic portion of your workflow.
Compaction is periodic context compression. Rather than letting context grow unboundedly over a long session, compaction summarizes prior turns into a structured representation — preserving key decisions, artifacts, and state while discarding verbose reasoning chains. Anthropic demonstrates compaction keeping a 10-hour agent session under 50K tokens at all times, with quality metrics unchanged vs. the uncompacted baseline.
The advisor strategy uses two models: a fast, inexpensive model handles most interactions, and a more capable model acts as an advisor — reviewing and refining the fast model's output for tasks that warrant it. The trigger for escalation can be confidence-based (low-confidence outputs route to the advisor), task-based (complex reasoning tasks always use the advisor), or sampling-based (periodic advisor review as a quality gate).
Cost reduction of 60-80% vs. always using the most capable model, with quality delta of less than 5% on representative evals.
The final demo combines all five patterns in a single long-running agent. The result: a 10x cost reduction vs. the naive implementation with no measurable quality degradation. The session closes with a pattern selection guide — which patterns apply to which use cases.
"The difference between an agent that costs $0.50 per run and $5 per run is usually not the model — it's these five patterns."
"Tool search sounds like a niche optimization. At scale, it's the difference between your agent working and your agent failing with context overflow."
| Pattern | Best For | Typical Savings |
|---|---|---|
| Prompt caching | Shared system prompts, RAG | 50–80% on repeated calls |
| Tool search | Large tool libraries | 20–40% context reduction |
| Programmatic tool calling | Deterministic tool use | 10–30% latency reduction |
| Compaction | Long-running sessions | Context stays bounded |
| Advisor strategy | Cost/quality balance | 60–80% cost reduction |