LLM providers

Cascade abstracts the LLM behind a single interface, so swapping providers is one config command. Each provider has different strengths and a different setup story.

At a glance

Provider	Default model	Auth	Best for	Cost per story (typical)
Anthropic	`claude-opus-4-7`	API key	Maximum reliability of structured output	$0.08 - $0.25
OpenAI	`gpt-4o`	API key	Teams already on OpenAI; Azure OpenAI	$0.05 - $0.20
Google Gemini	`gemini-2.0-flash`	API key	Cost-sensitive use cases	$0.01 - $0.05
Claude Code SDK	(uses your subscription)	None	Zero-setup adoption	(subscription)
Ollama / vLLM	`llama3.3:70b`	None	Air-gapped or local-only	(hardware)

Cost figures assume a typical multi-file story (~6,000 in / 1,500 out tokens). Your costs vary with story complexity and team-memory size.

Anthropic Claude

The default and the most reliable. Uses Anthropic’s native tool-use feature for structured output, which means the model returns JSON conforming to a schema with very high reliability (we rarely see parse failures).

cascade configure llm anthropic --key sk-ant-xxx --set-default

Setting	Value
Default model	`claude-opus-4-7`
Auth	API key (header)
Env var	`ANTHROPIC_API_KEY`
Structured output	Native tool use (very reliable)
Streaming	Supported
Best for	Production use, complex multi-file changes, anything where reliability matters

Choosing a Claude model

Model	Cost	When to use
`claude-opus-4-7`	Highest	Default. Best code quality.
`claude-sonnet-4-6`	Mid	When Opus is overkill. Smaller stories, refactors.
`claude-haiku-4-5-20251001`	Lowest	High-volume runs, experimentation. May struggle on complex stories.

Switch with --model:

cascade prompt "Refactor User class" --model claude-haiku-4-5-20251001

Or persistently:

cascade configure llm anthropic --model claude-sonnet-4-6

OpenAI

Uses OpenAI’s Structured Outputs (response_format with json_schema). Reliable for code generation. Works with any OpenAI-compatible endpoint (including Azure OpenAI deployments) via --base-url.

# OpenAI's own API
cascade configure llm openai --key sk-xxx --set-default

# Azure OpenAI
cascade configure llm openai \
  --key your-key \
  --base-url https://your-resource.openai.azure.com/openai/deployments/your-deployment

# OpenRouter (gateway to many models)
cascade configure llm openai \
  --key sk-or-xxx \
  --base-url https://openrouter.ai/api/v1

Setting	Value
Default model	`gpt-4o`
Auth	API key (header)
Env var	`OPENAI_API_KEY`
Structured output	Native Structured Outputs (reliable)
Streaming	Supported
Best for	Teams already on OpenAI; Azure deployments; OpenRouter as a multi-model gateway

Choosing an OpenAI model

Model	Cost	When to use
`gpt-4o`	Mid	Default. Good balance of quality and cost.
`o1-preview`	Highest	Hard reasoning-heavy stories. Slower.
`gpt-4o-mini`	Lowest	High-volume runs. Lower code quality on complex changes.

Google Gemini

Uses Gemini’s response_schema parameter with Pydantic models for structured output. Cheapest mainstream option per token; quality has caught up to GPT-4o for most code tasks.

cascade configure llm google --key gemini-xxx --set-default

Setting	Value
Default model	`gemini-2.0-flash`
Auth	API key (header)
Env vars	`GOOGLE_API_KEY` or `GEMINI_API_KEY`
Structured output	Native response_schema (reliable)
Streaming	Supported
Best for	Cost-sensitive teams; Google Cloud customers; high-volume workflows

Choosing a Gemini model

Model	Cost	When to use
`gemini-2.0-flash`	Low	Default. Fast and cheap. Good for most stories.
`gemini-2.0-pro`	Mid	When Flash fails on complex schemas.
`gemini-1.5-flash-8b`	Very low	Experimentation, throwaway runs.

Claude Code SDK

Uses your local Claude Code installation. No separate API key, no per-call billing. Your Claude subscription handles auth and quota.

pip install "cascade-agent[claude-code]"
cascade configure llm claude_code --set-default

Setting	Value
Default model	Whatever Claude Code is set to (typically Opus)
Auth	Your Claude Code installation
Env var	None
Structured output	Code-fence JSON parsing (slightly less reliable than direct API)
Streaming	Not exposed by the SDK
Best for	Zero-setup adoption; teams with existing Claude subscriptions; SaaS-restricted environments where Claude is approved

Full guide: run without an API key

Ollama / vLLM (local)

Any OpenAI-compatible local LLM server. Cascade ships with sensible Ollama defaults but the same code path works with vLLM, LM Studio, or any other local server speaking the OpenAI chat API.

# Ollama (default localhost:11434)
cascade configure llm ollama --model llama3.3:70b --set-default

# vLLM or remote server
cascade configure llm ollama \
  --model your-model \
  --base-url http://gpu-host:8000/v1 \
  --set-default

Setting	Value
Default model	`llama3.3:70b`
Default base URL	`http://localhost:11434/v1`
Auth	None
Env var	None
Structured output	Model-dependent. Use 70B+ for reliability.
Streaming	Supported
Best for	Air-gapped environments; data residency requirements; high-volume runs where per-token cost matters

See run without an API key for model-size guidance and hardware requirements.

Switching providers

Per command

cascade prompt "Add health endpoint" --model claude-opus-4-7

The --model flag picks the model from your already-configured provider. To switch providers per-call, set the provider in cascade.yaml for the repo.

Per project (via cascade.yaml)

agent:
  provider: openai
  model: gpt-4o

Globally (default)

cascade configure llm anthropic --set-default

Rate limits and retries

All providers can rate-limit you. Cascade handles this with exponential backoff:

Initial retry delay: 1 second
Max retries: 5
Backoff: 1s, 2s, 4s, 8s, 16s, then fail

If the LLM provider returns a 429 (rate limit), Cascade retries automatically. If it returns 500-class errors persistently, Cascade fails the stage and surfaces the upstream error message.

To work below rate limits in scripts, pace your calls:

for ticket in PROJ-101 PROJ-102 PROJ-103; do
  cascade ticket "jira:$ticket"
  sleep 10
done

Cost visibility

Every Cascade run prints the cost of its LLM calls:

  cost: $0.08 (4,820 in / 1,330 out tokens, anthropic/claude-opus-4-7)

For multi-story builds:

============================================================
  session: 4 stories built, 8 LLM calls, $0.42
============================================================

Set a hard ceiling with --max-cost:

cascade build stories/sprint.yaml --max-cost 5.00

Aborts between stories if cumulative cost would exceed the threshold. No effect on free providers (Claude Code SDK, Ollama).

Coming later

GitHub Copilot CLI as a transport (best-effort; Copilot’s API is not structured-output friendly today)
Per-stage routing: planner on Gemini Flash for cost, coder on Claude Opus for quality
Cost budgets per project in cascade.yaml

What is next

Run without an API key for Claude Code SDK and Ollama setup details
VCS providers and issue trackers for the other vendor integrations
Security model for what each provider sees