LLM providers
Cascade abstracts the LLM behind a single interface, so swapping providers is one config command. Each provider has different strengths and a different setup story.
At a glance
Section titled “At a glance”| Provider | Default model | Auth | Best for | Cost per story (typical) |
|---|---|---|---|---|
| Anthropic | claude-opus-4-7 | API key | Maximum reliability of structured output | $0.08 - $0.25 |
| OpenAI | gpt-4o | API key | Teams already on OpenAI; Azure OpenAI | $0.05 - $0.20 |
| Google Gemini | gemini-2.0-flash | API key | Cost-sensitive use cases | $0.01 - $0.05 |
| Claude Code SDK | (uses your subscription) | None | Zero-setup adoption | (subscription) |
| Ollama / vLLM | llama3.3:70b | None | Air-gapped or local-only | (hardware) |
Cost figures assume a typical multi-file story (~6,000 in / 1,500 out tokens). Your costs vary with story complexity and team-memory size.
Anthropic Claude
Section titled “Anthropic Claude”The default and the most reliable. Uses Anthropic’s native tool-use feature for structured output, which means the model returns JSON conforming to a schema with very high reliability (we rarely see parse failures).
cascade configure llm anthropic --key sk-ant-xxx --set-default| Setting | Value |
|---|---|
| Default model | claude-opus-4-7 |
| Auth | API key (header) |
| Env var | ANTHROPIC_API_KEY |
| Structured output | Native tool use (very reliable) |
| Streaming | Supported |
| Best for | Production use, complex multi-file changes, anything where reliability matters |
Choosing a Claude model
Section titled “Choosing a Claude model”| Model | Cost | When to use |
|---|---|---|
claude-opus-4-7 | Highest | Default. Best code quality. |
claude-sonnet-4-6 | Mid | When Opus is overkill. Smaller stories, refactors. |
claude-haiku-4-5-20251001 | Lowest | High-volume runs, experimentation. May struggle on complex stories. |
Switch with --model:
cascade prompt "Refactor User class" --model claude-haiku-4-5-20251001Or persistently:
cascade configure llm anthropic --model claude-sonnet-4-6OpenAI
Section titled “OpenAI”Uses OpenAI’s Structured Outputs (response_format with json_schema). Reliable for code generation. Works with any OpenAI-compatible endpoint (including Azure OpenAI deployments) via --base-url.
# OpenAI's own APIcascade configure llm openai --key sk-xxx --set-default
# Azure OpenAIcascade configure llm openai \ --key your-key \ --base-url https://your-resource.openai.azure.com/openai/deployments/your-deployment
# OpenRouter (gateway to many models)cascade configure llm openai \ --key sk-or-xxx \ --base-url https://openrouter.ai/api/v1| Setting | Value |
|---|---|
| Default model | gpt-4o |
| Auth | API key (header) |
| Env var | OPENAI_API_KEY |
| Structured output | Native Structured Outputs (reliable) |
| Streaming | Supported |
| Best for | Teams already on OpenAI; Azure deployments; OpenRouter as a multi-model gateway |
Choosing an OpenAI model
Section titled “Choosing an OpenAI model”| Model | Cost | When to use |
|---|---|---|
gpt-4o | Mid | Default. Good balance of quality and cost. |
o1-preview | Highest | Hard reasoning-heavy stories. Slower. |
gpt-4o-mini | Lowest | High-volume runs. Lower code quality on complex changes. |
Google Gemini
Section titled “Google Gemini”Uses Gemini’s response_schema parameter with Pydantic models for structured output. Cheapest mainstream option per token; quality has caught up to GPT-4o for most code tasks.
cascade configure llm google --key gemini-xxx --set-default| Setting | Value |
|---|---|
| Default model | gemini-2.0-flash |
| Auth | API key (header) |
| Env vars | GOOGLE_API_KEY or GEMINI_API_KEY |
| Structured output | Native response_schema (reliable) |
| Streaming | Supported |
| Best for | Cost-sensitive teams; Google Cloud customers; high-volume workflows |
Choosing a Gemini model
Section titled “Choosing a Gemini model”| Model | Cost | When to use |
|---|---|---|
gemini-2.0-flash | Low | Default. Fast and cheap. Good for most stories. |
gemini-2.0-pro | Mid | When Flash fails on complex schemas. |
gemini-1.5-flash-8b | Very low | Experimentation, throwaway runs. |
Claude Code SDK
Section titled “Claude Code SDK”Uses your local Claude Code installation. No separate API key, no per-call billing. Your Claude subscription handles auth and quota.
pip install "cascade-agent[claude-code]"cascade configure llm claude_code --set-default| Setting | Value |
|---|---|
| Default model | Whatever Claude Code is set to (typically Opus) |
| Auth | Your Claude Code installation |
| Env var | None |
| Structured output | Code-fence JSON parsing (slightly less reliable than direct API) |
| Streaming | Not exposed by the SDK |
| Best for | Zero-setup adoption; teams with existing Claude subscriptions; SaaS-restricted environments where Claude is approved |
Full guide: run without an API key
Ollama / vLLM (local)
Section titled “Ollama / vLLM (local)”Any OpenAI-compatible local LLM server. Cascade ships with sensible Ollama defaults but the same code path works with vLLM, LM Studio, or any other local server speaking the OpenAI chat API.
# Ollama (default localhost:11434)cascade configure llm ollama --model llama3.3:70b --set-default
# vLLM or remote servercascade configure llm ollama \ --model your-model \ --base-url http://gpu-host:8000/v1 \ --set-default| Setting | Value |
|---|---|
| Default model | llama3.3:70b |
| Default base URL | http://localhost:11434/v1 |
| Auth | None |
| Env var | None |
| Structured output | Model-dependent. Use 70B+ for reliability. |
| Streaming | Supported |
| Best for | Air-gapped environments; data residency requirements; high-volume runs where per-token cost matters |
See run without an API key for model-size guidance and hardware requirements.
Switching providers
Section titled “Switching providers”Per command
Section titled “Per command”cascade prompt "Add health endpoint" --model claude-opus-4-7The --model flag picks the model from your already-configured provider. To switch providers per-call, set the provider in cascade.yaml for the repo.
Per project (via cascade.yaml)
Section titled “Per project (via cascade.yaml)”agent: provider: openai model: gpt-4oGlobally (default)
Section titled “Globally (default)”cascade configure llm anthropic --set-defaultRate limits and retries
Section titled “Rate limits and retries”All providers can rate-limit you. Cascade handles this with exponential backoff:
- Initial retry delay: 1 second
- Max retries: 5
- Backoff: 1s, 2s, 4s, 8s, 16s, then fail
If the LLM provider returns a 429 (rate limit), Cascade retries automatically. If it returns 500-class errors persistently, Cascade fails the stage and surfaces the upstream error message.
To work below rate limits in scripts, pace your calls:
for ticket in PROJ-101 PROJ-102 PROJ-103; do cascade ticket "jira:$ticket" sleep 10doneCost visibility
Section titled “Cost visibility”Every Cascade run prints the cost of its LLM calls:
cost: $0.08 (4,820 in / 1,330 out tokens, anthropic/claude-opus-4-7)For multi-story builds:
============================================================ session: 4 stories built, 8 LLM calls, $0.42============================================================Set a hard ceiling with --max-cost:
cascade build stories/sprint.yaml --max-cost 5.00Aborts between stories if cumulative cost would exceed the threshold. No effect on free providers (Claude Code SDK, Ollama).
Coming later
Section titled “Coming later”- GitHub Copilot CLI as a transport (best-effort; Copilot’s API is not structured-output friendly today)
- Per-stage routing: planner on Gemini Flash for cost, coder on Claude Opus for quality
- Cost budgets per project in
cascade.yaml
What is next
Section titled “What is next”- Run without an API key for Claude Code SDK and Ollama setup details
- VCS providers and issue trackers for the other vendor integrations
- Security model for what each provider sees