Skip to content

LLM providers

Cascade abstracts the LLM behind a single interface, so swapping providers is one config command. Each provider has different strengths and a different setup story.

ProviderDefault modelAuthBest forCost per story (typical)
Anthropicclaude-opus-4-7API keyMaximum reliability of structured output$0.08 - $0.25
OpenAIgpt-4oAPI keyTeams already on OpenAI; Azure OpenAI$0.05 - $0.20
Google Geminigemini-2.0-flashAPI keyCost-sensitive use cases$0.01 - $0.05
Claude Code SDK(uses your subscription)NoneZero-setup adoption(subscription)
Ollama / vLLMllama3.3:70bNoneAir-gapped or local-only(hardware)

Cost figures assume a typical multi-file story (~6,000 in / 1,500 out tokens). Your costs vary with story complexity and team-memory size.

The default and the most reliable. Uses Anthropic’s native tool-use feature for structured output, which means the model returns JSON conforming to a schema with very high reliability (we rarely see parse failures).

Terminal window
cascade configure llm anthropic --key sk-ant-xxx --set-default
SettingValue
Default modelclaude-opus-4-7
AuthAPI key (header)
Env varANTHROPIC_API_KEY
Structured outputNative tool use (very reliable)
StreamingSupported
Best forProduction use, complex multi-file changes, anything where reliability matters
ModelCostWhen to use
claude-opus-4-7HighestDefault. Best code quality.
claude-sonnet-4-6MidWhen Opus is overkill. Smaller stories, refactors.
claude-haiku-4-5-20251001LowestHigh-volume runs, experimentation. May struggle on complex stories.

Switch with --model:

Terminal window
cascade prompt "Refactor User class" --model claude-haiku-4-5-20251001

Or persistently:

Terminal window
cascade configure llm anthropic --model claude-sonnet-4-6

Uses OpenAI’s Structured Outputs (response_format with json_schema). Reliable for code generation. Works with any OpenAI-compatible endpoint (including Azure OpenAI deployments) via --base-url.

Terminal window
# OpenAI's own API
cascade configure llm openai --key sk-xxx --set-default
# Azure OpenAI
cascade configure llm openai \
--key your-key \
--base-url https://your-resource.openai.azure.com/openai/deployments/your-deployment
# OpenRouter (gateway to many models)
cascade configure llm openai \
--key sk-or-xxx \
--base-url https://openrouter.ai/api/v1
SettingValue
Default modelgpt-4o
AuthAPI key (header)
Env varOPENAI_API_KEY
Structured outputNative Structured Outputs (reliable)
StreamingSupported
Best forTeams already on OpenAI; Azure deployments; OpenRouter as a multi-model gateway
ModelCostWhen to use
gpt-4oMidDefault. Good balance of quality and cost.
o1-previewHighestHard reasoning-heavy stories. Slower.
gpt-4o-miniLowestHigh-volume runs. Lower code quality on complex changes.

Uses Gemini’s response_schema parameter with Pydantic models for structured output. Cheapest mainstream option per token; quality has caught up to GPT-4o for most code tasks.

Terminal window
cascade configure llm google --key gemini-xxx --set-default
SettingValue
Default modelgemini-2.0-flash
AuthAPI key (header)
Env varsGOOGLE_API_KEY or GEMINI_API_KEY
Structured outputNative response_schema (reliable)
StreamingSupported
Best forCost-sensitive teams; Google Cloud customers; high-volume workflows
ModelCostWhen to use
gemini-2.0-flashLowDefault. Fast and cheap. Good for most stories.
gemini-2.0-proMidWhen Flash fails on complex schemas.
gemini-1.5-flash-8bVery lowExperimentation, throwaway runs.

Uses your local Claude Code installation. No separate API key, no per-call billing. Your Claude subscription handles auth and quota.

Terminal window
pip install "cascade-agent[claude-code]"
cascade configure llm claude_code --set-default
SettingValue
Default modelWhatever Claude Code is set to (typically Opus)
AuthYour Claude Code installation
Env varNone
Structured outputCode-fence JSON parsing (slightly less reliable than direct API)
StreamingNot exposed by the SDK
Best forZero-setup adoption; teams with existing Claude subscriptions; SaaS-restricted environments where Claude is approved

Full guide: run without an API key

Any OpenAI-compatible local LLM server. Cascade ships with sensible Ollama defaults but the same code path works with vLLM, LM Studio, or any other local server speaking the OpenAI chat API.

Terminal window
# Ollama (default localhost:11434)
cascade configure llm ollama --model llama3.3:70b --set-default
# vLLM or remote server
cascade configure llm ollama \
--model your-model \
--base-url http://gpu-host:8000/v1 \
--set-default
SettingValue
Default modelllama3.3:70b
Default base URLhttp://localhost:11434/v1
AuthNone
Env varNone
Structured outputModel-dependent. Use 70B+ for reliability.
StreamingSupported
Best forAir-gapped environments; data residency requirements; high-volume runs where per-token cost matters

See run without an API key for model-size guidance and hardware requirements.

Terminal window
cascade prompt "Add health endpoint" --model claude-opus-4-7

The --model flag picks the model from your already-configured provider. To switch providers per-call, set the provider in cascade.yaml for the repo.

agent:
provider: openai
model: gpt-4o
Terminal window
cascade configure llm anthropic --set-default

All providers can rate-limit you. Cascade handles this with exponential backoff:

  • Initial retry delay: 1 second
  • Max retries: 5
  • Backoff: 1s, 2s, 4s, 8s, 16s, then fail

If the LLM provider returns a 429 (rate limit), Cascade retries automatically. If it returns 500-class errors persistently, Cascade fails the stage and surfaces the upstream error message.

To work below rate limits in scripts, pace your calls:

Terminal window
for ticket in PROJ-101 PROJ-102 PROJ-103; do
cascade ticket "jira:$ticket"
sleep 10
done

Every Cascade run prints the cost of its LLM calls:

cost: $0.08 (4,820 in / 1,330 out tokens, anthropic/claude-opus-4-7)

For multi-story builds:

============================================================
session: 4 stories built, 8 LLM calls, $0.42
============================================================

Set a hard ceiling with --max-cost:

Terminal window
cascade build stories/sprint.yaml --max-cost 5.00

Aborts between stories if cumulative cost would exceed the threshold. No effect on free providers (Claude Code SDK, Ollama).

  • GitHub Copilot CLI as a transport (best-effort; Copilot’s API is not structured-output friendly today)
  • Per-stage routing: planner on Gemini Flash for cost, coder on Claude Opus for quality
  • Cost budgets per project in cascade.yaml