Run without an API key
Most AI dev tools require you to provision a new API key, get budget approval, and renew the key forever. Cascade has two paths that skip that entirely.
Option 1: Claude Code SDK
Section titled “Option 1: Claude Code SDK”If you already have Claude Code installed, Cascade can use it as the LLM transport. Your existing Claude subscription pays for the calls. No additional API key. No new vendor.
# Step 1: install Claude Code if you have not already# Follow the official Claude Code installation guide first.
# Step 2: install the Cascade extra for Claude Code supportpip install "cascade-agent[claude-code]"
# Step 3: configure Cascade to use itcascade configure llm claude_code --set-defaultThat is it. No API key flag, no token, no provider account.
Verify
Section titled “Verify”cascade configure showYou should see:
LLM providers: claude_code: key=(not set) model=(default) base_url=(default)The (not set) for the key is expected. Claude Code handles auth internally.
Then run the end-to-end smoke test:
cascade tryIf it passes, your Claude Code transport is working.
Caveats
Section titled “Caveats”- Slightly less reliable structured output than the direct Anthropic API. Cascade asks Claude to return JSON in a code fence and parses it; the direct API uses native tool use, which is stricter. Most of the time this difference does not matter. For very long or complex schemas you may see occasional parse errors. Cascade retries automatically.
- Subject to your Claude Code subscription’s usage limits. If you hit your quota, Cascade calls will fail until your quota resets.
- No streaming. The Claude Code SDK does not currently expose streaming token output, so the per-stage progress spinners spin for longer than they would with the direct API.
When this path is right
Section titled “When this path is right”- You already pay for Claude (Pro or Team) and want to amortize that cost
- Your company has approved Claude but blocks other AI APIs
- You want to onboard a teammate without provisioning a new vendor
- You are evaluating Cascade and do not want to set up billing yet
Option 2: Ollama or vLLM (fully local)
Section titled “Option 2: Ollama or vLLM (fully local)”If you want zero external dependencies (no SaaS API, no subscription, no per-token cost), run an LLM locally. Cascade talks to local servers via their OpenAI-compatible API.
Setup with Ollama
Section titled “Setup with Ollama”# Step 1: install Ollama from https://ollama.ai
# Step 2: start the Ollama server (usually starts automatically on install)ollama serve
# Step 3: pull a modelollama pull llama3.3:70b
# Step 4: configure Cascadecascade configure llm ollama --model llama3.3:70b --set-defaultSetup with vLLM, LM Studio, or other OpenAI-compatible local servers
Section titled “Setup with vLLM, LM Studio, or other OpenAI-compatible local servers”Any local LLM server that speaks OpenAI’s chat completion API will work. Point Cascade at the right URL:
cascade configure llm ollama \ --model your-served-model-name \ --base-url http://localhost:8000/v1 \ --set-default
# LM Studio at http://localhost:1234/v1cascade configure llm ollama \ --model your-loaded-model \ --base-url http://localhost:1234/v1 \ --set-default
# Remote GPU box on your networkcascade configure llm ollama \ --model llama3.3:70b \ --base-url http://gpu-host:11434/v1 \ --set-defaultThe ollama provider in Cascade is just an OpenAI-compatible client with sensible Ollama defaults. The name is historical; it works with anything that speaks the OpenAI chat API.
Model size guide
Section titled “Model size guide”Cascade asks the LLM to produce structured output (JSON matching a strict Pydantic schema) for the plan and code stages. This is hard for small models. A rough quality / hardware guide:
| Model class | Hardware needed | Cascade reliability | Notes |
|---|---|---|---|
| 1B-3B (tiny) | CPU only | Poor | Cannot produce reliable structured output. Avoid. |
| 7B-8B | 16 GB RAM CPU, or 8 GB VRAM | Usable for simple prompts | Works for cascade try and small cascade prompt calls. Struggles on multi-file plans. |
| 13B-30B | 32 GB RAM CPU, or 16-24 GB VRAM | Good | Reliable for most stories. Slow on CPU; comfortable on a single 24GB GPU. |
| 70B+ | 64+ GB RAM CPU, or 48 GB+ VRAM (or quantized to fit on 24) | Production-grade | Comparable to mid-tier SaaS models. Recommended if you have the hardware. |
Models with strong JSON-mode support are noticeably more reliable. As of this writing, Llama 3.3 (70B), Qwen 2.5 (32B+), and DeepSeek-Coder-V2 are the best open choices for code work.
Performance expectations
Section titled “Performance expectations”A single Cascade story makes two LLM calls (plan, then code), each typically 2,000 to 5,000 tokens out. On consumer hardware:
- CPU-only, 7B model: 60-180 seconds per call. Expect 3-6 minutes per story.
- Single 24GB GPU, 30B model: 10-30 seconds per call. Expect 30-90 seconds per story.
- Dual GPU or H100, 70B model: 5-15 seconds per call. Comparable to SaaS APIs.
For comparison, an Anthropic Claude Opus call typically completes in 5-15 seconds.
Caveats
Section titled “Caveats”- Quality scales with model size. Small local models (under 7B params) produce code Cascade cannot reliably ship. Use the largest model your hardware can run.
- Structured output reliability varies by model. Models trained with strong JSON-mode support (Llama 3.1 8B+, Llama 3.3, Qwen 2.5 7B+, DeepSeek Coder) work well. Older or smaller models may fail Cascade’s schema validation.
- Speed. Slower than the SaaS API providers, especially on CPU. Plan for 2-10x the latency of an Anthropic or OpenAI call.
- No automatic model updates. You are responsible for pulling new model versions as they release.
When this path is right
Section titled “When this path is right”- Your company prohibits sending source code to any SaaS API
- You operate in an air-gapped or restricted-network environment
- You are doing high-volume runs and the per-token cost of SaaS APIs adds up
- You want full reproducibility (same model weights, same output forever)
- You are experimenting with open models and want to benchmark them on real code work
Choosing between them
Section titled “Choosing between them”| Situation | Best choice |
|---|---|
| You have Claude Code installed; you are an individual developer | Claude Code SDK |
| Your company blocks SaaS APIs but you have GPUs in-house | Ollama or vLLM with a 70B-class model |
| You want maximum reliability and do not mind paying | Direct Anthropic / OpenAI / Google API |
| You are evaluating Cascade with no commitment | Claude Code SDK first; fall back to Ollama if needed |
| You need strict data residency (e.g., EU-only) | Local Ollama; verify your model weights are stored locally |
Switching providers per call
Section titled “Switching providers per call”Even with a default set, you can override per command:
cascade prompt "Add health endpoint" --model claude-opus-4-7Or change the default any time:
cascade configure llm anthropic --key sk-ant-xxx --set-defaultThe previous configuration stays in ~/.config/cascade/config.yaml. Switching back later is one command, no re-config.
What is next
Section titled “What is next”- LLM providers reference for full details on every provider
- Security model for what your code provider can and cannot see
- The pipeline for what each LLM call actually does