Run without an API key

Most AI dev tools require you to provision a new API key, get budget approval, and renew the key forever. Cascade has two paths that skip that entirely.

Option 1: Claude Code SDK

If you already have Claude Code installed, Cascade can use it as the LLM transport. Your existing Claude subscription pays for the calls. No additional API key. No new vendor.

Setup

# Step 1: install Claude Code if you have not already
# Follow the official Claude Code installation guide first.

# Step 2: install the Cascade extra for Claude Code support
pip install "cascade-agent[claude-code]"

# Step 3: configure Cascade to use it
cascade configure llm claude_code --set-default

That is it. No API key flag, no token, no provider account.

Verify

cascade configure show

You should see:

LLM providers:
  claude_code: key=(not set) model=(default) base_url=(default)

The (not set) for the key is expected. Claude Code handles auth internally.

Then run the end-to-end smoke test:

cascade try

If it passes, your Claude Code transport is working.

Caveats

Slightly less reliable structured output than the direct Anthropic API. Cascade asks Claude to return JSON in a code fence and parses it; the direct API uses native tool use, which is stricter. Most of the time this difference does not matter. For very long or complex schemas you may see occasional parse errors. Cascade retries automatically.
Subject to your Claude Code subscription’s usage limits. If you hit your quota, Cascade calls will fail until your quota resets.
No streaming. The Claude Code SDK does not currently expose streaming token output, so the per-stage progress spinners spin for longer than they would with the direct API.

When this path is right

You already pay for Claude (Pro or Team) and want to amortize that cost
Your company has approved Claude but blocks other AI APIs
You want to onboard a teammate without provisioning a new vendor
You are evaluating Cascade and do not want to set up billing yet

Option 2: Ollama or vLLM (fully local)

If you want zero external dependencies (no SaaS API, no subscription, no per-token cost), run an LLM locally. Cascade talks to local servers via their OpenAI-compatible API.

Setup with Ollama

# Step 1: install Ollama from https://ollama.ai

# Step 2: start the Ollama server (usually starts automatically on install)
ollama serve

# Step 3: pull a model
ollama pull llama3.3:70b

# Step 4: configure Cascade
cascade configure llm ollama --model llama3.3:70b --set-default

Setup with vLLM, LM Studio, or other OpenAI-compatible local servers

Any local LLM server that speaks OpenAI’s chat completion API will work. Point Cascade at the right URL:

cascade configure llm ollama \
  --model your-served-model-name \
  --base-url http://localhost:8000/v1 \
  --set-default

# LM Studio at http://localhost:1234/v1
cascade configure llm ollama \
  --model your-loaded-model \
  --base-url http://localhost:1234/v1 \
  --set-default

# Remote GPU box on your network
cascade configure llm ollama \
  --model llama3.3:70b \
  --base-url http://gpu-host:11434/v1 \
  --set-default

The ollama provider in Cascade is just an OpenAI-compatible client with sensible Ollama defaults. The name is historical; it works with anything that speaks the OpenAI chat API.

Model size guide

Cascade asks the LLM to produce structured output (JSON matching a strict Pydantic schema) for the plan and code stages. This is hard for small models. A rough quality / hardware guide:

Model class	Hardware needed	Cascade reliability	Notes
1B-3B (tiny)	CPU only	Poor	Cannot produce reliable structured output. Avoid.
7B-8B	16 GB RAM CPU, or 8 GB VRAM	Usable for simple prompts	Works for `cascade try` and small `cascade prompt` calls. Struggles on multi-file plans.
13B-30B	32 GB RAM CPU, or 16-24 GB VRAM	Good	Reliable for most stories. Slow on CPU; comfortable on a single 24GB GPU.
70B+	64+ GB RAM CPU, or 48 GB+ VRAM (or quantized to fit on 24)	Production-grade	Comparable to mid-tier SaaS models. Recommended if you have the hardware.

Models with strong JSON-mode support are noticeably more reliable. As of this writing, Llama 3.3 (70B), Qwen 2.5 (32B+), and DeepSeek-Coder-V2 are the best open choices for code work.

Performance expectations

A single Cascade story makes two LLM calls (plan, then code), each typically 2,000 to 5,000 tokens out. On consumer hardware:

CPU-only, 7B model: 60-180 seconds per call. Expect 3-6 minutes per story.
Single 24GB GPU, 30B model: 10-30 seconds per call. Expect 30-90 seconds per story.
Dual GPU or H100, 70B model: 5-15 seconds per call. Comparable to SaaS APIs.

For comparison, an Anthropic Claude Opus call typically completes in 5-15 seconds.

Caveats

Quality scales with model size. Small local models (under 7B params) produce code Cascade cannot reliably ship. Use the largest model your hardware can run.
Structured output reliability varies by model. Models trained with strong JSON-mode support (Llama 3.1 8B+, Llama 3.3, Qwen 2.5 7B+, DeepSeek Coder) work well. Older or smaller models may fail Cascade’s schema validation.
Speed. Slower than the SaaS API providers, especially on CPU. Plan for 2-10x the latency of an Anthropic or OpenAI call.
No automatic model updates. You are responsible for pulling new model versions as they release.

When this path is right

Your company prohibits sending source code to any SaaS API
You operate in an air-gapped or restricted-network environment
You are doing high-volume runs and the per-token cost of SaaS APIs adds up
You want full reproducibility (same model weights, same output forever)
You are experimenting with open models and want to benchmark them on real code work

Choosing between them

Situation	Best choice
You have Claude Code installed; you are an individual developer	Claude Code SDK
Your company blocks SaaS APIs but you have GPUs in-house	Ollama or vLLM with a 70B-class model
You want maximum reliability and do not mind paying	Direct Anthropic / OpenAI / Google API
You are evaluating Cascade with no commitment	Claude Code SDK first; fall back to Ollama if needed
You need strict data residency (e.g., EU-only)	Local Ollama; verify your model weights are stored locally

Switching providers per call

Even with a default set, you can override per command:

cascade prompt "Add health endpoint" --model claude-opus-4-7

Or change the default any time:

cascade configure llm anthropic --key sk-ant-xxx --set-default

The previous configuration stays in ~/.config/cascade/config.yaml. Switching back later is one command, no re-config.

What is next

LLM providers reference for full details on every provider
Security model for what your code provider can and cannot see
The pipeline for what each LLM call actually does