Skip to content

Run without an API key

Most AI dev tools require you to provision a new API key, get budget approval, and renew the key forever. Cascade has two paths that skip that entirely.

If you already have Claude Code installed, Cascade can use it as the LLM transport. Your existing Claude subscription pays for the calls. No additional API key. No new vendor.

Terminal window
# Step 1: install Claude Code if you have not already
# Follow the official Claude Code installation guide first.
# Step 2: install the Cascade extra for Claude Code support
pip install "cascade-agent[claude-code]"
# Step 3: configure Cascade to use it
cascade configure llm claude_code --set-default

That is it. No API key flag, no token, no provider account.

Terminal window
cascade configure show

You should see:

LLM providers:
claude_code: key=(not set) model=(default) base_url=(default)

The (not set) for the key is expected. Claude Code handles auth internally.

Then run the end-to-end smoke test:

Terminal window
cascade try

If it passes, your Claude Code transport is working.

  • Slightly less reliable structured output than the direct Anthropic API. Cascade asks Claude to return JSON in a code fence and parses it; the direct API uses native tool use, which is stricter. Most of the time this difference does not matter. For very long or complex schemas you may see occasional parse errors. Cascade retries automatically.
  • Subject to your Claude Code subscription’s usage limits. If you hit your quota, Cascade calls will fail until your quota resets.
  • No streaming. The Claude Code SDK does not currently expose streaming token output, so the per-stage progress spinners spin for longer than they would with the direct API.
  • You already pay for Claude (Pro or Team) and want to amortize that cost
  • Your company has approved Claude but blocks other AI APIs
  • You want to onboard a teammate without provisioning a new vendor
  • You are evaluating Cascade and do not want to set up billing yet

If you want zero external dependencies (no SaaS API, no subscription, no per-token cost), run an LLM locally. Cascade talks to local servers via their OpenAI-compatible API.

Terminal window
# Step 1: install Ollama from https://ollama.ai
# Step 2: start the Ollama server (usually starts automatically on install)
ollama serve
# Step 3: pull a model
ollama pull llama3.3:70b
# Step 4: configure Cascade
cascade configure llm ollama --model llama3.3:70b --set-default

Setup with vLLM, LM Studio, or other OpenAI-compatible local servers

Section titled “Setup with vLLM, LM Studio, or other OpenAI-compatible local servers”

Any local LLM server that speaks OpenAI’s chat completion API will work. Point Cascade at the right URL:

8000/v1
cascade configure llm ollama \
--model your-served-model-name \
--base-url http://localhost:8000/v1 \
--set-default
# LM Studio at http://localhost:1234/v1
cascade configure llm ollama \
--model your-loaded-model \
--base-url http://localhost:1234/v1 \
--set-default
# Remote GPU box on your network
cascade configure llm ollama \
--model llama3.3:70b \
--base-url http://gpu-host:11434/v1 \
--set-default

The ollama provider in Cascade is just an OpenAI-compatible client with sensible Ollama defaults. The name is historical; it works with anything that speaks the OpenAI chat API.

Cascade asks the LLM to produce structured output (JSON matching a strict Pydantic schema) for the plan and code stages. This is hard for small models. A rough quality / hardware guide:

Model classHardware neededCascade reliabilityNotes
1B-3B (tiny)CPU onlyPoorCannot produce reliable structured output. Avoid.
7B-8B16 GB RAM CPU, or 8 GB VRAMUsable for simple promptsWorks for cascade try and small cascade prompt calls. Struggles on multi-file plans.
13B-30B32 GB RAM CPU, or 16-24 GB VRAMGoodReliable for most stories. Slow on CPU; comfortable on a single 24GB GPU.
70B+64+ GB RAM CPU, or 48 GB+ VRAM (or quantized to fit on 24)Production-gradeComparable to mid-tier SaaS models. Recommended if you have the hardware.

Models with strong JSON-mode support are noticeably more reliable. As of this writing, Llama 3.3 (70B), Qwen 2.5 (32B+), and DeepSeek-Coder-V2 are the best open choices for code work.

A single Cascade story makes two LLM calls (plan, then code), each typically 2,000 to 5,000 tokens out. On consumer hardware:

  • CPU-only, 7B model: 60-180 seconds per call. Expect 3-6 minutes per story.
  • Single 24GB GPU, 30B model: 10-30 seconds per call. Expect 30-90 seconds per story.
  • Dual GPU or H100, 70B model: 5-15 seconds per call. Comparable to SaaS APIs.

For comparison, an Anthropic Claude Opus call typically completes in 5-15 seconds.

  • Quality scales with model size. Small local models (under 7B params) produce code Cascade cannot reliably ship. Use the largest model your hardware can run.
  • Structured output reliability varies by model. Models trained with strong JSON-mode support (Llama 3.1 8B+, Llama 3.3, Qwen 2.5 7B+, DeepSeek Coder) work well. Older or smaller models may fail Cascade’s schema validation.
  • Speed. Slower than the SaaS API providers, especially on CPU. Plan for 2-10x the latency of an Anthropic or OpenAI call.
  • No automatic model updates. You are responsible for pulling new model versions as they release.
  • Your company prohibits sending source code to any SaaS API
  • You operate in an air-gapped or restricted-network environment
  • You are doing high-volume runs and the per-token cost of SaaS APIs adds up
  • You want full reproducibility (same model weights, same output forever)
  • You are experimenting with open models and want to benchmark them on real code work
SituationBest choice
You have Claude Code installed; you are an individual developerClaude Code SDK
Your company blocks SaaS APIs but you have GPUs in-houseOllama or vLLM with a 70B-class model
You want maximum reliability and do not mind payingDirect Anthropic / OpenAI / Google API
You are evaluating Cascade with no commitmentClaude Code SDK first; fall back to Ollama if needed
You need strict data residency (e.g., EU-only)Local Ollama; verify your model weights are stored locally

Even with a default set, you can override per command:

Terminal window
cascade prompt "Add health endpoint" --model claude-opus-4-7

Or change the default any time:

Terminal window
cascade configure llm anthropic --key sk-ant-xxx --set-default

The previous configuration stays in ~/.config/cascade/config.yaml. Switching back later is one command, no re-config.