~/home~/résumé~/blog~/contact
Share
  1. Home
  2. /
  3. Blog
  4. /
  5. Guides
  6. /
  7. How to Run Claude Code With Local Models

How to Run Claude Code With Local Models

AI CodingClaude CodeDeveloper ToolsCLI

April 1, 2026

  • ›The Three Paths to Local
  • ›Ollama Setup in 5 Minutes
  • ›The Environment Variables That Matter
  • ›When Things Break
  • ›Pick the Right Model

Three environment variables connect claude code local models to the CLI you already know. Ollama, LM Studio, and vLLM all work. Set ANTHROPIC_BASE_URL, point it at your local server, and Claude Code runs against your own hardware with no API costs.

The Three Paths to Local#

You have three realistic options for running a local LLM coding assistant with Claude Code. Each fits a different use case.

Ollama is the fastest path. One binary, one command, native Anthropic API compatibility since v0.14. If you want to test this in five minutes, start here.

LM Studio gives you a GUI for model management and a built-in inference server. Good if you prefer clicking over typing. The tradeoff is less control over serving parameters.

vLLM + LiteLLM is the production stack. vLLM serves the model on your GPU, LiteLLM proxies requests and translates between API formats. More moving parts, more control. Claude Code Local Model Architecture

Ollama Setup in 5 Minutes#

Pull a model with tool-calling support and start the server with enough context:

ollama pull qwen3-coder
# Claude Code needs at minimum 32K context. 64K is better.
OLLAMA_NUM_CTX=65536 ollama serve

In a second terminal, set three variables and launch:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude --model qwen3-coder

That's it. Running ollama claude code takes one binary and three env vars. Claude Code now sends all requests to your local instance. I tested this on a 32GB machine with a 3090. Response times averaged 45 seconds for simple edits, 2-3 minutes for multi-file changes. Usable, not fast.

Note: You need at least 16GB of RAM for a 7B model and 32GB for anything larger. VRAM matters more than CPU. A dedicated GPU with 8GB+ VRAM is the practical minimum for a tolerable experience. For LM Studio, the setup is similar but with a GUI for browsing and downloading models:

lms server start --port 1234
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
# LM Studio models need the openai/ prefix
claude --model openai/qwen2.5-coder-14b

The Environment Variables That Matter#

Three variables control everything. Get one wrong and Claude Code either hits the cloud API or fails silently.

  • ANTHROPIC_BASE_URL sets the endpoint. Point this at your local server's address and port. No trailing slash.
  • ANTHROPIC_AUTH_TOKEN satisfies Claude Code's auth check. The value doesn't matter for local servers, but it can't be empty. Use ollama or local or whatever.
  • CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC stops Claude Code from phoning home for telemetry and update checks. Set it to 1.

Here is a gotcha that tripped me up: if you leave ANTHROPIC_API_KEY set in your shell profile, Claude Code may still try to reach the Anthropic API for certain operations. Unset it explicitly or the ANTHROPIC_BASE_URL override gets ignored for some requests. You can also set these permanently in ~/.claude/settings.json under the env key so you do not need to export them every session. This is the cleanest approach for daily use with claude code local models.

# Shell profile snippet for local-only mode
unset ANTHROPIC_API_KEY
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

When Things Break#

For teams needing a production-grade vLLM LiteLLM proxy setup, the DEV Community guide covers the full architecture. It breaks in specific ways, and I hit every one of these errors.

Tool calls silently fail (vLLM)

vLLM needs two flags: `--enable-auto-tool-choice` and `--tool-call-parser qwen3_coder`. Without these, the model generates tool call text but vLLM does not parse it into structured calls. Claude Code sees no tool output and loops.
Your model lacks tool-calling capability. Qwen3-Coder, GLM-4.7, and gpt-oss:20b all support tools. Most base Llama models do not. Switch models.
The /init command hardcodes a model name internally. Skip /init entirely and create your CLAUDE.md manually. See the guide on Claude Code skills for what to put in it.
Default Ollama context is 2048 tokens. Claude Code needs 32K minimum. Start Ollama with `OLLAMA_NUM_CTX=65536 ollama serve`. For vLLM, set `--max-model-len` to 65536.
LiteLLM tries to pass Anthropic-only params to your local model. Add `drop_params: true` and `modify_params: true` to your LiteLLM model config. Without this, you get 400 errors.

Here is the LiteLLM config that actually works:

model_list:
  - model_name: qwen3-coder
    litellm_params:
      model: openai/qwen3-coder
      api_base: http://localhost:8000/v1
      # Without these, Anthropic-specific params cause 400 errors
      drop_params: true
      modify_params: true

Pick the Right Model#

Not every local model works with Claude Code's agentic patterns. Tool calling is the hard requirement. Without it, Claude Code can't read files, run commands, or edit code. Models I've tested that work:

  • Qwen3-Coder: Best overall. Strong tool calling, good at following complex instructions. The community consensus on GitHub #7178.
  • GLM-4.7: Solid alternative. Slightly worse at multi-step reasoning but faster inference.
  • gpt-oss:20b: Works but needs more VRAM. Good code quality when it doesn't time out.

Several popular models fail here: most Llama variants struggle with Claude Code's tool-calling format. DeepSeek Coder generates good code but chokes on structured tool responses. On dual MI60 GPUs, I measured ~25-30 tokens/sec with ~175ms time to first token. Workable for targeted edits. For sustained back-and-forth (debugging sessions, large refactors), expect 1-3 minutes per response on a 7B model with a mid-range GPU. Cloud Claude handles these in 5-20 seconds. Cycling between local and cloud works best. Use local models for routine tasks that don't need peak intelligence. Switch to the API when you hit rate limits or need complex multi-file reasoning. A split terminal showing Claude Code connected to a local Ollama instance on the left, with GPU utilization metrics on the right "Free" local inference costs 32GB of RAM and your patience. But for many tasks, that tradeoff is worth it. Set the three env vars, pull a model with tool support, and start coding.

Share