How to Use OpenClaw with Ollama: Run AI Locally Without API Costs

Cloud AI APIs are powerful, but they come with costs, rate limits, and privacy concerns. What if you could run your AI assistant entirely on your own hardware — no API keys, no monthly bills, complete data sovereignty?

That’s exactly what OpenClaw + Ollama delivers. In this guide, we’ll set up a fully local AI assistant stack.

What is Ollama?

Ollama is a lightweight runtime that lets you run open-source large language models (LLMs) on your local machine. It handles model downloading, quantization, GPU acceleration, and serving — all through a simple CLI.

Combined with OpenClaw, Ollama becomes the brain behind your self-hosted AI assistant.

Prerequisites

OpenClaw installed and running (installation guide)
Hardware requirements (varies by model):
- Minimum: 16GB RAM, modern CPU
- Recommended: 32GB RAM, NVIDIA GPU with 8GB+ VRAM
- Ideal: 64GB RAM, NVIDIA RTX 4090 or Apple M3/M4 with 32GB+ unified memory

Step 1: Install Ollama

macOS

brew install ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows (WSL2)

# Inside WSL2
curl -fsSL https://ollama.ai/install.sh | sh

Verify installation:

ollama --version

Step 2: Download Your First Model

Ollama has a library of optimized models. Here are the best choices for OpenClaw:

# Best all-around (requires ~40GB RAM or 24GB VRAM)
ollama pull llama3.2:70b

# Great balance of speed and quality (requires ~16GB RAM or 8GB VRAM)
ollama pull llama3.2:8b

# Code-specialized model
ollama pull codellama:34b

# Fast and lightweight (runs on almost anything)
ollama pull phi3:mini

Model Size Reference

Model	Parameters	RAM Needed	VRAM Needed	Quality
`phi3:mini`	3.8B	8GB	4GB	★★★☆☆
`llama3.2:8b`	8B	16GB	8GB	★★★★☆
`codellama:34b`	34B	32GB	16GB	★★★★☆
`llama3.2:70b`	70B	48GB	24GB	★★★★★
`mixtral:8x7b`	46.7B	32GB	16GB	★★★★☆

Step 3: Start the Ollama Server

# Start in the background
ollama serve &

# Or as a system service (Linux)
sudo systemctl enable ollama
sudo systemctl start ollama

Test that it’s running:

curl http://localhost:11434/api/tags

You should see a JSON response listing your downloaded models.

Step 4: Connect OpenClaw to Ollama

Configure OpenClaw to use Ollama as its AI provider:

# ~/.openclaw/config.yaml
ai:
  provider: "ollama"
  
  providers:
    ollama:
      host: "http://localhost:11434"
      models:
        - name: "llama3.2:8b"
          max_tokens: 4096
          temperature: 0.7
          # Context window (depends on model)
          context_length: 8192

Restart the gateway:

openclaw gateway restart

Step 5: Verify the Connection

openclaw diagnostics --providers

┌──────────┬──────────┬──────────────────┬──────────┐
│ Provider │ Status   │ Model            │ Latency  │
├──────────┼──────────┼──────────────────┼──────────┤
│ Ollama   │ ● Online │ llama3.2:8b      │ 1.8s     │
└──────────┴──────────┴──────────────────┴──────────┘

Send a test message through any connected channel:

What's 2 + 2? Explain the mathematical reasoning.

Step 6: GPU Acceleration

NVIDIA GPUs (CUDA)

Ollama automatically detects NVIDIA GPUs. Verify:

# Check GPU detection
ollama ps

If your GPU isn’t detected, install the NVIDIA Container Toolkit:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Apple Silicon (M1/M2/M3/M4)

Ollama uses Metal GPU acceleration automatically on Apple Silicon. No extra setup needed — just ensure you have enough unified memory.

Verify GPU Usage

# NVIDIA
nvidia-smi

# Watch GPU usage in real-time
watch -n 1 nvidia-smi

Step 7: Performance Tuning

Optimize for Speed

# ~/.openclaw/config.yaml
ai:
  providers:
    ollama:
      host: "http://localhost:11434"
      models:
        - name: "llama3.2:8b"
          # Reduce for faster responses
          max_tokens: 2048
          # Lower temperature = more deterministic (slightly faster)
          temperature: 0.3
          # Number of GPU layers (increase to offload more to GPU)
          num_gpu: 99  # Use all available GPU layers
          # Keep model in memory between requests
          keep_alive: "30m"
          # Batch size for prompt processing
          num_batch: 512

Memory Management

If you’re running on limited RAM, manage model loading:

# Unload a model from memory
ollama stop llama3.2:70b

# List loaded models
ollama ps

Configure OpenClaw to manage Ollama memory:

ai:
  providers:
    ollama:
      memory_management:
        # Unload model after inactivity
        unload_after: "15m"
        # Max models loaded simultaneously
        max_loaded_models: 1

Step 8: Multiple Local Models

Run different models for different tasks (pairs well with multi-agent routing):

ai:
  providers:
    ollama:
      host: "http://localhost:11434"
      models:
        - name: "llama3.2:8b"     # General conversation
        - name: "codellama:34b"    # Code generation
        - name: "phi3:mini"        # Quick responses

routing:
  rules:
    - name: "coding"
      match:
        keywords: ["code", "function", "debug", "implement"]
      route:
        provider: "ollama"
        model: "codellama:34b"
    
    - name: "quick"
      match:
        max_input_tokens: 50
      route:
        provider: "ollama"
        model: "phi3:mini"
  
  # Default to general model
  default:
    provider: "ollama"
    model: "llama3.2:8b"

Step 9: Hybrid Mode (Local + Cloud)

The best of both worlds — use local models for most tasks, cloud for complex ones:

ai:
  default_provider: "ollama"
  default_model: "llama3.2:8b"

  providers:
    ollama:
      host: "http://localhost:11434"
      models:
        - name: "llama3.2:8b"
    
    anthropic:
      api_key: "${ANTHROPIC_API_KEY}"
      models:
        - name: "claude-3.5-sonnet"

routing:
  rules:
    # Complex tasks go to Claude
    - name: "complex"
      match:
        keywords: ["analyze", "refactor", "architect", "review complex"]
      route:
        provider: "anthropic"
        model: "claude-3.5-sonnet"
    
    # Privacy-sensitive stays local
    - name: "private"
      match:
        keywords: ["private", "confidential", "personal"]
      route:
        provider: "ollama"
        model: "llama3.2:8b"
  
  # Everything else: local
  default:
    provider: "ollama"
    model: "llama3.2:8b"

This way, 90% of your queries cost $0 while still having cloud models available for heavy lifting.

Benchmarking Your Setup

Test your local model performance:

openclaw benchmark --provider ollama --model llama3.2:8b

┌─────────────────────────┬──────────────┐
│ Metric                  │ Result       │
├─────────────────────────┼──────────────┤
│ Time to First Token     │ 0.8s         │
│ Tokens/Second           │ 42 tok/s     │
│ Average Response Time   │ 3.2s         │
│ Memory Usage            │ 5.4GB        │
│ GPU Utilization         │ 87%          │
│ Quality Score (MMLU)    │ 72.3%        │
└─────────────────────────┴──────────────┘

Troubleshooting

”Model not found”

# List available models
ollama list

# Pull the missing model
ollama pull llama3.2:8b

Slow Response Times

Use a smaller model or lower quantization
Increase num_gpu to offload more to GPU
Reduce max_tokens in the config
Close other GPU-intensive applications

Out of Memory (OOM)

# Check memory usage
free -h

# Use a smaller model
ollama pull llama3.2:8b  # Instead of 70b

# Or use quantized versions
ollama pull llama3.2:8b-q4_0  # 4-bit quantization

Conclusion

Running OpenClaw with Ollama gives you the ultimate self-hosted AI stack — zero API costs, complete privacy, and full control over your data. While local models aren’t quite as powerful as the latest cloud offerings, they’re more than capable for most daily tasks, and they’re getting better every month.

Want help optimizing your local AI infrastructure? Our team designs high-performance self-hosted AI deployments for businesses and development teams.