Running a Local Coding Agent on an M2 Max (32GB) with Ollama + opencode

I came across Alex Ewerlöf's excellent write-up, Local LLMs for Agentic Coding, and wanted to reproduce his setup on my own machine. His guide leans on LM Studio driving VS Code Copilot / Pi, with Gemma 4 as the model. My stack is a little different — I run Ollama and opencode — and my hardware is more modest: an Apple M2 Max with 32GB of unified memory.

I genuinely expected nothing from this exercise except learning how to set everything up. It turns out it's actually ... usable?

The model

Alex recommends the Gemma 4 family, and the sweet spot for agentic work is Gemma 4 26B-A4B — a mixture-of-experts model with 26B total but only 4B active parameters, so it's fast while staying capable. It's on Ollama:

bash

ollama pull gemma4:26b-a4b-it-q4_K_M   # ~17–18GB at 4-bit

On 32GB, stick to the 4-bit quant. The 8-bit (q8_0, ~28GB) leaves no room for the KV cache, the OS, or anything else you have open.

Tuning for 32GB

The article uses a 150k context window on a much bigger machine. On 32GB you have to be more conservative. Ollama also defaults every model to a tiny 4,096-token context, which is useless for an agent — opencode's system prompt and tool definitions alone eat 20–40k tokens.

I baked a sensible context window into a custom variant with a Modelfile:

dockerfile

FROM gemma4:26b-a4b-it-q4_K_M
PARAMETER num_ctx 49152      # 48k — plenty for agentic work, fits in 32GB
PARAMETER temperature 1.0
PARAMETER top_p 0.95

bash

ollama create gemma4-coding -f Modelfile

To stretch memory further I matched Alex's KV-cache-quantization trick. He set the K cache to Q8_0 and V cache to Q4_0 in LM Studio; Ollama exposes a single global setting plus flash attention:

bash

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0   # aggressive savings for a tight memory budget
export OLLAMA_KEEP_ALIVE=-1        # keep the model resident; reloads are slow at this size

Gemma 4's interleaved local/global attention keeps the KV cache smaller than a typical dense model, so q4_0 at 48k is comfortable. I also raised the Metal VRAM cap so the GPU could hold the whole model:

bash

sudo sysctl iogpu.wired_limit_mb=28000   # leaves ~4GB for the system

Wiring it into opencode

opencode talks to Ollama's OpenAI-compatible endpoint. The provider goes in ~/.config/opencode/opencode.json:

json{2 items"$schema":"https://opencode.ai/config.json""provider":{1 item"ollama":{}4 items}}

The model key (gemma4-codingmust exactly match the name from ollama list. opencode also wants an auth entry in ~/.local/share/opencode/auth.json, even though Ollama ignores the key:

json{1 item"ollama":{2 items"type":"api""key":"ollama"}}

How it runs

With everything fixed, ollama ps shows the model resident at ~20GB, 100% GPU, 49152 context. Performance is genuinely usable:

  • ~15–30 tokens/sec — comfortably above the ~10 tok/s the article calls the floor for coding work.
  • Cold start of 1–3 minutes on the first big agentic prompt (loading the model plus processing opencode's huge system prompt), then snappy after that.
  • The 4B-active MoE design is the whole reason this works on 32GB — a dense 31B would be slower and wouldn't fit nearly as comfortably.

If you've got bigger hardware, follow Alex's original guide and crank the context. But if you're on a 32GB Apple Silicon laptop, a 4-bit MoE model with a 48k window, quantized KV cache, and a resident model gets you a real, private, local coding agent — no tokens leaving your machine.