Running a Local Coding Agent on an M2 Max (32GB) with Ollama + opencode

I came across Alex Ewerlöf's excellent write-up, Local LLMs for Agentic Coding, and wanted to reproduce his setup on my own machine. His guide leans on LM Studio driving VS Code Copilot / Pi, with Gemma 4 as the model. My stack is a little different — I run Ollama and opencode — and my hardware is more modest: an Apple M2 Max with 32GB of unified memory.

I genuinely expected nothing from this exercise except learning how to set everything up. It turns out it's actually ... usable?

The model

Alex recommends the Gemma 4 family, and the sweet spot for agentic work is Gemma 4 26B-A4B — a mixture-of-experts model with 26B total but only 4B active parameters, so it's fast while staying capable. It's on Ollama:

bash

ollama pull gemma4:26b-a4b-it-q4_K_M   # ~17–18GB at 4-bit

On 32GB, stick to the 4-bit quant. The 8-bit (q8_0, ~28GB) leaves no room for the KV cache, the OS, or anything else you have open.

Tuning for 32GB

The article uses a 150k context window on a much bigger machine. On 32GB you have to be more conservative. Ollama also defaults every model to a tiny 4,096-token context, which is useless for an agent — opencode's system prompt and tool definitions alone eat 20–40k tokens.

I baked a sensible context window into a custom variant with a Modelfile:

dockerfile

FROM gemma4:26b-a4b-it-q4_K_M
PARAMETER num_ctx 49152      # 48k — plenty for agentic work, fits in 32GB
PARAMETER temperature 1.0
PARAMETER top_p 0.95

bash

ollama create gemma4-coding -f Modelfile

To stretch memory further I matched Alex's KV-cache-quantization trick. He set the K cache to Q8_0 and V cache to Q4_0 in LM Studio; Ollama exposes a single global setting plus flash attention:

bash

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0   # aggressive savings for a tight memory budget
export OLLAMA_KEEP_ALIVE=-1        # keep the model resident; reloads are slow at this size

Gemma 4's interleaved local/global attention keeps the KV cache smaller than a typical dense model, so q4_0 at 48k is comfortable. I also raised the Metal VRAM cap so the GPU could hold the whole model:

bash

sudo sysctl iogpu.wired_limit_mb=28000   # leaves ~4GB for the system

Wiring it into opencode

opencode talks to Ollama's OpenAI-compatible endpoint. The provider goes in ~/.config/opencode/opencode.json:

json{2 items"$schema":"https://opencode.ai/config.json""provider":{1 item"ollama":{}4 items}}

The model key (gemma4-coding) must exactly match the name from ollama list. opencode also wants an auth entry in ~/.local/share/opencode/auth.json, even though Ollama ignores the key:

json{1 item"ollama":{2 items"type":"api""key":"ollama"}}

How it runs

With everything fixed, ollama ps shows the model resident at ~20GB, 100% GPU, 49152 context. Performance is genuinely usable:

~15–30 tokens/sec — comfortably above the ~10 tok/s the article calls the floor for coding work.
Cold start of 1–3 minutes on the first big agentic prompt (loading the model plus processing opencode's huge system prompt), then snappy after that.
The 4B-active MoE design is the whole reason this works on 32GB — a dense 31B would be slower and wouldn't fit nearly as comfortably.

If you've got bigger hardware, follow Alex's original guide and crank the context. But if you're on a 32GB Apple Silicon laptop, a 4-bit MoE model with a 48k window, quantized KV cache, and a resident model gets you a real, private, local coding agent — no tokens leaving your machine.