Running a Local Coding Agent on an M2 Max (32GB) with Ollama + opencode
I came across Alex Ewerlöf's excellent write-up, Local LLMs for Agentic Coding, and wanted to reproduce his setup on my own machine. His guide leans on LM Studio driving VS Code Copilot / Pi, with Gemma 4 as the model. My stack is a little different — I run Ollama and opencode — and my hardware is more modest: an Apple M2 Max with 32GB of unified memory.
I genuinely expected nothing from this exercise except learning how to set everything up. It turns out it's actually ... usable?
The model
Alex recommends the Gemma 4 family, and the sweet spot for agentic work is Gemma 4 26B-A4B — a mixture-of-experts model with 26B total but only 4B active parameters, so it's fast while staying capable. It's on Ollama:
bash
ollama pull gemma4:26b-a4b-it-q4_K_M # ~17–18GB at 4-bitOn 32GB, stick to the 4-bit quant. The 8-bit (q8_0, ~28GB) leaves no room for the KV cache, the OS, or anything else you have open.
Tuning for 32GB
The article uses a 150k context window on a much bigger machine. On 32GB you have to be more conservative. Ollama also defaults every model to a tiny 4,096-token context, which is useless for an agent — opencode's system prompt and tool definitions alone eat 20–40k tokens.
I baked a sensible context window into a custom variant with a Modelfile:
dockerfile
FROM gemma4:26b-a4b-it-q4_K_M
PARAMETER num_ctx 49152 # 48k — plenty for agentic work, fits in 32GB
PARAMETER temperature 1.0
PARAMETER top_p 0.95bash
ollama create gemma4-coding -f ModelfileTo stretch memory further I matched Alex's KV-cache-quantization trick. He set the K cache to Q8_0 and V cache to Q4_0 in LM Studio; Ollama exposes a single global setting plus flash attention:
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0 # aggressive savings for a tight memory budget
export OLLAMA_KEEP_ALIVE=-1 # keep the model resident; reloads are slow at this sizeGemma 4's interleaved local/global attention keeps the KV cache smaller than a typical dense model, so q4_0 at 48k is comfortable. I also raised the Metal VRAM cap so the GPU could hold the whole model:
bash
sudo sysctl iogpu.wired_limit_mb=28000 # leaves ~4GB for the systemWiring it into opencode
opencode talks to Ollama's OpenAI-compatible endpoint. The provider goes in ~/.config/opencode/opencode.json:
json{2 items"$schema":"https://opencode.ai/config.json""provider":{1 item"ollama":{}4 items}}
The model key (gemma4-coding) must exactly match the name from ollama list. opencode also wants an auth entry in ~/.local/share/opencode/auth.json, even though Ollama ignores the key:
json{1 item"ollama":{2 items"type":"api""key":"ollama"}}
How it runs
With everything fixed, ollama ps shows the model resident at ~20GB, 100% GPU, 49152 context. Performance is genuinely usable:
- ~15–30 tokens/sec — comfortably above the ~10 tok/s the article calls the floor for coding work.
- Cold start of 1–3 minutes on the first big agentic prompt (loading the model plus processing opencode's huge system prompt), then snappy after that.
- The 4B-active MoE design is the whole reason this works on 32GB — a dense 31B would be slower and wouldn't fit nearly as comfortably.
If you've got bigger hardware, follow Alex's original guide and crank the context. But if you're on a 32GB Apple Silicon laptop, a 4-bit MoE model with a 48k window, quantized KV cache, and a resident model gets you a real, private, local coding agent — no tokens leaving your machine.