Running a Local Coding Agent on an M2 Max (32GB) with Ollama + opencode

I came across Alex Ewerlöf's excellent write-up, Local LLMs for Agentic Coding, and wanted to reproduce his setup on my own machine. His guide leans on LM Studio driving VS Code Copilot / Pi, with Gemma 4 as the model. My stack is a little different — I run Ollama and opencode — and my hardware is more modest: an Apple M2 Max with 32GB of unified memory.

I genuinely expected nothing from this exercise except learning how to set everything up. It turns out it's actually ... usable?

The model

Alex recommends the Gemma 4 family, and the sweet spot for agentic work is Gemma 4 26B-A4B — a mixture-of-experts model with 26B total but only 4B active parameters, so it's fast while staying capable. It's on Ollama:

bash

ollama pull gemma4:26b-a4b-it-q4_K_M   # ~17–18GB at 4-bit

On 32GB, stick to the 4-bit quant. The 8-bit (q8_0, ~28GB) leaves no room for the KV cache, the OS, or anything else you have open.

Tuning for 32GB

The article uses a 150k context window on a much bigger machine. On 32GB you have to be more conservative. Ollama also defaults every model to a tiny 4,096-token context, which is useless for an agent — opencode's system prompt and tool definitions alone eat 20–40k tokens.

I baked a sensible context window into a custom variant with a Modelfile:

dockerfile

FROM gemma4:26b-a4b-it-q4_K_M
PARAMETER num_ctx 49152      # 48k — plenty for agentic work, fits in 32GB
PARAMETER temperature 1.0
PARAMETER top_p 0.95

bash

ollama create gemma4-coding -f Modelfile

To stretch memory further I matched Alex's KV-cache-quantization trick. He set the K cache to Q8_0 and V cache to Q4_0 in LM Studio; Ollama exposes a single global setting plus flash attention:

bash

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0   # aggressive savings for a tight memory budget
export OLLAMA_KEEP_ALIVE=-1        # keep the model resident; reloads are slow at this size

Gemma 4's interleaved local/global attention keeps the KV cache smaller than a typical dense model, so q4_0 at 48k is comfortable. I also raised the Metal VRAM cap so the GPU could hold the whole model:

bash

sudo sysctl iogpu.wired_limit_mb=28000   # leaves ~4GB for the system

Wiring it into opencode

opencode talks to Ollama's OpenAI-compatible endpoint. The provider goes in ~/.config/opencode/opencode.json:

json{2 items"$schema":"https://opencode.ai/config.json""provider":{1 item"ollama":{}4 items}}

The model key (gemma4-codingmust exactly match the name from ollama list. opencode also wants an auth entry in ~/.local/share/opencode/auth.json, even though Ollama ignores the key:

json{1 item"ollama":{2 items"type":"api""key":"ollama"}}

How it runs

With everything fixed, ollama ps shows the model resident at ~20GB, 100% GPU, 49152 context. Performance is genuinely usable:

  • ~15–30 tokens/sec — comfortably above the ~10 tok/s the article calls the floor for coding work.
  • Cold start of 1–3 minutes on the first big agentic prompt (loading the model plus processing opencode's huge system prompt), then snappy after that.
  • The 4B-active MoE design is the whole reason this works on 32GB — a dense 31B would be slower and wouldn't fit nearly as comfortably.

If you've got bigger hardware, follow Alex's original guide and crank the context. But if you're on a 32GB Apple Silicon laptop, a 4-bit MoE model with a 48k window, quantized KV cache, and a resident model gets you a real, private, local coding agent — no tokens leaving your machine.

Goodreads Microblog


I read a lot. I also run a Ghost blog at partiallypeaceful.com. For years, I'd finish a book, rate it on Goodreads, and that was it. Maybe I'd remember to post something on the blog. Usually I wouldn't.

I wanted every book I read to show up on partiallypeaceful.com. Not as a big review post. Just a short microblog entry: cover image, star rating, a few thoughts if I wrote a review, and a link to buy it somewhere that isn't Amazon.

So I built a cron job. It runs once a day, checks my Goodreads "read" shelf, and publishes anything new to Ghost. The whole thing is about 150 lines of Python. Here's how it works.

The pipeline

Every morning at 8 AM, a cron job on my home server fires `goodreads-to-ghost`. It pulls my Goodreads RSS feed (every shelf gets one at `https://www.goodreads.com/review/list_rss/<user-id>?shelf=read`), parses the XML, and walks through each book in reverse chronological order.

For each book, it checks Ghost for an internal tag called `#goodreads-book-{id}`. If the tag is there, skip. If not, this book is new.

For new books, three things happen. One: the cover image gets downloaded from Goodreads and re-uploaded to Ghost's image library via the Admin API. Goodreads puts weird size suffixes in their cover URLs. The parser strips those out to get the full-resolution original. Two: the post HTML gets assembled. Ghost-hosted cover linked to Goodreads, a "Read [title] by [author]" line, unicode star rating, review text in a blockquote if I wrote one, and a Bookshop.org link. Three: the post gets created through Ghost's Admin API with public tags `microblog` and `books`, plus that invisible `#goodreads-book-{id}` tag.

Once the post is live, Ghost fires a `post.published` webhook. A separate process, `ghost-webhook-forwarder`, sits on port 9001 waiting for it. When the webhook hits, it sends a `repository_dispatch` event to GitHub. That kicks off a GitHub Actions workflow that rebuilds the Astro site. The new book post shows up on partiallypeaceful.com within a couple minutes.

Why RSS instead of the Goodreads API

Goodreads stopped issuing API keys in 2020. Their RSS feeds are still maintained though, and they include everything I need: book ID, title, author, cover URLs, rating, review text, publication date. No authentication required.

The parsing was the only challenge. Goodreads has two different RSS item formats. The standard shelf feed uses custom namespace fields (`book_id`, `book_title`, `author_name`). The updates feed buries the data in HTML descriptions with CSS classes like `bookTitle` and `authorName`. I wrote the parser to handle both. The updates feed parser uses Python's `HTMLParser` from the standard library to pull structured fields out of description blobs. Not elegant, but solid.

Idempotency

Cron jobs fail. Network blips, Ghost doing maintenance, whatever. If the job runs twice, duplicate posts are a bad look.

Internal tags solve this. Before creating a post, the job asks Ghost: "any posts with tag `hash-goodreads-book-{id}`?" If yes, skip. The tag is internal so it doesn't show up in the blog's tag cloud. It's purely operational plumbing.

Dry-run mode is the other safety net. `goodreads-to-ghost --dry-run` prints exactly what it would publish without touching Ghost. I test every change this way before letting cron run unattended.

Zero runtime dependencies

I'm stubborn about dependencies on small tools. Every library is a future maintenance headache. This project has none. No `requests`, no `httpx`, no Ghost SDK. All stdlib: `urllib` for HTTP, `xml.etree.ElementTree` for RSS, `hmac` and `hashlib` for Ghost's JWT-based Admin API auth, `html.parser` for the Goodreads updates feed.

The Ghost Admin API auth was the most satisfying piece. Split the API key into key ID and hex secret, build a JWT with HMAC-SHA256, and send it as a bearer token. About 15 lines of code.

The rebuild pipeline

My blog is a static Astro site on Cloudflare Pages. When Ghost gets a new post, the site needs to be rebuilt to pick it up.

The webhook forwarder bridges that gap. It's a tiny HTTP server on my home server behind Tailscale. Ghost sends `post.published` webhooks to it. The forwarder calls GitHub's `repository_dispatch` API, GitHub Actions runs `npm run build`, and Cloudflare gets the new static files.

It's intentionally bare. Parses the JSON, logs the post title, and fires the dispatch. No queue, no retries, no state. If it fails, it fails. The next webhook catches whatever was missed.

Is this overengineered?

Yeah, probably. I could paste my Goodreads reviews into the Ghost admin manually. 30 seconds per book. But I'd forget. I'd skip weeks. And the point of having a personal blog is that it reflects what I'm actually reading, not what I remembered to post about.

The automation removes the friction completely. Finish a book, rate it on Goodreads, write a review if I have thoughts. It appears on the blog the next morning. I don't think about it.