The case for small models on your own machine

After six months running everything locally, here's what I learned about latency, privacy, and the surprising creativity of constrained systems.

Yavari

April 30, 2026 · 8 min read

For the last six months I've run my entire writing pipeline on local models. No API keys, no per-token billing, no quiet panic when an outage hits mid-draft. The trade-offs were not what I expected. I went in thinking I was giving up capability to buy privacy. What I found was stranger: the constraints were doing some of the work I'd been paying a much larger model to do badly.

Latency is a feature

When the model is on the same machine, the round trip becomes invisible. You stop thinking of it as a service and start thinking of it as a tool. Tools that are always available become part of how you think.

ollama run llama3.2 "summarize this paragraph in one sentence"

That command takes 240ms on my laptop. It's faster than opening a browser tab. That changes what I'm willing to use the model for. A cloud call has a cost you feel: the half-second of latency, the token meter ticking, the small friction of having shipped your sentence to someone else's computer. You ration it. A local call has none of that, so you stop rationing, and a tool you reach for fifty times a day is a different tool than one you reach for five.

What actually runs

Specifics, because this is where credibility lives and vagueness is where these pieces go to die. The setup is deliberately boring:

# the daily driver
ollama run qwen3:8b

# the heavier lift, when I want it
ollama run qwen3:14b

Everything runs through Ollama on a Mac with 32GB of unified memory. The 8B model in a Q4_K_M quantization is what handles the constant small jobs: summarize this, tighten that, name this function, is this sentence doing two things at once. It loads in under a second and answers in the time it takes me to move my hands back to the keyboard. The 14B comes out for anything that needs a longer leash, and it's slower in a way I notice but rarely mind.

Two details matter more than the model choice. The first is quantization. Q4_K_M is the four-bit format most people land on, and the newer quantization-aware-trained variants now hold accuracy at low bit-width far better than the naive post-training quants did a year ago, which is most of why an 8B model in 2026 feels nothing like an 8B model felt in 2024. The second is that none of this is exotic anymore. A $600 Mac Mini runs 14B models comfortably. The hardware bar for "good enough to live on" quietly dropped below the price of a phone.

The generalist's tax

Here's the thing nobody says about the big frontier models: most of what we call prompt engineering is a tax we pay for their bigness.

A model trained on everything knows a little about everything, which means that out of the box it answers everything in the blandest, most median way it can. Ask a 400-billion-parameter generalist to review a function and you get the average of every code review on the internet. So we learned to compensate. We write the long preamble, the "you are a senior Rust engineer who values," the three examples, the "do not be verbose," the careful fencing of the question. Prompt engineering is, mostly, the work of dragging a generalist back toward the specific thing you actually wanted. The model is general; your need is not; you pay the difference by hand, every time.

A specialist is the other way around. It already lives where you live. When the model only knows a few things, but knows them at the level you work at, the preamble evaporates. You stop instructing it on who to be because it was only ever one thing.

This stopped being a hobbyist's hunch in 2025. NVIDIA researchers put it plainly in a paper titled, with no hedging, Small Language Models are the Future of Agentic AI. Their argument is that the systems we're actually building, the agents that do a small number of specialized tasks over and over, are exactly the wrong place for a generalist. Small models, they write, are "sufficiently powerful, inherently more suitable, and necessarily more economical" for most of what these systems do. The frontier model should be the exception you escalate to, not the default you start from. A separate study found a specialist small model reaching break-even with a much larger general one on text classification after roughly a hundred labelled examples. A hundred. That's an afternoon of work to beat a model a thousand times its size at the one job you care about.

The day-to-day reality of being a developer or a researcher is not open-ended. It is a few specific things, done many times, that you want done with precision. That is the exact shape a small specialist fits.

Constraint keeps it honest

The deeper surprise was about hallucination, and it took me a while to trust it.

A constrained model drifts less. When a model knows less, it has fewer plausible-but-wrong directions to wander off in. The giant generalist's failure mode is confident invention: asked something just outside what it knows, it fills the gap with the most statistically agreeable fiction, because somewhere in its training there's always a pattern that fits the shape of the question. A smaller model that lives closer to a single domain has less room to confabulate. It stays nearer to what was actually asked, partly because it can't reach as far.

I don't have a clean benchmark for this and I'm wary of dressing a suspicion up as a finding. But it rhymes with something humans have always known about themselves. We have spent centuries inventing constraints to think better inside of. The writer who locks the door and unplugs the router. The mathematician who takes the same walk every day so the only variable left is the problem. The poets who courted isolation and even melancholy, not because suffering is productive but because a narrowed world is a more honest one. Remove the infinite menu of things you could be attending to and what's left is the thing in front of you.

Constraint isn't the price of focus. Constraint is the mechanism.

A model with the whole internet in its head is, in a sense, never locked in the room. It can always reach for one more association. The small model on my laptop is locked in the room with my actual question, and most days that's exactly where I want it.

This is also where privacy stops being a paranoid afterthought and becomes a property of the system. Nothing I draft leaves the machine. There's no setting to misconfigure, no retention policy to read, no quiet clause about training on my inputs. That changes what I'm willing to write with the model in the loop: the half-formed idea, the thing about a person, the draft I'd never paste into a box owned by someone else. The privacy isn't a feature I turned on. It's just what's true when the computer doing the work is the one on your desk.

What it costs

None of this is free, and a piece that only lists upsides is marketing.

The frontier model is genuinely better at the open-ended things: the cold-start research question into a field I know nothing about, the long document that needs the whole context held at once, the leap that requires having read more than any specialist ever will. When I hand a small model a problem that's actually broad, it doesn't gracefully admit the limit. It does the confident-invention thing, same as the big one, and I have to catch it. The constraint that keeps it honest on narrow work becomes a blindfold on wide work.

So the setup isn't local-or-cloud. It's local-by-default, cloud-on-escalation, which is exactly the heterogeneous shape the NVIDIA paper describes. The small model handles the constant stream of specific, bounded jobs, and a few times a week I reach past it to something larger for the thing that genuinely needs a generalist. The skill that's replaced prompt engineering, for me, is knowing which is which: feeling the moment a task stops being narrow.

Six months in, the trade I thought I was making (capability for privacy) turned out to be the wrong frame entirely. I wasn't trading down. I was trading a tool that does everything adequately for one that does my few things well, and discovering that for most of a working day, that's not a compromise. It's the better tool.

Sources

Small Language Models are the Future of Agentic AI. Peter Belcak, Greg Heinrich, Shizhe Diao et al., NVIDIA Research, June 2025. arxiv.org/abs/2506.02153
Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance. arXiv preprint, 2024. arxiv.org/pdf/2402.12819
Best Local LLM Models 2026 (hardware bar; $600 Mac Mini running 14B). SitePoint, 2026. sitepoint.com
Gemma 3 quantization-aware-trained variants. Ollama model library. ollama.com/library/gemma3
Ollama, local model runtime. ollama.com

Thanks for reading.If this resonated, the easiest way to support the project is to forward it to one person who'd like it, or to subscribe to the letter.

AILocalTech

Comments

Loading…