The Local-vs-API Decision · Lucas Herrero

The most important architectural choice in AI engineering today isn't whether you use open-source or closed-source models. It's where the computation happens.

Most commentary on AI infrastructure treats this as the open-vs-closed debate in different clothes. It's not. Open vs closed is a question about who owns the weights. Local vs API is a question about where intelligence runs - on your hardware or on someone else's, billed by the call or paid for once. Those are different decisions with different consequences, and not being able to differentiate them is how most builders end up with infrastructure that doesn't scale economically.

The right answer to local vs API is almost always all three tiers, routed by job. The skill that matters in 2026 isn't building agents. It's knowing how to route intelligence to the right tier.

The three tiers

There are three meaningful model categories in 2026, and most commentary collapses them into two.

The first category is local models - open-weight models small enough to run on your own hardware. Gemma variants, the smaller Llamas, the smaller Mistrals. These run on consumer or near-consumer hardware. You own the weights and the inference both.

The second category is open-weight models served via API - models whose weights are public and self-hostable in principle, but which most people access through a hosted API because the model is too large to run locally. Today, Kimi-class and DeepSeek-class models sit here: open-weight or open-leaning economics, hosted on infrastructure someone else operates, often at meaningfully lower cost than the closed-source frontier. The names will change. The tier is what matters.

The third category is closed-source frontier models served via API - the best current models from Anthropic, OpenAI, Google, and whoever else is at the frontier this quarter. The weights aren't public; you call them through a vendor and pay per token. These are still meaningfully better at certain hard tasks, and their pricing reflects that.

Most discourse treats AI as a binary - open vs closed, or local vs API. The picture is three-tiered. The middle category - open-weight via API - is the one that gets miscategorized most often. It's the category that's grown the fastest and is changing the economics most. A reader who only thinks in two categories is missing half the landscape where pricing is actually moving.

Why the local tier matters for high-volume routine work

I run my personal infrastructure 24/7. The whole stack - agents, orchestration, scheduled work, market scanning, communication routing - costs me five to ten dollars a month to operate. If I routed every query to a frontier closed-source model, the same throughput would cost me thousands a month. The architecture that makes the difference is tiered routing: routine queries hit a local small model; long-context work auto-escalates to a cheaper hosted open-weight tier; multi-strategy synthesis and genuinely hard reasoning run on frontier APIs on demand.

Yes, I'm running Gemma 2 7B and not the latest Gemma 4. I'll upgrade when I have a reason to. The local tier doesn't need to be the newest model - it needs to be reliable and good enough for the work it does. Gemma 2 7B has been good enough for two years. The architecture is what's permanent; the specific model occupying each tier is not. Chasing the frontier on the local tier means re-tuning the system every few months for marginal gains. The cost of constant churn outweighs the marginal quality improvement on the routine work this tier handles.

The cost math is the most obvious argument for going local on this tier, but it's not the only one. Local inference has no rate limits. No vendor dependence. No data leaving my hardware. No quiet model swap from the provider that changes how my system behaves overnight. The latency is whatever my hardware is willing to give me, not whatever the API queue happens to be. For high-volume routine work - the unsexy 90% of any real system - local wins on every dimension that matters except absolute output quality, and the gap on output quality is small enough to not matter for the queries that hit this tier.

Why the open-weight API tier matters for the middle of the distribution

The middle category is the most underappreciated. Open-weight models served via API let you run inference at a scale your hardware can't support, on models good enough to handle real work, at prices that aren't frontier-API prices. Today, Kimi-class models hold their own against frontier models on some coding workloads at a fraction of the cost. DeepSeek-class models land in similar territory at even lower prices on certain workloads.

This tier is where most of the long-context work in my system lives. Anything that needs more than ~8,000 tokens of context - multi-document synthesis, large codebase navigation, anything that needs the model to hold a lot of relevant material - auto-escalates here. The local tier can't fit it. The frontier tier could handle it but would cost ten times more for output that's qualitatively similar on this kind of task.

The strategic point: this tier exists because the open-source/closed-source debate isn't actually about pricing or quality anymore. It's about whether the weights are public. Open-weight models served via API give you most of the benefits of the closed-source frontier (someone else handles the GPU costs, you don't maintain the infrastructure) at most of the price advantages of going local (much cheaper per token, no vendor lock-in to a specific company's roadmap). The middle category is doing more work in real systems than either pure open-source advocates or pure closed-source customers want to admit.

Why the closed-source API tier matters for hard low-volume work

The frontier-tier closed-source models still earn their price, but only for the work that actually needs them. Long-context synthesis where the system has to weigh six signals across five markets and produce structured reasoning over them. Novel multi-step problems where the smaller model gets stuck in a loop and the larger model just solves it. The kind of judgment work where the difference between Kimi's output and Opus's output is the difference between a draft that needs rewriting and a draft that ships.

The economic argument for the frontier tier is straightforward: when the work is rare and the stakes are high, paying for the best model is cheap relative to the cost of the worse output. A trade decision that depends on synthesizing complex signals against historical base rates is worth running through Opus, not Kimi, and not Gemma. The model cost is rounding error against the trade size.

The mistake isn't using frontier API models. The mistake is using them for everything.

Why routing - not picking a side - is the actual answer

Most AI engineering debates frame the question as a binary: open or closed, local or API, big model or small model. The framing is wrong. Real systems use all of them, routed by what the work actually requires. The discipline that matters is matching model to task - knowing which queries need frontier reasoning, which queries need long context but standard reasoning, and which queries are throwaway, then architecting the routing layer that decides.

This is treating models as a tier system, not a choice. Default to the cheapest tier that produces acceptable output. Escalate when the work demands it. Auto-fall when rate limits or cost ceilings hit. The architecture is what's permanent; the specific models occupying each tier are not. When something better drops on the open-weight side, the relevant tier upgrades. When the frontier moves, the synthesis tier upgrades. The router stays.

The compounding effect is what makes this work. Every dollar saved on routine work funds the API tier for the work that actually needs it. A system that routes intelligently can spend ten dollars a month on routine throughput, fifty dollars on the open-weight API tier for long-context work, and a hundred dollars on the closed-source frontier for the hard queries that justify the spend. A system that doesn't route - that pipes everything through the frontier API - spends thousands and produces the same output for the routine work, just slower and more expensively.

The counter-arguments

Three serious counters deserve response.

The reliability counter. Frontier models are better at hard reasoning. True - but most queries don't need frontier intelligence. The long tail of routine work is where local output is indistinguishable from frontier output. The gap shows on hard queries, and those are exactly what the routing layer escalates.

The hardware counter. Most people don't have local inference hardware. True - but the bar is dropping fast. A Mac mini handles the smaller open-weight models; the hardware investment pays for itself against API costs in months, not years, for any serious 24/7 workload.

The economic counter. API costs are dropping faster than hardware costs. True for low-volume work. Not true for sustained 24/7 inference, where the same workload on local hardware costs electricity and the breakeven is fast.

None of these counters argues against routing. Each argues for it - routing is what puts each tier where its strengths matter.

What this means for the engineering muscle

The crucial skill in AI engineering right now isn't building agents. Anyone can build an agent. The crucial skill is routing intelligence to the right model - knowing what each tier is good at, what each tier costs, and how to architect the system so the right query hits the right model every time.

This is unglamorous work. It's not the part of AI engineering that gets written about. It's the part that determines whether the system you build can scale economically or quietly bankrupts itself the first month it goes to production. The next generation of real AI infrastructure will be built by people who internalize this and shipped by people who route well. The people who default to frontier API for everything will keep shipping demos that don't survive contact with real workloads.

The local-vs-API split isn't the open-source/closed-source debate in different clothes. It's a deeper architectural question: where does intelligence live, and at what cost? The engineers who get this right will build systems that scale. The ones who don't will spend their way out of profitability.

Routing is the skill.