Inference

Inference is where the model runs. The architecture decision that matters most in 2026 isn't open vs closed; it's where the inference happens - local on my hardware, on someone else's hardware via an open-weight API, or on a closed-source frontier provider's API. Each has different economics, latency, rate limits, and vendor risk.

The local tier handles the routine stuff. I run small open-weight models on hardware I control for the unsexy 90% of work - formatting, summarizing, simple reasoning, extraction. Running it costs electricity. No rate limits. No data leaving the box. No vendor swap changing my system overnight.

The middle tier - hosted open-weight or open-leaning models - handles long-context work. Things that don't fit on local hardware but don't justify frontier-API prices. Today that means Kimi-class and DeepSeek-class models. The exact names will change faster than the architecture does.

The frontier tier handles the genuinely hard work. Multi-strategy synthesis, novel reasoning, the queries where the gap between a good model and the best available model is the difference between a draft that ships and one that needs rewriting.

The skill that matters most in AI engineering right now isn't agent-building - it's routing intelligence to the right inference tier. Knowing what each tier costs, what each is good at, and making sure the right query hits the right model.

Default to the cheapest tier that produces acceptable output. Escalate when the work demands it. Fall back automatically when rate limits or cost ceilings hit.

Used in

Digby

Atlas

Polymarket

Kairos