Most AI buying decisions are framed as a feature comparison. They shouldn't be. The real question is: what's the smallest model that does the job, at the latency you need, on the budget you have, without becoming a compliance problem?
Here's the framework we use to answer it.
The short version
- Use a large language model (LLM) when the task benefits from broad world knowledge, long context windows, and strong reasoning — and you can absorb the cost and latency.
- Use a small language model (SLM) when the task is narrow, the data is private, the latency has to be tight, or the cost per call has to be near zero.
- Use a hybrid — route easy traffic to an SLM, escalate hard traffic to an LLM — when the distribution of work is lopsided. Most production workloads are.
What actually changes between them
Cost
An LLM call typically runs two to three orders of magnitude more than an SLM call. On a support bot that handles 100,000 conversations a month, the difference between a 7B-parameter SLM and a frontier LLM is the difference between a $300/month bill and a $30,000 one. Unit economics matter.
Latency
LLMs on hosted APIs run 600–1,500ms per generation. An SLM on a GPU you own runs in 30–90ms. If the model sits inside a user-facing workflow — chat, autocomplete, voice — that delta changes the product.
Context and reasoning
Frontier LLMs reason better at the edge of a task. If the job involves weighing evidence across pages of text, following multi-step instructions, or being robust to adversarial prompts, the LLM earns its cost. Most internal tools don't need this.
Privacy and control
SLMs can run on your hardware, in your VPC, behind your auth. For healthcare, legal, finance and government, this alone often decides the choice.
The decision matrix
| If your problem is… | Start with… | | ------------------------------------ | ----------------- | | Open-ended reasoning, long context | LLM | | Classification, extraction, routing | SLM | | On-device / offline | SLM | | High-volume (>1M calls/month) | SLM or hybrid | | Multilingual, low-resource languages | LLM | | Regulated data, strict residency | SLM in-VPC | | Creative, long-form generation | LLM |
What we recommend, in practice
- Prototype the happy path on a frontier LLM. Prove the task is possible.
- Measure. What fraction of queries actually need the big model? On most business problems, it's less than 20%.
- Distill. Fine-tune a 7B-or-smaller model on the LLM's outputs for the 80% case.
- Route. Keep the LLM as fallback for the hard 20%.
- Monitor. Re-check the split every quarter — SLMs are catching up fast.
This is how you get a production system that costs what it should, runs fast, and still handles the hard cases with grace.
When to call us
If you're evaluating a vendor pitch, comparing quotes, or trying to figure out whether your current stack is over-engineered, get in touch. We'll look at the workload, the data and the economics, and tell you the honest answer — which is sometimes that you don't need us.