pick the constraint first
Teams waste a shocking amount of time debating model hosting before they’ve named the actual constraint, which is usually one of four things, p95 latency, cost per 1M tokens, data handling requirements, or plain old team capacity. If you don’t pin that down first, you’ll end up with the classic startup mistake, buying a GPU box because it feels strategic, then discovering three weeks later that your traffic pattern is spiky, your prompts are tiny, and OpenAI or Anthropic would have been cheaper for the next twelve months.
The decision gets simpler once you stop treating inference as a branding exercise and start treating it like a queueing problem with ugly economics. A support copilot doing 0.2 requests/sec during the day and basically nothing at night has completely different physics than a document extraction pipeline chewing through 80,000 PDFs in a batch window. Interactive SaaS products care about p95 and p99, offline jobs care about throughput and token cost, regulated systems care about data residency and vendor contracts, and every single one of those pushes you toward a different answer.
At Steezr we’ve seen all three paths make sense, hosted APIs for early AI salesperson workflows where product teams need to test prompts fast, local GPU serving for document processing where token volume is predictable and privacy matters, and CPU-first pipelines for internal ERP features where nobody needs a reply in 700 ms and cost discipline matters more than benchmark screenshots. The wrong move isn’t choosing API, GPU, or CPU. The wrong move is choosing one before you’ve written down your expected requests per second, average input tokens, average output tokens, p95 target, and who’s getting paged when vllm starts throwing OOMs at 02:13.
Write the matrix first. Then spend money.
where hosted APIs win
Hosted APIs win hard at low and moderate volume, especially during product discovery, because they collapse a pile of operational work into a bill and an SDK call. That sounds obvious, yet people still underestimate how much infrastructure work hides behind a simple client.responses.create(...). Retries, autoscaling, model updates, tokenizer quirks, speculative decoding tricks, capacity planning, rollout safety, abuse filtering, observability, vendor peering, all of that becomes somebody else’s problem, which is exactly what you want while you’re still figuring out whether users even like the feature.
A decent rule of thumb, if your product is below roughly 1 request/sec sustained, or below 5 to 10 million total tokens per day, hosted APIs are usually the cheapest full-stack decision once you include engineer time. That threshold moves depending on model class and prompt shape, though it’s a useful place to start. If your average request is 1,500 input tokens and 300 output tokens, and you’re doing 100,000 requests per month, the direct model bill may sting, yet it still won’t sting as much as one senior engineer burning a month on self-hosting, eval drift, load tests, and waking up to CUDA out of memory. Tried to allocate 1.53 GiB.
Latency is another place where APIs often beat self-hosted setups early on. The major providers have optimized token streaming aggressively, and for chat-style UX that matters more than raw tokens/sec in a benchmark sheet. You can get first-token latency that feels good enough without babysitting vLLM 0.5.x, TensorRT-LLM, or a homegrown autoscaler. Contractually, if you need customer-facing uptime commitments before you’ve built an SRE habit, a provider SLA plus graceful fallbacks is usually safer than promising your own 99.9% on a single A100 sitting in one rack.
The trap is assuming API economics stay acceptable forever. They don’t. At sustained traffic, token bills turn into rent.
when GPUs earn their keep
A dedicated GPU stack starts making financial sense once traffic is steady enough that you can keep the card busy for long stretches, and once your prompts are large enough that per-token API pricing is doing real damage. For many startups, the practical breakpoint lands around 30 to 50 million tokens per day for one core workload, sometimes earlier if prompts are long and predictable, sometimes later if traffic is bursty and your idle time is awful. Below that, the spreadsheet usually lies to you because it ignores ops cost. Above that, API margin starts looking absurd.
Use a concrete example. Say you rent an NVIDIA L4 instance for around $0.70 to $1.10 per hour depending on region and committed usage, or an A10G around $1.20 to $1.80, or an H100 north of sanity. Run a 7B or 8B instruct model quantized to 4-bit through vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --gpu-memory-utilization 0.92, and you might get somewhere between 80 and 180 output tokens/sec depending on batch shape, context length, and how badly your workload fragments KV cache. If that card is busy for most of the day, your effective cost per 1M tokens can beat hosted APIs by a wide margin.
There’s a giant caveat, your tail latency gets ugly the moment you over-batch or let long prompts mix freely with short ones. Anyone who has watched p95 spike because one 12k-token summarization request got shoved next to fifteen tiny chat requests knows this pain. You need admission control, queue isolation by prompt size, and real monitoring. Basic dashboards should include queue depth, GPU memory pressure, time-to-first-token, decode tokens/sec, p95 by route, and failure counts split by timeout, OOM, and upstream cancellation. If you can’t instrument that, don’t self-host yet.
GPU hosting also becomes the obvious move when data handling kills API usage. Some clients simply won’t sign off on third-party inference for contracts, healthcare notes, financial records, or internal legal documents. Then the argument changes. Cost matters, latency matters, compliance wins.
the underrated CPU middle
CPU batching is the option people ignore because it isn’t glamorous, and that’s exactly why it’s often the best startup choice. If the workload is asynchronous, if responses can take a few seconds, if you can group jobs by prompt template, and if a smaller quantized model does the job, CPUs can carry far more product surface area than the average founder expects. A lot of internal AI features don’t need frontier models. They need decent extraction, classification, reranking, templated drafting, or structured JSON output with low drama.
Take llama.cpp, Ollama, or text-generation-inference on a beefy EPYC machine with AVX-512 and plenty of RAM. A quantized 3B to 8B model in GGUF, often Q4_K_M or Q5_K_M, can handle batch document tasks, ticket triage, lead enrichment, or ERP note summarization at a very sane cost profile. You won’t get dazzling interactive latency for long generations. You will get infrastructure that doesn’t melt the moment a driver update lands. You also avoid the weird class of failures that haunt CUDA stacks, libcuda.so mismatches, NCCL WARN, kernel pinning, and hosts that reboot into the wrong driver version because somebody clicked upgrade in the cloud console.
The breakpoint where CPU batching wins is mostly about latency tolerance. If users can wait 2 to 10 seconds, or if the work runs fully async behind a job queue, CPU becomes pragmatic around a few hundred thousand to a couple million tokens per day, especially if the alternative is paying API rates for tasks where quality differences barely matter. We’ve used this approach for document processing pipelines where HTMX frontends just poll job status and nobody cares whether the backend used an L4 or a 32-core EPYC, they care that the invoice fields are extracted correctly and the monthly bill doesn’t become a board meeting topic.
This path falls apart for real-time chat at scale. Keep it in its lane.
a matrix that holds up
Use a simple decision matrix, not a ten-tab financial model that pretends your assumptions are precise. Start with sustained requests/sec and burst requests/sec, because average traffic hides disaster. Add average input and output tokens, then classify the workload as interactive or async. Add p95 target. Add monthly engineering hours you’re honestly willing to spend on inference ops. That last one matters more than most CTOs want to admit.
A practical version looks like this in plain English. Under 1 sustained request/sec, bursty traffic, p95 under 2 seconds, uncertain product fit, choose hosted APIs. Between 1 and 10 sustained requests/sec with moderate token counts, and strict privacy or vendor constraints, evaluate one GPU service with vLLM and a fallback API path. Above 10 sustained requests/sec on a narrow workload with stable prompts, GPUs usually deserve serious analysis because utilization can stay high enough to crush API cost. Async pipelines with loose latency and predictable templates should trigger a CPU experiment first, especially if a 3B or 7B model meets quality bars after task-specific prompting or light fine-tuning.
For SLAs, don’t promise a number you can’t observe. If your contract says 99.9% monthly availability and sub-3s p95 for inference-backed features, you need synthetic checks, multi-AZ or multi-provider failover, budget alerts, and a degraded mode. A degraded mode can be embarrassingly simple, switch to a smaller hosted model, disable streaming, cap max output tokens at 256, or queue noncritical requests. Customers prefer a slower system to a dead one.
Monitoring should be boring and ruthless. Prometheus, Grafana, Sentry, structured logs with request IDs, and traces that cross the app boundary into the inference service. Record token counts on every request. If you don’t, your cost model is fiction. Sample log lines should tell you route, model, input tokens, output tokens, queue wait, first-token latency, total latency, and result status. One line, one request, no excuses.
my default recommendation
Most startups should start with APIs, design the product so the inference provider is swappable, then move only the hot path to self-hosted GPU or CPU once real traffic justifies it. That means building a thin inference adapter from day one, with request and response schemas you control, plus hard limits on prompt size, output size, timeout, and retries. Don’t smear provider-specific options across your codebase. That decision will punish you later.
A clean interface in a Next.js or Django app is boring to write and saves months. One service boundary, one metrics wrapper, one place to enforce guardrails. If you need OpenAI now and vLLM later, the app shouldn’t care. If you need Anthropic for one route and local Llama 3.1 for another, same story. Teams that skip this end up doing a frantic rewrite right when traffic grows and nobody has time.
My bias is simple. Use hosted APIs for uncertain products and customer-facing chat where iteration speed matters. Use GPU serving for stable, heavy workloads where token spend dominates and you can keep utilization high. Use CPU batching for async features that need acceptable quality at a sane price. Avoid ideology. Avoid benchmark worship. Avoid buying hardware because a competitor mentioned H100s on a podcast.
You need a cost model, a latency target, and an honest answer to one uncomfortable question, who is going to own this system at 3 a.m. If the answer is vague, buy the API. If the answer is a named engineer with observability, runbooks, and a fallback path, then self-hosting can be a very good deal.
That’s the whole matrix.
