request pricing is lazy
Per-request pricing survives because it's easy to explain, easy to meter, and easy to drop into Stripe, not because it's a sane model once customers start building serious automation. The first hundred requests look harmless, then somebody wires your API into a queue consumer, or their ERP sync job, or an LLM agent loop that retries on every malformed response, and suddenly your invoice turns into a support ticket with screenshots and legal cc'd in. I've seen this pattern in document processing systems and AI workflows, the unit of customer value had almost nothing to do with request count, yet request count was the thing getting billed because the team shipped billing last and picked the fastest metric.
That choice creates a bad product. Customers start optimizing for fewer calls instead of better outcomes, which pushes them toward giant batch requests, oversized payloads, ugly client-side caching, and fragile orchestration code written purely to dodge billing. Meanwhile you get punished for efficiency work. If you cut latency in half, reduce CPU time by 40%, or move an endpoint off a GPU box onto a cheap CPU pool, your costs go down and revenue doesn't track with it. That's backward.
Compute is the cleaner unit. If a job burns 2.4 CPU-seconds, charge for 2.4 CPU-seconds. If an embedding run uses 180k input tokens and 12k output tokens, meter that directly. If an OCR pipeline grabs a GPU for 900 milliseconds and then spends 4 seconds on a worker parsing tables, split the meter into GPU-seconds and worker-seconds. Customers can still forecast, you can still price simply, and the bill maps to actual resource consumption instead of an arbitrary HTTP event.
This matters even more for AI products, because one request can mean wildly different things. A short prompt against gpt-4o-mini and a 200-page PDF extraction against a vision model are both technically one request. Treating them as equal is nonsense.
pick meters that match cost
You don't need a PhD in cloud economics to do this well, you need one rule, the billable unit should correlate tightly with your own infrastructure cost and be legible enough that customers can reason about it. For a Django app pushing jobs through Celery on Kubernetes, that usually means CPU-seconds, memory-heavy worker-seconds if RAM is the constraint, GPU-seconds where applicable, storage in GB-month, and tokens if you're brokering third-party model usage. Keep the number of billable dimensions low, because every extra meter becomes a support burden.
A practical scheme for an AI document pipeline could look like this. API ingress stays free. Queueing stays free. Each job is billed for worker runtime rounded to 100 ms, GPU runtime rounded to 100 ms, model tokens at actual counted usage, and persisted document storage after a 30-day included window. A cheap PDF text extraction might consume 0.8 worker-seconds and no GPU. A scanned invoice with table detection might hit 1.2 GPU-seconds, 3.6 worker-seconds, plus 24k model tokens for classification and post-processing. That maps directly to cost.
Pricing tiers then become straightforward. Starter: 3,600 worker-seconds and 250k tokens per month included. Growth: 36,000 worker-seconds, 5M tokens, lower overage rate. Scale: committed spend, custom concurrency caps, reserved GPU pool. The important part is that quotas reflect resource budgets, not arbitrary call counts. Customers who write efficient prompts, de-duplicate work, and stop retry storms get rewarded. They should.
I've got no patience for vendors who hide ugly cost structure behind credits with no fixed conversion. Publish the unit conversion. Say 1 worker-hour equals 3,600 credits if you're addicted to credits, or better, skip credits and show raw units on every invoice. People can handle numbers.
metering without guesswork
Instrumentation is where teams usually make this harder than necessary. You do not need a bespoke event pipeline on day one. Start with OpenTelemetry for traces and metrics, Prometheus for scraping, and one append-only usage table in PostgreSQL that records finalized billable events. Traces help you attribute cost across a request chain, Prometheus helps you watch aggregate behavior, and the usage table becomes the billing source of truth because invoices should not depend on a 15-second scrape interval.
For Python services, opentelemetry-sdk==1.25.0, opentelemetry-instrumentation-django==0.46b0, opentelemetry-instrumentation-celery==0.46b0, and prometheus-client==0.20.0 are perfectly fine starting points. Add a middleware that stamps account_id, job_id, and plan_code into the active span. In Celery, capture task_prerun and task_postrun, compute wall time, then persist billable worker-seconds after task completion so retries don't accidentally double count unfinished work.
A very normal schema looks like this:
create table usage_events (
id bigserial primary key,
account_id bigint not null,
job_id uuid,
meter_code text not null,
quantity numeric(18,6) not null,
unit text not null,
occurred_at timestamptz not null,
idempotency_key text not null unique,
metadata jsonb not null default '{}'::jsonb
);
create index idx_usage_events_account_time on usage_events(account_id, occurred_at);And a matching Prometheus metric set can be painfully boring, which is exactly what you want:
from prometheus_client import Counter, Histogram
worker_seconds = Counter(
"billable_worker_seconds_total",
"Finalized billable worker runtime",
["account_id", "plan_code"]
)
gpu_seconds = Counter(
"billable_gpu_seconds_total",
"Finalized billable GPU runtime",
["account_id", "model"]
)
task_duration = Histogram(
"task_duration_seconds",
"Task runtime",
["task_name"],
buckets=(0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60)
)Two hard rules. First, every usage event needs an idempotency key, because billing bugs caused by duplicate consumers are embarrassing and expensive. Second, finalize usage at the boundary where work is complete and attributable. If you try to reconstruct invoices later from logs, you're going to end up explaining a missing line item with the sentence retention expired in Loki, which is a sentence no customer should ever hear.
pricing people can predict
Founders often hear "compute-based pricing" and immediately create a pricing page only a procurement analyst could love, with six meters, twenty footnotes, and a calculator that needs a calculator. Don't do that. Internally, you can meter with precision. Externally, package that precision into two or three concepts customers already understand.
One model I like is included quotas plus explicit overages. For example, if you're selling an AI backoffice tool built on Next.js, Django, PostgreSQL, and a queue-backed worker fleet, your public plan could say: 10,000 documents/month included, up to 50 worker-hours and 2M model tokens, overages at €0.12 per worker-minute and €1.80 per million tokens. Under the hood, documents are just a reporting lens, billing still settles against worker time and tokens. That gives PMs a value metric to talk about and gives finance a cost metric that won't drift.
Another model works well for API-heavy products, monthly platform fee plus usage buckets. Say €499 includes 20,000 worker-seconds and 1M tokens, then overages are linear. Linear matters. Customers hate cliffs. Volume discounts are fine at the account level or through committed use, though the discount should come from predictability you gain in capacity planning, not from magical pricing theater.
Show usage in near real time. Every serious product should have a billing page with a chart broken down by meter, daily totals, month-to-date projected bill, and top cost drivers. If one tenant is spending half their monthly budget because a background sync loop is hammering your endpoint every 3 seconds, the dashboard should make that obvious before your support inbox does. We've built portals like this for clients at Steezr, usually with a boring stack on purpose, PostgreSQL materialized views for daily rollups, a Django admin action for adjustments, HTMX for quick filtering, no giant billing microservice circus. Boring systems survive billing month close.
how to migrate existing customers
Migrating from per-request pricing needs political care, because customers hear pricing change and assume you found a new way to charge more. Some of them will be right if you handle it badly. The clean path is dual metering for at least one full billing cycle, ideally two. Keep charging on the old model while you record compute usage in parallel, then send every customer a shadow invoice that compares old bill versus new bill with plain English explanation for the difference.
You want sentences like, "Your March usage was 1.8M requests, which included a retry loop from one integration job. Under compute billing you would have paid for 14.2 worker-hours and 320k tokens, total €184 instead of €690." That wins trust because the customer can see the old model was dumb. Some accounts will pay more under compute billing, especially the ones sending tiny high-frequency requests that are cheap for you to serve. Give them a transition credit or a 90-day rate cap. The point is to align incentives without setting fire to revenue.
Operationally, add a versioned pricing engine. Do not overwrite historical rating logic. pricing_version=v1_request and pricing_version=v2_compute should both exist in code and in invoice records. If finance asks why invoice INV-2025-03-104 differs from INV-2025-04-021 for the same customer behavior, you need an answer better than "we changed it." A small Python rating function is enough:
def rate_event(event, pricing_version):
if pricing_version == "v1_request":
return Decimal("0.0004") if event.meter_code == "api_request" else Decimal("0")
if pricing_version == "v2_compute":
rates = {
"worker_second": Decimal("0.0020"),
"gpu_second": Decimal("0.0180"),
"token_1m": Decimal("1.80"),
}
return rates[event.meter_code] * event.quantity
raise ValueError(f"unknown pricing version: {pricing_version}")One more thing, communicate the ugly details before they become rumors. State the rounding policy, minimum billable increment, treatment of retries, treatment of failed jobs, and what happens during provider outages. If a failed OpenAI call returns 429 Rate limit reached for gpt-4o-mini and you retry three times, decide whether those tokens are pass-through billable, then document it. Ambiguity is where billing distrust grows.
where this breaks
Compute billing isn't magic, and there are cases where per-request still makes sense. If your endpoint cost distribution is tight, requests are homogeneous, and customers mostly care about transaction counts because that's how they measure business volume, charging per request can be perfectly fine. A webhook relay with tiny payloads and stable processing time doesn't need GPU-seconds and token ledgers. Use judgment.
Problems show up when teams pick fake precision. Billing exact CPU cycles inside a noisy multi-tenant container setup sounds clever right up until Linux scheduling jitter and background contention make the numbers impossible to explain. Wall-clock worker-seconds are easier to reason about. The same goes for memory. Don't invent "GB-milliseconds" unless memory is truly your main cost driver and you can defend the measurement. Most teams should keep the model blunt and accurate enough, not academically pure.
Abuse prevention also changes. Under request pricing, spam is expensive for the customer. Under compute pricing, spam might be cheap if the request fails fast. You'll need proper rate limits, queue quotas, and concurrency caps. In NGINX that might be as plain as limit_req_zone $binary_remote_addr zone=api:10m rate=20r/s;, then at the app layer enforce tenant-level concurrent job caps based on plan. Billing and admission control work together, one doesn't replace the other.
The bigger point is simple, pricing should reward efficient software on both sides of the API. Per-request billing does the opposite once customers automate heavily. Compute-based billing fixes that, your costs track usage, customers stop building bizarre workarounds to reduce call count, and product decisions get easier because the pricing unit finally matches the machine time you're actually paying for.
