Don’t ship agent chains to SMB customers

agents fail where invoices exist

I don’t trust agentic tool-chains in SMB products, not because LLMs are useless, quite the opposite, they’re extremely useful, but because once money changes hands you need behavior you can explain, replay, cap, and support at 2 a.m. without reading a speculative planner transcript that wandered through five tools, retried two of them, and finally returned a half-right answer with no clear ownership of the failure.

Most agent setups collapse in the same places. The planner invents a sequence that technically satisfies the prompt and violates a business rule nobody encoded. A retrieval step times out, the model keeps going anyway, then synthesizes a confident answer from stale context. A tool returns malformed JSON once every few hundred calls, the parser swallows it, your app records success, support gets the ticket three days later. You can patch each issue individually, and people do, then six weeks later the system has prompt glue, retry glue, tracing glue, policy glue, and a giant hidden state machine nobody admits exists.

SMBs don’t need that mess. They need "extract invoice line items", "draft a response using these approved facts", "classify this ticket into one queue", "fill this CRM field if confidence exceeds 0.9". These are bounded operations with real side effects, and bounded operations deserve deterministic software around the model call. Fancy agent loops are a false economy because they reduce initial coding and then hand the cost back as operations, support load, compliance pain, and debugging time.

We’ve built AI-heavy internal systems and document pipelines at steezr for teams that don’t have the luxury of an ML platform squad. The pattern that survives contact with production is boring in the best possible way, typed function calls from the model, a thin tool-proxy service that owns auth and validation, explicit state transitions, and a fallback path a human can understand in one screenful of logs.

function calls beat planners

If your model supports structured outputs or function calling, use that first and keep the surface area tiny. Give the model a narrow menu of actions with typed arguments, describe each action in plain language, require strict JSON schema validation, then refuse anything outside contract. No reflection loop, no self-critique phase, no recursive planner deciding it probably needs to hit Slack, HubSpot, and your billing API because the user asked a vague question.

A simple example, customer support triage. The model gets the email body, account metadata, and a tool list with maybe three functions: classify_ticket, draft_reply, request_human_review. classify_ticket accepts an enum for category and a bounded confidence float. draft_reply accepts a constrained markdown string plus a list of cited source IDs. request_human_review accepts one reason code. That’s it. You’ve turned a fuzzy prompt into a typed interface.

In practice this means schemas like this:

json

 1{
 2  "name": "classify_ticket",
 3  "strict": true,
 4  "parameters": {
 5    "type": "object",
 6    "additionalProperties": false,
 7    "properties": {
 8      "category": {
 9        "type": "string",
10        "enum": ["billing_dispute", "refund", "bug_report", "sales", "other"]
11      },
12      "confidence": {
13        "type": "number",
14        "minimum": 0,
15        "maximum": 1
16      }
17    },
18    "required": ["category", "confidence"]
19  }
20}

Then your app does the adult work. If confidence < 0.86, route to review. If account status is delinquent, forbid refund promises. If the customer is in the EU, redact fields before any draft is generated. None of this belongs in a planner prompt. None of it should depend on whether the model woke up in a poetic mood.

People complain that typed calls feel restrictive. Good. Restriction is exactly what turns an LLM feature into product behavior you can charge for.

build a tool proxy

Direct model-to-tool access is where a lot of AI products quietly become unmaintainable. The model shouldn’t know how to talk to Stripe, Jira, Salesforce, your Postgres read replica, or the random SOAP endpoint some distributor still uses. Put a deterministic proxy layer in the middle and make that layer painfully boring.

That proxy should own five jobs. Authentication, authorization, input validation, rate limiting, and normalization of upstream errors into a small internal vocabulary. One endpoint per tool operation, strong schemas on both request and response, idempotency keys for side-effecting calls, and logs that show request_id, customer_id, model_decision, tool_name, validated_args, upstream_status, and final_outcome. Once you have that, audits stop being interpretive art.

A Go service works well here because it’s easy to keep it strict and fast. An HTTP handler with generated JSON Schema validation, a context deadline of, say, 1500 ms for read operations, and a hard allowlist of outbound targets already eliminates half the nonsense people blame on models. A sketch:

 1type CreateRefundRequest struct {
 2    InvoiceID string  `json:"invoice_id" validate:"required,uuid4"`
 3    Amount    float64 `json:"amount" validate:"gt=0"`
 4    Reason    string  `json:"reason" validate:"oneof=duplicate chargeback goodwill"`
 5}
 6
 7func (h *Handler) CreateRefund(w http.ResponseWriter, r *http.Request) {
 8    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
 9    defer cancel()
10
11    var req CreateRefundRequest
12    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
13        writeErr(w, 400, "invalid_json")
14        return
15    }
16    if err := h.validate.Struct(req); err != nil {
17        writeErr(w, 422, "schema_validation_failed")
18        return
19    }
20    if !h.policy.CanRefund(r.Context(), req.InvoiceID) {
21        writeErr(w, 403, "policy_denied")
22        return
23    }
24
25    res, err := h.billing.CreateRefund(ctx, req)
26    if err != nil {
27        writeMappedUpstreamErr(w, err)
28        return
29    }
30    writeJSON(w, 200, res)
31}

That proxy becomes your control plane. You can add per-customer rate caps, freeze dangerous tools for one tenant, shadow new behavior, and record exactly which validator rejected which payload. If you’ve ever had to explain a wrong side effect to an SMB owner who can name all twelve employees personally, you’ll appreciate how much this matters.

contracts before prompts

The order matters. Start with the contract, then write the prompt to satisfy the contract, not the other way around. Teams often prototype with a giant system prompt, watch it sort of work, then keep layering instructions until the prompt becomes an undocumented policy engine. That’s how you get brittle behavior and zero confidence during upgrades.

Schema-level contracts force useful decisions early. Which fields are optional. Which enums represent real business states. What maximum length is acceptable for free text. Whether citations are mandatory. What confidence threshold gates automation. Once those are explicit, model swaps are less dramatic because the contract absorbs most of the churn. You can move from one provider to another, or one model family to another, and your app logic still consumes the same validated shape.

For Python services, Pydantic is still the obvious hammer. For TypeScript in a Next.js app, Zod works fine, then export JSON Schema if your provider wants it. A thin example in TypeScript:

 1import { z } from "zod";
 2
 3export const DraftReply = z.object({
 4  tone: z.enum(["neutral", "warm", "firm"]),
 5  body_md: z.string().min(40).max(4000),
 6  citations: z.array(z.string().uuid()).min(1),
 7  safe_to_send: z.boolean(),
 8});
 9
10export type DraftReply = z.infer<typeof DraftReply>;

Then fail closed:

 1const parsed = DraftReply.safeParse(modelOutput);
 2if (!parsed.success) {
 3  logger.warn({ err: parsed.error }, "draft_reply_schema_failed");
 4  return { action: "human_review", reason: "invalid_model_output" };
 5}

This style also makes regression testing possible. Save real inputs, expected schema outputs, and policy decisions. Run them in CI against model changes. If output quality shifts, you’ll see it before a customer does. Prompt-only systems almost never get this discipline because there’s no stable seam to test against.

fallbacks are product behavior

A fallback isn’t a sad path. It’s core product behavior, and teams that treat it as an afterthought end up with AI features that feel random. Every operation should have a deterministic fallback defined before release. If extraction confidence is low, queue manual review. If the upstream CRM is rate-limited, save the suggestion and notify the user that sync is pending. If a generated answer lacks citations, don’t send it. If tool validation fails, return a recoverable state your UI understands.

Notice the pattern, the fallback is explicit and finite. Not "try a different chain" or "ask the model to reason harder". Retrying a model call can make sense for transient transport failures. Retrying semantics because the model returned mush is usually self-deception with extra tokens attached.

You also want user-facing states that map to backend states one-to-one. ready, needs_review, deferred, blocked_by_policy, upstream_unavailable. Real names. Support can read them. Product can design around them. Analytics can count them. Compare that with a free-form agent trace where the model decided step 7 was optional and step 8 contradicted step 3.

One practical trick, persist the model input bundle, the validated tool arguments, the policy snapshot, and the chosen fallback in a single immutable event row. PostgreSQL handles this perfectly well for most SMB volumes. Add a JSONB column for raw provider response if you need forensics, partition if volumes get high, and you’ve got replayable execution without buying some expensive observability story you’ll barely use.

Predictability wins twice, once for engineering, once for customer trust. Customers forgive "queued for review". They do not forgive silent wrong actions.

the stack I’d actually ship

For a greenfield SMB SaaS feature in 2026, I’d keep the architecture aggressively small. Next.js for the app surface if you’re already there, a Python or Go service for AI orchestration if the workflows have any complexity, PostgreSQL for events and audit state, Redis only if you truly need short-lived queues or rate counters, and one provider abstraction that exposes structured output plus function calling. Keep HTMX in play for internal backoffice tools where speed matters more than frontend ceremony. None of this needs a cathedral.

The request flow is straightforward. UI sends an action request. Backend assembles a bounded context packet, customer data already filtered by policy. Model is asked for one typed decision or one typed content object. Output is schema-validated. If the output requests a tool, that request goes through the proxy. Proxy applies policy, auth, rate caps, idempotency, and upstream error mapping. Final state is persisted with enough metadata to replay. UI renders a finite state, not a vibe.

You should also log the ugly stuff directly. Real status codes, real provider errors, token counts, latency percentiles by tool, schema failure rates by model version. I want dashboards that say draft_reply_schema_failed=1.7% after a model rollout, not abstract "agent quality dipped" nonsense. I want alerts on policy_denied spikes because they often reveal prompt drift or a tenant configuration bug. Concrete metrics force concrete fixes.

One more opinion, don’t let vendors bully you into their full agent framework unless you’ve already proven you need it. A lot of these stacks are wrappers around wrappers, with tracing formats nobody standardizes, execution semantics nobody can explain clearly, and enough magic to make incident response miserable. Plain function calling plus deterministic software around it gets you surprisingly far, and for SMB products, far is usually far enough.

Don’t ship agent chains to SMB customers

agents fail where invoices exist

function calls beat planners

build a tool proxy

contracts before prompts

fallbacks are product behavior

the stack I’d actually ship

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?