the gap nobody talks about
Every AI framework tutorial ends the same way. The LLM returns a JSON blob with tool_use or function_call or whatever the flavor of the month is, the tutorial author writes execute_tool(tool_call) in the next line, and the demo works. What the tutorial never shows you is what happens when that tool call is delete_invoice(invoice_id=9821) and invoice 9821 belongs to your biggest customer and the model hallucinated the ID because the user's prompt was ambiguous.
We've seen this exact failure mode surface repeatedly when building AI automation features into customer-facing products. A fintech client wanted an AI assistant that could adjust subscription billing records. A logistics company wanted automated order amendments. In both cases, the initial prototype worked fine in staging, where the data was fake and nobody cared. The moment we started talking about production, the conversation changed immediately, because the blast radius of a bad write against real customer data is not something an SMB can absorb.
The standard advice you'll get is 'add a human in the loop.' That's correct but useless as stated. It tells you nothing about the data model, the queue architecture, the UI surface the human actually uses, or how you handle the case where the human approves something that has since become stale. This post is the thing I wished I had read before building this pattern twice.
why tool calls are proposals, not commands
When OpenAI introduced function calling back in 2023 and then Anthropic formalized tool_use in Claude's API, the framing was always about capability: now the model can take actions. That framing is subtly wrong in a way that causes real engineering problems. The model doesn't take actions, it returns structured data that describes an intended action. The distinction matters enormously for how you architect the system.
A tool_call response from any major model (OpenAI, Anthropic, Google Gemini) is just JSON. It has a name and a set of arguments. Nothing has happened yet. The model has expressed an intent, in the same way a SQL statement expresses an intent before you hand it to a connection pool. Nobody ships raw user-provided SQL directly to a database without validation and probably a prepared statement. The equivalent caution should apply to tool calls, but somehow the AI demo culture has normalized writing the execute step as if it's obvious.
The correct mental model is that a tool call is a proposal. It needs to be validated against your business rules, checked against current state, and in many cases, confirmed by a human before it becomes a mutation. Once you internalize that, the architecture basically writes itself: you need a place to park proposals, a way to review them, and a way to execute or discard them with a full audit trail.
the queue architecture
Redis Streams are the right primitive here, not a job queue like Celery or BullMQ, because you want consumer group semantics and you want the stream to be replayable. Your approval queue is a stream where each entry is a pending tool call waiting for human review.
The flow looks like this: your AI handler receives the tool_call payload from the model, deserializes it, runs it through a rule engine (more on that in a second), and if the call requires human approval, it writes a new entry to a Redis Stream called pending_tool_calls with XADD. The entry contains the raw tool call JSON, the user ID who triggered it, a timestamp, the conversation context (at minimum a conversation ID you can use to retrieve context), and a TTL value indicating how long this proposal is valid. You do not call the actual tool. You return a response to the user like "I've prepared an update to invoice 9821, pending your confirmation."
The entry structure in Redis looks roughly like:
1XADD pending_tool_calls * \2 tool_name delete_invoice \3 tool_args '{"invoice_id": 9821}' \4 triggered_by user:448 \5 conversation_id conv:882af3 \6 valid_until 1746700000 \7 status pending
You also write a mirror record into Postgres immediately. Redis is your queue, Postgres is your audit log, and the two should stay in sync. The Postgres record is what you query when you're building the approval UI, because you want SQL-level filtering, full-text search across tool arguments, and foreign key relationships to your user and tenant tables. Redis is where the actual pending state lives; Postgres is where the history lives forever.
The rule engine I mentioned is important. Not every tool call needs human approval, and if you require approval for everything you'll train your users to rubber-stamp it. Simple reads, idempotent lookups, low-risk writes below certain value thresholds, these can go straight to execution with logging only. Destructive operations, writes above a financial threshold, anything touching billing or permissions, these go to the approval queue. You define the rules in code, not in the model prompt, because prompts drift and get edited and the rules need to be enforced at the infrastructure layer.
the audit log schema
The Postgres table is simple on purpose. Over-engineered audit schemas have a habit of never getting queried because nobody understands them.
1CREATE TABLE tool_call_proposals (2 id UUID PRIMARY KEY DEFAULT gen_random_uuid(),3 tenant_id UUID NOT NULL REFERENCES tenants(id),4 triggered_by UUID NOT NULL REFERENCES users(id),5 conversation_id TEXT NOT NULL,6 tool_name TEXT NOT NULL,7 tool_args JSONB NOT NULL,8 status TEXT NOT NULL DEFAULT 'pending',9 reviewed_by UUID REFERENCES users(id),10 reviewed_at TIMESTAMPTZ,11 executed_at TIMESTAMPTZ,12 execution_result JSONB,13 valid_until TIMESTAMPTZ NOT NULL,14 created_at TIMESTAMPTZ NOT NULL DEFAULT now()15);1617CREATE INDEX ON tool_call_proposals (tenant_id, status);18CREATE INDEX ON tool_call_proposals (triggered_by);19CREATE INDEX ON tool_call_proposals (tool_name, created_at DESC);
The status column moves through pending, approved, rejected, executed, expired, and failed. That last one matters: an approved call can still fail at execution time, and you need to distinguish that from a rejection. The execution_result JSONB column stores whatever your tool handler returns, including error details if it fails, so you have a complete picture of what happened.
The valid_until field is where most implementations cut corners and regret it. A proposal to transfer funds that was approved two hours after it was created is potentially operating on stale state. Your executor needs to check this timestamp before running the call, re-validate the current state against the original arguments, and reject execution if things have changed materially. What counts as 'changed materially' is domain-specific, but you need to define it explicitly and enforce it in the executor, not trust that approvals happen fast.
the approval UI surface
This is the part that gets the least attention in engineering writeups and causes the most usability failures. If the approval interface is buried in an admin panel that's hard to find, or if it shows the raw JSON arguments without any human-readable summary, your users will either ignore it or approve everything without reading it, which defeats the purpose entirely.
The approval surface should live in the main application flow, not in a separate admin area. For a customer portal built in Next.js, we put a persistent banner at the top of the page when there are pending approvals for the current tenant, with a count. Clicking it opens a slide-over panel that shows each pending proposal with a natural language summary generated by the model at the time the tool call was produced (you ask the model to produce a plain-English description alongside the tool call in the same response), the arguments in a readable format, the user who triggered it, and how long until it expires.
The approve and reject buttons post to a simple API route that updates both the Redis Stream entry and the Postgres record atomically (use a Lua script in Redis or a MULTI/EXEC block to update the stream entry status, and wrap the Postgres update in the same request handler). Approval doesn't execute the call immediately in the same HTTP request, it marks the proposal as approved and a background worker picks it up. This is important because tool execution might take seconds, might need retries, and you don't want the UI to hang waiting for it.
For the background worker in a Go service, we use a goroutine pool consuming from the Redis Stream consumer group watching for approved entries. The worker re-validates the valid_until timestamp, calls the actual tool handler, and updates the Postgres record with the result. If execution fails, it marks the record as failed and optionally notifies the user. The separation between approval and execution means your approval UI is always fast and your execution logic can be as complex as it needs to be.
what this looks like end to end
A user types into the AI assistant: "cancel the renewal for Acme Corp's annual plan." The model interprets this, constructs a tool call for cancel_subscription(customer_id='cust_acme', plan_id='annual_pro', reason='user_request'), and returns it alongside a description like "Cancel the annual Pro subscription for Acme Corp. This will take effect at the end of the current billing period."
Your handler receives the tool call, runs it through the rule engine, sees that cancel_subscription is a flagged operation type, and writes the proposal to both Redis and Postgres with status pending. The response sent back to the user is: "I've prepared a cancellation request for Acme Corp's annual plan. You'll see it in your pending actions above."
The user (or an admin with appropriate permissions) opens the pending actions panel, reads the natural language summary, checks the arguments, and clicks approve. The API handler marks the proposal approved in both stores. Within a few seconds, the background worker picks it up, validates that the subscription is still active and the valid_until timestamp hasn't passed, calls cancel_subscription, gets a success response, updates the Postgres record to executed with the execution result, and the user sees the proposal disappear from their pending queue and appear in the history tab.
If the user clicks reject, the proposal moves to rejected status and nothing touches the subscription data. If nobody touches it and valid_until passes, a separate cleanup worker marks it expired. Either way, the audit log has a complete record of what was proposed, who reviewed it, what decision was made, and what happened during execution.
The total implementation is maybe 600 lines of Go for the worker and rule engine, a Next.js API route and component for the UI surface, and the Postgres migration above. It's not complicated. The complicated part is deciding where your trust boundaries are and encoding them explicitly in the rule engine instead of hoping the model gets it right.
the parts that will bite you
A few things we learned the hard way. First, multi-step tool call chains are where this pattern gets complicated fast. If the model returns three sequential tool calls as part of a single user request, you need to decide whether each one is an individual proposal requiring individual approval or whether the chain is approved as a unit. Our current answer is that chains are approved as a unit, with the approval UI showing all steps in order, and execution is all-or-nothing transactional where your underlying tools support it. This is the right UX but it requires more work in the executor.
Second, the stale state problem is worse than you think. We had a case where a user approved a proposal to update a pricing record, but between proposal creation and execution, another user had manually edited that same record through a different UI. The executor ran, overwrote the manual change, and nobody knew because the audit log recorded a success. You need optimistic locking or a version check at execution time, which means your tool handlers need to accept and enforce a version or ETag parameter. This is an interface contract you should design from the start, not retrofit.
Third, notification infrastructure matters. An approval queue that nobody checks is just a delayed execution queue with extra steps. You need to notify the right people when proposals are created, remind them before expiry, and confirm when execution happens. We use a simple webhook dispatch in the background worker to post to Slack for internal tools, and in-app notifications stored in Postgres for customer-facing products. The notification layer is boring but it's what makes the human-in-the-loop actually function.
Fourth, permission scoping. Not every user should be able to approve every proposal. The approval check should verify that the reviewer has the appropriate role for the specific tool being approved, not just that they're authenticated. A billing admin should approve billing tool calls. An inventory manager shouldn't be able to approve finance operations just because they're also logged in. This sounds obvious but it's easy to skip when you're moving fast, and it creates a real access control gap.
