Kill the GraphQL gateway at 12 services

federation ages badly

GraphQL federation feels smart at service three, maybe four, because the demo is compelling, every team gets autonomy, product gets one endpoint, frontend engineers can ask for exactly the fields they want, everyone nods, and six months later you’ve built a coordination machine disguised as architecture. I’ve seen this pattern enough times now that I don’t buy the sales pitch anymore, at least not for mid-stage startups running ten-ish services with one platform team and a couple of product squads trying to ship real features under real deadlines.

The pain isn’t theoretical. The pain is Tuesday afternoon, someone adds a nullable field to a subgraph, a resolver still throws on a missing join key, the gateway composes happily, staging passes because the happy-path query worked, then production starts returning partial data and a pile of errors: [{ message: "Cannot return null for non-nullable field" }] that the client ignores because the HTTP status is 200. That’s bad system design. You’ve moved failure discovery from compile time to a distributed runtime, then wrapped it in a protocol that encourages consumers to believe they’re safe because they have types generated from the supergraph.

Schema governance gets ugly fast. Every cross-team change turns into a tiny standards committee, naming arguments, debating ownership of entities, deciding where a field should live, waiting for composition checks, replaying queries, and trying to reason about N+1 behavior that’s now split across multiple resolver chains. The gateway becomes an integration forest, a place where auth, caching, fanout, field-level joins, shape translation, and tribal knowledge accumulate until nobody wants to touch it on Friday.

A lot of teams confuse centralization of query shape with simplification of the system. Those are different things. A single graph can still hide a mess, and in practice it usually does.

the contract-first replacement

The replacement I recommend is boring on purpose, OpenAPI 3.1 specs owned per domain, machine-generated clients checked into consumer repos, consumer-driven contract tests, CI codegen, and a very thin edge layer that handles auth, rate limits, and request routing without pretending to be an orchestration brain. This setup has fewer moving parts, better failure modes, and a much cleaner ownership model.

OpenAPI 3.1 finally fixed most of the historical reasons people dismissed it. JSON Schema alignment matters, oneOf and discriminators are usable, nullable semantics stopped being weird, examples and webhooks are first-class, and the tooling around TypeScript is miles better than it was a few years ago. On a Next.js frontend, or a React Native app, we generate clients with openapi-typescript and a tiny fetch wrapper, then run tsc --noEmit in CI so consumer breakage shows up before merge, not after deployment.

A minimal setup looks like this:

yaml

 1openapi: 3.1.0
 2info:
 3  title: billing-api
 4  version: 1.4.0
 5paths:
 6  /customers/{customerId}/invoices:
 7    get:
 8      operationId: listCustomerInvoices
 9      parameters:
10        - name: customerId
11          in: path
12          required: true
13          schema:
14            type: string
15            format: uuid
16      responses:
17        '200':
18          description: OK
19          content:
20            application/json:
21              schema:
22                $ref: '#/components/schemas/InvoiceList'
23components:
24  schemas:
25    InvoiceList:
26      type: object
27      required: [items]
28      properties:
29        items:
30          type: array
31          items:
32            $ref: '#/components/schemas/Invoice'
33    Invoice:
34      type: object
35      required: [id, status, totalCents]
36      properties:
37        id: { type: string, format: uuid }
38        status: { type: string, enum: [draft, open, paid, void] }
39        totalCents: { type: integer }

Then in CI:

json

 1{
 2  "scripts": {
 3    "codegen": "openapi-typescript ./openapi/billing.yaml -o ./src/gen/billing.ts",
 4    "typecheck": "tsc --noEmit",
 5    "contracts": "pnpm pact:verify"
 6  }
 7}

This gives consumers static guarantees against the producer contract, and the producer still owns implementation details. No gateway composition ceremony required.

runtime failures stop hiding

The strongest argument for this approach has nothing to do with taste, it’s about where errors show up. Federation hides too many failures behind successful transport. Your graph can return 200 with half the page missing, a resolver timeout can degrade one branch of a query while the rest succeeds, and the client has to inspect errors[] and data together to understand what happened. Most teams don’t model that carefully, they pattern-match on generated hooks or SDK methods and assume a resolved promise means valid data.

REST with explicit contracts is much less magical, which is exactly why it’s easier to operate. If GET /customers/:id/invoices breaks its response shape, generated TypeScript explodes in consumer CI. If the producer changes behavior, the Pact verification fails before deploy. If a route is down, you get a 5xx where observability tools already know what to do. Error budgets, alerting, retries, CDN behavior, cache keys, auth policies, all of that machinery already speaks HTTP fluently.

Consumer-driven contract tests matter because producers are terrible at predicting what consumers actually depend on. A backend team will say, truthfully, that removing invoice.totalCents is safe because the web app can compute display values from line items, then mobile reminds everyone they use that field in an offline view and the support dashboard has a CSV export depending on the exact integer. Pact catches that. We’ve had great results storing pacts per consumer, verifying them in the provider pipeline, then blocking release if verification drops. Simple rule, no drama.

One concrete pattern that works well is to version for breaking semantic changes in the path only when needed, keep additive changes in-place, and deprecate with dates in the spec. Humans can read it. Tooling can enforce it. Nobody has to mentally simulate a graph planner.

how we cut change time

On one migration at steezr, a 12-service stack had drifted into the usual shape, a GraphQL gateway on the front, Django and Go services behind it, a couple of internal Node services doing document processing, PostgreSQL everywhere, Redis in the middle, and a frontend team that had learned to fear schema changes because every non-trivial feature crossed three team boundaries. Mean time to change for anything touching customer accounts was roughly eight working days, sometimes more, mostly waiting on coordination, not coding.

We didn’t try a dramatic rewrite. That would’ve been irresponsible. We started by inventorying actual consumer traffic from gateway logs, grouped operations by domain, and wrote OpenAPI 3.1 specs for the top fifteen workflows that represented around 80% of user-facing traffic. Then we put a thin HTTP edge in front of the existing services, Envoy for routing and auth context propagation, no fancy BFF logic, no aggregation except a handful of endpoints where aggregation was genuinely stable and domain-owned.

Clients were generated per consumer. Web got plain TypeScript types and a tiny wrapper around fetch, native got the same contract package through a shared workspace, internal tools using HTMX just called endpoints directly because they didn’t need generated SDKs. Provider repos added contract verification jobs. Consumer repos pinned spec versions, regenerated in CI, and failed fast on breaking changes. A simplified GitHub Actions job looked like this:

yaml

 1name: contract-check
 2on: [pull_request]
 3jobs:
 4  verify:
 5    runs-on: ubuntu-latest
 6    steps:
 7      - uses: actions/checkout@v4
 8      - uses: pnpm/action-setup@v4
 9      - uses: actions/setup-node@v4
10        with:
11          node-version: 22
12          cache: pnpm
13      - run: pnpm install --frozen-lockfile
14      - run: pnpm codegen
15      - run: pnpm typecheck
16      - run: pnpm contracts

The result was immediate. Mean time to change dropped to about four days within two months, mostly because frontend and backend could move independently again. Failures got louder. Ownership got clearer. Gateway work stopped being a specialist bottleneck.

the objections don’t hold

The standard pushback is frontend flexibility, teams want ad hoc query shapes, a graph gives them freedom, REST creates endpoint sprawl. That objection usually comes from teams who haven’t looked honestly at their traffic. Most production apps do not have infinite query variety, they have a small set of stable screens and workflows, and those workflows map cleanly to resource or task-oriented endpoints. If you really need custom projections for a reporting surface, add sparse fieldsets or purpose-built read endpoints. Don’t build a distributed query planner for the entire company because one dashboard wants five joins.

Another objection is over-fetching. Fine, over-fetching exists, and almost every team dramatically overstates its cost compared to the operational cost of federation. Sending an extra 1.5 KB of JSON over HTTP/2 or HTTP/3 is rarely your bottleneck. Unbounded fanout across subgraphs, hidden resolver waterfalls, and cache-hostile query variability, those are real bottlenecks. I’d take a slightly chunky response over a graph that requires Apollo Router experts to explain why p95 went from 180 ms to 900 ms after a harmless-looking schema update.

Then there’s tooling maturity. People assume GraphQL tooling is nicer because introspection and codegen feel slick. OpenAPI tooling in 2026 is fine, more than fine really, especially if your consumers are TypeScript-heavy. openapi-typescript, Redocly CLI, Stoplight Spectral, Pact, Dredd if you still want it, Prism for mocking, all of this works. You can lint specs, diff them in CI, generate SDKs, publish docs, and enforce style rules. None of that requires a central platform priesthood.

Use federation if you truly have many independent domains, many independent teams, and a strong platform group that treats graph governance as a product. Most startups don’t. They have six to fifteen services, one or two shared databases they’re trying to unwind, a handful of engineers wearing multiple hats, and no spare headcount for architectural vanity.

a saner default

My default stack for this class of company is pretty straightforward, domain HTTP APIs described in OpenAPI 3.1, generated TypeScript clients consumed by Next.js and React Native apps, provider verification with Pact, spec linting with Spectral, codegen and tsc --noEmit in every pull request, then Envoy or NGINX at the edge doing routing, auth, and maybe some very light response shaping where it’s justified. That setup matches how teams actually work. It also matches how systems fail in production, visibly and debuggably.

A practical repo layout helps a lot:

text

 1/contracts
 2  /billing/openapi.yaml
 3  /accounts/openapi.yaml
 4/services
 5  /billing-api
 6  /accounts-api
 7/apps
 8  /web
 9  /mobile
10/packages
11  /api-clients

Keep contracts close to code, publish generated clients as workspace packages, and make breaking changes expensive on purpose. A good CI gate is boring, redocly lint, openapi-diff, openapi-typescript, tsc, Pact verification, done. If someone changes a response from totalCents: integer to total: string, they should get punched in the face by automation within minutes.

This is one of those architectural calls where boring wins. Mid-stage startups need change velocity, predictable failures, and systems that a senior engineer can understand after reading the repo for an hour. Federation usually pushes in the opposite direction. Typed OpenAPI plus consumer-driven contracts gets you back to a system shaped around delivery instead of ceremony, which is where most teams should’ve stayed in the first place.

Kill the GraphQL gateway at 12 services

federation ages badly

the contract-first replacement

runtime failures stop hiding

how we cut change time

the objections don’t hold

a saner default

Your Tests Passed and Production Still Broke: The AI Verification Gap Nobody Wants to Name

The Vercel Breach Is a Template for How OAuth Sprawl Kills You

Your AI Dev Toolchain Is the Attack Surface

Want to work with us?