Edge observability is mostly a telemetry problem

visibility breaks first

Edge and serverless get blamed for problems they didn't create. Latency spikes, random timeouts, p95 drift that only shows up in one region, cache misses that vanish before anyone can reproduce them, all of that gets filed under architecture risk, even though the actual failure is that most teams ship these runtimes with observability designed for a Kubernetes pod that lives for three days.

That model falls apart fast. A Next.js 15 route handler running in an edge runtime has a short life, constrained APIs, weird execution boundaries, and far less room for the kind of invasive profiling people got used to on a warm Node process. A Go 1.22 backend behind it has the opposite shape, steady process lifetime, better profiling hooks, easier context propagation, and enough CPU time to actually emit useful telemetry. If you instrument both layers the same way, you'll get the worst of each, inflated bill, missing traces, and a dashboard full of averages that politely hide the bad cases.

The pattern that has held up best for us at steezr, across customer portals, AI-backed document pipelines, and a few internal systems with edge auth and regional routing, is pretty simple. Measure cold starts explicitly, collect profiles with aggressive sampling instead of permanent profiling everywhere, and tail-sample distributed traces based on latency, status code, and route importance instead of head-sampling everything at 10% and praying the interesting failure made the cut.

You also need to accept that edge observability is lossy by design. That's fine. Good systems work with useful loss. Bad systems pretend they can keep every event forever, then the first invoice lands and observability gets gutted by finance. The job is to keep the signals that answer production questions quickly, not to preserve every span generated by a bot hammering your preview deployment.

trace cold starts directly

Cold starts deserve first-class telemetry because they contaminate everything around them. If your edge function spends 180 ms booting, the downstream fetch to your Go API gets blamed, your synthetic checks point at the wrong service, and your incident review becomes a guessing contest.

For Next.js 15, instrument this in instrumentation.ts and in the route or middleware entrypoint where execution actually begins. The key is a process-local marker that only exists on the first invocation of that isolate, then a span attribute you can query later. Edge runtimes don't always give you rich process primitives, though globalThis is usually enough.

 1// instrumentation.ts
 2import { trace, context, SpanStatusCode } from '@opentelemetry/api'
 3
 4const tracer = trace.getTracer('next-edge')
 5
 6if (!(globalThis as any).__boot_ts) {
 7  ;(globalThis as any).__boot_ts = Date.now()
 8  ;(globalThis as any).__cold = true
 9}
10
11export async function register() {}
12
13export async function markInvocation<T>(name: string, fn: () => Promise<T>) {
14  return tracer.startActiveSpan(name, async span => {
15    const cold = !!(globalThis as any).__cold
16    const bootTs = (globalThis as any).__boot_ts as number
17    span.setAttribute('runtime.name', 'nextjs-edge')
18    span.setAttribute('deployment.environment', process.env.NODE_ENV || 'unknown')
19    span.setAttribute('serverless.cold_start', cold)
20    if (cold) {
21      span.setAttribute('serverless.init_duration_ms', Date.now() - bootTs)
22      ;(globalThis as any).__cold = false
23    }
24    try {
25      return await fn()
26    } catch (err: any) {
27      span.recordException(err)
28      span.setStatus({ code: SpanStatusCode.ERROR, message: err?.message })
29      throw err
30    } finally {
31      span.end()
32    }
33  })
34}

Then inside a route handler:

 1import { markInvocation } from '@/instrumentation'
 2
 3export const runtime = 'edge'
 4
 5export async function GET(req: Request) {
 6  return markInvocation('edge.products.GET', async () => {
 7    const t0 = performance.now()
 8    const res = await fetch(`${process.env.API_BASE_URL}/products`, {
 9      headers: { 'x-request-id': crypto.randomUUID() }
10    })
11    const body = await res.text()
12    const duration = performance.now() - t0
13    return new Response(body, {
14      status: res.status,
15      headers: {
16        'content-type': res.headers.get('content-type') || 'application/json',
17        'x-upstream-duration-ms': duration.toFixed(1)
18      }
19    })
20  })
21}

That serverless.cold_start=true attribute matters more than people think. Put it on the root span, break latency charts by it, and half the mystery disappears. On the Go side, stamp the same trace with container start age or process uptime so you can separate edge cold start from backend churn. If you don't do that, you'll spend weeks blaming Postgres for what was actually a fresh isolate booting in Frankfurt.

tail sample or drown

Head sampling is fine for low-volume systems and mediocre for everything else. Critical edge paths need tail sampling because you only know a trace is interesting after you've seen the whole thing, status code, retries, total duration, whether the cache missed, whether the user hit checkout instead of some harmless marketing route.

The OpenTelemetry Collector gives you enough control to be useful without turning into a science project. Keep the pipeline boring. Receive OTLP over gRPC and HTTP, batch, enrich a bit, then tail sample with explicit policies. This is a collector config we've used as a starting point:

yaml

 1receivers:
 2  otlp:
 3    protocols:
 4      grpc:
 5        endpoint: 0.0.0.0:4317
 6      http:
 7        endpoint: 0.0.0.0:4318
 8
 9processors:
10  memory_limiter:
11    check_interval: 1s
12    limit_mib: 512
13    spike_limit_mib: 128
14
15  batch:
16    send_batch_size: 2048
17    timeout: 2s
18
19  attributes/edge:
20    actions:
21      - key: service.namespace
22        value: edge-platform
23        action: upsert
24
25  tail_sampling:
26    decision_wait: 8s
27    num_traces: 100000
28    expected_new_traces_per_sec: 2500
29    policies:
30      - name: errors
31        type: status_code
32        status_code:
33          status_codes: [ERROR]
34      - name: slow-traces
35        type: latency
36        latency:
37          threshold_ms: 750
38      - name: cold-starts
39        type: string_attribute
40        string_attribute:
41          key: serverless.cold_start
42          values: ["true"]
43      - name: checkout-routes
44        type: string_attribute
45        string_attribute:
46          key: http.route
47          values: ["/api/checkout", "/api/login"]
48      - name: probabilistic-baseline
49        type: probabilistic
50        probabilistic:
51          sampling_percentage: 2
52
53exporters:
54  otlp/lightstep:
55    endpoint: ingest.lightstep.com:443
56    headers:
57      lightstep-access-token: ${LIGHTSTEP_ACCESS_TOKEN}
58  debug:
59    verbosity: basic
60
61service:
62  pipelines:
63    traces:
64      receivers: [otlp]
65      processors: [memory_limiter, attributes/edge, tail_sampling, batch]
66      exporters: [otlp/lightstep, debug]

A few practical notes. decision_wait: 8s is long enough for most request chains that bounce from edge to API to Redis to Postgres, short enough that the collector doesn't hoard memory forever. The cold-starts rule saves traces you'd definitely miss with head sampling. The baseline 2% keeps enough normal traffic for comparison. Raise that on low-traffic services, lower it on noisy anonymous routes.

If you're using Sentry for tracing on the frontend and edge layer, keep client-side head sampling low and route-specific. Something like this works without setting cash on fire:

 1Sentry.init({
 2  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
 3  tracesSampler: samplingContext => {
 4    const name = samplingContext.name || ''
 5    const attrs = samplingContext.attributes || {}
 6    if (attrs['serverless.cold_start'] === true) return 1.0
 7    if (name.includes('/api/checkout')) return 0.5
 8    if (name.includes('/api/login')) return 0.3
 9    return 0.02
10  },
11  profilesSampleRate: 0.0
12})

No, 100% on cold starts isn't expensive if your cold starts are rare. If they aren't rare, you have a deployment and traffic-shaping problem, and the traces will help prove it.

profiles need discipline

Flamegraphs are still the fastest way to kill bad assumptions, especially on Go services where people swear the bottleneck is network latency and pprof shows 37% CPU under JSON encoding or a regex hiding inside request validation. Always profile the Go tier. Profile the edge tier carefully, and only if your platform supports it without hacks.

Go 1.22 already gives you the basics with almost no drama. Expose net/http/pprof on an internal port, lock it down, scrape short CPU profiles during incidents, and use continuous profiling where available, Pyroscope, Parca, Grafana Cloud Profiles, take your pick. Sample-based profiling overhead is usually acceptable if you keep collection intervals sane.

 1import (
 2  _ "net/http/pprof"
 3  "net/http"
 4  "log"
 5)
 6
 7func startDebugServer() {
 8  go func() {
 9    mux := http.NewServeMux()
10    mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
11      w.WriteHeader(http.StatusOK)
12      _, _ = w.Write([]byte("ok"))
13    })
14    mux.Handle("/debug/pprof/", http.DefaultServeMux)
15    if err := http.ListenAndServe("127.0.0.1:6060", mux); err != nil {
16      log.Printf("pprof server failed: %v", err)
17    }
18  }()
19}

Then capture a 30-second profile from a pod or VM:

bash

 1go tool pprof -http=:0 http://127.0.0.1:6060/debug/pprof/profile?seconds=30

If the profile shows runtime.mallocgc and encoding/json dominating, don't add more collectors and don't add more dashboards, fix allocation churn or switch hot paths to jsoniter or precomputed response buffers where it actually matters. If database/sql wait time is huge, the flamegraph won't solve it, though it will stop the wrong argument in Slack.

Next.js edge profiling is rougher because you're inside a hosted isolate with limited introspection. Vercel's built-in observability has improved, though for code-level CPU analysis you still usually learn more by reproducing equivalent logic in the Node runtime or pushing expensive work into Go where pprof exists and tells the truth. That isn't a purity argument. It's a tooling argument. Spend your observability budget where the instrumentation surface is real.

One thing I would avoid is always-on high-rate profiling across every backend service. Teams love the idea for about two weeks, right until retention costs spike and nobody can explain why they need profiles for a cron worker that runs six times a day.

propagate context cleanly

Distributed tracing dies the moment context propagation gets sloppy, and edge systems get sloppy fast because the request may cross browser code, middleware, route handlers, fetch boundaries, an API gateway, then a Go service that spawns goroutines for parallel lookups. One missing header hop and your trace turns into decorative confetti.

Stick to W3C Trace Context, traceparent and tracestate, unless you have a really good reason not to. In Next.js 15 edge handlers, forward those headers explicitly on every internal fetch. Don't assume your platform does it consistently.

 1const traceparent = req.headers.get('traceparent')
 2const tracestate = req.headers.get('tracestate')
 3
 4const upstream = await fetch(`${process.env.API_BASE_URL}/v1/session`, {
 5  headers: {
 6    'traceparent': traceparent || '',
 7    'tracestate': tracestate || '',
 8    'x-request-id': req.headers.get('x-request-id') || crypto.randomUUID()
 9  }
10})

On Go, use the OTel HTTP middleware and stop hand-rolling half-baked span setup unless you enjoy debugging missing parent IDs at 2 a.m.

 1import (
 2  "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
 3  "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
 4  "go.opentelemetry.io/otel/sdk/resource"
 5  sdktrace "go.opentelemetry.io/otel/sdk/trace"
 6)
 7
 8handler := otelhttp.NewHandler(apiMux, "go-api")
 9
10srv := &http.Server{
11  Addr:    ":8080",
12  Handler: handler,
13}

Then add attributes that actually help during triage, region, deployment ID, cache hit, tenant ID if your privacy model allows it, edge colo, upstream timeout bucket. Skip vanity attributes nobody queries. I keep seeing spans stuffed with fifty tags because someone read a vendor blog post and got excited. That just bloats storage and slows queries.

The best trace attribute is the one you'll filter on during an incident. For edge systems, that's usually serverless.cold_start, cloud.region, http.route, http.status_code, cache.hit, and a deploy marker such as service.version or your Git SHA. Everything else needs to justify its existence.

retention without regret

Observability budgeting needs actual numbers, otherwise teams either under-collect and fly blind or keep everything for 30 days and act shocked when telemetry becomes one of the top five infra costs.

A sane starting budget for a mid-sized product, say 50 million requests per month, a Next.js edge frontend, a few Go APIs, PostgreSQL, Redis, looks like this. Keep metrics at 30 days for high-cardinality service dashboards only if your backend handles it well, otherwise downsample after 7 days. Keep traces at full fidelity for sampled important traffic for 7 days, then index-only or aggregate views for 30. Keep profiles for 3 to 7 days unless you're actively doing performance work. Keep logs shorter than everybody wants, usually 7 days for app logs, longer only for audit events and security-relevant streams.

A rough trace budget. Assume tail sampling preserves 100% of errors, 100% of cold starts, 100% of checkout and auth routes, and 2% of the rest. Many teams land around 3% to 6% effective retention once all rules combine. If your average stored trace is 12 KB compressed and you preserve 2.5 million traces per month, that's about 30 GB raw trace payload before indexing overhead. The bill comes from the indexing and query engine, not the wire bytes. Plan around vendor pricing for indexed spans, not just storage on disk.

For Sentry, keep transaction sampling aggressive on hot business paths and near-zero elsewhere. For Lightstep or another OTLP backend, push retention of expensive raw traces down and export long-lived RED metrics out of the collector if your vendor supports span-to-metrics. If self-hosting, ClickHouse-based stacks can get cost-efficient fast, though they absolutely demand someone on the team who understands merge behavior, TTLs, and why a naive schema will wreck query performance.

A practical rule I trust more than any pricing calculator, if nobody has looked at a telemetry class in two weeks, cut retention or cut sample rate. Observability data should earn its keep.

That's also where senior judgment matters. Critical auth failures deserve generous retention. A noisy public search endpoint hit by crawlers doesn't. Spend the money on traces that explain user pain and revenue risk, then keep the rest just long enough to spot trends.

Edge observability is mostly a telemetry problem

visibility breaks first

trace cold starts directly

tail sample or drown

profiles need discipline

propagate context cleanly

retention without regret

Your Tests Passed and Production Still Broke: The AI Verification Gap Nobody Wants to Name

The Vercel Breach Is a Template for How OAuth Sprawl Kills You

Your AI Dev Toolchain Is the Attack Surface

Want to work with us?