Stop trusting Lighthouse for product decisions

the score trap

Lighthouse is useful, I run it all the time, and I still think teams trust it far too much. A clean 95 on a MacBook Pro over a fast local network tells you almost nothing about the customer on a mid-range Samsung phone, on patchy 4G, opening your pricing page inside an in-app browser with six other tabs fighting for memory. That customer is the one who bounces, not the one sitting inside your synthetic report.

The deeper problem is that Lighthouse collapses performance into a tidy score, which makes it irresistible to product people because a single number fits neatly into a sprint goal, a dashboard, or a board slide. Single numbers are comforting and usually misleading. I’ve seen teams spend two weeks shaving 200 ms off Total Blocking Time on a marketing page while their actual checkout abandonment was caused by a React hydration stall on the shipping step that only appeared on lower-end Android devices after a third-party fraud script initialized.

Synthetic testing has a place. CI should fail if someone ships a 4.2 MB JavaScript bundle to the homepage, or if Largest Contentful Paint regresses by 40 percent in a controlled environment. That’s basic hygiene. The mistake starts when hygiene gets treated as strategy. Lighthouse won’t tell you that users from paid acquisition are hitting a consent banner race condition, that your CRM widget adds 1.8 seconds before the first interaction on account pages, or that one customer segment tolerates slower dashboards while another drops off hard if the signup form takes longer than three seconds to become interactive.

Teams at Steezr usually hit this wall after a few months of scale. The app feels fast enough in staging, the score looks respectable, support tickets mention slowness in vague terms, and revenue still behaves strangely. You can’t fix that with prettier lab numbers. You need telemetry tied to real behavior, in production, across actual user journeys.

measure real users

Real User Monitoring should be the center of your performance strategy, not an afterthought bolted onto whatever your frontend framework already logs. Start with Web Vitals, sure, LCP, INP, CLS, plus TTFB if you care about backend latency on server-rendered routes, then add custom spans around the work your product actually does. A dashboard route that fetches seven widgets, decrypts tenant settings, bootstraps a feature flag client, and hydrates a giant table needs more than a generic page-load metric.

For a Next.js 14 app, we usually wire web-vitals on the client, ship the events to Sentry or OpenTelemetry, and tag them aggressively. Device class, effective connection type, route template, experiment variant, logged-in state, customer segment, sometimes even plan tier. Raw vitals without context are barely better than Lighthouse. Context is what lets you answer painful questions.

A simple example in app/instrumentation-client.ts works fine:

 1import { onLCP, onINP, onCLS } from 'web-vitals';
 2
 3function send(metric: any) {
 4  navigator.sendBeacon('/api/vitals', JSON.stringify({
 5    name: metric.name,
 6    value: metric.value,
 7    id: metric.id,
 8    rating: metric.rating,
 9    path: window.location.pathname,
10    conn: (navigator as any).connection?.effectiveType,
11    mem: (navigator as any).deviceMemory,
12  }));
13}
14
15onLCP(send);
16onINP(send);
17onCLS(send);

That’s the easy part. The useful part is custom timing around specific UX boundaries. Measure time to first rendered price on a pricing configurator. Measure time from route transition start to the first enabled state of the primary CTA. Measure time until checkout step validation becomes responsive after the user types. If your backend is Django, add server spans with request IDs and propagate them through the client so one ugly trace can explain a bad conversion session instead of forcing everybody to guess.

OpenTelemetry is good at this. Use it. Emit spans from the browser and the server, keep the naming boring and consistent, and resist the urge to track fifty things. Track the moments users feel.

map performance to funnels

Most teams collect performance data and business data in separate silos, then wonder why prioritization turns into politics. Engineering shows a graph of p75 LCP. Product shows a graph of checkout conversion. Nobody can prove causation, everyone argues from instinct, and the loudest person gets the sprint allocation.

Tie the metrics together at the funnel level. That means defining a few core journeys, signup, checkout, lead submission, dashboard activation, document upload, whatever actually drives the business, and attaching performance measurements to each step. Not pageviews, steps. A route can contain multiple moments that matter, and one page can sit inside several different journeys depending on intent.

Take a simple SaaS signup flow. Step one is landing page view. Step two is form start. Step three is form submit. Step four is email verification. Step five is first successful session. Now add performance fields to every event: p75 INP at form start, time to interactive on the signup route, API latency for account creation, hydration delay before the submit button becomes stable, third-party script cost before verification UI appears. Once those dimensions live in the same event stream, you can ask useful questions, like whether users with INP above 350 ms convert materially worse, or whether mobile users on low-memory devices abandon after your onboarding checklist paints late.

Mixpanel, PostHog, Amplitude, BigQuery, any of them can handle this if your event model is sane. The event model is usually not sane. Teams spray analytics events around the app with names like button_clicked_final_v2 and then act surprised when analysis becomes impossible. Spend a day fixing the taxonomy. It pays for itself immediately.

One of the least glamorous and most effective pieces of work we do on ERP and portal projects is building this mapping layer properly. You end up with a table that says, in plain terms, if route X crosses this threshold on these devices, step Y loses Z percent of users. That’s what prioritization should rest on.

run ugly experiments

Correlation is a start, not the finish. If you want budget for performance work, especially at the CTO or product level, run experiments that isolate impact. Keep them ugly and practical. Fancy methodology is less useful than a blunt test that changes one thing and moves money.

Throttle a non-critical third-party script for 20 percent of traffic and compare conversion. Delay chat widget boot until idle on half of sessions. Ship a lighter hero image variant to mobile users. Replace a client-rendered pricing calculator with a server-rendered shell plus deferred enhancements. Then watch both performance and business metrics together. If p75 LCP improves by 600 ms and nothing changes in the funnel, great, you just learned that route wasn’t the problem. Move on. If a 150 ms improvement in input responsiveness on checkout step two produces a measurable lift in completion, that becomes an easy roadmap decision.

The point is to stop treating performance as a moral virtue and start treating it as an economic variable. I’ve watched teams burn quarters on vague “make the app faster” initiatives because nobody forced the question, faster for which journey, on which devices, at which step, with what expected payoff. That question cuts through a lot of nonsense.

You don’t need a full data science team for this. A PostgreSQL table with session IDs, vitals, funnel events, and experiment flags gets you surprisingly far. A query like this already beats most dashboards:

sql

 1select
 2  experiment_variant,
 3  percentile_cont(0.75) within group (order by inp_ms) as p75_inp,
 4  avg(case when completed_checkout then 1 else 0 end) as checkout_rate
 5from checkout_sessions
 6where device_type = 'mobile'
 7group by experiment_variant;

This kind of analysis is boring, mechanical, and brutally effective. Good. Boring systems win.

what to do monday

Start small, and start where revenue is already sensitive. Pick one journey. Usually checkout, signup, or lead capture. Instrument Web Vitals on that journey, add two or three custom timings around the moments a user actually experiences, and get those events into the same warehouse or analytics tool that holds conversion data. If your stack is Next.js and Django, this is very achievable in a week without derailing feature work.

Next, set thresholds that matter to users, not vanity targets copied from a blog post. Maybe your internal rule becomes: mobile checkout step response must stay under 200 ms p75 INP, first usable pricing state must appear under 2.5 seconds on 4G, dashboard shell must render under 1.8 seconds for logged-in users in your primary market. Those targets should come from observed drop-off curves, not generic best practices.

After that, kill a few illusions. Stop celebrating aggregate Lighthouse improvements that don’t touch key funnels. Stop adding third-party tools without a performance budget. Stop reviewing frontend pull requests without bundle diff checks. Put @next/bundle-analyzer in the repo, inspect the chunks, and make people justify large dependencies. If someone imports a date library that adds 90 KB gzip to a critical route, block it. If a Tailwind-heavy page hides a hydration problem because it looks complete before it works, measure the gap explicitly.

Finally, make performance ownership cross-functional. Product has to care because prioritization sits there. Engineering has to care because implementation sits there. Leadership has to care because budget sits there. A score in Lighthouse can still live in CI, that’s fine, just keep it in its place, as a guardrail, not a compass. Compasses need to point toward user behavior and business outcomes, otherwise you’re optimizing a lab experiment while customers quietly leave.

Stop trusting Lighthouse for product decisions

the score trap

measure real users

map performance to funnels

run ugly experiments

what to do monday

Your AI Coding Budget Is Buying the Illusion of Speed

Ship one correct admin workflow

GPU, API, or CPU batching for AI inference

Want to work with us?