How we eliminated cold starts and cut P99 latency by 62%

At Rocket, we obsess over time-to-interaction. This is the story of how our infrastructure team rethought warm capacity, predictive scaling, and baseline metrics to make slowness a problem we stopped tolerating.

p99 latency

Metric	Result
P99 Latency Reduction	62%
Concurrent Load Handled	4×

Latency is a tax users never agreed to pay. Every extra millisecond erodes trust, compounds frustration, and eventually becomes a churn signal you can't ignore. At Rocket, as our user base scaled through this past year, we started noticing a pattern that our dashboards were quietly normalizing: users hitting our platform at peak hours were experiencing session-opening delays that were simply unacceptable for a product at our tier.

The culprit, as it often is in SaaS infrastructure, wasn't a single bad query or an unoptimized endpoint. It was something more systemic — the gap between when demand arrived and when our infrastructure was ready to serve it.

The cold start problem, in plain terms

Cold starts aren't new. The community has fought them for years, and the challenge looks familiar even outside of function-as-a-service architectures. At Rocket, our services needed initialization time — environment setup, dependency hydration, connection pool warming — before they could serve traffic at full capacity. Under typical daytime loads, this was masked by the fact that instances were already warm and running.

But at scale inflection points — Monday morning spikes, post-feature-launch surges, seasonal load events — autoscaling would kick in and spin up new capacity. Those new instances arrived cold. And users landed on them.

The real cost of cold starts: A cold instance doesn't just add latency to a single request — it creates a cascade. Connection pools aren't established, caches are empty, and the instance may shed or queue subsequent requests while bootstrapping. A single cold instance under load can visibly degrade dozens of concurrent users.

Rethinking warm capacity

Our first instinct — shared by most teams that hit this problem — was to keep a fixed floor of pre-warmed instances running at all times. It worked, but it was blunt. We were over-provisioning during off-peak hours and under-provisioning during the moments that mattered. The economics and the experience were both suboptimal.

We needed warm capacity that was dynamic and anticipatory, not just static. The insight was straightforward once we said it out loud: if we can predict concurrent user arrival rate with reasonable accuracy, we can stage warm instances ahead of demand rather than in reaction to it.

How the prediction pipeline works

Usage signals  →  Forecast model  →  Warm pool target  →  Capacity provisioner

Prediction pipeline — runs continuously, feeds the warm pool ahead of demand

We built a forecasting pipeline that ingests historical concurrency patterns, time-of-day and day-of-week signals, recent trend deltas, and product-level event data — scheduled communications, active campaigns, release windows. The output isn't a capacity number. It's a warm pool target that updates continuously, so our provisioner always knows how many instances should be sitting ready before traffic actually lands.

The provisioner treats warm capacity as a first-class resource, separate from reactive autoscaling. When the forecast indicates a concurrency ramp approaching, instances are initialized and fully bootstrapped before users arrive. When demand subsides, warm pool size steps down gracefully.

Adaptive baseline metrics

Solving warm capacity was step one. But we quickly realized we had a second problem: the thresholds we used to define "healthy" capacity were static. They'd been set during an earlier, lower-load era of the product and never revisited in a systematic way.

A static baseline treats a Tuesday at 2am and a Monday at 9am the same. That's not a scaling strategy — it's an assumption. We needed our baseline metrics to be load-aware, shifting to reflect the operational context of the moment.

What we changed

Concurrency-relative thresholds
Scaling triggers are now expressed as ratios of current concurrency, not fixed absolute values. A threshold that makes sense at 500 concurrent users automatically recalibrates when you're at 5,000.

Rolling baseline windows
Instead of a single baseline computed from all-time averages, we maintain rolling windows at multiple granularities. The system compares current behavior against recent norms, not historical ones.

Continuous baseline recalibration
As load profiles evolve — new features ship, usage patterns shift — baselines update automatically. We no longer treat baseline-setting as a manual, quarterly exercise.

Alerting that accounts for context
Anomaly detection now fires relative to load-adjusted expectations. A spike that's normal under peak concurrency no longer triggers false positives that were training our on-call team to ignore the noise.

What the numbers looked like

P99 response time · Session open · Before vs. after

	P99 (ms)
Before	2,340ms
After	890ms

Cold start incidents per week · 30-day rolling

Period	Incidents
Week –4	147
Week –3	134
Week –2	52
Week –1	17
This week	4

The tail latency reduction was the number we cared most about. P50 and P75 were already reasonable — the pain lived in the P99, and that's where users making their worst-case impressions of Rocket were forming their opinions. Bringing P99 session-open time from ~2.3 seconds to under 900ms was a meaningful jump, and it correlated directly with improvements in our session continuation and feature engagement metrics.

Cold start incidents — defined as a user-visible initialization delay attributable to an un-warmed instance — dropped from triple digits per week to single digits.

What we learned

Reactive autoscaling is a latency floor, not a ceiling. Spinning up new capacity in response to demand is table stakes. If that new capacity isn't warm, you've added capacity without adding performance for the users who triggered the scale event.

Static baselines age poorly. Any threshold that was "set and forgotten" is probably wrong today. The right question isn't "is this threshold too high or too low?" — it's "does this threshold still reflect what normal looks like at current scale?"

Prediction doesn't need to be perfect to be useful. Our forecast isn't clairvoyant. It doesn't need to be. A warm pool sized for anticipated concurrency 80% of the time eliminates 80% of your cold start surface area. The marginal cost of over-provisioning in that window is small relative to the experience improvement.

Latency has a compounding cost at the user level. A 2-second delay at session open doesn't just feel slow — it trains users to expect slowness. Recovering that trust, even after the performance is fixed, takes longer than fixing the problem did.

What's next

We're now looking at extending the same predictive approach to database connection pool sizing and edge cache warming. The primitive is the same — anticipate demand contours, provision ahead of arrival — and there are meaningful latency gains still on the table in both areas.

Table of contents

How we eliminated cold starts and cut P99 latency by 62%

The cold start problem, in plain terms

Rethinking warm capacity

How the prediction pipeline works

Adaptive baseline metrics

What we changed

What the numbers looked like

What we learned

What's next

The work is only as good as the thinking before it.

The cold start problem, in plain terms

Rethinking warm capacity

How the prediction pipeline works

Adaptive baseline metrics

What we changed

What the numbers looked like

What we learned

What's next