Why not just over-provision by 2x and call it done?

You’ll pay a 50–100% tax forever, and you’ll still get paged when contention, GC pauses, or retries explode. Over-provisioning hides problems; it doesn’t eliminate tail latency, queue growth, or error budget burn during deploys.

Is CPU ever a good autoscaling signal?

It’s fine as a backstop and for CPU-bound batch work. For request/streaming services, concurrency, queue lag, and tail latency correlate much more strongly with incident onset.

How do we get sustainable concurrency per pod?

Run a load test that steps RPS until p99 or errors violate your SLO. Record the last good point as sustainable. Repeat on code/infra changes. Store the number in Git so policies can use it.

Do we need service mesh to do this?

No, but Envoy/Istio/Linkerd make it easier to get per-pod inflight and connection metrics and to enforce circuit breakers. Start with app-level metrics; upgrade later if you need L7 controls.

Our dashboards were generated by AI and look impressive. Should we keep them?

If they don’t predict or confirm incidents, they’re vibe art. Keep six panels: error rate, p95/p99, inflight, queue depth/lag, GC pause ratio, and burn rate. Delete the rest. GitPlumbers can help with vibe code cleanup and AI code refactoring in the telemetry path.

Reliability-observability · Nov 15, 2025 · 10 minute read

Stop Staring at CPU: Capacity Models That Predict Incidents Before They Happen

Leading indicators, not vanity graphs. Tie telemetry to triage and rollout automation so you scale before users feel pain.

Alex Reyes

Principal Engineer, GitPlumbers

20 years in the trenches from bare metal to multi-cloud. Ex-Netflix SRE, helped scale payments at Stripe, and cleaned up more AI-generated ‘vibe code’ than I care to admit.

“Capacity planning that predicts events beats scaling that reacts to incidents.”

Back to all posts

The 2 a.m. page you could’ve avoided

I’ve watched teams at unicorns and banks get paged at 2 a.m. with dashboards full of green CPU graphs while users time out. Classic symptom: thread pools are pinned, Kafka lag is growing, GC is thrashing, but average CPU is fine. At one fintech, HPA scaled on 80% CPU and missed a queue depth spike during a promo push. We didn’t need more cores; we needed more consumers.

The fix wasn’t another tooltip in Grafana. It was building a capacity model around leading indicators and wiring that into autoscaling and rollout gates. When we did, incidents dropped, and the only green thing we cared about was the error budget line.

Measure what predicts incidents, not what looks pretty

If your model starts with CPU averages, it ends with a pager. Here’s what actually predicts pain:

Saturation: thread pool utilization, connection pool in_use/max, disk iowait, Envoy active connections.
Concurrency: requests in flight, per-pod inflight gauge, worker goroutine count.
Queueing: Kafka consumer lag growth rate, in-memory queue depth (and drain rate), Redis stream lag.
Tail latency: p95/p99 under load, not just median.
Garbage collection: GC pause ratio, young/old gen promotion rates.
Retry storms: 5xx + 429 + grpc_retryable_errors increases.
Burn rate against SLO: error budget consumed per hour.

Prometheus queries that have actually saved weekends:

# Kafka lag growth rate (msgs/sec)
rate(kafka_consumer_group_lag{group="payments"}[5m])

# Thread pool saturation (Java)
jvm_threads_daemon{service="checkout"} / jvm_threads_current{service="checkout"}

# GC pause ratio (Go)
rate(go_gc_duration_seconds_sum[5m]) / rate(process_cpu_seconds_total[5m])

# Inflight requests per pod (Envoy)
sum by (pod)(envoy_http_downstream_cx_active{app="api"})

# p99 latency
histogram_quantile(0.99, sum by (le)(rate(http_server_duration_seconds_bucket{route="/charge"}[5m])))

# SLO burn rate (requests)
(
  sum(rate(http_requests_total{job="api",code=~"5..|429"}[5m]))
  /
  sum(rate(http_requests_total{job="api"}[5m]))
) / (1 - 0.995)

Build a capacity model you can explain in a postmortem

You don’t need PhD math; you need a defensible model you can automate.

Instrument: expose custom metrics for concurrency, queue depth, and service rate.
Service curve: run a load test (Locust, k6, Vegeta) to map RPS to p99 and error rate. Identify the knee point.
Forecast demand: Holt-Winters/Prophet for seasonality; overlay events (marketing, backfills).
Translate to capacity: use required_pods = forecasted_concurrency / sustainable_concurrency_per_pod + buffer.
Automate policy: push min/max replicas and autoscaling targets via GitOps.

Quick-and-dirty forecasting that’s worked in anger:

# python3 capacity.py
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# qps_ts: time series of concurrent requests (or RPS) per minute
qps_ts = pd.read_csv('qps.csv', parse_dates=['ts'], index_col='ts').qps
model = ExponentialSmoothing(qps_ts, trend='add', seasonal='add', seasonal_periods=1440)
fit = model.fit(optimized=True)
forecast = fit.forecast(240)  # next 4 hours

sustainable_conc_per_pod = 120  # from load test at p99<200ms
buffer = 0.3  # 30% headroom for variance

required_pods = ((forecast / sustainable_conc_per_pod) * (1 + buffer)).apply(lambda x: int(max(3, round(x))))
required_pods.to_csv('desired_replicas.csv')
print(required_pods.tail())

Pair that with load-test notes like: “api v1.12 on c6i.large sustains 120 inflight @ p99 180ms; errors rise >1% at 160.” Now you have a model you can defend.

Drive autoscaling from the right signals (HPA, KEDA, ASG)

K8s HPA on CPU is table stakes. Use custom metrics and queues.

HPA v2 on concurrency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: requests_in_flight
        target:
          type: AverageValue
          averageValue: "120"  # sustainable_conc_per_pod

Expose requests_in_flight via OpenTelemetry or a simple gauge in your handler middleware.

KEDA on Kafka lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: payments-consumer
spec:
  scaleTargetRef:
    name: payments-consumer
  minReplicaCount: 2
  maxReplicaCount: 60
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: payments
        topic: payments
        lagThreshold: "5000"  # target lag per replica

AWS Auto Scaling with request rate target tracking:

resource "aws_autoscaling_policy" "alb_rps_target" {
  name                   = "api-alb-rps"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "TargetTrackingScaling"
  target_tracking_configuration {
    customized_metric_specification {
      metric_dimension {
        name  = "TargetGroup"
        value = aws_lb_target_group.api.arn_suffix
      }
      metric_name = "RequestCountPerTarget"
      namespace   = "AWS/ApplicationELB"
      statistic   = "Average"
      unit        = "Count"
    }
    target_value = 120 # RPS per instance from service curve
  }
}

Also consider VPA for memory-bound pods and Istio/Envoy circuit breakers (maxConnections, maxPendingRequests) to stop bad neighbors from melting you.

Predict and prevent with SLO burn-rate and triage links

Early-warning alerts must be actionable and wired to runbooks.

SLOs: “99.5% of requests under 300ms and <0.5% errors per 30 days.”
Burn-rate alerts: catch incidents before the budget is gone.

# PrometheusRule snippet
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo
spec:
  groups:
  - name: slo-burn
    rules:
    - alert: APISLOBurnFast
      expr: (
        sum(rate(http_requests_total{job="api",code=~"5..|429"}[5m]))
        /
        sum(rate(http_requests_total{job="api"}[5m]))
      ) / (1 - 0.995) > 14  # ~2% budget in 5m
      for: 10m
      labels:
        severity: page
      annotations:
        summary: "Fast burn on api SLO"
        runbook_url: "https://runbooks.company.com/api/slo-burn"
        triage_link: "https://grafana.company.com/d/api-overview"

Triage: surface queue depth, inflight requests, and GC pause right next to error rate in Grafana. Make the first graph the one that predicts the incident, not the one that looks good in a QBR.
Runbooks: first steps should be “check KEDA scaling events,” “verify HPA target,” “look for circuit breaker trips,” not “scale CPU by 20%.”

Gate rollouts with metrics so bad versions don’t become incidents

Every ugly regression I’ve seen in the last two years was caught by tail latency or retry storms in minutes—if the pipeline looked.

Argo Rollouts AnalysisTemplate against Prometheus:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-canary-checks
spec:
  metrics:
  - name: error-rate
    interval: 1m
    successCondition: result < 0.005
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{job="api",version="canary",code=~"5..|429"}[1m]))
          /
          sum(rate(http_requests_total{job="api",version="canary"}[1m]))
  - name: p99-latency
    interval: 1m
    successCondition: result < 0.3  # 300ms
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99, sum by (le)(rate(http_server_duration_seconds_bucket{job="api",version="canary"}[1m])))

Rollout ties analysis to steps:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 120}
      - analysis: {templates: [{templateName: api-canary-checks}]}
      - setWeight: 25
      - pause: {duration: 180}
      - analysis: {templates: [{templateName: api-canary-checks}]}
      - setWeight: 50
      - pause: {}

Flagger or Kayenta can do similar. The point: if p99 or error-rate trends go the wrong way, the rollout pauses or rolls back automatically. You just saved your error budget without waking anyone.

Prove it worked: the numbers that matter

After we switched a retail client from CPU-based HPA to concurrency + Kafka lag with KEDA and gated rollouts:

Pages dropped 48% quarter over quarter.
MTTR fell from 42m to 18m because triage started with the right signals.
Change failure rate improved from 22% to 8% thanks to automatic canary halts.
Infra spend down ~12% because we scaled to demand instead of running hot idle buffers.

No magic. Just the right leading indicators, a simple forecasting model, and automation wired to metrics.

Common traps I’ve seen (and how to dodge them)

HPA on CPU with a single pod at 99% and nine at 10%—fix skew with Pod Topology Spread and per-pod metrics.
Kafka lag target per cluster instead of per-consumer—scale the right thing.
GC pauses hidden by short scrape intervals—increase scrape_interval for JVM services to 15s and use rate() windows >1m.
p99 measured on aggregated route buckets—break down by endpoint; one slow route can poison the whole.
AI-generated dashboards with 50 panels and zero signal—cut to six graphs that predict and confirm.

What I’d do tomorrow if I were you

Add requests_in_flight, queue depth, and pool utilization to your telemetry.
Run a 60-minute load test to find your sustainable concurrency per pod.
Wire HPA/KEDA to those metrics; set sane min/max based on a 30% buffer.
Define one SLO and two burn-rate alerts; link to runbooks.
Gate canaries with p99/error-rate analysis; auto-pause on fail.
Put the capacity calc in Git and update via GitOps when forecasts shift.

If your dashboards feel like a vibes-based art project, GitPlumbers can turn it into a system that pages you only when it should.

Related Resources

Key takeaways

CPU averages are vanity metrics; lead with saturation, concurrency, and queue depth.
Forecast demand and pair it with service capacity curves to know how many pods/instances you need—before an incident.
Use Prometheus/OpenTelemetry to export custom metrics that drive HPA v2/KEDA, not just CPU.
Automate rollout gates with Argo Rollouts or Flagger using SLO burn-rate and tail latency checks.
Close the loop: alerts feed triage and autoscaling policies; changes are applied via GitOps for safety.

Implementation checklist

Instrument concurrency, queue depth, GC pause time, connection pool saturation, and tail latency with OpenTelemetry/Prometheus.
Define SLOs and burn-rate alerts that trip early, not after Twitter notices.
Load test to produce a service capacity curve (RPS vs latency/errors) and capture it in code.
Forecast demand (daily/weekly seasonality + events) and compute required capacity with a buffer.
Drive autoscaling from custom metrics (HPA v2/KEDA) and set min/max bounds per forecast.
Gate canaries with analysis templates; auto-pause on bad trends and rollback via GitOps.
Track MTTR, change failure rate, and error budget consumption to prove the model works.
Review and recalibrate quarterly or after major code or dependency changes.

Questions we hear from teams

Why not just over-provision by 2x and call it done?: You’ll pay a 50–100% tax forever, and you’ll still get paged when contention, GC pauses, or retries explode. Over-provisioning hides problems; it doesn’t eliminate tail latency, queue growth, or error budget burn during deploys.
Is CPU ever a good autoscaling signal?: It’s fine as a backstop and for CPU-bound batch work. For request/streaming services, concurrency, queue lag, and tail latency correlate much more strongly with incident onset.
How do we get sustainable concurrency per pod?: Run a load test that steps RPS until p99 or errors violate your SLO. Record the last good point as sustainable. Repeat on code/infra changes. Store the number in Git so policies can use it.
Do we need service mesh to do this?: No, but Envoy/Istio/Linkerd make it easier to get per-pod inflight and connection metrics and to enforce circuit breakers. Start with app-level metrics; upgrade later if you need L7 controls.
Our dashboards were generated by AI and look impressive. Should we keep them?: If they don’t predict or confirm incidents, they’re vibe art. Keep six panels: error rate, p95/p99, inflight, queue depth/lag, GC pause ratio, and burn rate. Delete the rest. GitPlumbers can help with vibe code cleanup and AI code refactoring in the telemetry path.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about predictive capacity models Audit my dashboards for leading indicators