Your Cache Isn’t a Performance Trick — It’s a Reliability System (If You Design It Right)

A pragmatic, layered approach to caching that cuts latency *and* reduces blast radius when downstreams misbehave.

If your cache can’t keep serving during an upstream incident, it’s not a cache — it’s a new dependency with better marketing.
Back to all posts

The outage you don’t see until it’s 2 a.m.

I’ve watched teams ship “a cache” to fix p95 latency, then get blindsided a month later when Redis hiccups and the database turns into a bonfire. The root problem is treating caching like a performance hack instead of a load-shedding and failure-management system.

A good caching strategy does two things at once:

  • Performance: fewer round-trips, lower p95/p99, fewer bytes over the wire
  • Reliability: smaller blast radius, graceful degradation, stable MTTR when a dependency is sick

If you design for both up front, caches become the thing that keeps you online when the DB is melting or an upstream is rate-limiting you.

Start with proof: pick metrics and establish a baseline

Before you add a single SET call, decide how you’ll prove this worked. Otherwise you’ll be arguing about “it feels faster” in a retro.

Track these cache + system metrics per endpoint (or use-case) you plan to cache:

  • Latency: p50, p95, p99 (and tail behavior during deploys)
  • Cache hit ratio: hits / (hits + misses) — but segment by keyspace (don’t average everything)
  • Backend QPS reduction: DB/Upstream requests per second attributable to the endpoint
  • Error rate: overall and by dependency (DB timeouts vs cache timeouts)
  • Stale-served rate: how often you serve stale due to refresh or upstream failure
  • Eviction rate / memory pressure: evictions per minute, used memory, fragmentation
  • Stampede signals: lock contention, concurrent regenerations, sudden miss spikes

Tooling that’s actually useful in the real world:

  • OpenTelemetry tracing to see where time is spent and whether cache is in the critical path
  • Prometheus + Grafana dashboards for hit/miss, latency, backend QPS
  • k6 or wrk2 for controlled load tests (include cold-cache and warm-cache runs)

Checkpoint:

  • You can answer: “If we ship caching, which metric should move, by how much, and what’s the rollback trigger?”

Pick the right layers: don’t cram everything into Redis

Most teams jump straight to a shared Redis and call it a day. That’s how you end up paying Redis tax for things the browser, CDN, or an in-process LRU could have done cheaper and with fewer failure modes.

A pragmatic layered model:

  • Client/browser cache for static-ish assets and safe GETs (when allowed)
  • CDN cache (Cloudflare/Fastly/Akamai) for public or semi-public content
  • Edge/proxy cache (NGINX, Varnish, Envoy) for shared HTTP caching inside your perimeter
  • In-process cache (LRU/TTL) for hot keys; fastest and simplest; per-pod
  • Distributed cache (Redis, Memcached) when you need cross-pod sharing
  • Database-side caches (e.g., DynamoDB DAX) only when it’s truly the right fit

Rules of thumb I’ve seen hold up:

  • If the data is public and cacheable, push it outward (CDN/edge). That’s free reliability.
  • If you need sub-millisecond hot-path reads and can tolerate per-pod variance, use in-process LRU.
  • Use Redis when you need shared state, high hit rates across pods, or coordinated stampede control.

Checkpoint:

  • You’ve written down which layers are in play for this endpoint and what each layer’s failure policy is.

Cache keys and TTLs: where good strategies go to die

Cache design fails most often in two places: key design and TTL selection.

Key design that survives reality

Your cache key must encode every dimension that can change the response:

  • Tenant/account (multi-tenant systems get this wrong constantly)
  • Auth scope (anonymous vs user vs role)
  • Locale/currency/timezone (yes, really)
  • Schema/version (v1, v2) so you can change formats without mass invalidation

Example key format:

product:v3:tenant=acme:locale=en-US:currency=USD:id=12345

If you can’t confidently list the dimensions, don’t cache the response yet.

TTLs: pick a staleness budget, not a vibe

Set TTL based on how wrong you’re allowed to be, not based on “seems fine.” For many systems:

  • Pricing/inventory: seconds to a minute (or don’t cache without SWR)
  • Product catalog/details: minutes to hours
  • Reference data (currencies, feature configs): hours to a day

Add TTL jitter to avoid synchronized expiry (classic stampede trigger):

  • Base TTL 300s
  • Jitter ±10–20%

Checkpoint:

  • You have an explicit staleness budget approved by the business owner (even if informal).

Choose patterns intentionally: read-through, write-through, SWR, and friends

I’ve seen teams mix patterns accidentally, then wonder why consistency is weird and load spikes happen.

Use these patterns deliberately:

  1. Read-through cache (most common)

    • App checks cache → on miss, fetch from source → populate cache
    • Great for read-heavy workloads
  2. Write-through cache

    • Write to cache and source in the same request path
    • Better consistency; slower writes; simpler reads
  3. Write-back / async write-behind

    • Writes land in cache, flush to source later
    • High throughput, higher risk; needs durable queues and careful ops
  4. Stale-while-revalidate (SWR)

    • Serve slightly stale data while refreshing in background
    • This is the “reliability pattern” that keeps systems stable under upstream pain

Concrete HTTP caching example (CDN/edge-friendly)

If you can use HTTP caching, do it. It’s the most cost-effective cache you’ll ever deploy.

curl -I https://api.example.com/catalog/12345
Cache-Control: public, max-age=60, stale-while-revalidate=300, stale-if-error=600
ETag: "catalog-12345-v3-9c2a"
Vary: Accept-Encoding
  • max-age=60: fresh for 60s
  • stale-while-revalidate=300: serve stale up to 5m while refresh happens
  • stale-if-error=600: keep serving stale for 10m if origin is failing

That stale-if-error directive has saved more launches than a dozen war rooms.

Application SWR example (Node/TypeScript + Redis)

import Redis from "ioredis";
import crypto from "crypto";

const redis = new Redis(process.env.REDIS_URL!);

function jitter(ttlSeconds: number, pct = 0.15) {
  const delta = Math.floor(ttlSeconds * pct);
  return ttlSeconds - delta + Math.floor(Math.random() * (2 * delta + 1));
}

async function getWithSWR<T>(
  key: string,
  freshTtl: number,
  staleTtl: number,
  loader: () => Promise<T>
): Promise<{ value: T; stale: boolean }> {
  const cached = await redis.get(key);
  if (cached) return { value: JSON.parse(cached), stale: false };

  const staleKey = `${key}:stale`;
  const staleCached = await redis.get(staleKey);

  // Simple request coalescing using a short-lived lock
  const lockKey = `${key}:lock`;
  const lockVal = crypto.randomUUID();
  const gotLock = await redis.set(lockKey, lockVal, "NX", "EX", 10);

  if (!gotLock && staleCached) {
    // Someone else is refreshing; serve stale
    return { value: JSON.parse(staleCached), stale: true };
  }

  try {
    const value = await loader();
    await redis.set(key, JSON.stringify(value), "EX", jitter(freshTtl));
    await redis.set(staleKey, JSON.stringify(value), "EX", jitter(staleTtl));
    return { value, stale: false };
  } catch (e) {
    if (staleCached) return { value: JSON.parse(staleCached), stale: true };
    throw e;
  } finally {
    // Best-effort unlock
    const cur = await redis.get(lockKey);
    if (cur === lockVal) await redis.del(lockKey);
  }
}

What this buys you:

  • A miss doesn’t automatically mean “hammer the DB”
  • During upstream errors, you still serve something
  • TTL jitter reduces herd behavior

Checkpoint:

  • You can name the pattern you’re using per endpoint and explain the failure behavior in one sentence.

Stampedes and dependency failures: design for the ugly path

Most cache incidents aren’t because caching is hard. They’re because the system’s fallback behavior was never specified.

Common failure modes I’ve personally seen in production:

  • Cache stampede/dogpile: synchronized expirations cause a thundering herd to the DB
  • Cache outage: Redis cluster down or network partition → every request becomes a miss
  • Hot key: one key gets 80% of traffic and all refreshers fight over it
  • Negative caching bug: caching “not found” too long, making data appear missing

Mitigations that actually work:

  • Request coalescing (singleflight) per key
  • TTL jitter to desynchronize expirations
  • Stale-if-error and stale-while-revalidate so you don’t amplify upstream incidents
  • Circuit breaker + timeout on cache calls (yes, the cache can be the dependency that fails)
  • Bulkheads: separate Redis DBs / clusters or at least separate keyspaces for critical vs non-critical

Operational guardrails:

  • Treat cache timeouts as non-fatal for many endpoints (fallback to source if safe)
  • Cap regeneration concurrency (global and per-key)
  • Add rate limiting on regenerations if the DB is already struggling

Checkpoint:

  • You have a written “if cache is down” policy: fail open, fail closed, or serve stale — per endpoint.

Make it observable: dashboards and alerts that don’t lie

If you don’t instrument caching, you’ll end up with the classic executive summary: “Redis was involved.”

Minimal Prometheus metrics to expose

  • cache_requests_total{result="hit|miss|stale|error"}
  • cache_latency_seconds_bucket{operation="get|set"} (histogram)
  • backend_requests_total{dependency="postgres|upstream"}
  • cache_evictions_total (if available)

Example Prometheus alert rules:

groups:
  - name: cache-alerts
    rules:
      - alert: CacheHitRatioDropped
        expr: |
          (sum(rate(cache_requests_total{result="hit"}[5m]))
          / sum(rate(cache_requests_total[5m]))) < 0.70
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Cache hit ratio dropped below 70%"
          description: "Investigate key churn, TTLs, invalidation storms, or cache capacity."

      - alert: BackendQPSSpike
        expr: sum(rate(backend_requests_total{dependency="postgres"}[5m])) > 2 * sum(rate(backend_requests_total{dependency="postgres"}[1h]))
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "DB QPS spiking (possible cache miss storm)"

      - alert: CacheErrorRateHigh
        expr: sum(rate(cache_requests_total{result="error"}[5m])) / sum(rate(cache_requests_total[5m])) > 0.01
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Cache error rate > 1%"

Dashboards that make debugging faster:

  • Hit ratio over time by keyspace
  • Backend QPS and latency aligned with cache misses
  • Stale-served rate (if SWR) — spikes during incidents are expected; sustained spikes are not

Checkpoint:

  • You can correlate: “hit ratio dropped → backend QPS rose → p95 rose” in one Grafana view.

Rollout like a grown-up: flags, canaries, and safe rollback

Caching changes production behavior in non-obvious ways. Roll it out like you would a database migration.

  1. Ship instrumentation first (no behavior change)
  2. Enable caching behind a feature flag (LaunchDarkly, Unleash, or your in-house system)
  3. Canary: 1% → 10% → 50% → 100%, watching:
    • p95/p99 latency
    • backend QPS
    • error rate
    • stale-served rate
  4. Load test cold vs warm cache (people forget cold-cache behavior and then get wrecked on deploy)
  5. Plan invalidation events (deploys, backfills, re-indexes). “We flushed Redis” is not a strategy.

If you’re dealing with AI-assisted “vibe coded” caching (I’ve seen plenty), do an explicit pass for:

  • Missing key dimensions (auth/tenant) → data leaks
  • Unbounded caches in-process → OOMKills in Kubernetes
  • No timeouts on Redis calls → threadpool starvation

This is the kind of cleanup GitPlumbers gets called in for: the cache exists, but it’s actively making outages worse.

Checkpoint:

  • Rollback is a flag flip, not a redeploy, and you’ve tested it.

Related Resources

Key takeaways

  • Treat caching as a reliability primitive: it should *degrade gracefully* when dependencies fail.
  • Design a layered cache: browser/CDN → edge/proxy → service/in-process → distributed cache → database.
  • Pick patterns intentionally (read-through vs write-through vs async refresh) and be explicit about staleness.
  • Prevent stampedes with request coalescing, TTL jitter, and stale-while-revalidate.
  • Instrument caches like production systems: hit ratio, p95/p99, backend QPS, evictions, stale-served, and error rates.
  • Roll out caching behind flags with guardrails; verify impact with load tests and production canaries.

Implementation checklist

  • Define an SLO for the endpoint(s) you’re caching and the acceptable staleness window.
  • Instrument baseline latency (`p50/p95/p99`), backend QPS, and error rate before adding caching.
  • Choose cache layers and ownership (CDN vs service vs shared Redis).
  • Design cache keys and versioning (include tenant, locale, auth scope, and schema version).
  • Select caching pattern per data shape (read-through, write-through, refresh-ahead, SWR).
  • Add stampede protection (singleflight/coalescing, TTL jitter, locks) and failure policy (stale-if-error).
  • Add cache observability: hit/miss, evictions, memory, stale-served, lock contention.
  • Roll out via feature flag/canary; validate business KPIs and rollback plan.

Questions we hear from teams

What’s a good cache hit ratio to aim for?
It depends on the workload and keyspace. For a hot, read-heavy endpoint, I typically expect **70–95%** once warmed. More important: segment hit ratio by keyspace/endpoint and correlate it to backend QPS and tail latency. A global “90% hit ratio” can hide a miss storm on the one endpoint that matters.
Should we fail open (bypass cache) or fail closed when Redis is down?
Per endpoint. For user-facing reads where stale is acceptable, prefer **serve stale** (SWR) or **fail open** to origin with strict rate limiting. For authorization/session-like data, you may need **fail closed** to avoid security issues. Write the policy down and test it.
Is in-process caching safe in Kubernetes with autoscaling?
Yes, if you bound it. Use a real LRU/TTL cache with size limits and measure memory. Expect lower hit ratios during scale events and deploys. In-process caches are great for hot keys but shouldn’t be your only layer if you need stability across pods.
How do we avoid data leaks with caching in multi-tenant systems?
Treat cache keys like you treat database queries: include **tenant + auth scope** (and any personalization dimension). Add versioning, and use automated tests that confirm tenant isolation. I’ve seen ‘just cache the response’ turn into a P0 incident because the key forgot one dimension.
When should we use Redis vs a CDN?
If the response is cacheable via HTTP and doesn’t require per-user personalization, start with **CDN/edge caching**—it’s cheaper and reduces load closer to the client. Use Redis for shared, authenticated, or non-HTTP caching needs, and for coordination (locks/coalescing) when stampedes matter.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about hardening your caching and reliability path See how we approach production-grade code rescue

Related resources