What’s a good cache hit ratio to aim for?

It depends on the workload and keyspace. For a hot, read-heavy endpoint, I typically expect **70–95%** once warmed. More important: segment hit ratio by keyspace/endpoint and correlate it to backend QPS and tail latency. A global “90% hit ratio” can hide a miss storm on the one endpoint that matters.

Should we fail open (bypass cache) or fail closed when Redis is down?

Per endpoint. For user-facing reads where stale is acceptable, prefer **serve stale** (SWR) or **fail open** to origin with strict rate limiting. For authorization/session-like data, you may need **fail closed** to avoid security issues. Write the policy down and test it.

Is in-process caching safe in Kubernetes with autoscaling?

Yes, if you bound it. Use a real LRU/TTL cache with size limits and measure memory. Expect lower hit ratios during scale events and deploys. In-process caches are great for hot keys but shouldn’t be your only layer if you need stability across pods.

How do we avoid data leaks with caching in multi-tenant systems?

Treat cache keys like you treat database queries: include **tenant + auth scope** (and any personalization dimension). Add versioning, and use automated tests that confirm tenant isolation. I’ve seen ‘just cache the response’ turn into a P0 incident because the key forgot one dimension.

When should we use Redis vs a CDN?

If the response is cacheable via HTTP and doesn’t require per-user personalization, start with **CDN/edge caching**—it’s cheaper and reduces load closer to the client. Use Redis for shared, authenticated, or non-HTTP caching needs, and for coordination (locks/coalescing) when stampedes matter.

Guides · Jan 14, 2026 · 7 minute read

Your Cache Isn’t a Performance Trick — It’s a Reliability System (If You Design It Right)

A pragmatic, layered approach to caching that cuts latency *and* reduces blast radius when downstreams misbehave.

GitPlumbers Editorial Team

Production Fixers (20-year veterans of reliability and technical debt)

We’ve rebuilt legacy platforms, cleaned up AI-generated code that shipped too fast, and stabilized distributed systems across Kubernetes, Redis, Postgres, and more. Our bias: instrument first, ship safely, and don’t introduce a ‘performance optimization’ that becomes your next outage.

If your cache can’t keep serving during an upstream incident, it’s not a cache — it’s a new dependency with better marketing.

Back to all posts

The outage you don’t see until it’s 2 a.m.

I’ve watched teams ship “a cache” to fix p95 latency, then get blindsided a month later when Redis hiccups and the database turns into a bonfire. The root problem is treating caching like a performance hack instead of a load-shedding and failure-management system.

A good caching strategy does two things at once:

Performance: fewer round-trips, lower p95/p99, fewer bytes over the wire
Reliability: smaller blast radius, graceful degradation, stable MTTR when a dependency is sick

If you design for both up front, caches become the thing that keeps you online when the DB is melting or an upstream is rate-limiting you.

Start with proof: pick metrics and establish a baseline

Before you add a single SET call, decide how you’ll prove this worked. Otherwise you’ll be arguing about “it feels faster” in a retro.

Track these cache + system metrics per endpoint (or use-case) you plan to cache:

Latency: p50, p95, p99 (and tail behavior during deploys)
Cache hit ratio: hits / (hits + misses) — but segment by keyspace (don’t average everything)
Backend QPS reduction: DB/Upstream requests per second attributable to the endpoint
Error rate: overall and by dependency (DB timeouts vs cache timeouts)
Stale-served rate: how often you serve stale due to refresh or upstream failure
Eviction rate / memory pressure: evictions per minute, used memory, fragmentation
Stampede signals: lock contention, concurrent regenerations, sudden miss spikes

Tooling that’s actually useful in the real world:

OpenTelemetry tracing to see where time is spent and whether cache is in the critical path
Prometheus + Grafana dashboards for hit/miss, latency, backend QPS
k6 or wrk2 for controlled load tests (include cold-cache and warm-cache runs)

Checkpoint:

You can answer: “If we ship caching, which metric should move, by how much, and what’s the rollback trigger?”

Pick the right layers: don’t cram everything into Redis

Most teams jump straight to a shared Redis and call it a day. That’s how you end up paying Redis tax for things the browser, CDN, or an in-process LRU could have done cheaper and with fewer failure modes.

A pragmatic layered model:

Client/browser cache for static-ish assets and safe GETs (when allowed)
CDN cache (Cloudflare/Fastly/Akamai) for public or semi-public content
Edge/proxy cache (NGINX, Varnish, Envoy) for shared HTTP caching inside your perimeter
In-process cache (LRU/TTL) for hot keys; fastest and simplest; per-pod
Distributed cache (Redis, Memcached) when you need cross-pod sharing
Database-side caches (e.g., DynamoDB DAX) only when it’s truly the right fit

Rules of thumb I’ve seen hold up:

If the data is public and cacheable, push it outward (CDN/edge). That’s free reliability.
If you need sub-millisecond hot-path reads and can tolerate per-pod variance, use in-process LRU.
Use Redis when you need shared state, high hit rates across pods, or coordinated stampede control.

Checkpoint:

You’ve written down which layers are in play for this endpoint and what each layer’s failure policy is.

Cache keys and TTLs: where good strategies go to die

Cache design fails most often in two places: key design and TTL selection.

Key design that survives reality

Your cache key must encode every dimension that can change the response:

Tenant/account (multi-tenant systems get this wrong constantly)
Auth scope (anonymous vs user vs role)
Locale/currency/timezone (yes, really)
Schema/version (v1, v2) so you can change formats without mass invalidation

Example key format:

product:v3:tenant=acme:locale=en-US:currency=USD:id=12345

If you can’t confidently list the dimensions, don’t cache the response yet.

TTLs: pick a staleness budget, not a vibe

Set TTL based on how wrong you’re allowed to be, not based on “seems fine.” For many systems:

Pricing/inventory: seconds to a minute (or don’t cache without SWR)
Product catalog/details: minutes to hours
Reference data (currencies, feature configs): hours to a day

Add TTL jitter to avoid synchronized expiry (classic stampede trigger):

Base TTL 300s
Jitter ±10–20%

Checkpoint:

You have an explicit staleness budget approved by the business owner (even if informal).

Choose patterns intentionally: read-through, write-through, SWR, and friends

I’ve seen teams mix patterns accidentally, then wonder why consistency is weird and load spikes happen.

Use these patterns deliberately:

Read-through cache (most common)
- App checks cache → on miss, fetch from source → populate cache
- Great for read-heavy workloads
Write-through cache
- Write to cache and source in the same request path
- Better consistency; slower writes; simpler reads
Write-back / async write-behind
- Writes land in cache, flush to source later
- High throughput, higher risk; needs durable queues and careful ops
Stale-while-revalidate (SWR)
- Serve slightly stale data while refreshing in background
- This is the “reliability pattern” that keeps systems stable under upstream pain

Concrete HTTP caching example (CDN/edge-friendly)

If you can use HTTP caching, do it. It’s the most cost-effective cache you’ll ever deploy.

curl -I https://api.example.com/catalog/12345

Cache-Control: public, max-age=60, stale-while-revalidate=300, stale-if-error=600
ETag: "catalog-12345-v3-9c2a"
Vary: Accept-Encoding

max-age=60: fresh for 60s
stale-while-revalidate=300: serve stale up to 5m while refresh happens
stale-if-error=600: keep serving stale for 10m if origin is failing

That stale-if-error directive has saved more launches than a dozen war rooms.

Application SWR example (Node/TypeScript + Redis)

import Redis from "ioredis";
import crypto from "crypto";

const redis = new Redis(process.env.REDIS_URL!);

function jitter(ttlSeconds: number, pct = 0.15) {
  const delta = Math.floor(ttlSeconds * pct);
  return ttlSeconds - delta + Math.floor(Math.random() * (2 * delta + 1));
}

async function getWithSWR<T>(
  key: string,
  freshTtl: number,
  staleTtl: number,
  loader: () => Promise<T>
): Promise<{ value: T; stale: boolean }> {
  const cached = await redis.get(key);
  if (cached) return { value: JSON.parse(cached), stale: false };

  const staleKey = `${key}:stale`;
  const staleCached = await redis.get(staleKey);

  // Simple request coalescing using a short-lived lock
  const lockKey = `${key}:lock`;
  const lockVal = crypto.randomUUID();
  const gotLock = await redis.set(lockKey, lockVal, "NX", "EX", 10);

  if (!gotLock && staleCached) {
    // Someone else is refreshing; serve stale
    return { value: JSON.parse(staleCached), stale: true };
  }

  try {
    const value = await loader();
    await redis.set(key, JSON.stringify(value), "EX", jitter(freshTtl));
    await redis.set(staleKey, JSON.stringify(value), "EX", jitter(staleTtl));
    return { value, stale: false };
  } catch (e) {
    if (staleCached) return { value: JSON.parse(staleCached), stale: true };
    throw e;
  } finally {
    // Best-effort unlock
    const cur = await redis.get(lockKey);
    if (cur === lockVal) await redis.del(lockKey);
  }
}

What this buys you:

A miss doesn’t automatically mean “hammer the DB”
During upstream errors, you still serve something
TTL jitter reduces herd behavior

Checkpoint:

You can name the pattern you’re using per endpoint and explain the failure behavior in one sentence.

Stampedes and dependency failures: design for the ugly path

Most cache incidents aren’t because caching is hard. They’re because the system’s fallback behavior was never specified.

Common failure modes I’ve personally seen in production:

Cache stampede/dogpile: synchronized expirations cause a thundering herd to the DB
Cache outage: Redis cluster down or network partition → every request becomes a miss
Hot key: one key gets 80% of traffic and all refreshers fight over it
Negative caching bug: caching “not found” too long, making data appear missing

Mitigations that actually work:

Request coalescing (singleflight) per key
TTL jitter to desynchronize expirations
Stale-if-error and stale-while-revalidate so you don’t amplify upstream incidents
Circuit breaker + timeout on cache calls (yes, the cache can be the dependency that fails)
Bulkheads: separate Redis DBs / clusters or at least separate keyspaces for critical vs non-critical

Operational guardrails:

Treat cache timeouts as non-fatal for many endpoints (fallback to source if safe)
Cap regeneration concurrency (global and per-key)
Add rate limiting on regenerations if the DB is already struggling

Checkpoint:

You have a written “if cache is down” policy: fail open, fail closed, or serve stale — per endpoint.

Make it observable: dashboards and alerts that don’t lie

If you don’t instrument caching, you’ll end up with the classic executive summary: “Redis was involved.”

Minimal Prometheus metrics to expose

cache_requests_total{result="hit|miss|stale|error"}
cache_latency_seconds_bucket{operation="get|set"} (histogram)
backend_requests_total{dependency="postgres|upstream"}
cache_evictions_total (if available)

Example Prometheus alert rules:

groups:
  - name: cache-alerts
    rules:
      - alert: CacheHitRatioDropped
        expr: |
          (sum(rate(cache_requests_total{result="hit"}[5m]))
          / sum(rate(cache_requests_total[5m]))) < 0.70
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Cache hit ratio dropped below 70%"
          description: "Investigate key churn, TTLs, invalidation storms, or cache capacity."

      - alert: BackendQPSSpike
        expr: sum(rate(backend_requests_total{dependency="postgres"}[5m])) > 2 * sum(rate(backend_requests_total{dependency="postgres"}[1h]))
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "DB QPS spiking (possible cache miss storm)"

      - alert: CacheErrorRateHigh
        expr: sum(rate(cache_requests_total{result="error"}[5m])) / sum(rate(cache_requests_total[5m])) > 0.01
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Cache error rate > 1%"

Dashboards that make debugging faster:

Hit ratio over time by keyspace
Backend QPS and latency aligned with cache misses
Stale-served rate (if SWR) — spikes during incidents are expected; sustained spikes are not

Checkpoint:

You can correlate: “hit ratio dropped → backend QPS rose → p95 rose” in one Grafana view.

Rollout like a grown-up: flags, canaries, and safe rollback

Caching changes production behavior in non-obvious ways. Roll it out like you would a database migration.

Ship instrumentation first (no behavior change)
Enable caching behind a feature flag (LaunchDarkly, Unleash, or your in-house system)
Canary: 1% → 10% → 50% → 100%, watching:
- p95/p99 latency
- backend QPS
- error rate
- stale-served rate
Load test cold vs warm cache (people forget cold-cache behavior and then get wrecked on deploy)
Plan invalidation events (deploys, backfills, re-indexes). “We flushed Redis” is not a strategy.

If you’re dealing with AI-assisted “vibe coded” caching (I’ve seen plenty), do an explicit pass for:

Missing key dimensions (auth/tenant) → data leaks
Unbounded caches in-process → OOMKills in Kubernetes
No timeouts on Redis calls → threadpool starvation

This is the kind of cleanup GitPlumbers gets called in for: the cache exists, but it’s actively making outages worse.

Checkpoint:

Rollback is a flag flip, not a redeploy, and you’ve tested it.

Related Resources

Key takeaways

Treat caching as a reliability primitive: it should *degrade gracefully* when dependencies fail.
Design a layered cache: browser/CDN → edge/proxy → service/in-process → distributed cache → database.
Pick patterns intentionally (read-through vs write-through vs async refresh) and be explicit about staleness.
Prevent stampedes with request coalescing, TTL jitter, and stale-while-revalidate.
Instrument caches like production systems: hit ratio, p95/p99, backend QPS, evictions, stale-served, and error rates.
Roll out caching behind flags with guardrails; verify impact with load tests and production canaries.

Implementation checklist

Define an SLO for the endpoint(s) you’re caching and the acceptable staleness window.
Instrument baseline latency (`p50/p95/p99`), backend QPS, and error rate before adding caching.
Choose cache layers and ownership (CDN vs service vs shared Redis).
Design cache keys and versioning (include tenant, locale, auth scope, and schema version).
Select caching pattern per data shape (read-through, write-through, refresh-ahead, SWR).
Add stampede protection (singleflight/coalescing, TTL jitter, locks) and failure policy (stale-if-error).
Add cache observability: hit/miss, evictions, memory, stale-served, lock contention.
Roll out via feature flag/canary; validate business KPIs and rollback plan.

Questions we hear from teams

What’s a good cache hit ratio to aim for?: It depends on the workload and keyspace. For a hot, read-heavy endpoint, I typically expect **70–95%** once warmed. More important: segment hit ratio by keyspace/endpoint and correlate it to backend QPS and tail latency. A global “90% hit ratio” can hide a miss storm on the one endpoint that matters.
Should we fail open (bypass cache) or fail closed when Redis is down?: Per endpoint. For user-facing reads where stale is acceptable, prefer **serve stale** (SWR) or **fail open** to origin with strict rate limiting. For authorization/session-like data, you may need **fail closed** to avoid security issues. Write the policy down and test it.
Is in-process caching safe in Kubernetes with autoscaling?: Yes, if you bound it. Use a real LRU/TTL cache with size limits and measure memory. Expect lower hit ratios during scale events and deploys. In-process caches are great for hot keys but shouldn’t be your only layer if you need stability across pods.
How do we avoid data leaks with caching in multi-tenant systems?: Treat cache keys like you treat database queries: include **tenant + auth scope** (and any personalization dimension). Add versioning, and use automated tests that confirm tenant isolation. I’ve seen ‘just cache the response’ turn into a P0 incident because the key forgot one dimension.
When should we use Redis vs a CDN?: If the response is cacheable via HTTP and doesn’t require per-user personalization, start with **CDN/edge caching**—it’s cheaper and reduces load closer to the client. Use Redis for shared, authenticated, or non-HTTP caching needs, and for coordination (locks/coalescing) when stampedes matter.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about hardening your caching and reliability path See how we approach production-grade code rescue

The outage you don’t see until it’s 2 a.m.

Start with proof: pick metrics and establish a baseline

Pick the right layers: don’t cram everything into Redis

Cache keys and TTLs: where good strategies go to die

Key design that survives reality

TTLs: pick a staleness budget, not a vibe

Choose patterns intentionally: read-through, write-through, SWR, and friends

Concrete HTTP caching example (CDN/edge-friendly)

Application SWR example (Node/TypeScript + Redis)

Stampedes and dependency failures: design for the ugly path

Make it observable: dashboards and alerts that don’t lie

Minimal Prometheus metrics to expose

Rollout like a grown-up: flags, canaries, and safe rollback

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources