Your Cache Isn’t a Performance Trick — It’s a Reliability System (If You Design It Right)
A pragmatic, layered approach to caching that cuts latency *and* reduces blast radius when downstreams misbehave.
If your cache can’t keep serving during an upstream incident, it’s not a cache — it’s a new dependency with better marketing.Back to all posts
The outage you don’t see until it’s 2 a.m.
I’ve watched teams ship “a cache” to fix p95 latency, then get blindsided a month later when Redis hiccups and the database turns into a bonfire. The root problem is treating caching like a performance hack instead of a load-shedding and failure-management system.
A good caching strategy does two things at once:
- Performance: fewer round-trips, lower p95/p99, fewer bytes over the wire
- Reliability: smaller blast radius, graceful degradation, stable MTTR when a dependency is sick
If you design for both up front, caches become the thing that keeps you online when the DB is melting or an upstream is rate-limiting you.
Start with proof: pick metrics and establish a baseline
Before you add a single SET call, decide how you’ll prove this worked. Otherwise you’ll be arguing about “it feels faster” in a retro.
Track these cache + system metrics per endpoint (or use-case) you plan to cache:
- Latency:
p50,p95,p99(and tail behavior during deploys) - Cache hit ratio: hits / (hits + misses) — but segment by keyspace (don’t average everything)
- Backend QPS reduction: DB/Upstream requests per second attributable to the endpoint
- Error rate: overall and by dependency (DB timeouts vs cache timeouts)
- Stale-served rate: how often you serve stale due to refresh or upstream failure
- Eviction rate / memory pressure: evictions per minute, used memory, fragmentation
- Stampede signals: lock contention, concurrent regenerations, sudden miss spikes
Tooling that’s actually useful in the real world:
OpenTelemetrytracing to see where time is spent and whether cache is in the critical pathPrometheus+Grafanadashboards for hit/miss, latency, backend QPSk6orwrk2for controlled load tests (include cold-cache and warm-cache runs)
Checkpoint:
- You can answer: “If we ship caching, which metric should move, by how much, and what’s the rollback trigger?”
Pick the right layers: don’t cram everything into Redis
Most teams jump straight to a shared Redis and call it a day. That’s how you end up paying Redis tax for things the browser, CDN, or an in-process LRU could have done cheaper and with fewer failure modes.
A pragmatic layered model:
- Client/browser cache for static-ish assets and safe GETs (when allowed)
- CDN cache (Cloudflare/Fastly/Akamai) for public or semi-public content
- Edge/proxy cache (
NGINX,Varnish,Envoy) for shared HTTP caching inside your perimeter - In-process cache (LRU/TTL) for hot keys; fastest and simplest; per-pod
- Distributed cache (
Redis,Memcached) when you need cross-pod sharing - Database-side caches (e.g., DynamoDB DAX) only when it’s truly the right fit
Rules of thumb I’ve seen hold up:
- If the data is public and cacheable, push it outward (CDN/edge). That’s free reliability.
- If you need sub-millisecond hot-path reads and can tolerate per-pod variance, use in-process LRU.
- Use Redis when you need shared state, high hit rates across pods, or coordinated stampede control.
Checkpoint:
- You’ve written down which layers are in play for this endpoint and what each layer’s failure policy is.
Cache keys and TTLs: where good strategies go to die
Cache design fails most often in two places: key design and TTL selection.
Key design that survives reality
Your cache key must encode every dimension that can change the response:
- Tenant/account (multi-tenant systems get this wrong constantly)
- Auth scope (anonymous vs user vs role)
- Locale/currency/timezone (yes, really)
- Schema/version (
v1,v2) so you can change formats without mass invalidation
Example key format:
product:v3:tenant=acme:locale=en-US:currency=USD:id=12345
If you can’t confidently list the dimensions, don’t cache the response yet.
TTLs: pick a staleness budget, not a vibe
Set TTL based on how wrong you’re allowed to be, not based on “seems fine.” For many systems:
- Pricing/inventory: seconds to a minute (or don’t cache without SWR)
- Product catalog/details: minutes to hours
- Reference data (currencies, feature configs): hours to a day
Add TTL jitter to avoid synchronized expiry (classic stampede trigger):
- Base TTL 300s
- Jitter ±10–20%
Checkpoint:
- You have an explicit staleness budget approved by the business owner (even if informal).
Choose patterns intentionally: read-through, write-through, SWR, and friends
I’ve seen teams mix patterns accidentally, then wonder why consistency is weird and load spikes happen.
Use these patterns deliberately:
Read-through cache (most common)
- App checks cache → on miss, fetch from source → populate cache
- Great for read-heavy workloads
Write-through cache
- Write to cache and source in the same request path
- Better consistency; slower writes; simpler reads
Write-back / async write-behind
- Writes land in cache, flush to source later
- High throughput, higher risk; needs durable queues and careful ops
Stale-while-revalidate (SWR)
- Serve slightly stale data while refreshing in background
- This is the “reliability pattern” that keeps systems stable under upstream pain
Concrete HTTP caching example (CDN/edge-friendly)
If you can use HTTP caching, do it. It’s the most cost-effective cache you’ll ever deploy.
curl -I https://api.example.com/catalog/12345Cache-Control: public, max-age=60, stale-while-revalidate=300, stale-if-error=600
ETag: "catalog-12345-v3-9c2a"
Vary: Accept-Encodingmax-age=60: fresh for 60sstale-while-revalidate=300: serve stale up to 5m while refresh happensstale-if-error=600: keep serving stale for 10m if origin is failing
That stale-if-error directive has saved more launches than a dozen war rooms.
Application SWR example (Node/TypeScript + Redis)
import Redis from "ioredis";
import crypto from "crypto";
const redis = new Redis(process.env.REDIS_URL!);
function jitter(ttlSeconds: number, pct = 0.15) {
const delta = Math.floor(ttlSeconds * pct);
return ttlSeconds - delta + Math.floor(Math.random() * (2 * delta + 1));
}
async function getWithSWR<T>(
key: string,
freshTtl: number,
staleTtl: number,
loader: () => Promise<T>
): Promise<{ value: T; stale: boolean }> {
const cached = await redis.get(key);
if (cached) return { value: JSON.parse(cached), stale: false };
const staleKey = `${key}:stale`;
const staleCached = await redis.get(staleKey);
// Simple request coalescing using a short-lived lock
const lockKey = `${key}:lock`;
const lockVal = crypto.randomUUID();
const gotLock = await redis.set(lockKey, lockVal, "NX", "EX", 10);
if (!gotLock && staleCached) {
// Someone else is refreshing; serve stale
return { value: JSON.parse(staleCached), stale: true };
}
try {
const value = await loader();
await redis.set(key, JSON.stringify(value), "EX", jitter(freshTtl));
await redis.set(staleKey, JSON.stringify(value), "EX", jitter(staleTtl));
return { value, stale: false };
} catch (e) {
if (staleCached) return { value: JSON.parse(staleCached), stale: true };
throw e;
} finally {
// Best-effort unlock
const cur = await redis.get(lockKey);
if (cur === lockVal) await redis.del(lockKey);
}
}What this buys you:
- A miss doesn’t automatically mean “hammer the DB”
- During upstream errors, you still serve something
- TTL jitter reduces herd behavior
Checkpoint:
- You can name the pattern you’re using per endpoint and explain the failure behavior in one sentence.
Stampedes and dependency failures: design for the ugly path
Most cache incidents aren’t because caching is hard. They’re because the system’s fallback behavior was never specified.
Common failure modes I’ve personally seen in production:
- Cache stampede/dogpile: synchronized expirations cause a thundering herd to the DB
- Cache outage: Redis cluster down or network partition → every request becomes a miss
- Hot key: one key gets 80% of traffic and all refreshers fight over it
- Negative caching bug: caching “not found” too long, making data appear missing
Mitigations that actually work:
- Request coalescing (singleflight) per key
- TTL jitter to desynchronize expirations
- Stale-if-error and stale-while-revalidate so you don’t amplify upstream incidents
- Circuit breaker + timeout on cache calls (yes, the cache can be the dependency that fails)
- Bulkheads: separate Redis DBs / clusters or at least separate keyspaces for critical vs non-critical
Operational guardrails:
- Treat cache timeouts as non-fatal for many endpoints (fallback to source if safe)
- Cap regeneration concurrency (global and per-key)
- Add rate limiting on regenerations if the DB is already struggling
Checkpoint:
- You have a written “if cache is down” policy: fail open, fail closed, or serve stale — per endpoint.
Make it observable: dashboards and alerts that don’t lie
If you don’t instrument caching, you’ll end up with the classic executive summary: “Redis was involved.”
Minimal Prometheus metrics to expose
cache_requests_total{result="hit|miss|stale|error"}cache_latency_seconds_bucket{operation="get|set"}(histogram)backend_requests_total{dependency="postgres|upstream"}cache_evictions_total(if available)
Example Prometheus alert rules:
groups:
- name: cache-alerts
rules:
- alert: CacheHitRatioDropped
expr: |
(sum(rate(cache_requests_total{result="hit"}[5m]))
/ sum(rate(cache_requests_total[5m]))) < 0.70
for: 10m
labels:
severity: page
annotations:
summary: "Cache hit ratio dropped below 70%"
description: "Investigate key churn, TTLs, invalidation storms, or cache capacity."
- alert: BackendQPSSpike
expr: sum(rate(backend_requests_total{dependency="postgres"}[5m])) > 2 * sum(rate(backend_requests_total{dependency="postgres"}[1h]))
for: 5m
labels:
severity: page
annotations:
summary: "DB QPS spiking (possible cache miss storm)"
- alert: CacheErrorRateHigh
expr: sum(rate(cache_requests_total{result="error"}[5m])) / sum(rate(cache_requests_total[5m])) > 0.01
for: 10m
labels:
severity: page
annotations:
summary: "Cache error rate > 1%"Dashboards that make debugging faster:
- Hit ratio over time by keyspace
- Backend QPS and latency aligned with cache misses
- Stale-served rate (if SWR) — spikes during incidents are expected; sustained spikes are not
Checkpoint:
- You can correlate: “hit ratio dropped → backend QPS rose → p95 rose” in one Grafana view.
Rollout like a grown-up: flags, canaries, and safe rollback
Caching changes production behavior in non-obvious ways. Roll it out like you would a database migration.
- Ship instrumentation first (no behavior change)
- Enable caching behind a feature flag (LaunchDarkly, Unleash, or your in-house system)
- Canary: 1% → 10% → 50% → 100%, watching:
p95/p99latency- backend QPS
- error rate
- stale-served rate
- Load test cold vs warm cache (people forget cold-cache behavior and then get wrecked on deploy)
- Plan invalidation events (deploys, backfills, re-indexes). “We flushed Redis” is not a strategy.
If you’re dealing with AI-assisted “vibe coded” caching (I’ve seen plenty), do an explicit pass for:
- Missing key dimensions (auth/tenant) → data leaks
- Unbounded caches in-process → OOMKills in Kubernetes
- No timeouts on Redis calls → threadpool starvation
This is the kind of cleanup GitPlumbers gets called in for: the cache exists, but it’s actively making outages worse.
Checkpoint:
- Rollback is a flag flip, not a redeploy, and you’ve tested it.
Key takeaways
- Treat caching as a reliability primitive: it should *degrade gracefully* when dependencies fail.
- Design a layered cache: browser/CDN → edge/proxy → service/in-process → distributed cache → database.
- Pick patterns intentionally (read-through vs write-through vs async refresh) and be explicit about staleness.
- Prevent stampedes with request coalescing, TTL jitter, and stale-while-revalidate.
- Instrument caches like production systems: hit ratio, p95/p99, backend QPS, evictions, stale-served, and error rates.
- Roll out caching behind flags with guardrails; verify impact with load tests and production canaries.
Implementation checklist
- Define an SLO for the endpoint(s) you’re caching and the acceptable staleness window.
- Instrument baseline latency (`p50/p95/p99`), backend QPS, and error rate before adding caching.
- Choose cache layers and ownership (CDN vs service vs shared Redis).
- Design cache keys and versioning (include tenant, locale, auth scope, and schema version).
- Select caching pattern per data shape (read-through, write-through, refresh-ahead, SWR).
- Add stampede protection (singleflight/coalescing, TTL jitter, locks) and failure policy (stale-if-error).
- Add cache observability: hit/miss, evictions, memory, stale-served, lock contention.
- Roll out via feature flag/canary; validate business KPIs and rollback plan.
Questions we hear from teams
- What’s a good cache hit ratio to aim for?
- It depends on the workload and keyspace. For a hot, read-heavy endpoint, I typically expect **70–95%** once warmed. More important: segment hit ratio by keyspace/endpoint and correlate it to backend QPS and tail latency. A global “90% hit ratio” can hide a miss storm on the one endpoint that matters.
- Should we fail open (bypass cache) or fail closed when Redis is down?
- Per endpoint. For user-facing reads where stale is acceptable, prefer **serve stale** (SWR) or **fail open** to origin with strict rate limiting. For authorization/session-like data, you may need **fail closed** to avoid security issues. Write the policy down and test it.
- Is in-process caching safe in Kubernetes with autoscaling?
- Yes, if you bound it. Use a real LRU/TTL cache with size limits and measure memory. Expect lower hit ratios during scale events and deploys. In-process caches are great for hot keys but shouldn’t be your only layer if you need stability across pods.
- How do we avoid data leaks with caching in multi-tenant systems?
- Treat cache keys like you treat database queries: include **tenant + auth scope** (and any personalization dimension). Add versioning, and use automated tests that confirm tenant isolation. I’ve seen ‘just cache the response’ turn into a P0 incident because the key forgot one dimension.
- When should we use Redis vs a CDN?
- If the response is cacheable via HTTP and doesn’t require per-user personalization, start with **CDN/edge caching**—it’s cheaper and reduces load closer to the client. Use Redis for shared, authenticated, or non-HTTP caching needs, and for coordination (locks/coalescing) when stampedes matter.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
