The Staff Engineer Quit and Took the Map With Them: Building Knowledge Systems That Survive Turnover

Institutional expertise isn’t a wiki problem — it’s a workflow problem. Here’s the paved-road setup that keeps tribal knowledge from walking out the door.

If knowledge capture isn’t wired into PRs, incidents, and releases, you’re one resignation away from running production on Slack archaeology.
Back to all posts

The day the “human load balancer” leaves

I’ve watched this movie too many times: the senior/staff engineer who actually knows how things work leaves, and suddenly your org is running on vibes and Slack archaeology. The new on-call asks, “Where’s the runbook for the payments queue backlog?” and someone replies with a screenshot of a 2021 thread and a half-remembered kubectl incantation.

That’s not a documentation failure. That’s a knowledge capture system failure.

If your “institutional knowledge” lives in people’s heads, private gists, and tribal Slack channels, you don’t have a resilience problem — you have a bus factor of 1 and an expensive surprise queued up.

What breaks when you don’t preserve institutional expertise

This stuff shows up as real business pain, not philosophical angst:

  • On-call becomes a lottery
    • MTTR spikes because responders are re-discovering the system during an outage.
    • Escalations increase because only two people know where the bodies are buried.
  • Delivery slows down
    • Engineers stop touching “scary” services (hello, unplanned monolith preservation).
    • PRs get bigger because folks batch changes to reduce “unknown unknowns.”
  • Security and compliance become theater
    • Audit questions (“who can access prod?”) turn into spreadsheet cosplay.
  • AI-assisted coding makes it worse
    • AI-generated changes often look plausible while subtly breaking invariants.
    • Without crisp runbooks/ADRs, reviewers can’t tell what’s “normal” vs “dangerous.”

Here’s the trap: teams respond by buying a fancy portal, spinning up a bespoke “knowledge base,” and then… nobody uses it because it’s not part of the work.

The paved-road default that actually sticks: docs-as-code + minimal primitives

If you want knowledge to persist, tie it to the same place you already enforce quality: the repo.

The simplest system I’ve seen work across startups and bigcos is:

  • Docs-as-code in a docs/ folder
  • Runbooks per service (required for anything that pages)
  • ADRs for decisions that change constraints
  • Ownership metadata (CODEOWNERS + service owner field)

That’s it. You can layer in a catalog later.

A basic repo layout:

.
├── services/
│   ├── payments-api/
│   │   ├── README.md
│   │   ├── runbook.md
│   │   └── service.yaml
│   └── billing-worker/
│       ├── README.md
│       ├── runbook.md
│       └── service.yaml
├── docs/
│   ├── index.md
│   ├── onboarding.md
│   ├── incident-response.md
│   └── adr/
│       ├── 0001-template.md
│       └── 0007-why-we-chose-sqs-over-kafka.md
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       └── docs.yml
└── CODEOWNERS

Why this works:

  • It uses paved-road defaults: GitHub/GitLab, Markdown, PR review.
  • Docs change when code changes (because they ride the same workflow).
  • Ownership is explicit, so docs don’t become communal junk drawers.

MkDocs: the “boring” win

MkDocs is a good default because it’s low-ceremony and fast.

# mkdocs.yml
site_name: Platform Knowledge Base
repo_url: https://github.com/acme/platform
theme:
  name: material
nav:
  - Home: index.md
  - Onboarding: onboarding.md
  - Incident Response: incident-response.md
  - ADRs:
      - "ADR Template": adr/0001-template.md

Local preview:

python -m venv .venv
source .venv/bin/activate
pip install mkdocs-material
mkdocs serve

The goal isn’t to win a docs beauty contest. It’s to make knowledge close to the code and easy to maintain.

Make knowledge capture automatic: PRs, incidents, and releases are your hooks

Most orgs fail because they rely on “someone remember to document it.” Nobody remembers. Especially during an incident.

Wire documentation into the places work already happens.

1) PR template that forces the question

<!-- .github/pull_request_template.md -->

## What changed
- 

## Ops impact
- [ ] No ops impact
- [ ] New alert / changed thresholds
- [ ] Behavior change that affects on-call or support

## Docs/runbook updates
- [ ] Not needed
- [ ] Updated `README.md` / `docs/`
- [ ] Updated `runbook.md`

## Rollout
- [ ] Safe to deploy normally
- [ ] Needs canary deployment
- [ ] Needs feature flag

This doesn’t guarantee quality, but it reliably catches “oh right, the alert name changed” before it wakes someone up at 3am.

2) Lightweight ADRs (keep them small, keep them honest)

Use ADRs when a decision changes the constraints of the system: datastore choice, queue semantics, auth strategy, tenancy model, retry policy.

# docs/adr/0001-template.md

# Title

## Status
Proposed | Accepted | Superseded

## Context
What problem are we solving? What constraints matter (latency, compliance, cost, ops)?

## Decision
What are we doing?

## Consequences
- Positive:
- Negative:
- Operational notes:

## Links
- RFC/PR:
- Incident/Postmortem:

The payoff is during change review and incident response: you can answer “why is it like this?” without summoning a specific human.

3) Runbooks that match reality (not aspirational diagrams)

If a service can page someone, it needs a runbook. Period.

# services/payments-api/runbook.md

## Service overview
- Repo path: `services/payments-api`
- Owner: `@payments-platform`
- SLO: 99.9% availability / 300ms p95

## Dashboards
- Grafana: https://grafana.example.com/d/abc123/payments

## Common alerts
### `PaymentsQueueBacklogHigh`
**Meaning**: SQS backlog > 50k for 10m

**First checks**
1. Confirm consumer lag in Grafana
2. Check recent deploys in ArgoCD
3. Look for downstream dependency failures (Stripe, DB)

**Mitigations**
- Scale workers: `kubectl -n payments scale deploy/payments-worker --replicas=12`
- Temporarily disable expensive feature flag `payments_enrichment`

**Escalate when**
- Backlog > 200k for 15m
- Error rate > 2% for 5m

Notice what’s not here: a 40-page narrative. It’s a playbook.

Discoverability beats completeness: make the “map” impossible to miss

Even good docs fail if nobody can find them. The trick is to make docs discoverable from the places engineers naturally start.

Add ownership and entrypoints

CODEOWNERS creates accountability and keeps docs from going stale.

# CODEOWNERS
/docs/ @platform-eng
/services/payments-api/ @payments-platform
/services/payments-api/runbook.md @payments-platform

Add a tiny service.yaml that tools (or humans) can read later:

# services/payments-api/service.yaml
name: payments-api
tier: 1
owner: payments-platform
oncall: pagerduty:payments
runbook: ./runbook.md
slack: "#payments-oncall"
dependencies:
  - postgres:payments
  - sqs:payments-events

If you want a catalog UI eventually, this file becomes your source of truth. If you don’t, it’s still useful.

One rule that changes everything

  • No orphan services: if it deploys, it has README.md, runbook.md, and an owner.

That one rule prevents the zombie-service problem that kills productivity in mature orgs.

Guardrails in CI: stop doc rot before it merges

The fastest way to get compliance is to let CI be the bad guy.

A simple GitHub Actions workflow:

# .github/workflows/docs.yml
name: docs
on:
  pull_request:
    paths:
      - 'docs/**'
      - 'services/**/runbook.md'
      - 'mkdocs.yml'
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install mkdocs-material
      - run: mkdocs build --strict

--strict is the quiet hero here: broken links and missing pages fail the build.

If you want to go one notch further without building a Rube Goldberg machine:

  • Add a script that checks every services/*/ has runbook.md.
#!/usr/bin/env bash
set -euo pipefail
missing=0
for d in services/*; do
  [ -d "$d" ] || continue
  if [ ! -f "$d/runbook.md" ]; then
    echo "Missing runbook: $d/runbook.md"
    missing=1
  fi
done
exit $missing

That’s “paved road.” No bespoke portal required.

Cost/benefit trade-offs (with a real before/after)

Here’s what I’ve seen in the wild when teams adopt the repo-first system.

Before

  • Onboarding: 3–6 weeks to ship a meaningful change (lots of pairing, tribal knowledge).
  • Incidents: MTTR is “depends who’s awake.” Often 2–5 hours for recurring failure modes.
  • Changes: high fear factor; senior engineers become bottlenecks.
  • Docs: a Confluence space with hundreds of pages, most outdated.

After (typically 4–8 weeks of disciplined rollout)

  • Onboarding: 7–10 days to first meaningful PR, because the “where is everything?” questions have deterministic answers.
  • Incidents: recurring alerts drop to 30–60 minutes MTTR because the first responder has steps, links, and escalation criteria.
  • Reviews: PRs move faster because ADRs clarify the “why,” and runbooks clarify ops impact.

The real cost

  • You’ll spend a few engineer-days setting up MkDocs + CI + templates.
  • You’ll spend ~15–30 minutes per meaningful change keeping docs/runbooks accurate.

And yes, people will complain. Usually the same people who complain about being paged for mysteries.

At GitPlumbers, when we’re called in to stabilize teams drowning in legacy modernization or AI-generated code gone sideways, this is one of the first productivity moves we make. It’s not glamorous, but it’s the difference between “we can safely ship” and “we’re one resignation away from chaos.”

What I’d do next Monday (without boiling the ocean)

  1. Pick one repo (or one product area) and implement docs/ + MkDocs.
  2. Add runbook.md requirement for Tier-1 services only.
  3. Add ADRs for only these decisions:
    • datastore changes
    • queue/eventing changes
    • auth/permissions model changes
    • major dependency changes
  4. Add CODEOWNERS and enforce review on docs/ and runbooks.
  5. Track outcomes for 30 days:
    • onboarding time-to-first-PR
    • MTTR on top 3 alerts
    • % Tier-1 services with current runbooks

If you can’t measure improvement, you’re doing documentation cosplay.

The goal isn’t to “document everything.” The goal is to make the system legible to the next engineer at 2am.

If you want a second set of eyes, GitPlumbers can help you set this up without turning it into a six-month platform rebuild. We’re biased toward boring defaults, tight feedback loops, and systems that survive contact with reality.

Related Resources

Key takeaways

  • If knowledge capture isn’t wired into PRs, incidents, and releases, it won’t happen consistently.
  • Start with a **single paved-road**: `docs/` in the main repo + Markdown + CI checks. Avoid bespoke portals early.
  • Use **ADRs + runbooks + ownership metadata** as the minimum viable “map” of the system.
  • Make docs discoverable by linking them from code (`README`, `service.yaml`, `CODEOWNERS`) and enforcing a “no orphan services” rule.
  • Measure outcomes: onboarding time, incident MTTR, change failure rate, and doc freshness — not page counts.

Implementation checklist

  • Create a `docs/` directory with MkDocs or Docusaurus and ship it via CI.
  • Add an ADR template and require ADRs for specific decision types (datastore, queue, auth, major deps).
  • Create a runbook template and require it for any service that pages humans.
  • Add `CODEOWNERS` for `docs/` and per-service docs to establish accountability.
  • Add PR templates that ask for doc/runbook updates when behavior or ops changes.
  • Add CI to block merges on broken links and missing required doc fields.
  • Stand up a lightweight service catalog (even a YAML file) before buying a platform.
  • Track 3 metrics: onboarding time-to-first-PR, MTTR, and % services with current runbooks.

Questions we hear from teams

Should we use Confluence instead of docs-as-code?
If Confluence is already working for you and people keep it current, fine. The failure mode I’ve seen is “docs live somewhere separate from code,” so they drift. Docs-as-code wins because it shares the PR workflow, review gates, and ownership model. You can still publish the rendered site for non-engineers.
When do we need a service catalog like Backstage?
When you have enough services that humans can’t keep the map in their heads (usually dozens+) and you need standardized metadata (ownership, tiers, dependencies). Start by creating `service.yaml` files and enforcing the “no orphan services” rule. You can adopt Backstage later and ingest the same metadata.
How do we stop runbooks from becoming outdated?
Tie them to change workflows: PR templates that ask about ops impact, `CODEOWNERS` so the right team reviews updates, and CI checks (broken links, required files present). Also, review runbooks during postmortems: every incident should produce at least one concrete runbook improvement.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Sanity-check our knowledge system See how we stabilize teams shipping on legacy + AI-generated code

Related resources