The 84-Service Migration That Finally Stopped Waking Up the On‑Call

A real microservices “cleanup” that reduced operational overhead by standardizing deployment, observability, and service boundaries—without pretending we could rewrite the business in a quarter.

If your microservices migration doesn’t reduce the operational tax, it’s just infrastructure cosplay.
Back to all posts

The week the on‑call rotation became a retention problem

I’ve seen a lot of “microservices transformations” that look great in a slide deck and terrible in PagerDuty. This one started with the usual symptoms:

  • 84 services (Node.js + Java + a few Go “one-offs”) spread across ECS, EC2, and a couple of “don’t touch it” boxes
  • Every team had its own deployment script (some bash, some GitHub Actions, some Jenkins jobs nobody owned)
  • Logs in three places, metrics in two, tracing “coming soon”
  • Incidents that read like: “Service A timed out calling Service B which was retrying Service C which was rate-limited by Service D”

The CTO didn’t ask for “Kubernetes” or “service mesh.” They asked a better question: “Can we stop paying an operational tax on every single change?”

GitPlumbers was brought in because the team had been burned by consultants before—big rewrite plans, vague promises, and zero reduction in day-to-day pain.

What we inherited: microservices without the safety rails

This org wasn’t incompetent. They were moving fast in a regulated fintech environment, and the architecture reflected years of pressure.

Constraints we had to respect:

  • Regulatory auditability: change history, approvals, and reproducibility mattered
  • No big bang: revenue-critical payment flows could not freeze for a quarter
  • Multi-tenant SLAs: a noisy tenant could not take down the fleet
  • Vendor limitations: security team required AWS primitives, and data residency ruled out some managed services

Baseline metrics (4-week average before intervention):

  • Deploy frequency: ~2.1 deployments/day across all services
  • Change failure rate: ~18% (rollback or hotfix within 24h)
  • MTTR: 92 minutes
  • PagerDuty pages: 310/month (and the “real” number was higher—people muted alerts)
  • Ops load: ~22 engineer-hours/week spent on “why didn’t this deploy” and “where are the logs”

The kicker: half the services were “micro” in the worst way—tiny codebases, big operational footprint. Lots of duplicated patterns:

  • Slightly different retry logic everywhere
  • Inconsistent timeouts (30s in one service, 120s in another)
  • Ad-hoc auth middleware
  • A few AI-generated helpers (“vibe coded” during an outage) that nobody fully trusted

The approach: paved road first, migration second

I’ve seen migrations fail when teams start by moving workloads before they’ve defined how workloads should behave. We flipped it:

  1. Define the paved road (deployment, observability, config, runtime defaults)
  2. Move the highest-pain services first (not necessarily the highest-traffic)
  3. Reduce service count selectively where it removed coordination and on-call load

We agreed on a 12-week plan with clear exit criteria:

  • Any service moved must have standard dashboards, structured logs, and tracing
  • Every migrated service must ship via GitOps (ArgoCD) and include a runbook
  • No “platform team fantasy”: teams had to be able to operate their own services with the new defaults

Intervention #1: GitOps + templates to kill snowflake deployments

The first operational win came from deleting variance.

We standardized on:

  • EKS for runtime (yes, Kubernetes has sharp edges—so do bespoke scripts at 2am)
  • Terraform for infra provisioning
  • ArgoCD for deployment (pull-based, auditable, consistent)
  • Helm charts generated from a minimal service template

Here’s a simplified ArgoCD ApplicationSet we used to onboard services with the same rules. This wasn’t “platform engineering theater”—it was the smallest thing that made deployments boring:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: services
spec:
  generators:
    - git:
        repoURL: https://github.com/acme-fintech/platform
        revision: main
        directories:
          - path: services/*
  template:
    metadata:
      name: '{{path.basename}}'
      labels:
        app.kubernetes.io/managed-by: argocd
    spec:
      project: default
      source:
        repoURL: https://github.com/acme-fintech/platform
        targetRevision: main
        path: '{{path}}/chart'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

What changed immediately:

  • No more “works in Jenkins but not in Actions” nonsense
  • Rollbacks became deterministic (git revert + auto-sync)
  • Audit trails stopped being tribal knowledge

We also implemented progressive delivery with canaries for the riskiest services (payments and identity). We didn’t go full Istio—I’ve watched service meshes become expensive hobbies. Instead, we used ingress-weighted canaries and tight health checks.

Intervention #2: Observability that engineers actually use

If you can’t see it, you can’t operate it. The prior setup had “monitoring,” but not debuggability.

We standardized instrumentation using OpenTelemetry and shipped everything to the same place:

  • Metrics: Prometheus
  • Dashboards: Grafana
  • Traces: Tempo (or a vendor equivalent)
  • Logs: structured JSON into a single pipeline

A minimal OpenTelemetry Collector config (trimmed) looked like this:

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  batch:
  memory_limiter:
    limit_mib: 512
exporters:
  prometheus:
    endpoint: 0.0.0.0:9464
  otlp/tempo:
    endpoint: tempo.monitoring.svc.cluster.local:4317
    tls:
      insecure: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]

We paired this with SRE-lite hygiene:

  • SLOs for the top 10 user flows (not for every endpoint)
  • Alerts on symptoms (latency, error rates) instead of “CPU is 70%”
  • Runbooks that answered: “What do I check first?”

This is where the operational overhead dropped the fastest: fewer false alarms, faster root cause, less Slack archaeology.

Intervention #3: Service boundary fixes (and yes, we removed some services)

The dirty secret: the company didn’t have “too many microservices.” They had too many microservices that behaved differently.

Still, we did reduce service count—but surgically.

We found three categories worth changing:

  • Zombie services: no clear owner, minimal traffic, high alert noise
  • Utility services: tiny wrappers around databases or queues (often better as libraries)
  • Chatty pairs: services that always deployed together and failed together

We consolidated 12 services into 4 by:

  • Folding utility services into the nearest domain service
  • Replacing synchronous call chains with async events in two hotspots
  • Adding timeouts + circuit breakers where sync was unavoidable

A concrete example: a “RiskScore” service was doing 6 downstream calls with inconsistent timeouts. We standardized in code (Node.js example):

import CircuitBreaker from 'opossum';
import fetch from 'node-fetch';

const callFraudProvider = async (payload: unknown) => {
  const res = await fetch(process.env.FRAUD_URL!, {
    method: 'POST',
    body: JSON.stringify(payload),
    headers: { 'content-type': 'application/json' },
    // hard timeout at the HTTP layer
    signal: AbortSignal.timeout(1500),
  });
  if (!res.ok) throw new Error(`fraud provider ${res.status}`);
  return res.json();
};

export const fraudBreaker = new CircuitBreaker(callFraudProvider, {
  timeout: 2000,
  errorThresholdPercentage: 50,
  resetTimeout: 10_000,
});

fraudBreaker.fallback(() => ({ score: 0, reason: 'fallback' }));

That one change cut cascading failures during provider brownouts.

Data migration reality: We used the Strangler Fig pattern for two legacy stores, and for one Postgres schema we relied on CDC (AWS DMS) so new services could read the new shape while old ones still ran.

API safety: We added consumer-driven contract tests for the most brittle internal APIs. This prevented “migration by outage,” where you only find incompatible changes after traffic shifts.

Results after 12 weeks: less overhead, more shipping

This is the part leaders actually care about.

Measured outcomes (first 4 weeks post-migration vs baseline):

  • PagerDuty pages: 310/month → 118/month (-62%)
  • MTTR: 92 min → 48 min (-48%)
  • Change failure rate: 18% → 7%
  • Deploy frequency: 2.1/day → 6.4/day (because deploys got boring)
  • “Where are the logs?” time: estimated -70% (from incident review sampling)
  • Service count: 84 → 76 (not dramatic—and that’s the point)
  • Infra spend: -23% (mostly by rightsizing and killing idle ECS capacity)

The unsexy win: engineering stopped treating releases like live-fire exercises.

A detail I always look for: did the on-call rotation stop being a resignation trigger? In this case, yes—after two months, the team reported fewer escalations to staff/principal engineers and fewer “all-hands incident” Slack storms.

What actually worked (and what we’d never do again)

If you’re a senior leader staring at a microservices mess, here’s the distilled guidance.

What worked:

  • Standardize before you migrate: templates, defaults, and guardrails beat heroics
  • GitOps everywhere: ArgoCD reduced variance and improved auditability overnight
  • Telemetry as a migration gate: if a service can’t be observed, it can’t be “done”
  • Contract-first on the critical paths: payments, identity, billing—don’t wing it
  • Selective consolidation: remove services only when it removes coordination and pager load

What we’d never do again:

  • Moving 80+ services to Kubernetes without a paved road (I’ve seen that movie; the sequel is always worse)
  • Rolling out a full service mesh to “fix reliability” before teams have basic timeouts, retries, and tracing
  • Letting AI-generated code slip into shared libraries without ownership—vibe code cleanup is real work, and it belongs on the plan

If your “microservices migration” doesn’t reduce the operational tax, it’s just infrastructure cosplay.

GitPlumbers’ role here was part platform, part incident archaeology, part pragmatism referee. We didn’t sell magic; we shipped boring, repeatable operations.

If you’re in the same place—lots of services, lots of noise, and a team that’s tired—we can help you get to a paved road without stopping the business.

Related Resources

Key takeaways

  • Operational overhead in microservices usually comes from inconsistency, not scale—standardize the “paved road” before touching service count.
  • GitOps (ArgoCD) plus opinionated templates removed most manual deployment variance and cut incident blast radius.
  • We reduced services where it made sense, but the biggest win came from shared observability, SLOs, and sane defaults.
  • Contract-first APIs and consumer-driven contract tests prevented “migration by outage.”
  • Measure outcomes that engineering leaders actually care about: pager volume, MTTR, deploy frequency, change failure rate, and cost per service.

Implementation checklist

  • Inventory services and classify them (core, edge, utility, zombie) with owners and SLOs
  • Define a golden path: CI, CD, secrets, logging, tracing, runbooks, alerts
  • Adopt GitOps (e.g., ArgoCD) and stop ad-hoc deploy scripts
  • Implement baseline telemetry with OpenTelemetry + Prometheus + Grafana
  • Add contract tests for critical APIs before moving traffic
  • Use canary releases + progressive delivery to reduce change failure rate
  • Consolidate only the services that are operationally expensive and low-value

Questions we hear from teams

Did you have to migrate everything to Kubernetes to get the overhead reduction?
No. The biggest reduction came from standardizing deployment and observability. Kubernetes (EKS) helped by enforcing consistent runtime patterns, but the real win was eliminating snowflake pipelines and making services debuggable by default.
Why not use a service mesh like Istio to solve reliability problems?
I’ve seen Istio become a second platform teams have to operate. We deferred a mesh until basic hygiene was consistent: timeouts, retries, circuit breakers, tracing, and sane SLO-based alerting. Only then does a mesh become additive instead of overhead.
How did you decide which services to consolidate?
We consolidated services that were ownerless, low-traffic but high-noise, or tightly coupled (always deployed/failing together). We did not consolidate core domain services just to reduce the count—service count is a vanity metric if operability is still inconsistent.
What metrics should we track to prove overhead is going down?
Pager volume, MTTR, change failure rate, deploy frequency, and engineer-hours spent on deploy/debug toil. Cost per service can be useful too, but only if you separate runtime cost from incident and coordination cost.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about reducing microservices operational overhead See how we rescue brittle systems (without a rewrite fantasy)

Related resources