< back to notebook > entry
> 2026-04-28 · standing · 3 min read

Capacity planning fails the same way at any scale

  • leadership
  • capacity
  • lessons

Customer Engineering Manager at spaceOS, post-acquisition, three timezones, a relocation to Japan. I said yes to too many roles individually. The architectural rule for systems applies to people; I just hadn't applied it.

You don’t run a service at 100% utilisation. You leave headroom for spikes, retries, garbage collection, the unknown. Run any system at saturation and the next thing that goes wrong takes the whole thing down.

I knew this rule cold. I applied it to every architecture I ever shipped. I didn’t apply it to myself, and the result was exactly what the rule predicts.

Context

I was Customer Engineering Manager at spaceOS, post-Equiem acquisition. The team needed coverage across PL, USA, and APAC. I moved to Japan to embed with the client there.

The roles I picked up, in order, each one reasonable on its own:

  • Architecture decisions on the localisation module and payment systems (because I’d been Tech Lead before)
  • Operational ownership of the support → customer-engineering transformation
  • Cross-timezone bridging across PL, USA, and AU/Asia
  • Mentoring frontend hires (because I’d onboarded them)
  • Direct client-side embedding in Japan (because I was already there)
  • Hiring and calibration for the chapter

Every “yes” looked small. The sum was not.

The error, in engineering terms

Every load test passed in isolation. None of us load-tested the combination. That’s the entire bug.

The system’s failure mode looks normal in the steady state and catastrophic on any spike. Move-to-Japan jet lag, a client-side incident, an architecture review with a hard deadline, and a hiring round — none of them were the problem individually. They co-occurred, because nobody had budgeted for them not to.

What I’d architect differently now

The same rules I’d apply to any service:

  • Headroom is a feature, not laziness. Plan for utilisation around 70%, not 100%. Anything above 80% sustained means the next thing that goes wrong has nowhere to land.
  • Named owner per recurring responsibility. “I’ll keep an eye on it” is not an owner. The on-call schedule needs to apply to mentoring, customer engagement, hiring — not just incidents.
  • No shared on-call surfaces. Mentoring on one continent and customer escalations on another can’t both route through the same person. They will both fire on the same day, and one of them will not be answered well.
  • Deliberate slack. Block calendar time the way you reserve memory. Calendar pressure tells you the same thing memory pressure does — you’re operating at saturation, and the next page request is going to hurt.

Most of these are obvious in retrospect. None of them were obvious to me when each “yes” arrived in isolation.

The honest part

There’s no villain in this story. The acquisition wasn’t bad, the team wasn’t unreasonable, the client wasn’t demanding. The decision-making error was mine, and it was a category error: treating a personal capacity question as a willpower question instead of an architecture question.

Willpower scales linearly with effort and runs out. Capacity is a fixed budget. You don’t try harder with a system running at 100% — you shed load or you watch it fall over.

What stuck

I’m now suspicious of every “small yes.” Not because the small yeses are wrong, but because I know my measurement of “small” is the part that fails first under load.

Two questions I ask before adding any recurring responsibility:

  1. What does the budget look like at the next predictable spike? Not the steady-state — the spike. Releases, kids falling sick, a parent needing surgery, a country move.
  2. Who is the named owner if I’m fully unavailable for a week? If the answer is “we’ll figure it out,” the role isn’t mine to take yet.

Both are operational questions, not personal ones. Both apply to a service the same way they apply to a person. I just hadn’t generalised the rule.

The diamond-shoes caveat

This is, I realise, the loudest privileged complaint in the world. Senior leadership role at a global SaaS, multi-continent team, the company paid for me to live in Japan. Diamond shoes pinching, the lot.

That part’s fair. I won’t pretend the constraint was anything other than the kind a lot of people would happily take on.

What’s also fair: spending eighteen months saying yes to roles I was qualified for is how I figured out I wasn’t picking for the work I actually loved. I was picking for the next title, the next “harder problem” — and the harder problems were starting to mean more meetings, more ops, more timezones. Less code.

When the system unwound, the answer turned out to be embarrassingly simple. I’m a nerd. I like building things. I like the part of architecture that is “stare at the problem until it cracks”, not the part that is “build a coverage matrix for the support team’s shift rota”. This is not a curable condition.

I came out of it as a frontend architect again. Not a manager, not a director, not anything ending in Engineering. Just an architect, with significantly better taste in which yes-questions to take seriously, and a clearer sense of what to chase next instead of what to chase next because everyone else assumed I would.

Diamond shoes, mostly healed. Career, calibrated. Recommended only if you can’t find a less expensive way to learn the same lesson.