Murat Bahadır Kayıhan | Mobile & Backend Developer

Most systems don’t fail because of a single catastrophic bug. They fail because of small, well-intentioned decisions that compound under pressure.

One of the most common examples is retry logic.

At first glance, retries feel like the responsible thing to do. A request fails? Try again. Network hiccup? One more attempt won’t hurt.

In production, this mindset often does the opposite.

Retries create a comforting illusion: that failures are temporary and systems are patient.

They aren’t.

In a healthy system, retries are invisible. In a degraded system, retries multiply load at the worst possible moment.

What starts as a minor slowdown turns into:

The system doesn’t recover. It drowns.

I’ve seen systems where a 5% failure rate triggered a 200% traffic spike.

Not because users refreshed the page. Because the system attacked itself.

Every failed request retried:

Each retry increased pressure on the very component that was already struggling.

The retry logic was correct in isolation. It was disastrous in reality.

Failure isn’t an exception in distributed systems. It’s a state.

Designing for failure means accepting three uncomfortable truths:

This shifts the question from “Can we retry?” to “Should we?”

In production systems, I now treat retries as a scarce resource, not a default behavior.

My current rules of thumb:

If the system can’t survive a failure without retries, retries won’t save it.

The most dangerous retries are the invisible ones.

If you can’t answer:

then retries are already working against you.

Observability doesn’t prevent failure. It prevents self-deception.

The most reliable systems I’ve worked on did less, not more.

They failed faster, but recovered cleaner.

Retries feel like optimism encoded in code. Production systems don’t need optimism.

They need clear limits, honest signals, and the courage to say no.