ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

Member-only story

Resiliency in a nutshell. Part: 3 Traditional Engineering — Programming patterns

Alexander Wichmann Carlsen
ITNEXT
Published in
8 min readApr 24, 2023

--

All of my stories are free. If you aren’t a member, you can read it here.

Part 1: Introduction
Part 2: Traditional Engineering — Architectural patterns
Part 3: Traditional Engineering — Programming patterns (this)
Part 4: Traditional Engineering — Testing strategies and coding standards

Today’s cloud-based, microservice-based or internet-of-things applications often depend on communicating with other systems across an unreliable network.
Such systems can be unavailable or unreachable due to transient faults such as network problems and timeouts, or subsystems being offline, under load or otherwise non-responsive.

Handling transient faults with retry and circuit-breaker

Transient faults are errors whose cause is expected to be a temporary condition such as temporary service unavailability or network connectivity issues.

Retry

retry allows callers to retry operations in the expectation that many faults are transient and may self-correct: the operation may succeed if retried, possibly after a short delay.
Waiting between retries allows faults time to self-correct. Practices such as exponential backoff and jitter refine this by scheduling retries to prevent them becoming sources of further load or spikes.

Circuit breaker

A circuit breaker detects the level of faults in calls placed through it, and prevents calls when a configurable fault threshold is exceeded.
While retrying plays for success, faults do arise where retries are not likely to succeed or may be counter-productive — for example, where a subsystem is completely offline, or struggling under load.
In such cases additional retries may be inappropriate, either because they have no chance of succeeding, or because they may just place additional load on the called system.

A further ramification is that if a caller is unable to detect that a downstream system is unavailable, it may itself queue up large numbers of pending requests and retries.
Resources in the caller may then become exhausted or excessively blocked, waiting for replies which will never come.
In the worst…

--

--

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Responses (1)

Write a response