There’s no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again. The reasons that works so well are varied, but one reason is especially important for the developers and operators of distributed systems: metastability.
I’ll let the authors of Metastable Failures in Distributed Systems define what that means:
Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.
Read in full here:
https://brooker.co.za/blog/2021/05/24/metastable.html
This thread was posted by one of our members via one of our news source trackers.