[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Blog post that uses TLA+ in a debate about properties of system designs
A colleague pointed out this encouraging development in the world of engineering blogs:
There's a grand tradition of blogs about Eric Brewer's "CAP theorem". Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch.
It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.
The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure. Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.
This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems. (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data. See  for why human intuition about probabilities of failure in big systems are very likely to be wrong.)
But it would still be a big improvement.
 "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]
Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems. The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems.
NASA’s C. Michael Holloway (after some studies of accidents):
“To a first approximation, we can say that
accidents are almost always the result of incorrect estimates
of the likelihood of one or more things.”
– We fatally underestimate the probabilities of what can happen.
– Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.