[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*From*: Chris Newcombe <chris.n...@xxxxxxxxx>*Date*: Sun, 26 May 2013 08:15:10 -0700 (PDT)

A colleague pointed out this encouraging development in the world of engineering blogs:

There's a grand tradition of blogs[1] about Eric Brewer's "CAP theorem". Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch[2].

It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.

The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure. Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.

This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems. (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data. See [3] for why human intuition about probabilities of failure in big systems are very likely to be wrong.)

But it would still be a big improvement.

Chris

[1] https://www.google.com/search?q=blog+CAP+theorem&oq=blog+CAP+theorem (not exhaustive, obviously)

[2] "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]

Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems. The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems.

One interesting quote:

NASA’s C. Michael Holloway (after some studies of accidents):

“To a first approximation, we can say that

accidents are almost always the result of incorrect estimates

of the likelihood of one or more things.”

– We fatally underestimate the probabilities of what can happen.

– Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.

**Follow-Ups**:

- Prev by Date:
**hyperbook release** - Next by Date:
**"Raft" : A new consensus algorithm with a TLA+ spec** - Previous by thread:
**Re: hyperbook release** - Next by thread:
**Re: [tlaplus] Blog post that uses TLA+ in a debate about properties of system designs** - Index(es):