Re: [tlaplus] Blog post that uses TLA+ in a debate about properties of system designs

On Sun, May 26, 2013 at 7:15 AM, Chris Newcombe <chris.n...@xxxxxxxxx> wrote:

A colleague pointed out this encouraging development in the world of engineering blogs:

http://aphyr.com/posts/287-asynchronous-replication-with-failover

There's a grand tradition of blogs[1] about Eric Brewer's "CAP theorem". Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch[2].

It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.

The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure. Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.

This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems. (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data. See [3] for why human intuition about probabilities of failure in big systems are very likely to be wrong.)

But it would still be a big improvement.

Chris

[1] https://www.google.com/search?q=blog+CAP+theorem&oq=blog+CAP+theorem (not exhaustive, obviously)

[2] "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

[3] "Murphy was an Optimist" : http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf

Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems. The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems.

One interesting quote:

NASA’s C. Michael Holloway (after some studies of accidents):
“To a first approximation, we can say that
accidents are almost always the result of incorrect estimates
of the likelihood of one or more things.”
– We fatally underestimate the probabilities of what can happen.

– Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To post to this group, send email to tla...@xxxxxxxxxxxxxxxx.
Visit this group at http://groups.google.com/group/tlaplus?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.