[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Blog post that uses TLA+ in a debate about properties of system designs



A colleague pointed out this encouraging development in the world of engineering blogs:

    http://aphyr.com/posts/287-asynchronous-replication-with-failover

There's a grand tradition of blogs[1] about Eric Brewer's "CAP theorem".  Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch[2]. 

It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.
The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure.  Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.

This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems.  (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data.  See [3] for why human intuition about probabilities of failure in big systems are very likely to be wrong.)
But it would still be a big improvement.

Chris

[1] https://www.google.com/search?q=blog+CAP+theorem&oq=blog+CAP+theorem   (not exhaustive, obviously)

[2] "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

[3] "Murphy was an Optimist" : http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf

Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems.  The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems. 

One interesting quote:

     NASA’s C. Michael Holloway (after some studies of accidents): 

           “To a first approximation, we can say that

            accidents are almost always the result of incorrect estimates

            of the likelihood of one or more things.” 

      – We fatally underestimate the probabilities of what can happen.

      – Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.