[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Blog post that uses TLA+ in a debate about properties of system designs



Hi Chris,
 
Thanks for the posting.  I particularly liked this comment by Metaxis on a page linked to from the page you cited:

I think your attempt to differentiate formal correctness and real world operations is deeply flawed and amounts to asserting anecdote - that what you have observed to be common makes for good design assumptions and better trade off decisions.

Allow me to counter: Real world operations will inevitably tend to approach formal correctness in terms of observed failure modes. In other words, over time, you are more and more likely to see edge cases and freak occurrences that are predicted in theory but happen rarely in practice.

And thus will totally bite you in the butt unless you design for them.

A real danger with anecdotal reasoning is post hoc rationalization - if an event happens that causes an unacceptable loss it is either explained away as a "freak occurrence" or added into the anecdotal body as a new "likely" occurrence that must be designed around, one real-world operational failure at a time.

This is exactly what formal modeling avoids.

TL;DR: Unless you are describing what is acceptible and comparing that to what is formally possible in your design, you are on borrowed time.


Leslie
 
On Sun, May 26, 2013 at 7:15 AM, Chris Newcombe <chris.n...@xxxxxxxxx> wrote:
A colleague pointed out this encouraging development in the world of engineering blogs:

    http://aphyr.com/posts/287-asynchronous-replication-with-failover

There's a grand tradition of blogs[1] about Eric Brewer's "CAP theorem".  Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch[2]. 

It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.
The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure.  Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.

This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems.  (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data.  See [3] for why human intuition about probabilities of failure in big systems are very likely to be wrong.)
But it would still be a big improvement.

Chris

[1] https://www.google.com/search?q=blog+CAP+theorem&oq=blog+CAP+theorem   (not exhaustive, obviously)

[2] "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

[3] "Murphy was an Optimist" : http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf

Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems.  The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems. 

One interesting quote:

     NASA’s C. Michael Holloway (after some studies of accidents): 

           “To a first approximation, we can say that

            accidents are almost always the result of incorrect estimates

            of the likelihood of one or more things.” 

      – We fatally underestimate the probabilities of what can happen.

      – Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.


--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To post to this group, send email to tla...@xxxxxxxxxxxxxxxx.
Visit this group at http://groups.google.com/group/tlaplus?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.