[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Blog post that uses TLA+ in a debate about properties of system designs

From: Chris Newcombe <chris.n...@xxxxxxxxx>
Date: Sun, 26 May 2013 08:15:10 -0700 (PDT)

A colleague pointed out this encouraging development in the world of engineering blogs:

http://aphyr.com/posts/287-asynchronous-replication-with-failover

There's a grand tradition of blogs[1] about Eric Brewer's "CAP theorem". Many of those blogs use basic terms in ambiguous ways -- even the core terms; 'consistency', 'availability', and 'partition-tolerance'. Many of the blogs significantly depart from the system-model in the paper by Gilbert & Lynch[2].

It would help avoid a lot of confusion and noise if blog-authors used languages like TLA+ and PlusCal to precisely state the algorithms, network/failure model, and other assumptions that they are discussing.

The ensuing debate could then be much more precise and enlightening; refutable analysis of properties of algorithms; precise proposals for different models of network & failure. Arguments could be settled by using the model checker, or, if people care enough, with machine-checked TLAPS proofs.

This wouldn't stop the subset of blogs that make hand-wavy claims about the 'real-world benefits' of various existing systems. (Such arguments are usually based on the author's personal experience & intuition about the probabilities of various failure modes, without any actual supporting data. See [3] for why human intuition about probabilities of failure in big systems are very likely to be wrong.)

But it would still be a big improvement.

Chris

[1] https://www.google.com/search?q=blog+CAP+theorem&oq=blog+CAP+theorem (not exhaustive, obviously)

[2] "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002)" [Seth Gilbert, Nancy Lynch]

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

[3] "Murphy was an Optimist" : http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf

Slide #3 points out that human working-lifetimes are too short to give us correct intuition about the probability of subtle bugs being triggered in massive-scale production systems. The rest of the talk is about various nasty examples of byzantine faults. Most of the examples are in hardware systems.

One interesting quote:

NASA’s C. Michael Holloway (after some studies of accidents):
“To a first approximation, we can say that
accidents are almost always the result of incorrect estimates
of the likelihood of one or more things.”
– We fatally underestimate the probabilities of what can happen.
– Designers will say a failure mode is “not credible”, when it actually can happen with probabilities far greater than requirements allow.

Follow-Ups:
- Re: [tlaplus] Blog post that uses TLA+ in a debate about properties of system designs
  - From: TLA Plus

Prev by Date: hyperbook release
Next by Date: "Raft" : A new consensus algorithm with a TLA+ spec
Previous by thread: Re: hyperbook release
Next by thread: Re: [tlaplus] Blog post that uses TLA+ in a debate about properties of system designs
Index(es):
- Date
- Thread