[tlaplus] Re: Ideas for model-checking probabilistic consensus systems?

Resurrecting this thread, because it continues to recur in my thoughts.

Newcomers to TLA+ are often confused that actions such as a network message getting lost are given the same "weight" as happy path actions like receiving & processing that message. They intuitively think of network failures as a lower-probability event, and want some way to express this. It takes some cajoling to convince them this equal weighting of the unhappy path is actually a good thing - TLA+ requires you to be very explicit about what assumptions are required for your safety or liveness invariants to hold.

But what if we want to address the probabilistic nature of reality head-on? What if we had some way to express safety invariants which only hold probabilistically? How would that work, and what would it look like?

Consider an issue known to any developer in a large company: the flaky unit test. Say you have N unit tests, each with uniform independent probability of failure; if any one of them fails, the whole test run fails. Given probability P of each individual unit test failing, will the test run pass with probability at least 0.5? Let's write this in a spec with some hypothetical extension to the TLA+ language:

--------------------------- MODULE FlakyUnitTests ---------------------------

EXTENDS
    Reals

CONSTANTS
    Test,
    TestFailureProbability

ASSUME
    /\ TestFailureProbability \in {r \in Real : 0 <= r /\ r <= 1}

VARIABLES
    executed,
    succeeded

PTestOutcome[outcome \in BOOLEAN] ==
    IF outcome
    THEN 1 - TestFailureProbability
    ELSE TestFailureProbability

Init ==
    /\ executed = [t \in Test |-> FALSE]
    /\ succeeded = [t \in Test |-> FALSE]

TypeInvariant ==
    /\ executed \in [Test -> BOOLEAN]
    /\ succeeded \in [Test -> BOOLEAN]

SafetyInvariant ==
    /\ (\A t \in Test : executed[t]) =>
        /\ (\A t \in Test : succeeded[t]) WITH P(0.5, 1)

RunTest(t) ==
    /\ ~executed[t]
    /\ \E outcome \in BOOLEAN WITH PTestOutcome:
        /\ succeeded' = [succeeded EXCEPT ![t] = outcome]
    /\ executed' = [executed EXCEPT ![t] = TRUE]

Next ==
    \/ \E t \in Test : RunTest(t)

Spec ==
    /\ Init
    /\ [][Next]_<<executed, succeeded>>

THEOREM Spec => TypeInvariant /\ SafetyInvariant

=============================================================================

Here we use the keyword WITH in two ways:

1. Expressing the probability of taking each branch in a nondeterministic step

2. Expressing the probability of a predicate holding true for a state; the P(low, high) function specifies the range in which this probability must lie

If no WITH is specified for a state transition, that state transition is given a 100% probability of occurring.

I want to be clear this isn't a serious language design proposal, but more a sketch of what this capability could look like.

TLC would keep track of two additional float variables for each distinct state: lowest probability, and highest probability. The first is the lowest recorded probability for a sequence of steps to arrive in that state, and the second is the highest recorded probability for a sequence of steps to arrive in that state. These values are updated as additional paths (with corresponding probabilities) are found to each state. The end result is we are able to express probabilistic invariants.

I haven't put a particularly large amount of thought into this, but it does seem like a very neat capability which would enable TLA+ to specify things like the Slush/Snowflake/Snowball/Avalanche protocols, which come with probabilistic guarantees of success analyzed via something called continuous-time Markov chains. So in addition to easily transcribing the pseudocode from the paper to PlusCal, we could also easily transcribe & check the bounds:

Andrew

On Monday, December 10, 2018 at 3:07:41 PM UTC-8, Leslie Lamport wrote:

Hi Andrew,

If you're trying to check that the probability of an error is
sufficiently low, you probably want to execute a large number of
simulations and see what percentage of them produce an error. For
that, see the RandomElement, TLCGet and TLCSet operators defined in
the TLC module that are described here:

   https://lamport.azurewebsites.net/tla/current-tools.pdf

The operators defined in the Randomization module might also be
useful; they're described here:

   https://lamport.azurewebsites.net/tla/inductive-invariant.pdf

I'm not sure if that module is in the current release; you might have
to download the nightly build to get it.

Leslie

On Monday, December 10, 2018 at 1:36:16 PM UTC-8, Andrew Helwer wrote:
I've been toying with spec'ing the snowflake/snowball/avalanche byzantine consensus protocols in TLA+: https://muratbuffalo.blogspot.com/2018/06/snowflake-to-avalanche-novel-metastable.html

These protocols are probabilistic; quoting from Murat's post:

To perform a query, a node picks a small, constant-sized (k) sample of the network uniformly at random, and sends a query message. Upon receiving a query, an uncolored node adopts the color in the query, responds with that color, and initiates its own query, whereas a colored node simply responds with its current color.

The naive way to spec this in TLA+ is as follows:

QuerySample(k) ==
    /\ \E sample \in SUBSET Node :
        /\ Cardinality(sample) = k
        /\ ...

This obviously leads to a HUGE explosion in number of states and all but guarantees infeasibility of model-checking. What other approaches exist for modeling probabilistic properties such as this in TLA+?

Andrew