[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Canonical TLA+ ... how to avoid "imperative brainwash" on KV store



Guys, thank you. The light bulb went on. Please see https://github.com/rodgarrison/tla_note1/blob/main/README.md where I explain all this in my own words. If there are errors here, submit a Github issue. I'll fix it. Shane

On Wednesday, November 25, 2020 at 10:32:27 PM UTC-5 Calvin Loncaric wrote:
> the one obvious question is how TLA ever escapes the while [true] loop to explore more states or terminate

Yep, the model checker "escapes" the loop, even though your program cannot "escape" the loop. Crazy, right? But: When the model checker visits a state it has seen before, it can safely abandon its current line of execution, even if the program being checked would continue running!

To understand the model checker better, think deeply about this very simple non-terminating program:

        variable x = 1;
loop:
        while (TRUE) {
            x := IF x = 1 THEN 2 ELSE 1;
        };
after:
        x := -1;

It runs forever, but it only has two states:

    /-----------\       /-----------\
    | x  = 1    | ----> | x  = 2    |
    | pc = loop | <---- | pc = loop |
    \-----------/       \-----------/
   (initial state)

When the model checker revisits the state (x=1 /\ pc=loop), it can stop because the only possible outcome is an infinite loop. It has already seen every possible state of the program. The program never actually "escapes" the loop: there are no reachable states where pc=after or x=-1. However, the model checker doesn't need to wait for the program to terminate before it can stop and tell you that it has seen all of the reachable states.

You might be concerned that stopping at duplicate states means the model checker has missed some possible behaviors. Don't worry, it hasn't missed anything. I'll try to respond to some concerns:

*** The example program runs forever. Therefore, isn't it an infinite-state program?

No. It can take arbitrarily many steps, but along the way it only visits a finite number of states. If you replace the body of the loop in my example program with x := x + 1; then the program becomes an infinite-state program, and the model checker will not terminate. Note Stephen's comment: "The power of model checking is that it lets you explore even infinite executions over finite state spaces."

*** If my program is nondeterministic or multi-threaded, isn't it wrong to stop at an already-visited state? After all, a nondeterministic program might do something different on the next iteration.

Great question. Don't worry: the model checker still won't miss any states. While stepping through the loop body, the model checker queues all nondeterministic behaviors to "visit later". Stephan's code shows how this queue works. The model checker will "abandon this line of execution"... but then it goes back to its queue to look at other possible executions.

If you like thinking in terms of invariants, the model checker's invariant is that all reachable states of the program are either
  • in the visited set -OR-
  • queued to be visited later -OR-
  • reachable from some queued state.
Therefore, when the queue is empty, all reachable states have been visited. :)

*** I still think the model checker missed something, because my program has billions of possible behaviors but TLC reported only "278 states generated, 178 distinct states found".

It's possible that there is a bug in your program, and that some states you thought were reachable... aren't. If you can describe an interesting state, you can check an invariant like "~ Interesting" to see if the model checker can find such a state. A violation of "~ Interesting" corresponds to a trace where an interesting state was reached.

For what it's worth: I think 178 is a reasonable number of distinct states for your program, even if there are indeed billions of ways to reach those states. Think deeply about the difference between the number of reachable states and the number of paths through the state graph (i.e. behaviors).

More concretely, it can be hard to predict the effect of "diamonds" in the state graph. For example, the programs [put(1, 100); put(1, 200);] and [put(1, 101); put(1, 200);] arrive at the same final state:

        (empty)
         /   \
        /     \
  {1:100}     {1:101}
        \     /
         \   /
        {1:200}


As a result, with one single-threaded client I expect the number of states in your spec would be roughly (# states of the key-value store) * (# of labels in the client program). With 2 keys and 2 values, I think (# states of the key-value store)=9, so 178 states is pretty reasonable.

> you write: "TLC will explore all possible sequences of operation calls...": Are we really sure? Really? If yes, how ... because it's impressive if true.

Stephan simplified reality slightly, I think. TLC will explore all possible states, and it constructs a state graph that captures all possible sequences of operation calls. It does this without actually "exploring" all possible sequences of operation calls (most of the time).

Fortunately, that state graph is all it needs to check correctness properties.

You should actually be more impressed now! TLC verifies your program over all possible sequences of operation calls without actually "exploring" all possible sequences of operation calls. :)

> I'm finding I need a TLA explain so I can assuredly resolve my anxiety that no errors is really no errors.

Section 14.5.3 of Specifying Systems agrees with you: "It’s very easy to satisfy a safety property by simply doing nothing." That section has a few tips about how to gain confidence that "no errors is really no errors".

I'm also really interested in what you would want from an "explain" mechanism. If you can mock up what you want, there's a chance someone can build it or point you at an existing one. The dot-visualization of the state graph is supposed to be an "explain" mechanism, but it is quite unwieldy for large specifications.


On Wed, Nov 25, 2020 at 12:17 PM recepient recepient <nets...@xxxxxxxxx> wrote:
From an earlier email: TLC will explore all possible sequences of operation calls, for all possible combinations of arguments. 
This, as I understand it, is overly broad and not in fact true. This isn't meant to shoot down the whole approach --- clearly not ---
but it emphasize that without greater specificity that the following fragment is impenetrable to me at least. Look, the one obvious
question is how TLA ever escapes the while loop to explore more states or terminate:

-------------
while (TRUE) {
  with k \in HashKey, v \in HashValue, op \in ClientOps {
    call PerformOp(k,v,op)
  }
}
-----------

Here is where where I argue line programmers get fed up with model checking black magic. At least in C/C++/GO I know what CPU
will do. I can always ask the DB to EXPLAIN itself. From the previous email:

---------
TLC explores all successors of every state it analyzes. The algorithm for checking invariants can be given as follows:
queue = [s : s is an initial state of the spec]  \* queue of states to explore
visited = []   \* set of states already visited
while queue != [] :
 . . .
-------

And thus the counter example I posed in previous email is demonstrated. Once TLA leaves step 0 and adds it to its visited queue,
it will explore no more paths from state 0 meaning a whole bunch of what I thought TLA would check or wanted it to check ... it
doesn't. I will simulate your TLA state-exploration code in python to double check my facts.

On Wednesday, November 25, 2020 at 1:23:41 PM UTC-5 Stephan Merz wrote:
TLC explores all successors of every state it analyzes. The algorithm for checking invariants can be given as follows:

queue = [s : s is an initial state of the spec]  \* queue of states to explore
visited = []   \* set of states already visited
while queue != [] :
   s = head(queue)
   queue = tail(queue)
   if ~(s in visited) :
      visited += [s]
      if ~(s satisfies invariant) :
        return false   \* compute and display counter-example
      succs = [t : t is a successor of s w.r.t. the next-state relation]
      queue += succs
return true

You can visualize the state graph as a .dot file, this works for small state spaces (it might still be OK for your example). In any case, it is a good idea to check properties that you expect to be false and inspect the counter-example to check that it corresponds to the computation that you expect. For example, you could check properties such as

hash[10] ~= 101   (invariant stating that one can not assign value 101 to key 10)

[][hash[10] = 101 => hash'[10] = 101]_vars   (temporal property asserting that once key 10 has that value, it will never be assigned a different value)

etc. I am sure that more could be done for visualizing and exploring state graphs.

Stephan

On 25 Nov 2020, at 18:53, recipient <nets...@xxxxxxxxx> wrote:



Stephen, again thank you for taking time to engage questions. 

Continuing the sql analogy where one describes what is to be done ... the sql developer always has the option of running EXPLAIN throwing open the curtain to reveal how things work.

I'm finding I need a TLA explain so I can assuredly resolve my anxiety that no errors is really no errors. Otherwise it's black magic. Maybe I need to read up on the tla model checking algo. This would help programmers like me who are too versed in imperative orientation rather then specification orientation.

In the running simplified example the initial state is an empty hash map to which the the following client operations are permitted:

Step 0 = initialize has empty
Step 1 = Op get key 10 
Step 2 = Op put key 10 Val 100
Step 3 = Op put key 10 Val 101

So tla starts in the initial state, empty hash. The state paths available to it depending on fairness are:

(0) stop
(0, 1) stop \* 0 \X perms of {1, 2, 3} take 1
(0, 2) stop
(0, 3) stop 
0 \X perms of {1, 2, 3} take 2
+ 0 \X  permutations of {1, 2, 3} take 3

Here's the critical question: if tla starts in step 0 then transitions to step 1 ... Then so far so good. 

It sees step 2 is enabled. 

But if it determines step 0 was already visited and simply transitions into step 2 making path (0, 1, 2) --- ultimately one of the paths we want --- it'll never explore (0, 2) and by extension it'll never try any later path starting with 0 and therefore NOT explore all possible paths. The blanket statement then it tries only previously unvisited but enabled steps is incomplete.

So what does tla do? And what tools are available to have tla explain what it does? I find myself in need of understanding this aspect of tla behavior to assure myself it's doing what I think it ought to do.

Regards





On Wed, Nov 25, 2020, 2:42 AM Stephan Merz <stepha...@xxxxxxxxx> wrote:
Think of it the other way round: the model that you described in your previous message has two keys and two values (plus a null value when a key is not assigned), so your store has 9 possible values. I'm assuming that the operation that was performed last is not stored in the state (so it doesn't contribute to the number of states) but you may have one or two more variables, for exemple for displaying the result of an operation or for storing the point of control, and these contribute more states. Let's assume that the result variable has 4 possible values (the two values in your store, a "success" for a PUT and an "error" for a GET or DELETE on an unassigned key), then we get 9*4 = 36 possible states. Your spec may be different, but a state graph of 178 distinct states does not sound wrong to me.

TLC cannot and does not generate executions one by one: as you say, every execution is of infinite size because of the infinite loop, and there are infinitely many of them because operation calls can be interleaved in arbitrary orders. Instead, it computes the graph of reachable states and explores successors only once per state. As you say there are 12 possible successors for each state, but many of them will loop back to states that TLC has already seen before (e.g., a GET or a DELETE for an unassigned key don't change the state of the store, two PUT operations for two different keys executed in one order or the other lead to the same state etc). So there is no need to actually explore all permutations of all subsets of operations, the only thing that matters is that no reachable state is missed.

Regards,
Stephan


On 25 Nov 2020, at 00:54, recepient recepient <nets...@xxxxxxxxx> wrote:

Stephan, Thank you - know how on display.

How does TLA ever escape from 'while (TRUE)'? It does in fact. But any reasonable interpretation would argue TRUE is a constant true condition.

Second, you write: "TLC will explore all possible sequences of operation calls...": Are we really sure? Really? If yes, how ... because it's impressive if true.

The cartesian product k \in HashKey, v \in HashValue, op \in ClientOps is 12 elements big so 2^12 elements in its power set. And for each element in the power set it would have to try all permutations contributing n! calls per each. Altogether that's over 1 billion calls into PerformOp. On the other hand TLA reports:

>278 states generated, 178 distinct states found, 0 states left on queue.

Adding a print in PerformOp (I know, I know!) only shows ~100 calls. I'm only seeing <200 nodes and <300 edges in the .dot graph albeit there are cycles all over the place. Even then I can only find ~100  unique paths from the initial node to final node without revisiting nodes.

But there's even more. If this is to be done right TLA will have to reset the hash to its initial state at the right to time.

Example, suppose we only cared about exploring all possible sequences of operation calls on just two tuples. I'd expect TLA to do the following. 

Case 1:
altHash = [ altHashKey \in HashKey |-> Nil ],  
call PerformOP( tuple1 );

Case 2:
altHash = [ altHashKey \in HashKey |-> Nil ],  
call PerformOP( tuple2 );

Case 3:
altHash = [ altHashKey \in HashKey |-> Nil ],  
call PerformOP( tuple1 );
call PerformOP( tuple2 );

Case 4:
altHash = [ altHashKey \in HashKey |-> Nil ],  
call PerformOP( tuple2 );
call PerformOP( tuple1 );

Again, if this fragment: 

while (TRUE) {
  with k \in HashKey, v \in HashValue, op \in ClientOps {
    call PerformOp(k,v,op)
  }
}

is really running this kind of state exploration ... mark me down as extremely impressed
On Tuesday, November 24, 2020 at 2:43:10 AM UTC-5 Stephan Merz wrote:
Why don't you just write a loop of the form

while (TRUE) {
  with k \in HashKey, v \in HashValue, op \in ClientOps {
    call PerformOp(k,v,op)
  }
}

TLC will explore all possible sequences of operation calls, for all possible combinations of arguments. Whenever it runs into a state that it has seen before (because the store has the same value for every key), it will stop exploring that branch. The power of model checking is that it lets you explore even infinite executions over finite state spaces. There's no need to materialize the search that TLC will perform in the data structures of your specification.

Regards,
Stephan

On 23 Nov 2020, at 18:31, recepient recepient <nets...@xxxxxxxxx> wrote:

Hi -

I'm working on a distributed KV store in Pluscal. Using TLA+ our desire is to ``SQLtise" our Pluscal code everywhere possible: describe what is to be done & checked without proscribing the steps ala imperative code to TLA+. Granted, we can fly closer to that mentality by avoiding Pluscal altogether ... but that's out of scope for the moment.

I am stuck on one basic aspect of this work which seems to exist right at the boundary between allowing TLA+ to explore reachable states, and me telling it what I think reachable states are... 

Assume:
key \in {0,1}, value \in {100,101},   {OP_GET, OP_PUT, OP_DELETE}  for  {0,1} \X {100,101} \X {OP_GET_PUT,OP_DELETE} or 12 unique possible client requests. (Note: get/delete operations never use the value part).

Goal
:
Have TLA+ check correctness properties (not shown) by exploring all possible ways a client can send KV requests. Imperatively all possible ways are given by:

For each subset s in  {0,1} \X {100,101} \X {OP_GET_PUT,OP_DELETE}:
    For each permutation p in s:
         (re)initialize model
         model check operations in p by running p[i] where i \in 1..Len(p) 

Candidate Process 1 (elided): BAD: Doesn't do all permutations or subsets
    with v \in HashKey, v \in HashValue, op \in ClientOps  
    do
       call PerformOp(k,v.op);
    end with;
Enabling DOT visualization & writing some python to print all paths from initial to final states using the DOT data, we can show that TLA+ tries the first rpc tuple, then tries rpc tuples 1 and 2, then 1 and 2 and 3 ... and on up to 1..12. But that's it!

Candidate Process 2 (elided): BAD: Doesn't do permutations or subsets:
    
\* Alternative formulation of process 1. Choose approach much the same ...
    with i \in HashKey \X HashValue \X ClientOps
    do
       call PerformOp(i[1],i[2],i[3]);
    end with;

At this point TLA+ can't read our minds: it needs to be told to try all permutations of all subsets. But if one writes writes something like the following, JAVA runs out of stack:

Candidate Process 3 (elided): BAD: Won't work; runs out of stack
    variables rpcs=Permutations(SUBSET(HashKey, v \in HashValue, op \in ClientOps))
    with i \in rpcs
    do
       // now try to dig out individual RPCs and run them...
       call PerformOp(....);
    end with;

Moreover, even if this compiled and ran it's not clear if this achieves the goal. Did it check all permutations on subsets of size 5 starting from the initial state or did it run that on top of permutations on subsets size 4?   

What does one do here?


--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/e3a99dad-9df9-48a9-b0ad-c9ba95a25d7dn%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/ce825f7b-4614-4031-8395-91765f479fe9n%40googlegroups.com.


--
You received this message because you are subscribed to a topic in the Google Groups "tlaplus" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tlaplus/XakPxMJczb8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/E92DE534-170B-4D79-B587-6A21FD6D11C5%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/f39f6ab7-8ec1-4e81-9fe6-f4465dc17e71n%40googlegroups.com.