[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] understanding temporal properties



Hello,

for invariants, TLC (essentially) guarantees shortest counter-examples [1], but this is not true for liveness properties. Moreover, liveness checking requires exploring loops in the state space, and finding them requires generating many more states than just reachability of some state violating an invariant. You should not draw conclusions from the numbers of states that TLC explored about the probabilities of errors occurring during system executions. TLA+ is meant to verify if a property is true of all executions and does not attempt to provide quantitative information. If you choose to accept that some problematic executions exist and want to find out how frequent they are, you may want to read about statistical model checking, which relies on Monte Carlo simulations. For this to be meaningful, you should provide probabilities for transitions, message delays etc.

Or maybe it fails on the first possible scenario and the recovery doesn't work at all?

As pointed out by Ron, this is a question that TLC can answer: check for the property

[](SplitBrain => []SplitBrain)

A counter-example to this property is an execution where recovery works.

Is there a way to ask the model checker to give all paths in which the invariable is falsifiable instead of stopping at the first failure? 

Markus already indicated the "-continue" command line option, but let me point out that even a finite-state instance usually generates an infinite number of executions (e.g., by going k times through some loop, for any number k, before branching to a different part of the state space), so it is futile to try and generate all violating paths.

Regards,
Stephan

[1] TLC generates states using breadth-first search, and will therefore signal an invariant violation for some state at minimal depth in the state space. Technically, this is not entirely true because TLC runs multiple threads, so some thread might be exploring states deeper down in the state space and find a violation, while some other thread still works on "shallower"states. But it is a good approximation.

On 5 Oct 2018, at 16:04, Bekir Oguz <beki...@xxxxxxxxx> wrote:

Dear all,
I am writing a specification for an algorithm which runs on individuals nodes of a cluster and downs/removes itself of other nodes if detects a network partition, thus recovers from a Split Brain situation.

To be able to prove this, I defined some variables and processes (representing each node) which are able to receive one of the following messages about another node: "unreachable", "reachable again", "left cluster", "joined cluster", "became the leader", etc... Also, each node, at any time can take a decision to either "do nothing" or "take an action" like removing some nodes from its membership set.

In the end, have invariants called "NoSplitBrain" and "ConsistentLeader", which indicate the cluster is split into two or maybe more groups. 

If I add the "NoSplitBrain" and "ConsistentLeader" invariants to the TLA toolbox invariants section, TLA perfectly reaches to a state which these invariants can be falsified. Which is normal, because this split will happen, but after taking a decision it should be resolved by the algorithm. So, I have written a temporal property like the following:


(* There is only one leader *)
ConsistentLeader == ...


(* At any moment this split brain state may happen which should be resolved at later steps. see SplitBrainRecovery temporal formula *)
SplitBrain == ...


(* No splits detected in the cluster *)                      
NoSplitBrain == ~SplitBrain

                      


(* Temporal formula which gives the guarantee of the eventual resolution of the SplitBrain situation *)
SplitBrainRecovery == SplitBrain ~> (NoSplitBrain /\ ConsistentLeader)


When I run model checking with the NoSplitBrain invariant, the toolbox finds a state in which this a split happens in 17.702 states.
When I run model checking with the temporal property "SplitBrainRecovery", the toolbox finds a state in which the liveness check fails after reaching 979.494 distinct states. 

My confusion and also question is: Can I rely on this temporal property? May I assume that the toolbox finds many split brain situations which the algorithm recovers from, and then fails at a very rare state which we cannot recover? Or maybe it fails on the first possible scenario and the recovery doesn't work at all?

I mean, with the NoSplitBrain invariant I can prove that this happens in 17.702 states, but with the "SplitBrainRecovery" temporal property, can I prove that the split happens and eventually we recover from that?

If you would like to look at the spec, it is here: https://github.com/ing-bank/baker/blob/split-brain-resolver/tla/ClusterV6.tla

Thanks in advance,
Bekir Oguz

-- 
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To post to this group, send email to tla...@xxxxxxxxxxxxxxxx.
Visit this group at https://groups.google.com/group/tlaplus.
For more options, visit https://groups.google.com/d/optout.