I am writing a specification for an algorithm which runs on individual nodes of a cluster and downs/removes itself of other nodes if detects a network partition, thus recovers from a Split Brain situation.
To be able to prove this, I defined some variables and processes (representing each node) which are able to receive one of the following messages about another node: "unreachable", "reachable again", "left cluster", "joined cluster", "became the leader", etc... Also, each node, at any time can take a decision to either "do nothing" or "take an action" like removing some nodes from its membership set.
In the end, have invariants called "NoSplitBrain" and "ConsistentLeader", which indicate the cluster is split into two or maybe more groups.
If I add the "NoSplitBrain" and "ConsistentLeader" invariants to the TLA toolbox invariants section, TLA perfectly reaches to a state which these invariants can be falsified. Which is normal, because this split will happen, but after taking a decision it should be resolved by the algorithm. So, I have written a temporal property like the following:
(* There is only one leader *)
ConsistentLeader == ...
(* At any moment this split brain state may happen which should be resolved at later steps. see SplitBrainRecovery temporal formula *)
SplitBrain == ...
(* No splits detected in the cluster *)
NoSplitBrain == ~SplitBrain
(* Temporal formula which gives the guarantee of the eventual resolution of the SplitBrain situation *)
SplitBrainRecovery == SplitBrain ~> (NoSplitBrain /\ ConsistentLeader)
When I run model checking with the NoSplitBrain invariant, the toolbox finds a state in which this a split happens in 17.702 states.
When I run model checking with the temporal property "SplitBrainRecovery", the toolbox finds a state in which the liveness check fails after reaching 979.494 distinct states.
My confusion and also question is: Can I rely on this temporal property? May I assume that the toolbox finds many split brain situations which the algorithm recovers from, and then fails at a very rare state which we cannot recover? Or maybe it fails on the first possible scenario and the recovery doesn't work at all?
I mean, with the NoSplitBrain invariant I can prove that this happens in 17.702 states, but with the "SplitBrainRecovery" temporal property, can I prove that the split happens and eventually we recover from that?
Thanks in advance,