# Re: [tlaplus] State-constraints and liveness

Hi Chris,

Stephan seems to be answering your questions about what TLC should do.  However, you should be aware that TLC's liveness checking is known to be buggy.  So far, all the bugs I've seen have been ones in which TLC finds a genuine error in a liveness property, but it reports a bogus error trace.  I've never seen the kind of exception you found, but it might be the same class of bug--except that the trace TLC was trying to report was so bogus that it went off into hyperspace.  Yuan has refused to look at liveness-checking bugs, saying that the implementation is such a can of worms that it needs to be rewritten from scratch.  I haven't found anyone willing to do that.  (It would make a nice intern project, but my schedule makes it impossible for me to supervise a normal summer intern.)  So, I'm afraid we have to do the best we can with the current buggy implementation.

Leslie

On Sun, Oct 21, 2012 at 9:19 PM, Chris Newcombe wrote:
Hi Stephan,

Many thanks for your help -- it is hugely useful.

>>Perhaps a more appropriate specification of the fairness conditions would be something like
>>  WF_AllVars(Constraints /\ InternalActions)

I intend to try this, hopefully later this week.   I did manage to find a way forward for my current spec, but it's not general, and I'd like a general method.
My way forward was to stick with the scheme I mentioned in my last post,  i.e.
- no state constraint
- finite sets of "identifiers" for operations/resources, to limit the size of the state/behavior space.  These are defined as CONSTANTs, i.e. no definition overrides for these.
- assumption of weak fairness on internal actions that drive in-progress operations forward as much as possible
- no fairness assumption on 'voluntary actions by the application' that would start new operations, or on 'environment' actions such as failure/recovery
- liveness properties in the form
Liveness == \A id \in Ids : MyProperty(id) ~> (\/ SomeOtherProperty(id)
\/ LegitimateTermination)

- (new) weakening my LegitimateTermination condition to precisely take into account the liveness assumption.
See more detailed remarks on this at end of this post.

This seems to work (as far as I can tell from my position of being very new to writing and checking liveness properties).
- Model-checking some known-to-be-invalid liveness properties does now find the expected counter-examples.
- Model-checking of properties that I'm pretty sure are valid does eventually complete without finding any violations.
- Injecting "liveness bugs" into the spec does cause TLC reporting either expected deadlock or an expected violation of my formerly valid liveness properties.

The only downside of this approach is that because model-checking liveness is expensive and (I believe) single-threaded, and also limited by size of RAM, it is likely to be critical to constrain the size of the state-space.  For some specs, it might not be sufficient to limit only the 'number of new operations' that may be started -- they may need a non-trivial state constraint.  So I really do want to find (and understand) a general method of checking liveness with state-constraints.

I'll post to this thread when I've tried the method you suggest.

thanks again,

Chris

I can't share my spec, but I here are some comments I wrote that helped me analyze the ways in which I needed to weaken my LegitimateTermination condition to account for my liveness assumption.  For my spec, the legitimate termination condition required to address these principles turned out to be quite complicated.

This probably reveals how fuzzy my thinking still is about liveness, but so far I haven't found anything wrong with the approach.

(* Our 'legitimate termination' condition. *)

\* Legitimate termination is subtle because it is strongly related to the
\* particular LivenessAssumption that we have chosen for the specification,
\*
\* We call a state a 'termination state' of some behavior if it is
\* the 'last' state of that behavior, at which 'infinite stuttering' occurs,
\* More precisely, there is some natural number n such that for all m >= n,
\* the m'th suffix of the behavior begins with the 'termination state'.
\*
\* N.B.: in the most general sense (i.e. considering _any_ possible system),
\* 'termination state' is only defined for a subset of possible behaviors.
\* Some behaviors might end in repeating cycles, and other behaviors might
\* have an infinite state space and behaviors that never end in infinite stuttering
\* e.g. the spec that just increments a natural number without bound:
\*     x = 0  /\  [][x' = x + 1]_<<x>>  /\  WF_<<x>>(x' = x + 1)
\*
\* There are several interesting kinds of 'termination state':
\*
\*      The Next-state action is disabled in circumstances in which we would want it to be enabled,
\*      We choose define our specs so that if Next is disabled, that means deadlock --
\*      -- i.e. a bug in the TLA+ code or the system design.  TLC checks for deadlock by default.
\*
\*    - 'lifetime termination' due to reaching the legtimate finite bounds of the model.
\*      e.g. In this spec, we exhaust the space of identifiers, and all possible internal
\*      actions have run to completion.  Such a state represents the assumption
\*      that the system cannot be used (or even perhaps experience any more failures)
\*      after its natural lifetime is exhausted.
\*      Next is disabled, but in circumstances that we accept as legal.
\*      We want to allow behaviors to end in such a state, because we don't
\*      want false-positives.
\*
\*    - 'constrained termination' due to a TLC state constraint or action constraint.
\*      We clearly don't want to forbid behaviors just because they violate such a constraint
\*      (or there would be no point in ever using the constraints).
\*      However, Next is still enabled in such a state.
\*      The constraint simply stops TLC examining future states (although TLC does
\*      check invariants for the 'first' such state it finds in each behavior).
\*      This is the most subtle case of all because the constraint can be entirely
\*      arbitrary, and it interacts with the liveness assumption in complex ways.
\*      See section 14.3.4 of Specifying Systems and this discussion:
\*
\*    - 'optional termination due to waiting for voluntary stimulus' in which any
\*      future progress depends on actions that do not have liveness assumptions.
\*      Examples of such actions are voluntary actions by an application using
\*      the system (e.g. to start new operations), and actions by the 'environment'
\*      such as partial failure.  We don't have liveness assumptions on those actions because
\*      we don't want to _require_ them to happen in every behavior.
\*      In such a state, Next is *enabled*, but the only sub-actions that are enabled
\*      are those that don't have liveness assumptions.
\*      Important: the behavior _may_ continue past such a state (if Next happens to be enabled
\*      in the state).  Indeed, later states may or may not be legitimate termination states.
\*      For this reason, many system states (perhaps most) may be legitimate termination states.
\*      e.g. For the current spec, even the initial state is a legitimate termination state
\*      because any progress requires the application to perform a voluntary action
\*      which does not have a liveness assumption.
\*
\* We currently choose to not use state constraints or action constraints with this specification
\* because of the subtle interaction with liveness assumptions.
\* So our definition of a legitimate termination state is:
\*
\*    or optional termination due to waiting for stimulus
\*
\* We only assume fairness of 'internal' actions that drive existing in-progress operations forward.
\* We don't require that the application ever do anything. Similarly we don't require failure to happen.
\*
\* In summary, we consider a state to be a legitimate terminating state iff
\*
\*  - internal operations are all in their 'terminal' states, and the only enabled actions
\*    are voluntary actions by the application that we don't want to 'require'
\*    (i.e. to start new operations), or are actions by the environment (failure)
\*    that we also don't want to require.
\* or
\*  - any internal operations that are not yet in one of their 'terminal' states
\*    are currently blocked, because they (the in-progress internal actions)
\*    currently depend on the application performing some voluntary action
\*    which is currently disabled, e.g. due to exhaustion of identifiers.
\*    i.e. The design of the system assumes certain finite bounds on usage and
\*    failure-rate and those bounds have now been reached.
\* or
\*  - any internal operations that are not yet in one of their 'terminal' states
\*    but are currently blocked, because they (the in-progress internal actions)
\*    currently depend on the application performing some voluntary action for which
\*    we don't have a fairness assumption (as we don't want to require the application to
\*    keep using the system).  i.e. It's not the system's fault that the system may be stuck
\*    in the current state indefinitely.

On Monday, October 15, 2012 5:47:54 AM UTC-7, Stephan Merz wrote:
Hi Chris,

>> checking liveness properties in the presence of state constraints can be really tricky because the specification may not assert what you think it does. In particular, see section 14.3.5 of the TLA+ book.

Thanks for pointing this out.  I believe that I've dodged the issue described in that section of the book, by ensuring that my fairness assumption cannot not be made false by the state-constraint.  My fairness assumption is weak-fairness on just the 'internal' actions of the system -- i.e. the actions that drive any 'currently in-progress' operation forward to completion.  But my state-constraint only constrains the number of 'calls' by a user of the system that may start new operations. The state-constraint does also constrain the number of actions by the 'physical environment', such as process failure.

But as this is a subtle issue, I could easily have made an error.  I plan to test liveness checking as Leslie suggests, by ensuring that it can find violations of intentionally incorrect liveness properties.  (See below for how doing so found an apparent problem).

>>I'd suggest that you rewrite your property so that TLC will accept it. For example, try
Spec /\ []<><TRUE>_vars => Liveness
[…]
I tried entering just the antecedent
Spec /\ []<><<TRUE>>_AllVars
... as the "Temporal formula" in the "What is the behavior spec?" section.   i.e. Instead of just Spec.
And I entered the consequent
Liveness
... as the property to check.

Yes, that's exactly what I meant.

TLC accepts this.  However, I first tested this configuration by checking several leads-to properties that I know to be invalid, and it failed to find any of them.

So either I'm using your suggestion incorrectly, or I'm still running into the kind of subtle interaction between liveness and state-constraints that you pointed out.

I'm probably asking the obvious, but does your constrained specification allow for any behaviors where variables keep changing eventually? E.g., if you have some counter that can only increase but for which the constraints impose an upper bound then at some point the model can no longer continue and has to stutter.

In such a case, augmenting the specification by

[]<><<TRUE>>_AllVars

effectively rules out all behaviors, so any property will hold vacuously. (TLC will still report failures of invariants since these are checked as it generates states.)

Perhaps a more appropriate specification of the fairness conditions would be something like

WF_AllVars(Constraints /\ InternalActions)

where InternalActions is the disjunction of all internal actions of your system and Constraints describes the state constraints. This condition requires that internal actions should eventually occur from states that satisfy your state constraints but requires nothing of states that do not satisfy the state constraints. Thus, the formula is not vacuously satisfied over the constrained runs, unlike

WF_AllVars(InternalActions)

which is false of the constrained runs in case the internal actions remain enabled in the states that do not satisfy the constraints.

(Of course, depending on your spec you may want to impose several constraints of the form WF_AllVars(Constraints /\ A) instead of a single constraint for the disjunction of the internal actions.)

Now, you may see counter-examples to your liveness property that are due to some operation that has been started but couldn't finish within the bounds of the constraints. In order to rule out these, you may have to weaken your liveness property to something like

Liveness == \A id \in Ids : MyProperty(id) ~> (\/ SomeOtherProperty(id)
\/ LegitimateTermination)

where LegitimateTermination could be just "~Constraints" or a suitable refinement that expresses that the handling of request "id" couldn't finish within the bounds. (Otherwise, if all your runs end up violating the constraints eventually, you haven't really verified anything.)

Again, understanding what is actually verified by the model checker in such situations can be really subtle.

At this point I decide to change my specification to remove the need for a state-constraint.  I changed the definition of several sets from Nat to become CONSTANTS that are (small) finite sets, and added enabling conditions to the 'public API' and 'environment' actions to only enable them if they stay in these limits.  This change isn't really polluting the conceptual purity of the specification, as in the real system these sets would be finite -- just very, very large.  So this change can be interpreted as an assumption that the system will no longer be used if the sets are exhausted.

Alternatively, you could override Nat in the model that you pass to TLC by specifying something like

Nat <- 0 .. 9

in the "Definition Override" tab in the advanced options. (If you use any operations on naturals, you'll probably have to override these as well.) I agree, though, that the change that you made is nicer.

And I went back to using "Spec" as the 'Temporal formula' in the 'What is the behavior spec?' section.

Again I checked the several known-to-be-incorrect liveness properties.
This time I ran into a new TLC bug, which I think is real -- i.e. not just a user-error entering unacceptable temporal formulas into the wrong part of the toolbox.

The error occurred when TLC was checking liveness.
The exception was a java.io.IOException: Negative seek
offsetjava.io.IOException: Negative seek offset
    at java.io.RandomAccessFile.seek(Native Method)
    at tlc2.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:212)
    at tlc2.tool.liveness.DiskGraph.getNode(DiskGraph.java:145)
    at tlc2.tool.liveness.DiskGraph.getPath(DiskGraph.java:309)
    at tlc2.tool.liveness.LiveWorker.printTrace(LiveWorker.java:521)
    at tlc2.tool.liveness.LiveWorker.checkComponent(LiveWorker.java:290)
    at tlc2.tool.liveness.LiveWorker.checkSccs(LiveWorker.java:119)
    at tlc2.tool.liveness.LiveWorker.run(LiveWorker.java:613)

Ouch. Looks like TLC can't find its states anymore on the disk. I'm afraid I can't help you here.

Stephan

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To post to this group, send email to tla...@xxxxxxxxxxxxxxxx.
To unsubscribe from this group, send email to tlaplus+u...@xxxxxxxxxxxxxxxx.