The use for which TLA+ was primarily designed is to avoid basic
concurrency errors, not to catch coding errors. Ideally, you would
write a spec in which accessing two pieces of data is an atomic action
in the spec if and only if the accesses need to be performed
atomically. For example, this code
(1) w : while (n < 27)
{ x := x+1 ;
n := n+1 }
repeatedly performs the atomic action w that access both n and x.
To indicate that those accesses don't need to be performed in the
same action, we can write
(2) t : while (n < 27)
{ u : x := x+1 ;
v : n := n+1 }
to indicate that each iteration of the loop body comprises
execution of the three separate actions t, u, and v. However,
doing this complicates the spec, producing more reachable
states and making it harder to model check. It's most likely
that the variable n is local to the process and is not accessed by
any other process. In that case, we can write (1) in the spec,
perhaps adding a comment that only the update of x needs to
be done atomically.
The basic argument that (1) and (2) are equivalent goes as follows.
Consider any execution of the system with (2). An execution of one
iteration of the loop in (2) will appear in a behavior of the
spec as
(3) ... -> t -> A -> u -> B -> v -> ...
where A and B are sequences of actions of other processes. Since A
and B can't access n, (2) will also allow this behavior
(4) ... -> A -> t -> u -> v -> B -> ...
which is equivalent to the behavior
(5) ... -> A -> w -> B -> ...
of the spec containing (1). Thus, if (5) satisfies the desired correctness
condition of the spec, then so will (3). Hence, it suffices to show
correctness of the spec containing (1) to deduce correctness of the
more accurate spec containing (2).
There are a lot of assumptions hidden in this argument. For examine,
all bets are off if A and B could access the control state of the
process p executing the while loop (represented by the PlusCal
translation as pc[p]). Also, the argument depends on the kind of
correctness property we expect the spec to satisfy. For example, it
doesn't work if the property depends on the number of (non-stuttering
steps) performed.
The idea of deducing properties of a spec by reasoning about a
coarser-grained spec (one with fewer atomic actions) is called
reduction. The most general rigorous statement of reduction
that I know of is in this paper:
http://lamport.azurewebsites.net/pubs/cohen-tlareduction.pdf
However, that paper is pretty heavy reading. My advice is to write
your spec so that each atomic action performs operations that must be
implemented to appear atomic (e.g., by using locking) plus any
operations that are completely local to the process. You should study
the argument in the example I gave above to understand why that works,
so you'll recognize any unusual cases in which it doesn't work.
There's an additional problem that can arise in going from the spec to
the code: Even though operations on two pieces of data--for example
the values A[1] and A[27] of an array A--may be logically independent
of one another, they might be implemented in such a way that they
interfere with one another. For example, if space is allocated as
needed, the operation of assigning a value to A[27] might require
increasing the size of the array, which could screw up a concurrent
operation on A[1]. This is a coding problem, not a problem that you
should expect to be solved by writing a TLA+ spec. There are coding
tools to find this kind of error--in particular, tools that warn about
concurrent accesses to shared data not protected by locks.
Leslie
On Tuesday, February 12, 2019 at 4:36:16 PM UTC-8, Oliver Yang wrote:
Hi All,
Actions of TLA+ spec are executed atomically. A legal behavior specified by a TLA+ spec consists of a sequence of atomic actions. When translating TLA+ spec to real code, we still need to figure out how to provide the atomicity. Let's take Write-Through Cache in Chapter 5 of Book "Specifying Systems" as an example.
Action DoRd(p1) and Action DoWr(p2) are concurrent operations. They may touch the same cache line of processor p2 if p1 is writing the same cache'd address as p2 is reading from. This means we may need a mutex for each cache line.
In addition, DoWr(p) action requires updating cache and updating memQ to be an atomic step but it doesn't necessarily require updating cache and updating buf/ctrl to be an atomic step.
Figuring out optimal locking/mutual exclusion strategy (when converting spec to code) can still be challenge and error-prone when implementing large concurrent system. It's very easy for engineers to diverge implementation from specification in the process of designing locking strategy
Here are my questions:
(1) IMO, locking strategy is one important aspect of designing concurrent program which TLA+ seems not cover. Is this a right understanding of TLA+ spec?
(2) If it indeed requires figuring out optimal locking strategy (which TLA+ doesn't capture), then does anyone have figured out a set of good principles/disciplines to do that? or We have to develop an ad-hoc locking strategy every time we try to convert TLA+ specs to implementations?
Thanks,
Oliver