[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Some distributed TLC errors not displayed in TLA+ Toolbox



Hi Markus,

On 01.10.2015 11:34, Markus Alexander Kuppe wrote:
> Even with distributed fingerprint set, the master still maintains the
> queue of unexplored/unseen states ("StateQueue") and a trace. The
> StateQueue  - depending on your spec/model - can be quite huge. That is
> why only portions of it are kept in memory while the rest is swapped to
> disk. How many workers (slaves) do you use though? I can see that a
> large amount of workers can exhaust 8gb of master memory if they explore
> states faster than the master swaps to disk.

Thank you very much for this insight! I'm using a local cluster where
I'm currently running about 120 worker threads on several servers with
enough memory. The VM running the TLA+ Toolkit only has 8GB, and such
memory usage of the TLC master process suprised me. The "Running TLC in
Distributed Mode" web page seems to have no information about this.

> Generally, how large is your spec/model exactly? Can you share it with
> me (privately)?

Unfortunately I can't share the details of the spec/model, a work in
progress. After some optimizations to the specification consists of
roughly 500-600 lines of TLA+ code, and less than 400000 distinct
states. At this point the queue size stays below 32000 according to the
statistics and I'm able to complete the model check successfully as far
as I can tell (from the workers exiting without errors, the GUI showing
"No errors", the TLC master process exiting etc). If I remember
correctly, before the spec optimizations the distinct state set exploded
over 21M and the queue size was above 600000 before the TLC master
process ran out of memory. I've used -Xmx options to make JVM use all
available memory (and swap).

> "Errors detected: No errors" sounds as if the Toolbox parser/error
> detection is broken. What number does it show for unexplored states
> ("Queue size") then?

It was something above 600000. I guess next time I should check that its
zero. The problem is that from the perspective of a TLA+ Toolbox user
the "No errors" message makes it is too easy to incorrectly conclude
that the model-checking completed successfully when in fact it was
incomplete. I feel this warrants a better notification than a non-zero
value in a cell in the statistics table.


Best regards,
Jaak