[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Some distributed TLC errors not displayed in TLA+ Toolbox


On 01.10.2015 11:34, Markus Alexander Kuppe wrote:
> On 01.10.2015 09:39, Jaak Ristioja wrote:
>> I'm running TLC (2.07) in distributed ad hoc mode from the TLA+ Toolbox
>> (1.5.1) and I encountered the following issue. In some cases the TLC
>> master process run from the TLA+ Toolbox allocates a lot of memory and
>> starts swapping. I run the FPSets on slave nodes (using
>> TLCWorkerAndFPSet) so I have no idea why the master sometimes uses more
>> than 8G of memory. Creating and model-checking a new identical model
>> might be a workaround (tried once).
>> Anyway when the master starts swapping then some slave threads time out
>> during reads in "JRMP connection establishment" with
>> java.net.SocketTimeoutException. However, the TLA+ Toolbox eventually
>> displays "Errors detected: No errors" even though model checking has not
>> successfully been completed. Sometimes one can see TLC printouts on the
>> slave terminals AFTER the TLA+ Toolbox has displayed a finished/success
>> status to the user.
> Even with distributed fingerprint set, the master still maintains the
> queue of unexplored/unseen states ("StateQueue") and a trace. The
> StateQueue  - depending on your spec/model - can be quite huge. That is
> why only portions of it are kept in memory while the rest is swapped to
> disk. How many workers (slaves) do you use though? I can see that a
> large amount of workers can exhaust 8gb of master memory if they explore
> states faster than the master swaps to disk.

The master seems to be the bottleneck. My 3 servers running the slaves
are almost idle, using less than 15% of their allocated CPU capacity and
only about 30% of memory reserved for them (32G out of 106G). The master
uses up all of its 6 CPUs and about 60G of RAM.

> Btw. you can always restart a master from a checkpoint. However, for
> distributed fingerprint sets, the logic to recover a checkpoint has not
> been implemented. This is why TLC would also re-explore all (previously
> seen) states again that are reachable from the current StateQueue.

Was this already implemented in git? Using a toolbox nightly from last
week, it seems that one can restart from a checkpoint in ad-hoc mode. At
least it seems to be model checking again and showing a diameter greater
than where it previously crashed due to running out of memory.

I didn't add any memory to the master, but when I restarted it from the
checkpoint, it didn't immediately use up all memory. Perhaps there is a
memory leak or caching which could be improved?

Best regards,