[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Some distributed TLC errors not displayed in TLA+ Toolbox

On 01.10.2015 09:39, Jaak Ristioja wrote:
> I'm running TLC (2.07) in distributed ad hoc mode from the TLA+ Toolbox
> (1.5.1) and I encountered the following issue. In some cases the TLC
> master process run from the TLA+ Toolbox allocates a lot of memory and
> starts swapping. I run the FPSets on slave nodes (using
> TLCWorkerAndFPSet) so I have no idea why the master sometimes uses more
> than 8G of memory. Creating and model-checking a new identical model
> might be a workaround (tried once).
> Anyway when the master starts swapping then some slave threads time out
> during reads in "JRMP connection establishment" with
> java.net.SocketTimeoutException. However, the TLA+ Toolbox eventually
> displays "Errors detected: No errors" even though model checking has not
> successfully been completed. Sometimes one can see TLC printouts on the
> slave terminals AFTER the TLA+ Toolbox has displayed a finished/success
> status to the user.

Hi Jaak,

thanks for reporting the problem.

Even with distributed fingerprint set, the master still maintains the
queue of unexplored/unseen states ("StateQueue") and a trace. The
StateQueue  - depending on your spec/model - can be quite huge. That is
why only portions of it are kept in memory while the rest is swapped to
disk. How many workers (slaves) do you use though? I can see that a
large amount of workers can exhaust 8gb of master memory if they explore
states faster than the master swaps to disk.

Generally, how large is your spec/model exactly? Can you share it with
me (privately)?

Btw. you can always restart a master from a checkpoint. However, for
distributed fingerprint sets, the logic to recover a checkpoint has not
been implemented. This is why TLC would also re-explore all (previously
seen) states again that are reachable from the current StateQueue.

"Errors detected: No errors" sounds as if the Toolbox parser/error
detection is broken. What number does it show for unexplored states
("Queue size") then?