[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] TLC: NoSuchElementException on checkpoint restore



On 09.12.2015 14:35, Markus Alexander Kuppe wrote:
> On 09.12.2015 13:13, Jaak Ristioja wrote:
>> After switching from distributed ad-hoc mode to non-distributed mode and
>> recovering from checkpoint, I get the following exception:
>>
>> @!@!@STARTMSG 2197:0 @!@!@
>> Starting recovery from checkpoint 15-12-02-13-53-09/
>> @!@!@ENDMSG 2197 @!@!@
>> @!@!@STARTMSG 1000:1 @!@!@
>> TLC threw an unexpected exception.
>> This was probably caused by an error in the spec or model.
>> See the User Output or TLC Console for clues to what happened.
>> The exception was a java.util.NoSuchElementException
>> java.util.NoSuchElementException
>>         at
>> tlc2.tool.fp.MSBDiskFPSet$TLCIterator.getLast(MSBDiskFPSet.java:321)
>>         at
>> tlc2.tool.fp.MSBDiskFPSet$MSBFlusher.mergeNewEntries(MSBDiskFPSet.java:108)
>>         at
>> tlc2.tool.fp.DiskFPSet$Flusher.mergeNewEntries(DiskFPSet.java:1298)
>>         at tlc2.tool.fp.DiskFPSet.recoverFP(DiskFPSet.java:896)
>>         at tlc2.tool.fp.DiskFPSet.recover(DiskFPSet.java:939)
>>         at tlc2.tool.ModelChecker.recover(ModelChecker.java:778)
>>         at tlc2.tool.ModelChecker.modelCheck(ModelChecker.java:114)
>>         at tlc2.TLC.process(TLC.java:773)
>>         at tlc2.TLC.main(TLC.java:195)
>>
>> @!@!@ENDMSG 1000 @!@!@
>> @!@!@STARTMSG 2186:0 @!@!@
>> Finished in 04s at (2015-12-09 14:06:40)
>> @!@!@ENDMSG 2186 @!@!@
>>
>> Recovering from the same checkpoint when switching back to distributed
>> ad-hoc mode works. Looks like this might be caused by the FPSet not
>> being present? Is there a workaround or an easy fix for this?
> 
> Hi Jaak,
> 
> what number of distributed fingerprint sets have you been using? Zero
> for the built-in set or >0 distributed sets.

3 sets, one for each slave.

> You are leaving the beaten path when you try to recovery from
> checkpoints created by a different TLC mode. Why do want to do this anyway?

In my distributed setup, the slaves seems to be rather idle, If I get
access to a more powerful machine I want to try whether doing the
computation locally on that machine would be any faster. I suspect that
besides the master CPU usage, networking between the master and slaves
might also be a bottleneck.

I recorded some of data on the network interface of the master and took
a look at it using Wireshark. I saw uncompressed RMI data containing
quite a lot of TLA model identifier strings being passed. I guess that
networking could be well optimized.

Jaak