[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Distributed TLC: Failed to checkpoint the fingerprint server



Hmm... I discovered that this might have been caused by the TLC slave
running out of disk space in /tmp. It seems that TLC doesn't clean up
the /tmp/FPSet* directories after exit. :/

Is there a way to make TLC use some other directory instead of /tmp ?

Thanks!
Jaak

On 09.12.2015 10:29, Jaak Ristioja wrote:
> Hello!
> 
> Running a nightly from last week, I found that the distributed (ad-hoc)
> model check I left running for the night has hanged. The toolbox is
> still in "Computing reachable states" mode, the servers seem to be up
> and have output no errors. Servers still output progress messages about
> the number of fingerprints stored, which do not increase. The TLC java
> process is still present on the toolbox machine. The slave servers as
> well as the toolbox machine seem to be idle (no CPU, disk or network
> usage) and just sit on the heap of memory allocated for them. The MC.out
> file contains:
> 
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:23:05: 2754111084 states generated
> (1,440,989 s/min), 452951326 distinct states found (215,859 ds/min),
> 87756011 states left on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2195:0 @!@!@
> Checkpointing of run -- Checkpointing of run 15-12-02-13-53-09/ compl
> @!@!@ENDMSG 2195 @!@!@
> Error: Failed to checkpoint the fingerprint server at s3. This server
> might be down.
> @!@!@STARTMSG 2196:0 @!@!@
> Checkpointing completed at (2015-12-09 06:26:04)
> @!@!@ENDMSG 2196 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:26:04: 2758088637 states generated
> (3,977,553 s/min), 453573098 distinct states found (621,772 ds/min),
> 88271243 states left on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:27:04: 2760215909 states generated
> (2,127,272 s/min), 453682938 distinct states found (109,840 ds/min),
> 87607691 states left on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:28:04: 2761764541 states generated
> (1,548,632 s/min), 453682938 distinct states found (0 ds/min), 87313205
> states left on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:29:04: 2761764541 states generated (0
> s/min), 453682938 distinct states found (0 ds/min), 87313205 states left
> on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:30:04: 2761764541 states generated (0
> s/min), 453682938 distinct states found (0 ds/min), 87313205 states left
> on queue.
> @!@!@ENDMSG 2200 @!@!@
> @!@!@STARTMSG 2200:0 @!@!@
> Progress(90) at 2015-12-09 06:31:04: 2761764541 states generated (0
> s/min), 453682938 distinct states found (0 ds/min), 87313205 states left
> on queue.
> @!@!@ENDMSG 2200 @!@!@
> 
> ...and so on.
> 
> I stopped TLC from the toolbox and hit CTRL+C for the servers. I hope
> that doesn't invalidate the model checking results?
> 
> Best regards,
> Jaak
>