[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlaplus] Re: Scaling up model checking

I see. Should I expect better scalability with distributed TLC? Your presentation here indicates that distributed mode may be able to scale linearly in some cases. My goal is to try running multiple TLC worker processes on the same EC2 machine. Do you have reason to believe that will improve performance versus running all threads inside the same JVM process? I am also curious if using multiple fingerprint set servers is recommended for scaling to this level (i.e. 96 machine cores). I am just curious if it's worth throwing many cores/machines at the problem. If the improvements are only marginal, it may just be simpler for me to run non-distributed TLC with suitably many workers.


On Tuesday, August 27, 2019 at 12:38:31 PM UTC-4, dun...@xxxxxxxxx wrote:
So I got my model to a point where it catches a lot of the things I want to catch.  Modeling a single "slot" and two processes (never mind what that means) takes about 1 minute 20 seconds.  Making it 2 slots and 3 processes has so far taken my 8-thread workstation about 12 hours; and to be sure the model is race-free, I really want to run 3 slots and 3 processes.

So I started looking into how to increase throughput.  Among the bits of kit we have lying around for testing is a slightly older 48-thread box, and a more recent 400+ thread box (but Linux only knows how to use 64 of them).

Unfortunately, I'm having trouble getting tlc to scale up to using these systems.  Basically, if I go past 8 worker threads, the number of states generated actually goes down.  And since the maximum clock rate of my desktop is actually higher than the maximum clock rate of the two massive-cpu boxen, the total throughput on the big boxes is lower than on my workstation.

I've also tried running in -simluation mode.  Allegedly this has been implemented (https://github.com/tlaplus/tlaplus/issues/147); but running with 24 worker threads, the java process never takes more than 120% (i.e., no more than one worker is actually active).

It's more or less the same for two different JVMs: CentOS's OpenJDK 11.0.4, and Debian's OpenJDK 1.8.0_171 (from jessie-backports).

Any suggestisons? Am I missing something obvious?


You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/76a2b5ba-302a-443d-8237-0254d907a62e%40googlegroups.com.