So I got my model to a point where it catches a lot of the things I want to catch. Modeling a single "slot" and two processes (never mind what that means) takes about 1 minute 20 seconds. Making it 2 slots and 3 processes has so far taken my 8-thread workstation about 12 hours; and to be sure the model is race-free, I really want to run 3 slots and 3 processes.
So I started looking into how to increase throughput. Among the bits of kit we have lying around for testing is a slightly older 48-thread box, and a more recent 400+ thread box (but Linux only knows how to use 64 of them).
Unfortunately, I'm having trouble getting tlc to scale up to using these systems. Basically, if I go past 8 worker threads, the number of states generated actually goes down. And since the maximum clock rate of my desktop is actually higher than the maximum clock rate of the two massive-cpu boxen, the total throughput on the big boxes is lower than on my workstation.
I've also tried running in -simluation mode. Allegedly this has been implemented (https://github.com/tlaplus/tlaplus/issues/147); but running with 24 worker threads, the java process never takes more than 120% (i.e., no more than one worker is actually active).
It's more or less the same for two different JVMs: CentOS's OpenJDK
11.0.4, and Debian's OpenJDK 1.8.0_171 (from jessie-backports).
Any suggestisons? Am I missing something obvious?