Hi Everyone,
I'm currently teaching a class of people how to use TLA+. One thing we ran into is that TLC has poor performance on some machines. Running
VARIABLE x
Init == x = 1
Next == UNCHANGED x
Spec == Init /\ [][Next]_x
Will take either 0 seconds or around 10 seconds across computers. Times are consistent: if a computer takes 0 seconds, it takes 0 seconds on every run. I had everybody send me TLC logs and the only obvious difference I noticed is that the fast machines all had 12 cores, while the slow machines all had 8 cores. However, reducing the number of workers to 1 does not reduce the performance on the 12 core machines. So I don't think that's it, and I've hit a dead end.
I don't think I have enough data to diagnose the likely issues or file an informative bug report, so my goal here is to find out what else I should be looking for. Since these are other people's computers I can't do anything complicated or test stuff for more than five minutes at a time or so. The workshop goes the whole week, though, so I can try a few different short things. How can I get more data on what could be wrong here?
Some notes I have so far:
- Doesn't seem OS specific: the two fast runs are on Linux and Mac, and the slow runs Mac or Windows.
- Doesn't seem to be JVM startup related. TLC tells me the slow machines are evaluating fewer states per second.
- I got all logs by running clearing the TLA+ toolbox console and running TLC once.
- Switching from breadth-first to depth-first seems to fix the slowdown.
- Slowdown seems to be a constant offset: I tried a few other (small) specs and the difference was always around 10 seconds. So if the fast computers did a spec in about 2 seconds, the slow computers would take about 12 seconds.
H