[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlaplus] Re: Scaling up model checking



To follow up on this, I am observing some apparent scalability issues with TLC when running on a single machine. I am using an EC2 c5d.24xlarge instance type, which has 96 cores and 192GB of memory. When I run a particular model with 64 workers, I see the following:

ubuntu@ip-172-31-1-228:/data/tla$ java -XX:MaxDirectMemorySize=120000m -Xmx120185m -Dtlc2.tool.fp.FPSet.impl=tlc2.tool.fp.OffHeapDiskFPSet -XX:+UseParallelGC -cp tla2tools.jar tlc2.TLC -workers 64 -config mc.cfg mc.tla
TLC2 Version 2.14 of 10 July 2019 (rev: 0cae24f)
Running breadth-first search Model-Checking with fp 117 and seed 1389448420162930971 with 64 workers on 96 cores with 106832MB heap and 120000MB offheap memory [pid: 7730] (Linux 4.15.0-1051-aws amd64, Ubuntu 11.0.4 x86_64, OffHeapDiskFPSet, DiskStateQueue).
Parsing file /data/tla/mc.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/Integers.tla
Parsing file /tmp/FiniteSets.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/TLC.tla
Semantic processing of module Naturals
Semantic processing of module Integers
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module mc
Starting... (2019-11-14 01:05:32)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2019-11-14 01:05:43.
Progress(6) at 2019-11-14 01:05:46: 147,528 states generated (147,528 s/min), 9,917 distinct states found (9,917 ds/min), 8,452 states left on queue.
Progress(9) at 2019-11-14 01:06:46: 4,363,650 states generated (4,216,122 s/min), 382,868 distinct states found (372,951 ds/min), 290,240 states left on queue.
Progress(9) at 2019-11-14 01:07:46: 8,448,010 states generated (4,084,360 s/min), 731,818 distinct states found (348,950 ds/min), 533,823 states left on queue.
Progress(10) at 2019-11-14 01:08:46: 12,452,538 states generated (4,004,528 s/min), 1,093,250 distinct states found (361,432 ds/min), 790,535 states left on queue.
Progress(10) at 2019-11-14 01:09:46: 16,336,391 states generated (3,883,853 s/min), 1,436,130 distinct states found (342,880 ds/min), 1,044,868 states left on queue.
Progress(10) at 2019-11-14 01:10:46: 20,274,326 states generated (3,937,935 s/min), 1,787,099 distinct states found (350,969 ds/min), 1,295,292 states left on queue.
Progress(10) at 2019-11-14 01:11:46: 24,372,652 states generated (4,098,326 s/min), 2,177,834 distinct states found (390,735 ds/min), 1,556,071 states left on queue.
Progress(10) at 2019-11-14 01:12:46: 28,343,121 states generated (3,970,469 s/min), 2,552,098 distinct states found (374,264 ds/min), 1,815,312 states left on queue.
Progress(10) at 2019-11-14 01:13:46: 32,267,652 states generated (3,924,531 s/min), 2,902,935 distinct states found (350,837 ds/min), 2,057,542 states left on queue.
Progress(10) at 2019-11-14 01:14:46: 36,234,857 states generated (3,967,205 s/min), 3,282,234 distinct states found (379,299 ds/min), 2,314,932 states left on queue.
Progress(11) at 2019-11-14 01:15:46: 40,186,365 states generated (3,951,508 s/min), 3,679,592 distinct states found (397,358 ds/min), 2,582,686 states left on queue.
Progress(11) at 2019-11-14 01:16:46: 44,118,806 states generated (3,932,441 s/min), 4,034,988 distinct states found (355,396 ds/min), 2,825,454 states left on queue.

The throughput (s/min) appears to top out at around 3.9 million states/minute. These are the reported numbers when I run with just 8 workers, on the same machine:

ubuntu@ip-172-31-1-228:/data/tla$ java -XX:MaxDirectMemorySize=120000m -Xmx120185m -Dtlc2.tool.fp.FPSet.impl=tlc2.tool.fp.OffHeapDiskFPSet -XX:+UseParallelGC -cp tla2tools.jar c2.TLC -workers 8 -config mc.cfg mc.tla
TLC2 Version 2.14 of 10 July 2019 (rev: 0cae24f)
Running breadth-first search Model-Checking with fp 43 and seed 6871599717738624516 with 8 workers on 96 cores with 106832MB heap and 120000MB offheap memory [pid: 8128] (Linux 4.15.0-1051-aws amd64, Ubuntu 11.0.4 x86_64, OffHeapDiskFPSet, DiskStateQueue).
Parsing file /data/tla/mc.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/Integers.tla
Parsing file /tmp/FiniteSets.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/TLC.tla
Semantic processing of module Naturals
Semantic processing of module Integers
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module mc
Starting... (2019-11-14 01:17:58)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2019-11-14 01:18:07.
Progress(6) at 2019-11-14 01:18:10: 136,991 states generated (136,991 s/min), 8,542 distinct states found (8,542 ds/min), 7,332 states left on queue.
Progress(8) at 2019-11-14 01:19:10: 3,344,141 states generated (3,207,150 s/min), 295,218 distinct states found (286,676 ds/min), 229,350 states left on queue.
Progress(9) at 2019-11-14 01:20:10: 6,591,286 states generated (3,247,145 s/min), 572,490 distinct states found (277,272 ds/min), 430,834 states left on queue.
Progress(9) at 2019-11-14 01:21:10: 9,852,114 states generated (3,260,828 s/min), 846,787 distinct states found (274,297 ds/min), 621,471 states left on queue.
Progress(9) at 2019-11-14 01:22:10: 12,984,599 states generated (3,132,485 s/min), 1,124,563 distinct states found (277,776 ds/min), 827,090 states left on queue.
Progress(9) at 2019-11-14 01:23:10: 16,140,694 states generated (3,156,095 s/min), 1,398,002 distinct states found (273,439 ds/min), 1,030,079 states left on queue.
Progress(10) at 2019-11-14 01:24:10: 19,356,420 states generated (3,215,726 s/min), 1,688,862 distinct states found (290,860 ds/min), 1,228,268 states left on queue.
Progress(10) at 2019-11-14 01:25:10: 22,544,906 states generated (3,188,486 s/min), 1,981,062 distinct states found (292,200 ds/min), 1,428,619 states left on queue.
Progress(10) at 2019-11-14 01:26:10: 25,777,526 states generated (3,232,620 s/min), 2,284,525 distinct states found (303,463 ds/min), 1,633,791 states left on queue.
Progress(10) at 2019-11-14 01:27:10: 28,980,735 states generated (3,203,209 s/min), 2,577,732 distinct states found (293,207 ds/min), 1,831,318 states left on queue.
Progress(10) at 2019-11-14 01:28:10: 32,199,852 states generated (3,219,117 s/min), 2,872,788 distinct states found (295,056 ds/min), 2,033,065 states left on queue.

which shows a max throughput of around 3.2 million states/minute. This would imply that TLC is scaling very badly i.e. a scale factor of (3.9/3.2) = 1.21x when there are 8 times as many cores. I don't think TLC should be disk bound yet since it has a ton of memory (120GB) allocated to it and I can observe nearly maximal core utilization while executing these runs. Is there something obvious I may be overlooking here?




On Tuesday, August 27, 2019 at 12:38:31 PM UTC-4, dun...@xxxxxxxxx wrote:
So I got my model to a point where it catches a lot of the things I want to catch.  Modeling a single "slot" and two processes (never mind what that means) takes about 1 minute 20 seconds.  Making it 2 slots and 3 processes has so far taken my 8-thread workstation about 12 hours; and to be sure the model is race-free, I really want to run 3 slots and 3 processes.

So I started looking into how to increase throughput.  Among the bits of kit we have lying around for testing is a slightly older 48-thread box, and a more recent 400+ thread box (but Linux only knows how to use 64 of them).

Unfortunately, I'm having trouble getting tlc to scale up to using these systems.  Basically, if I go past 8 worker threads, the number of states generated actually goes down.  And since the maximum clock rate of my desktop is actually higher than the maximum clock rate of the two massive-cpu boxen, the total throughput on the big boxes is lower than on my workstation.

I've also tried running in -simluation mode.  Allegedly this has been implemented (https://github.com/tlaplus/tlaplus/issues/147); but running with 24 worker threads, the java process never takes more than 120% (i.e., no more than one worker is actually active).

It's more or less the same for two different JVMs: CentOS's OpenJDK 11.0.4, and Debian's OpenJDK 1.8.0_171 (from jessie-backports).

Any suggestisons? Am I missing something obvious?

Thanks,
 -George

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/7efee561-eb5b-433a-ae7c-308a95451658%40googlegroups.com.