[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Help with running TLC in distributed mode (ad hoc)



On 19.01.19 18:28, ahmed...@xxxxxxxxx wrote:
> I am trying to run TLC in ad hoc mode to check a complex model and I face some problems with that. I set the number of fingerprint servers to 0 in the toolbox, so only the master server stores fingerprints. First I run the master server from the toolbox and its state changes to  "Master waiting for workers". Then, I start slave workers using the command line and its output on the shell shows that its state is "ready". However, the master server state changes to "0 workers registered" and the toolbox progress output shows that it connected to workers and disconnected immediately as shown in the following logs:
> 
> SANY finished.
> Computing initial states...
> Finished computing initial states: 1 distinct state(s) generated.
> TLC server at ahmad-ThinkPad-T470s is ready (2019-01-19 21:19:40)
> Registration for worker at rmi://ahmad-ThinkPad-T470s:35985/3 completed (2019-01-19 21:20:17)
> Registration for worker at rmi://ahmad-ThinkPad-T470s:35985/1 completed (2019-01-19 21:20:17)
> Registration for worker at rmi://ahmad-ThinkPad-T470s:35985/0 completed (2019-01-19 21:20:17)
> Registration for worker at rmi://ahmad-ThinkPad-T470s:35985/2 completed (2019-01-19 21:20:17)
> TLC worker connection lost rmi://ahmad-ThinkPad-T470s:35985/3 (2019-01-19 21:20:17)
> TLC worker rmi://ahmad-ThinkPad-T470s:35985/3 disconnected (2019-01-19 21:20:17)
> TLC worker connection lost rmi://ahmad-ThinkPad-T470s:35985/2 (2019-01-19 21:20:17)
> TLC worker rmi://ahmad-ThinkPad-T470s:35985/2 disconnected (2019-01-19 21:20:17)
> TLC worker connection lost rmi://ahmad-ThinkPad-T470s:35985/1 (2019-01-19 21:20:17)
> TLC worker rmi://ahmad-ThinkPad-T470s:35985/1 disconnected (2019-01-19 21:20:17)
> TLC worker connection lost rmi://ahmad-ThinkPad-T470s:35985/0 (2019-01-19 21:20:17)
> TLC worker rmi://ahmad-ThinkPad-T470s:35985/0 disconnected (2019-01-19 21:20:17)
> 
> When I run slaves, I only specify -Djava.rmi.server.hostname argument, should I specify anything else?
> 
> Can you help me resolving the issue?

Hi,

this behavior is usually caused by broken name resolution. Please try to
set "-Djava.rmi.server.hostname" on the master computer and similarly on
each worker.  Distributed TLC requires opening connections from the
worker computers to the master but also from the master to each worker.
With distributed fingerprint sets communication is even fully meshed.

For me it usually works best to set "java.rmi.server.hostname" to the
computer's (externally visible) IP address. Lets assume the master has
IP 1.2.3.4 and the worker has 5.6.7.8:

---

master: java -Djava.rmi.server.hostname=1.2.3.4 -cp tla2tools.jar
tlc2.tool.distributed.TLCServer MC

worker: java -Djava.rmi.server.hostname=5.6.7.8 -cp tla2tools.jar
tlc2.tool.distributed.TLCWorker 1.2.3.4

---

For the master running from inside the Toolbox, you can pass the
"java.rmi.server.hostname" JVM argument on the "Advanced Options" tab of
the Model Editor.

On the worker computer, does the command line show anything of interest?

Hope this helps,
Markus