[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Developing a TLA+ Model for the KafkaRoller in the Strimzi Project



So after a few iterations of the model...I was able to run verification on safety properties 
States
Time	Diameter	Found	Distinct	Queue
00:00:00	0	2	2	2
00:00:03	12	373 773	80 532	39 897
00:01:03	17	11 409 991	1 730 911	650 391
00:02:03	19	22 763 536	3 230 184	1 123 685
00:03:03	20	34 124 771	4 653 357	1 528 520
00:04:03	20	45 515 221	6 030 076	1 914 174
00:05:03	21	56 993 967	7 381 016	2 261 254
00:06:03	21	68 303 213	8 686 644	2 578 860
00:07:03	22	79 637 963	9 979 068	2 898 506
00:08:03	22	90 914 422	11 224 929	3 143 886
00:09:03	22	102 262 975	12 496 690	3 455 732
00:10:03	23	113 723 997	13 740 892	3 727 154
00:11:03	23	125 147 018	14 962 231	3 937 627
00:12:03	23	136 675 178	16 200 961	4 210 156
00:13:03	24	148 115 992	17 420 927	4 427 002
00:14:03	24	159 595 885	18 620 234	4 683 289
00:15:03	24	170 985 420	19 792 862	4 850 702
00:16:03	24	182 526 842	20 995 286	5 107 492
00:17:03	24	193 700 817	22 153 555	5 298 514
00:18:03	25	204 984 084	23 298 651	5 497 201
00:19:03	25	216 009 397	24 400 438	5 656 809
00:20:03	25	227 106 833	25 515 676	5 812 490
00:21:03	25	238 359 175	26 644 930	6 013 833
00:22:03	25	249 569 969	27 766 444	6 186 311
00:23:03	25	260 734 338	28 874 577	6 290 077
00:24:03	26	272 001 237	29 985 680	6 516 761
00:25:03	26	283 153 368	31 061 666	6 629 888
00:26:03	26	294 410 128	32 160 080	6 766 310
00:27:03	26	305 672 779	33 253 834	6 924 320
00:28:03	26	316 922 323	34 347 480	7 087 609
00:29:03	26	328 063 646	35 430 080	7 206 904
00:30:03	27	339 293 311	36 501 951	7 316 872
00:31:03	27	350 641 518	37 573 848	7 462 407
00:32:03	27	361 762 024	38 622 782	7 556 111
00:33:03	27	372 957 115	39 681 670	7 664 655
00:34:03	27	384 230 947	40 739 338	7 778 788
00:35:03	27	395 493 403	41 801 105	7 921 614
00:36:03	27	406 745 835	42 866 056	8 021 215
00:37:03	27	417 922 316	43 914 700	8 086 017
00:38:03	28	429 227 513	44 952 149	8 217 793
00:39:03	28	440 408 053	45 990 326	8 321 943
00:40:03	28	451 599 765	47 007 672	8 372 374
00:41:03	28	462 824 574	48 036 880	8 443 463
00:42:03	28	473 936 478	49 042 645	8 517 575
00:43:03	28	484 604 347	50 031 004	8 646 419
00:44:03	28	495 524 420	51 029 013	8 725 648
... 

Coverage
Module	Action	Total	Distinct
KafkaRoller	Init	2	2
KafkaRoller	ObserveNode	206 922 950	15 340 974
KafkaRoller	ReconfigureNode	35 128 411	8 280
KafkaRoller	NodeServing	87 000 108	8 282
KafkaRoller	RetryNode	155 553 002	34 673 480

Seems it's working somehow :D. Very excited with first real model...[1]. But it's funny how like 20 minutes could generate 20GB of data and it's increasing very quickly.

[1] - https://github.com/see-quick/verification/blob/main/formal_verification/TLA%2B/strimzi/kafka_roller/KafkaRoller.tla

št 9. 5. 2024 o 13:55 Maroš Orsák <maros.orsak159@xxxxxxxxx> napísal(a):
Hi Paul,

Thanks, for your answer. I think that's a solid piece of work, which could be used as a baseline. I appreciate that. The possible state changes are described in [1] under states. i.e., 
image.png
and thus I have refined the spec you have provided a little bit based on these rules and here is the implementation with detailed changes [2]. But this implementation reached a deadlock so I had to change a bit ObserveNode(n) action. And I am now in a problem where Invariant ActiveControllerNotRestarted is violated.

Here is a trace:
```
Invariant ActiveControllerNotRestarted is violated.
The behavior up to this point is:
1: <Initial predicate>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {}
2: <ObserveNode line 60, col 3 to line 62, col 72 of module KafkaRoller>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "NOT_RUNNING",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {}
3: <RestartNode line 66, col 3 to line 72, col 55 of module KafkaRoller>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "RESTARTED",
roles |-> {"controller"},
restartAttempts |-> 1,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {"controller-0"}



[1] - https://github.com/strimzi/proposals/blob/9a777c320c56e3edd473f088a89864fa643ac63a/06x-new-kafka-roller.md#proposal
[2] - https://github.com/see-quick/verification/pull/3 

pi 26. 4. 2024 o 20:19 Paul Eipper <lkraider@xxxxxxxxx> napísal(a):
Hello,

Maybe this could be a first start on the definition rules:

There is missing logic on the possible state changes and running this model will fail as there are no restraint rules on them, as I would need to understand the possible state changes and model them accordingly.

So for example running this model:

SPECIFICATION Spec

INVARIANTS
  NodesWithinRestartLimit
  NodesWithinRetryLimit
  ActiveControllerNotRestarted
  ControllerQuorumMaintained

PROPERTIES
  AllNodesServing
  NodesNotStuckRecovering

CONSTANTS
  ControllerNodes = {"controller-0", "controller-1"}
  BrokerNodes = {"broker-2", "broker-3", "broker-4"}
  CombinedNodes = {}
  MaxRestartAttempts = 3
  MaxRetries = 5

Will hit a failed state fairly quickly because there is no rule that a RESTART can't happen on initialization:

/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
      [ state |-> "RESTARTED",
        roles |-> {"controller"},
        restartAttempts |-> 1,
        retries |-> 0 ] @@
  "controller-1" :>
      [ state |-> "UNKNOWN",
        roles |-> {"controller"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-2" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-3" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-4" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {"controller-0"}

My suggestion on the first step on evolving this spec would be to fix the possible state transitions.
Hope this helps.

att,
--
Paul Eipper


On Fri, Apr 26, 2024 at 7:34 AM Maroš Orsák <maros.orsak159@xxxxxxxxx> wrote:
Hello, 

I am writing to seek assistance with developing a model for the KafkaRoller within the Strimzi open-source project, which I maintain. For context, Strimzi is a Cloud Native Computing Foundation (CNCF) project that simplifies the complexities of managing Apache Kafka, a distributed system. The project provides a suite of operators that facilitate various tasks, including updating Kafka clusters, scaling nodes, and integrating multiple components such as tracing and OAuth2, among others.

A critical aspect of managing Kafka clusters is determining when a Kafka node should be rolled. To address this, we have developed a KafkaRoller class. Recently, we proposed a new design featuring an advanced algorithm intended to enhance this functionality. To ensure the safety and correctness of this algorithm, I aim to create a model in TLA+ that will allow us to formally verify it using safety and liveness properties. These properties include but are not limited to, ensuring that no node exceeds maximum retries (safety), preventing the simultaneous restart of all controller nodes (safety), and guaranteeing that all nodes will eventually be in a SERVING state (liveness).

However, I am unfamiliar with how to properly construct such a model and specify these properties in TLA+. Moreover, I lack time to do such a thing...so any guidance or support you could provide would be greatly appreciated. Thank you.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAMd-2QuocAToKY7HoorcvM8szpMSUu81UzgjRRs9gxROVkUhSg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAO8R-xguX_TmFr2zQ%2BMvzkqUys-r5pv08SL%3DVXGT%3D-m5bnkYeA%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAMd-2QsN_NBUgVQ87tLt_Du_uR7UPvLAfEBePXx0TojTZ8fOyQ%40mail.gmail.com.