[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Developing a TLA+ Model for the KafkaRoller in the Strimzi Project



Hi Paul,

Thanks, for your answer. I think that's a solid piece of work, which could be used as a baseline. I appreciate that. The possible state changes are described in [1] under states. i.e., 
image.png
and thus I have refined the spec you have provided a little bit based on these rules and here is the implementation with detailed changes [2]. But this implementation reached a deadlock so I had to change a bit ObserveNode(n) action. And I am now in a problem where Invariant ActiveControllerNotRestarted is violated.

Here is a trace:
```
Invariant ActiveControllerNotRestarted is violated.
The behavior up to this point is:
1: <Initial predicate>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {}
2: <ObserveNode line 60, col 3 to line 62, col 72 of module KafkaRoller>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "NOT_RUNNING",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {}
3: <RestartNode line 66, col 3 to line 72, col 55 of module KafkaRoller>
/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
[ state |-> "RESTARTED",
roles |-> {"controller"},
restartAttempts |-> 1,
retries |-> 0 ] @@
"controller-1" :>
[ state |-> "UNKNOWN",
roles |-> {"controller"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-2" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-3" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] @@
"broker-4" :>
[ state |-> "UNKNOWN",
roles |-> {"broker"},
restartAttempts |-> 0,
retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {"controller-0"}



[1] - https://github.com/strimzi/proposals/blob/9a777c320c56e3edd473f088a89864fa643ac63a/06x-new-kafka-roller.md#proposal
[2] - https://github.com/see-quick/verification/pull/3 

pi 26. 4. 2024 o 20:19 Paul Eipper <lkraider@xxxxxxxxx> napísal(a):
Hello,

Maybe this could be a first start on the definition rules:
https://gist.github.com/lkraider/07cef395cc2a648cbaac747fe33d8bf7

There is missing logic on the possible state changes and running this model will fail as there are no restraint rules on them, as I would need to understand the possible state changes and model them accordingly.

So for example running this model:

SPECIFICATION Spec

INVARIANTS
  NodesWithinRestartLimit
  NodesWithinRetryLimit
  ActiveControllerNotRestarted
  ControllerQuorumMaintained

PROPERTIES
  AllNodesServing
  NodesNotStuckRecovering

CONSTANTS
  ControllerNodes = {"controller-0", "controller-1"}
  BrokerNodes = {"broker-2", "broker-3", "broker-4"}
  CombinedNodes = {}
  MaxRestartAttempts = 3
  MaxRetries = 5

Will hit a failed state fairly quickly because there is no rule that a RESTART can't happen on initialization:

/\ activeController = "controller-0"
/\ nodeState = ( "controller-0" :>
      [ state |-> "RESTARTED",
        roles |-> {"controller"},
        restartAttempts |-> 1,
        retries |-> 0 ] @@
  "controller-1" :>
      [ state |-> "UNKNOWN",
        roles |-> {"controller"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-2" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-3" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] @@
  "broker-4" :>
      [ state |-> "UNKNOWN",
        roles |-> {"broker"},
        restartAttempts |-> 0,
        retries |-> 0 ] )
/\ reconfiguringNodes = {}
/\ restartingNodes = {"controller-0"}

My suggestion on the first step on evolving this spec would be to fix the possible state transitions.
Hope this helps.

att,
--
Paul Eipper


On Fri, Apr 26, 2024 at 7:34 AM Maroš Orsák <maros.orsak159@xxxxxxxxx> wrote:
Hello, 

I am writing to seek assistance with developing a model for the KafkaRoller within the Strimzi open-source project, which I maintain. For context, Strimzi is a Cloud Native Computing Foundation (CNCF) project that simplifies the complexities of managing Apache Kafka, a distributed system. The project provides a suite of operators that facilitate various tasks, including updating Kafka clusters, scaling nodes, and integrating multiple components such as tracing and OAuth2, among others.

A critical aspect of managing Kafka clusters is determining when a Kafka node should be rolled. To address this, we have developed a KafkaRoller class. Recently, we proposed a new design featuring an advanced algorithm intended to enhance this functionality. To ensure the safety and correctness of this algorithm, I aim to create a model in TLA+ that will allow us to formally verify it using safety and liveness properties. These properties include but are not limited to, ensuring that no node exceeds maximum retries (safety), preventing the simultaneous restart of all controller nodes (safety), and guaranteeing that all nodes will eventually be in a SERVING state (liveness).

However, I am unfamiliar with how to properly construct such a model and specify these properties in TLA+. Moreover, I lack time to do such a thing...so any guidance or support you could provide would be greatly appreciated. Thank you.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAMd-2QuocAToKY7HoorcvM8szpMSUu81UzgjRRs9gxROVkUhSg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAO8R-xguX_TmFr2zQ%2BMvzkqUys-r5pv08SL%3DVXGT%3D-m5bnkYeA%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAMd-2QufszteStvt-2dFTTa%2B2EVjT8Mb5hRNWJZP1tM0ctbHFA%40mail.gmail.com.