Hello,
I am writing to seek assistance with developing a model for the KafkaRoller within the Strimzi open-source project, which I maintain. For context, Strimzi is a Cloud Native Computing Foundation (CNCF) project that simplifies the complexities of managing Apache Kafka, a distributed system. The project provides a suite of operators that facilitate various tasks, including updating Kafka clusters, scaling nodes, and integrating multiple components such as tracing and OAuth2, among others.
A critical aspect of managing Kafka clusters is determining when a Kafka node should be rolled. To address this, we have developed a KafkaRoller class. Recently, we proposed a new design featuring an advanced algorithm intended to enhance this functionality. To ensure the safety and correctness of this algorithm, I aim to create a model in TLA+ that will allow us to formally verify it using safety and liveness properties. These properties include but are not limited to, ensuring that no node exceeds maximum retries (safety), preventing the simultaneous restart of all controller nodes (safety), and guaranteeing that all nodes will eventually be in a SERVING state (liveness).
However, I am unfamiliar with how to properly construct such a model and specify these properties in TLA+. Moreover, I lack time to do such a thing...so any guidance or support you could provide would be greatly appreciated. Thank you.