Currently I'm trying to build a layer of distributed state machine with Raft following this paper: http://nil.csail.mit.edu/6.5840/2025/papers/raft-extended.pdf
Course: https://pdos.csail.mit.edu/6.824/
The problem appeared when I introduced the persistence layer (disk) to my log replication logic. There are situations where agreement on a specific entry cannot be reached because of time limit exceeded (a time-frame the entry should be replicated inside of).
Mainly, when there is a heavy randomized leader failure, the followers don't get the chance to fully back up their log. This problem didn't appear previously without persistence in the picture.
My question is: does the TLA+ tool help me identify the true problem in case I'm already identifying it wrong?
Any thoughts or recommendations will be helpful, thanks.
--