[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Re: Strategies for scalable modeling of append-only logs



Andrew, real systems have finite-length logs. They either have a real-life mechanism that limits lag (as you suggest) or more often a node can just lag too far behind and "fall off the log": entries are GC'ed before the laggy node has replicated them, so the node must initiate some recovery process to get back on track. For example, MongoDB. Modeling all this might be complicated, but it seems like the other state-limiting methods are both complicated and diverge from how real systems work.

Even with finite logs, there are monotonically increasing numbers (e.g. ballot ids) that would benefit from the "Real Time Is Really Simple" treatment, but I haven't done this myself.

On Tue, Oct 24, 2023 at 11:54 AM Andrew Helwer <andrew.helwer@xxxxxxxxx> wrote:
Thanks Jones! Great to have data for testing this.

My thinking has evolved a bit since my last post. Instead of thinking about this in terms of logs that are purged to a snapshot, it really is just directly analogous to Leslie's approach in Real Time is Really Simple. The absolute state of the log & each nodes' execution position do not matter, much like how the absolute value of a global clock does not matter. All that matters is the relative divergence of execution between the different nodes!

The basic idea is very simple. Here it is: once all nodes have executed a transaction in the log, remove that transaction from the start of the log.

The key here is we need to bound how far ahead one node can be in executing log transactions relative to others. This is where all the interesting behavior manifests, when the nodes are in different states or have different knowledge of the global state. So there's a tradeoff (like always) between how much interesting behavior we can explore and what is tractable to model check. I think a 3 transaction divergence limit seems appropriate for most applications.

But how do we model this? Similar to the Real Time approach where the Tick action can't move time forward if there's an upper-bound timer that would reach zero, any action that would add additional transactions to the log (past 3) is not enabled as long as some node has not executed the first transaction in the log. Once that node has executed the transaction, the first transaction is dropped to open up additional space in the log for another transaction to be added to the end.

This approach can also be used to model any system that increases monotonically, say with epoch or sequence numbers. Just limit how far the nodes can diverge, and keep reducing the monotonically-increasing state (in the case of epoch/sequence numbers, keep reducing them so they stay in the 0-3 range for instance) once all nodes have progressed past it.

One possibly annoying thing is the system might want to reference specific transaction numbers or epoch/sequence numbers in other parts of the state, and these references will also have to be updated as earlier state is dropped. This wasn't an issue in Real Time since it is always possible to write things in terms of upper- or lower-bound timers that are held in a centralized variable, and it seems possible to do the same thing here. Instead of holding log indices or epoch/sequence numbers in other variables directly, hold a reference to an entry in some other centralized variable that is easy to update as state is dropped.

I hope all of this made sense. I think I will try to update my Checkpoint Coordination spec to use this approach and see how it works; currently it uses a state constraint to keep the log from growing infinitely, which both massively increases the state space and rules out liveness checking: https://github.com/tlaplus/Examples/tree/master/specifications/CheckpointCoordination

Andrew Helwer

On Monday, October 23, 2023 at 4:12:43 PM UTC-4 Jones Martins wrote:
(My previous reply was completely broken by Markdown to HTML conversion. Lesson learned. I'm resending this as a fix.)

I decided to model check the extremely contrived example in my previous message and compare the state space and model check time compared to no log compression.

In case anyone’s interested, here are the results:

No log compression:

Len(log) < ? represents the state constraint.

| Len(log) < ? | Distinct States | Time (4 workers & 10 runs) |
|--------------|-----------------|----------------------------|
|           11 |          19,233 |          1.305 s ± 0.222 s |
|          
12 |          48,526 |          1.793 s ± 0.319 s |
|
           13 |         121,217 |          2.516 s ± 0.382 s |
|
           14 |         305,152 |          2.830 s ± 0.658 s |
|
           15 |         763,353 |          4.365 s ± 0.772 s |
|
           16 |       1,918,966 |          7.745 s ± 0.664 s |
|
           17 |       4,804,797 |         17.935 s ± 1.278 s |
|
           18 |      12,067,896 |         44.357 s ± 3.356 s |
|
           19 |      30,233,757 |        114.005 s ± 7.446 s |

Compressing the log by removing the longest “redundant” log suffix:

| Len(log) < ? | Distinct States | Time (4 workers & 10 runs) |
|--------------|-----------------|-----------------------------|
|           11 |          16,319 |           2.877 s ± 0.425 s |
|
           12 |          38,356 |           3.571 s ± 0.571 s |
|
           13 |          89,795 |           4.971 s ± 0.837 s |
|
           14 |         209,188 |           9.074 s ± 1.174 s |
|
           15 |         487,065 |          17.080 s ± 0.918 s |
|
           16 |       1,132,250 |          35.743 s ± 1.882 s |
|
           17 |       2,633,623 |         63.383 s ± 32.838 s |
|
           18 |       6,122,816 |    144s (4 workers & 1 run) |
|           19 |      14,234,901 |   6m46s (4 workers & 1 run)
|

Differences by compressing

Len(log) < ? | State space difference* | Time difference** |
|--------------|-------------------------|-------------------|
|           11 |           15.2% smaller |     2.205x slower |
|
           12 |           21.0% smaller |     1.992x slower |
|           13 |           25.9% smaller |     1.976x slower |
|           14 |           31.4% smaller |     3.206x slower |
|           15 |           36.2% smaller |     3.913x slower |
|
           16 |           41.0% smaller |     4.615x slower |
|
           17 |           45.2% smaller |     3.535x slower |
|
           18 |           49.3% smaller |     3.246x slower |
|          
19 |           52.9% smaller |     3.561x slower  |

*: 1 - (states_compressed/states_not_compressed)
**: time_compressed_avg/time_not_compressed_avg

This very contrived example with compression removed, on average, 35.3% of the state space, but it took, on average, 3.13 times longer to model check.

Removing log from the view and model checking with Constraint == Len(log) < 19 took less than a second, which is over 114x faster than including a log (maybe 1140x faster?)

(I didn’t test adding snapshots, though… I guess I’d need a different contrived example for that, hahah)

Best,
Jones



On Monday, 23 October 2023 at 16:23:04 UTC-3 Jones Martins wrote:

I decided to model check the extremely contrived example in my previous message and compare the state space and model check time compared to no log compression.

In case anyone’s interested, here are the results:

No log compression

Len(log) < ? represents the state constraint.

Len(log) < ? Distinct States Time (4 workers & 10 runs) 11 19,233 1.305 s ± 0.222 s 12 48,526 1.793 s ± 0.319 s 13 121,217 2.516 s ± 0.382 s 14 305,152 2.830 s ± 0.658 s 15 763,353 4.365 s ± 0.772 s 16 1,918,966 7.745 s ± 0.664 s 17 4,804,797 17.935 s ± 1.278 s 18 12,067,896 44.357 s ± 3.356 s 19 30,233,757 114.005 s ± 7.446 s Compressing the log by removing the longest “redundant” log suffix: Len(log) < ? Distinct States Time (4 workers & 10 runs) 11 16,319 2.877 s ± 0.425 s 12 38,356 3.571 s ± 0.571 s 13 89,795 4.971 s ± 0.837 s 14 209,188 9.074 s ± 1.174 s 15 487,065 17.080 s ± 0.918 s 16 1,132,250 35.743 s ± 1.882 s 17 2,633,623 63.383 s ± 32.838 s 18 6,122,816 144s (4 workers & 1 run) 19 14,234,901 6m46s (4 workers & 1 run) Differences by compressing Len(log) < ? State space difference (1 - (states_compressed/states_not_compressed)) Time difference (time_compressed_avg/time_not_compressed_avg) 11 15.2% smaller 2.205x slower 12 21% smaller 1.992x slower 13 25.9% smaller 1.976x slower 14 31.4% smaller 3.206x slower 15 36.2% smaller 3.913x slower 16 41.0% smaller 4.615x slower 17 45.2% smaller 3.535x slower 18 49.3% smaller 3.246x slower 19 52.9% smaller 3.561x slower

This very contrived example with compression removed, on average, 35.3% of the state space, but it took, on average, 3.13 times longer to model check.

Removing log from the view and model checking with Constraint == Len(log) < 19 took less than a second, which is over 114x faster than including a log (maybe 1140x faster?)

(I didn’t test adding snapshots, though… I guess I’d need a different contrived example for that, hahah)

Best,
Jones

On Saturday, 21 October 2023 at 10:41:13 UTC-3 Andrew Helwer wrote:
Actually this comment inspired me to think of another possible solution! Which also matches how most of these systems are implemented. We want two things:
  1. Avoiding factorial state space blowup due to path dependence
  2. Finite non-restricted state space, so liveness checking is possible
A naive implementation of a replicated log has infinite state space, or at least requires state restrictions that rule out liveness checking since replicated logs grow indefinitely. But usually replicated logs are implemented in the real world by a combination of logs and snapshots. So every N additions to the log, a snapshot is taken of the outward system state. This means when a new node comes online it can rehydrate from the latest snapshot, and then replay the last < N log elements to catch up to the frontier; much faster than replaying every transaction from the beginning of time!

This suggests a possible spec design: if the replicated log reaches size N, instead of adding a state restriction, the outward system state gets dumped to a snapshot variable and the log is emptied. This is the same as if the Init formula specifies an empty log with a snapshot variable as an arbitrary valid system state. Thus the state space becomes finite and, with a small enough N, the path dependence becomes less of an issue.

I would say the main drawback of this design is that it limits how much different nodes can lag in terms of what log elements they know about. Often this is where a lot of interesting behavior manifests itself, when some node knows about log element N but another node has only reached N - 5 or something. However, perhaps the rule of three will save us: if this interesting behavior doesn't manifest with a log of size 3 (or at least a fairly small log size), it is unlikely to manifest in a larger log size. Maybe. Depending on the system.

Using VIEW to exclude the log seems possible but I worry it adds nondeterminism to the model check process. In the A/B vs B/A example in my original post, only one of those paths (whichever is encountered first, which is nondeterministic) would be added to the next state queue with log intact; the other would collide with its state hash and be discarded. So when that next state is evaluated, if the spec logic somehow depends on the history of the log, then there will be some model check executions where perhaps some states are not explored. I can't think of a real example off the top of my head but it seems possible.

Andrew

On Friday, October 20, 2023 at 4:54:58 PM UTC-4 Jones Martins wrote:

Hi, Andrew

I have no experience modeling append-only logs, but this scaling problem comes up when adding history variables to a spec, right?

Even though you’re modeling an append-only log, the solution for scalability, while retaining some log in the state space, might be removing event sequences that leave the state unchanged from the log. One could achieve this “solution” by optimizing the state space inside the spec itself.

Since we’re dealing with a sequence of events, symmetry sets wouldn’t really help, but memoization and data compression might (?).

The “solution” (in many quotes) from a spec:

Some spec has a log variable that represents the append-only log. Every action that updates log would check whether there’s an equivalent, but smaller, event sequence, thus compressing the log and the state space. This optimization would reduce the state space, reduce overall memory consumption but, even in parallel (thanks to TLC), might add substantial processing cost. It’s not a lossless compression, either.

Let’s say we had states <<s, log1>>, <<s, log2>>, <<s, log3>>. We know log1, log2 and log3 are equivalent to some log log4, and log4 is shorter than the other 3.
A spec that optimizes for log size would only have one state: <<s, log4>>.
A spec that applies a view <<s>> would also have only one state <<s>>, but with complete information loss.

One way to achieve this compression is by checking equivalent suffixes.
Let’s say some spec has variables log and x (an integer). Possible events are {+1, -1, div2}:

  • Event +1 adds 1 to x;
  • Event -1 subtracts 1 from x;
  • Event div2 divides x by two;

We know a few suffixes leave the system as if nothing had happened before (except time passing): { <<+1, -1>>, <<-1, +1>>, <<+1, +1, div2, -1>> }.

Suppose we’re in a state where log = <<+1>>. Event -1 happens, so we add it to the log: <<+1, -1>>. But this suffix is marked as redundant: the log would be transformed into <<>>.
Suppose we’re in a state where log = <<-1, -1, -1>>'. Event+1’ happens, so we add it to the log: <<-1, -1, -1, +1>>. But this suffix is marked as redundant: the log would be transformed into <<-1, -1>>.

The community module SequencesExt contains a IsSuffix(_) operator. Every action that changes the log would be written like AddOne:

EquivalentSuffixes == { <<"+1", "-1">>, <<"-1", "+1">>, <<"+1", "-1", "div2", "-1">> } AppendToLog(log, event) == LET newLog == log \o <<event>> IN IF \E suffix \in EquivalentSuffixes: IsSuffix(suffix, newLog) THEN log ELSE newLog AddOne == /\ x' = x + 1 /\ log' = AppendToLog(log, "+1")

By compressing log, the state space would contain a subset of possible logs, which might be a problem…

I haven’t tested any of this, by the way, but I’m curious to know if there are any other solutions.

Best,
Jones

On Thursday, 19 October 2023 at 10:42:42 UTC-3 Andrew Helwer wrote:
Since TLA+ sees use in distributed systems, it often has to be used to model append-only logs. However, logs of this sort place a path dependence on states; if event A happens before event B, and it leads to the same outward system state as if event B happened before event A, TLC will nonetheless have these as two separate states because their order was recorded in the log. This means the state tree scales in size factorially and greatly limits the viability of modeling systems beyond two, perhaps three nodes and a small depth.

I've run into this problem a few times and have never come up with a decent way of handling it. Perhaps it's possible to write your spec so the append-only log is analogous to the clock in real-time specs (ignored & excluded from the model checker view). What strategies have others developed?

Andrew

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/82b7aa83-6aee-4125-a1bd-da0b4513c3d5n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAFRUCtZ%2BdZHvV1RCUL44LApVhW_%3DHePUYS5ntHTSAFSrSmiMpA%40mail.gmail.com.