[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Simple Concurrent/Distributed example with an implementation

On Wednesday, 11 July 2018 03:04:10 UTC-5, Bruce Trask wrote:
Hi Hillel,

Thanks very much.

Yes. You understood my question.

I will check out those links. They look very helpful. I also got a response from Leslie that echoed some of your points.

Looks like ElasticSearch just encountered a perfect example of this:
  1. ES has been running in production for a long time.
  2. In March, David Turner wrote a TLA+ model and discovered a bug with ES 6.2 that nobody's yet run into
  3. In April, they preemptively fix the bug and update the version to 6.3
  4. Two days ago somebody on 6.2.4 reports a bizarre concurrency bug
  5. After investigation, the source of the bug exactly matches the discovered TLA+ edge case.
I have some follow up questions for you both but want to digest your post and his response a bit before I post again.

Appreciate the help.


On Wednesday, July 11, 2018 at 12:23:45 AM UTC+2, Hillel Wayne wrote:

I was looking for what would amount to the world's simplest concurrent/distributed (i.e. multithreaded/multiprocess/network) design/architecture for which there was an implementation that satisfied the design but for which TLA+ is used to find a problem in the design. So it would be an end-to-end sort of look at a system that included TLA+ as a key artifact.

To my understanding, you're looking for a real-life example with specs and source code, where the bug was discovered via TLA+, right? There are a few things that make those finding resources tricky:

* Most people currently using TLA+ are using it for proprietary projects, so it's hard to find the spec or source code.
* Most people use TLA+ to either build new projects or changed known buggy ones, not necessarily to write specs looking for bugs in existing, "working" systems.
* Most TLA+ specs are for complicated projects.

Here's what I came up with. This is an example of sample code where a bug was really hard to find with testing and really easy to find in TLA+, but it was a toy example for class. That said, someone did run into an extremely similar bug in practice and provided code. Here is a case of someone specifically using TLA+ to find a bug in an existing system, but he doesn't provide the full code. Mongo wrote a simplified model of part of their database and found a bug that none of them knew about before; the code is available but very complex.

Does that help?