Re: Simple Concurrent/Distributed example with an implementation

On Wednesday, 11 July 2018 03:04:10 UTC-5, Bruce Trask wrote:

Hi Hillel,

Thanks very much.

Yes. You understood my question.

I will check out those links. They look very helpful. I also got a response from Leslie that echoed some of your points.

Looks like ElasticSearch just encountered a perfect example of this:

ES has been running in production for a long time.
In March, David Turner wrote a TLA+ model and discovered a bug with ES 6.2 that nobody's yet run into
In April, they preemptively fix the bug and update the version to 6.3
Two days ago somebody on 6.2.4 reports a bizarre concurrency bug
After investigation, the source of the bug exactly matches the discovered TLA+ edge case.

I have some follow up questions for you both but want to digest your post and his response a bit before I post again.

Appreciate the help.

Regards,
Bruce

On Wednesday, July 11, 2018 at 12:23:45 AM UTC+2, Hillel Wayne wrote:

I was looking for what would amount to the world's simplest concurrent/distributed (i.e. multithreaded/multiprocess/network) design/architecture for which there was an implementation that satisfied the design but for which TLA+ is used to find a problem in the design. So it would be an end-to-end sort of look at a system that included TLA+ as a key artifact.

To my understanding, you're looking for a real-life example with specs and source code, where the bug was discovered via TLA+, right? There are a few things that make those finding resources tricky:

* Most people currently using TLA+ are using it for proprietary projects, so it's hard to find the spec or source code.
* Most people use TLA+ to either build new projects or changed known buggy ones, not necessarily to write specs looking for bugs in existing, "working" systems.
* Most TLA+ specs are for complicated projects.

Here's what I came up with. This is an example of sample code where a bug was really hard to find with testing and really easy to find in TLA+, but it was a toy example for class. That said, someone did run into an extremely similar bug in practice and provided code. Here is a case of someone specifically using TLA+ to find a bug in an existing system, but he doesn't provide the full code. Mongo wrote a simplified model of part of their database and found a bug that none of them knew about before; the code is available but very complex.

Does that help?

H