Experimenting with PlusCal / TLA+ at Dropbox

Hi all,

I work at Dropbox, and we recently had a "Hack Week" (i.e. a week to work on anything we want); I chose to spend the week experimenting with TLA+ / PlusCal for specifying components of our production systems.

Anyway, I just wanted to share the results of my week with this list. I ended up using PlusCal to specify one of our protocols for maintaining strong consistency between a cache and the underlying database. I've attached a recent version of the spec; here are a few additional remarks:

I actually spent the majority of my time understanding the protocol (i.e. tracking down and talking to the authors, reading design docs, reading code); the process of writing PlusCal went really quickly and smoothly.
The model checker did find an issue, but it was already a known and tracked issue (something can go wrong if a writer takes a really long time to commit a database transaction while the cache is being restarted).
I had to keep the model's parameters really small, or face a tremendous combinatorial explosion of states (though I'm not ruling out the possibility that I could've formulated the spec in a different way that's more conducive to model checking).

Overall, I thought this was a really interesting and worthwhile exercise. I definitely plan to write a PlusCal or TLA+ spec next time I get the opportunity to design a complicated component involving lots of concurrency or subtle state transitions. I also think I've managed to generate some interest among my coworkers; if I'm lucky, I'll be able to get some of them to add these great tools to their toolbox as well.

Thanks for reading; I'd love to hear from you if you have any questions or suggestions!

Regards,

Elliott