[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlaplus] Spec Review Request



Hi Jamie,

    >>we get some guarantees of atomicity from the underlying tools (Phoenix/Elixir/Erlang) but questioning those assumptions (and our understanding of those assumptions) would be a good exercise.

Explicitly capturing the assumptions is a very important part of the process.   For example, I too built several systems using Erlang (at Amazon from 2006 to 2010; I can't reveal which systems), so I have some (albeit dated) knowledge of incorrect assumptions about that environment.  For example, if you are using the send/receive operators of Distributed Erlang (not just local/send receive), then it's worth reading this paper:
   A More Accurate Semantics for Distributed Erlang [2007]

The paper starts by honestly admitting that "Since ... Erlang starts in industry, and not in a university, Erlang is very much defined in terms of its implementation".  It then describes the drawbacks of lack of a formal specification for Erlang (quote: "doing verification without formal semantics is very hard"), and some of the consequences.  

The consequences boil down to application authors making incorrect assumptions.  As an example, the authors cite two other papers that show how incorrect assumptions about handling of network partitions were the cause of 2 serious bugs in the original leader-election algorithm in Erlang's standard library.  (Sidebar: IIRC, that algorithm was written by members of the Erlang engineering team or researchers with close links to the team -- i.e. people with the best knowledge of the implementation of Erlang.  However, it's been 10+ years since I read those papers, so I might be misremembering.)

The bulk of the above paper is a definition of formal semantics for Distributed Erlang.  The definition is in the form of "small-step operational semantics and transition rules".  I haven't studied operational semantics, but I suspect that most engineers would struggle with this notation, and so find it very difficult to fully understand and internalize the semantics.  My suspicion seems to be supported by the fact that the paper says this is the third attempt to define the semantics of Erlang (hence the name of the paper 'A more accurate semantics ...').  The authors mention that their own earlier attempt was 'incomplete'.  I put 'incomplete' in quotes as I think it simply means incorrect, because the authors say, "With incomplete we mean that there are possible behaviors in the Erlang run-time system, which can not be described by the semantics." 

I suspect that the semantics would be much more understandable (and so likely more accurate), and more useful to application engineers if it was given in the form of a TLA+ specification, but I've never had time or inclination to do the translation. 

Finally, I'll point out that even the above semantics is not complete.  E.g. I once ran into an issue with a production Erlang system that was caused by the fact that in Distributed Erlang, process IDs contain a hidden 'creation count', which is like a generation number of the node.  But the generation number only has range 1..3, so can cycle very quickly if nodes are spin-crashing or the epmd process (Erlang Port Mapper Daemon) is crashing[1].  So in some circumstances, process ids in Erlang are much more likely to be re-used than application authors realize (back then most Erlang programmers assume that the chance of pid re-use was negligible.)  This is another common bad assumption about "the environment", and could be caught with a specification at an appropriate level of abstraction.

I also hit various other surprising semantic issues when implementing the Erlang port driver for Berkeley DB[2], but I've thankfully forgotten all of the details.

It's that kind of craziness with concurrent and distributed systems --when the consequences can really matter to customers-- that led me to using TLA+.

cheers,
Chris

[1] https://stackoverflow.com/questions/243363/can-someone-explain-the-structure-of-a-pid-in-erlang

Also: a quote from an Erlang VM engineer:     http://erlang.2086793.n4.nabble.com/PID-recycling-td2108252.html#a2108255

The incarnation aka "Creation" is stored in EPMD, the Erlang 
Port Mapper Daemon, that is external to all nodes. 
It cycles 1,2,3 then round to 1 again. You can not change it. 
It is platform independent. An undistributed node has creation 0. 
Creation 0 is also used for dirty tricks having the meaning 
of the current creation. 

We have had plans to change it from 2 bits to 32 but it would 
hit hard on erl_interface and applications using it. 
 --  Raimo Niskanen, Erlang/OTP, Ericsson AB 
_


[2] http://erlang.org/pipermail/erlang-questions/2006-November/024030.html

cheers,
Chris






On Tue, Feb 13, 2018 at 3:13 PM, Jamie White <ja...@xxxxxxxxxxxxx> wrote:
Hi Chris

Thank you so much for your message. It’s a treasure trove of knowledge and valuable advice.

Your assumptions are pretty much spot on, but allow me to clarify the intent a little further:

The spec in its current state is meant to capture the essence of the system, in order to then demonstrate the process of specifying safety and liveness properties, and finding bugs with TLC.

My idea is that I can sit down with other Herokai who’ve not yet seen TLA+/TLC in action, and work through adding invariants such as “the playhead should never jump backwards” or “a race of skip messages should not cause the server to miss out tracks”. No doubt there are other invariants I’ve yet to think of that my colleagues will suggest right away.

So it’s a sort of “jumping off point” for demonstrating the practical applications of TLA+ to the folks here.

Modelling duplicated and reordered messages would indeed be an interesting extension. In its current environment, we get some guarantees of atomicity from the underlying tools (Phoenix/Elixir/Erlang) but questioning those assumptions (and our understanding of those assumptions) would be a good exercise.

Finally, I’ll confess that though this is the first spec I’ve written, I have read through a good chunk of Specifying Systems and avidly followed the video course… so i can’t claim to have learnt TLA+ in the space of a week :)

Thanks again. I’ll keep y’all updated on Heroku’s usage of TLA+.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@googlegroups.com.
To post to this group, send email to tla...@xxxxxxxxxxxxxxxx.
Visit this group at https://groups.google.com/group/tlaplus.
For more options, visit https://groups.google.com/d/optout.