Re: [tlaplus] Unicode

On Wednesday, March 30, 2016 at 8:19:23 PM UTC+3, Leslie Lamport wrote:

Ron,

What you are proposing to do involves modifying the core of SANY and
TLC.

That's only required for precise error reporting, and I believe this not to be too hard anyway (e.g. SANY recognizes the \cup and \union tokens as a union; would it be hard to add the token ∪, too?). In any event, I can abort the minute it becomes non-trivial. Whatever the outcome, I think any project could only benefit from another set of eyes once in a while.

expect the advantages of ASCII (for example, the ability to do serious
editing in Emacs) mean that I wouldn't use unicode symbols in my
specs.

Emacs (and vim) work very well with unicode. In fact, Emacs has modes that replace TeX escapes with their unicode equivalents as you type. Agda works like this. All of the popular editors nowadays support unicode, and Java has been unicode-only since 1996.

So, I think you should find out how many users would find
your proposal a significant improvement.

Yes. In any event, I can play around a bit to get a feel for just how hard this should be; you are always free to reject my code.

Now to get to your proposal. There seem to be two viable approaches:

This means that there is no need for the ASCII to be easy to read.

Right.

For example, it would be quite
all right for the spec displayed as follows (where & stands for the
\in symbol)

   pred(S) == \A value & S:   \/ x == 1
   (*NoteThisNiceCommentHere*)\/ x == 2

to produce this in the ASCII file

   pred(S) == \A value \in S:   \/ x == 1
   (*NoteThisNiceComment Here*)\/ x == 2

Right. But the program should still support conversion in the opposite directions (without any comment "prgamas") for people who want to migrate, and that should be easy to read. In that case, I see no other way other than to move the left comment over to the end of the column, and then let the user decide what they want to do with it.

Note that the ASCII file could have arbitrary information required to
display it encoded in special comments at the end of lines.

That's a good idea, but I don't know if I'll need it before I start studying the issue more in depth.

This would require almost no changes to the existing tools. The
Toolbox could translate locations in the ASCII file to locations in
the displayed text for displaying locations of errors reported by TLC
and of proof obligations reported by TLAPS.

Oh, right. I guess that if that turns out to be much easier, then that's the way to go.

2. Put the Unicode in the .tla files. This potentially requires
changes to code scattered throughout SANY and TLC, making it hard to
avoid introducing bugs.

Right, but it all depends on how pervasive the changes need to be. I'm optimistic, but it's an optimism based largely on ignorance of the code and familiarity with Java, which is very unicode-friendly.

One problem is in the communication between
the Toolbox and the tools. This is all done with stdio and stderr.
Do those interfaces support Unicode?

Absolutely. They just transport bytes, and need to be given an encoding (a Charset in the Java parlance) to/from strings (which in Java are always unicode). They could either be told to work with UTF-8 by passing this charset as an extra parameter, or UTF-8 could be set as the default encoding for all streams at JVM startup. The UTF-8 encoding preserves the ASCII-ness of ASCII data, and only does new things for character codes beyond 255. I.e. it is completely backwards compatible with ASCII.

In any case, a lot of Toolbox code needs to be rewritten. You will
need to find someone to do that, because neither Markus nor I will
have time to do more than answer questions.

OK.

P.S. The alignment algorithm of the pretty-printer does not guarantee
correctness.

Hmm, that's good to know.

Ron