Re: [tlaplus] UTF-8

On Sunday, 18 October 2020 at 09:18:01 UTC+2 Stephan Merz wrote:

Hello,

I understand the idea but your definitions do not express what you think they do. (Did you at all play with the Toolbox, which lets you evaluate constant expressions, to test your spec?)

You intend to express characters as sequences of bits, and sequences are indeed functions. However, in TLA+, functions are not represented as sets of pairs, so you cannot combine functions using set union or set enumeration. I recommend writing, for example,

UTF8_1B == { s \in [ 0 .. 7 -> BIT] : s[0] = 0 }

You may also use the :> and @@ operators defined in the TLC standard module for constructing elementary functions and for combining functions, as in

UTF8_B234 == { (0 :>1) @@ (1 :> 0) @@ s : s \in [ 2..7 -> BIT ] }

Follow up question is there better way to express this repeating pattern:
{[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 1], [4 -> 0]}

See above how to write what you mean using :> and @@. Alternatively, you could write

[ i \in 0..4 |-> IF i=4 THEN 0 ELSE 1 ]

And indeed, if you decided to index your bit sequences from 1 instead of from 0, you could use standard TLA+ sequences and write <<1,1,1,0>>. The definition above could similarly be written as

UTF8_B234 == { <<1,0>> @ s : s \in [ 1 .. 6 -> BIT ] }

Hope this helps,
Stephan

On 18 Oct 2020, at 02:20, Igor Kim <bios...@xxxxxxxxx> wrote:

Hi,

I'm playing with TLA+ language so not a real application.
Can I think if UTF-8 specification (ignoring endianness) as this?

LOCAL INSTANCE Naturals
NotNegNat == {i \in Nat: i >= 0} \* array index

\* BIT
BIT == {0, 1}

\* ASCII == 1 byte UTF-8 == 0XXX_XXXX
UTF8_1B == [0 -> 0] \cup [1 .. 7 -> BIT]
ASCII == UTF8_1B

\* bytes 2, 3, 4 for 2,3,4 byte UTF8 == 10XX_XXXX
UTF8_B234 == {[0 -> 1], [1 -> 0], [2 .. 7 -> BIT]}

\* 2 byte UTF8 == 110X_XXXX, 10XX_XXXX
UTF8_2B ==
[1 -> {[0 -> 1], [1 -> 1], [2 -> 0], [3 .. 7 -> BIT]}]
\cup
[2 -> UTF8_B234]

\* 3 byte UTF8 == 1110_XXXX, 10XX_XXXX, 10XX_XXXX
UTF8_3B ==
[1 -> {[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 0], [4 .. 7 -> BIT]}]
\cup
[2 .. 3 -> UTF8_B234]

\* 4 byte UTF8 == 1111_0XXX, 10XX_XXXX, 10XX_XXXX, 10XX_XXXX
UTF8_4B ==
[1 -> {[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 1], [4 -> 0], [5 .. 7 -> BIT]}]
\cup
[2 .. 4 -> UTF8_B234]

UTF8_CHAR == UTF8_1B \/ UTF8_2B \/ UTF8_3B \/ UTF8_4B

\* Sequence of UTF8 Characters
UTF8_SEQ == [NotNegNat -> UTF8_CHAR]

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/1ed97684-f9d4-4e62-bcce-8571345d0809n%40googlegroups.com.