[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlaplus] Re: UTF-8



Follow up question is there better way to express this repeating pattern:

{[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 1], [4 -> 0]}

On Sunday, 18 October 2020 at 02:20:58 UTC+2 Igor Kim wrote:
Hi,

I'm playing with TLA+ language so not a real application. 
Can I think if UTF-8 specification (ignoring endianness) as this?

LOCAL INSTANCE Naturals
NotNegNat == {i \in Nat: i >= 0} \* array index

\* BIT
BIT == {0, 1}

\* ASCII == 1 byte UTF-8 == 0XXX_XXXX
UTF8_1B == [0 -> 0] \cup [1 .. 7 -> BIT]
ASCII == UTF8_1B 

\* bytes 2, 3, 4 for 2,3,4 byte UTF8 == 10XX_XXXX
UTF8_B234 == {[0 -> 1], [1 -> 0], [2 .. 7 -> BIT]}
 
\* 2 byte UTF8 == 110X_XXXX, 10XX_XXXX
UTF8_2B == 
    [1 -> {[0 -> 1], [1 -> 1], [2 -> 0], [3 .. 7 -> BIT]}] 
    \cup
    [2 -> UTF8_B234]
       
\* 3 byte UTF8 == 1110_XXXX, 10XX_XXXX, 10XX_XXXX
UTF8_3B == 
    [1 -> {[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 0], [4 .. 7 -> BIT]}] 
    \cup
    [2 .. 3 -> UTF8_B234]
       
\* 4 byte UTF8 == 1111_0XXX, 10XX_XXXX, 10XX_XXXX, 10XX_XXXX
UTF8_4B == 
    [1 -> {[0 -> 1], [1 -> 1], [2 -> 1], [3 -> 1], [4 -> 0], [5 .. 7 -> BIT]}] 
    \cup
    [2 .. 4 -> UTF8_B234]
    
UTF8_CHAR == UTF8_1B \/ UTF8_2B \/ UTF8_3B \/ UTF8_4B

\* Sequence of UTF8 Characters
UTF8_SEQ == [NotNegNat -> UTF8_CHAR]

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/96be84c4-b1ce-450d-ae72-e8e2c15269b0n%40googlegroups.com.