[squeak-dev] Question about a possible Utf8String (and Utf8Symbol)

Jakob Reschke jakres+squeak at gmail.com
Wed Mar 9 21:41:33 UTC 2022


Hi,

I think the byte-wise collation algorithm implemented in primitive 158
and primitiveCompareString cannot fully work with multi-byte
encodings:

a \x61
b \x62
c \x63
ß \xc3\x9f
ä \xc3\xa4

ä and ß both start with byte \xc3 when UTF-8-encoded.
Let the collation rules be a = ä < b < c < ß.
With the algorithm you cannot have ä < b and b < ß at the same time
because the \xc3 can only be before \x62 or after \x62, but not both.
(Setting (order at: \xc3) = (order at: \x62) is also not an option
because you could not satisfy all of b < c, ä < c and c < ß.)
Anyway, comparing ßc with ac would not work correctly because the
second byte of the ß gets compared with the letter c of the other
string.

Unrelated to the encoding, the algorithm also cannot account for
collation rules involving more than one character, like ß = ss.

Kind regards,
Jakob

Am Mi., 9. März 2022 um 17:27 Uhr schrieb Marcel Taeumel
<marcel.taeumel at hpi.de>:
>
> Hi there --
>
> Did somebody already put some thought in the pros and cons of having an Utf8String (and Utf8Symbol) regarding:
>
> - #compareWith:collated: (primitive 158)
> - #compare:with:collated: (#primitiveCompareString in #MiscPrimitivePlugin)
> - #at:(put:) (primitives 63 and 64)
>
> Our current distinction between ByteString and WideString allows for fast #at:(put:) access but might be slow when comparing ByteString with WideString.
>
> Some external libraries (e.g. via FFI) might even expect Utf8 encoded strings as input. At the moment, we would have to put Utf8TextConverter to work explicitely.
>
> No call for action here. Just a survey. :-)
>
> Best,
> Marcel
>


More information about the Squeak-dev mailing list