[squeak-dev] Question about a possible Utf8String (and Utf8Symbol)

Marcel Taeumel marcel.taeumel at hpi.de
Thu Mar 10 09:06:57 UTC 2022


Hi all --

Interesting recent share via the Pharo mailing list:
https://github.com/svenvc/UTF8String [https://github.com/svenvc/UTF8String]


It's good to see that one can implement the concept in Smalltalk without needing extra VM support. Then again, our UTF8TextConverter does this as well. :-)

I think an implementation of Utf8String should subclass from String (instead of Object) as it would be the third among the two existing encodings, namely ByteString and WideString.

In general, here are some thoughts on having Utf8String in general:

Pro:
+ The concept of "strings are collections of characters" can still hold
+ Special FFI type for such utf8 strings can help writing more robust code

Con:
- User data beyond 21-bit for Unicode code points would not be possible (e.g. #leadingChar)

Best,

Marcel
Am 09.03.2022 22:41:54 schrieb Jakob Reschke <jakres+squeak at gmail.com>:
Hi,

I think the byte-wise collation algorithm implemented in primitive 158
and primitiveCompareString cannot fully work with multi-byte
encodings:

a \x61
b \x62
c \x63
ß \xc3\x9f
ä \xc3\xa4

ä and ß both start with byte \xc3 when UTF-8-encoded.
Let the collation rules be a = ä < b < c < ß.
With the algorithm you cannot have ä < b and b < ß at the same time
because the \xc3 can only be before \x62 or after \x62, but not both.
(Setting (order at: \xc3) = (order at: \x62) is also not an option
because you could not satisfy all of b < c, ä < c and c < ß.)
Anyway, comparing ßc with ac would not work correctly because the
second byte of the ß gets compared with the letter c of the other
string.

Unrelated to the encoding, the algorithm also cannot account for
collation rules involving more than one character, like ß = ss.

Kind regards,
Jakob

Am Mi., 9. März 2022 um 17:27 Uhr schrieb Marcel Taeumel
:
>
> Hi there --
>
> Did somebody already put some thought in the pros and cons of having an Utf8String (and Utf8Symbol) regarding:
>
> - #compareWith:collated: (primitive 158)
> - #compare:with:collated: (#primitiveCompareString in #MiscPrimitivePlugin)
> - #at:(put:) (primitives 63 and 64)
>
> Our current distinction between ByteString and WideString allows for fast #at:(put:) access but might be slow when comparing ByteString with WideString.
>
> Some external libraries (e.g. via FFI) might even expect Utf8 encoded strings as input. At the moment, we would have to put Utf8TextConverter to work explicitely.
>
> No call for action here. Just a survey. :-)
>
> Best,
> Marcel
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220310/15f386a2/attachment.html>


More information about the Squeak-dev mailing list