[squeak-dev] The Inbox: Collections-ct.851.mcz

Levente Uzonyi leves at caesar.elte.hu
Fri Aug 16 11:41:22 UTC 2019

On Fri, 16 Aug 2019, Jakob Reschke wrote:

> Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi <leves at caesar.elte.hu>:
>       On Thu, 15 Aug 2019, Thiede, Christoph wrote:
>       > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.
>       IMO, it opens a can of worms:
>       - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them.
>       - WideStrings spread like plague:
>               - Wrote a WideString into a stream? your stream's buffer is now a WideString.
>               - Did some operation with a WideString, e.g. #,? The result is very likely a WideString.
>       - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine".
> Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-)

I guess you missed my point. You do not want to match unicode digits when 
you write [0-9], but the unicode character may visually appear as a 
regular digit, making it harder to debug your code.

> In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it?

It depends on how you process those numbers.

> Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by
> the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must
> now also support such unicode cases.

I personally see little value in having 63 ways to write a single 
digit in my Smalltalk method.



More information about the Squeak-dev mailing list