Fwd: [Pharo-dev] [squeak-dev] Unicode Support

Sun Dec 6 17:04:47 UTC 2015

---------- Forwarded message ----------
From: Todd Blanchard <tblanchard at mac.com>
Date: Sun, 06 Dec 2015 08:37:12 -0800
Subject: Re: [Pharo-dev] [squeak-dev] Unicode Support
To: Pharo Development List <pharo-dev at lists.pharo.org>

(Resent because of bounce notification (email handling in osx is
really beginning to annoy me).  Sorry if its a dup)

I used to worry a lot about strings being indexable.  And then I
eventually let go of that and realized that it isn't a particularly
important property for them to have.

I think you will find that UTF8 is generally the most convenient for a
lot of things but its a bit like light in that you treat it
alternately as a wave or particle depending on what you are trying to
do.

So goes strings - they can be treated alternately as streams or byte
arrays (not character arrays - stop thinking in characters).  In
practice, this tends to not be a problem since a lot of the times when
you want to replace a character or pick out the nth one you are doing
something very computerish and the characters you are working with are
the single byte (ASCII legacy) variety.  You generally know when you
can get away with that and when you can't.

Otherwise you are most likely doing things that are best dealt with in
a streaming paradigm.  For most computation, you come to realize you
don't generally care how many characters but how much space (bytes)
you need to store your chunk of text.  Collation is tricky and
complicated in unicode in general but it isn't any worse in UTF8 than
any other encoding.  You are still going to scan each sortable item
from front to back to determine its order, regardless.

Most of the outside world has settled on UTF8 and any ASCII file is
already UTF8 - which is why it ends up being so convenient.  Most of
our old text handling infrastructure can still handle UTF8 while it
tends to choke on wider encodings.

-Todd Blanchard

> On Dec 6, 2015, at 07:23, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>
>> We do the same thing, but that doesn't mean it's a good idea to create a
>> new String-like class having its content encoded in UTF-8, because
>> UTF-8-encoded strings can't be modified like regular strings. While it
>> would be possible to implement all operations, such implementation would
>> become the next SortedCollection (bad performance due to misuse).