[Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Wed Dec 16 01:35:38 UTC 2015

Hi Todd,

On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <tblanchard at mac.com> wrote:

> Hi Eliot,
>
> On Dec 15, 2015, at 13:46, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> Just so you know, I will dig my heels in as deeply as I am able to prevent
> the use of C++ libraries in the VM.  It destroys the simulator, which is
> the most important thing we have for VM development productivity.  As far
> as I'm concerned any use of external libraries to implement core
> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are
> built upon.
>
>
> OK, I defer to you because you certainly know more about the VM internals
> and what does and doesn't work well than anyone else.
>
> So I guess I would like to know your recommendation for 1) how best to
> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
> get to worry about endian).
>

Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
should start as an 8-bit-per-Unicode-character string.  Attempts to store
Character values that won't fit cause the String to become a String whose
element size is large enough to accommodate the character.  In Spur,
become: is cheap so this growth pays only for the reallocation and copying
of the at a, not for an expensive heap scan necessary to do the become:.

> Bearing in mind that both representations are variable length and so while
> accessing the n'th byte/word is O(1), accessing the n'th character is
> necessarily O(n) unless you know you have no surrogates in your string.
>

Right, so UTF-8 and UTF-16 are not convenient representations and to be
provided only for interchange.

>
> Also...since NSString has been mentioned...it is worth noting that
> NSString is built atop CFString (source code here:
> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
> does a fair job of optimizing memory by using bytes where it can and shorts
> where it cannot.  It is also worth noting that characterAt: actually does
> the wrong thing, since it assumes characters are no bigger than FFFF rather
> than 10FFFF.
>

Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
has become: it is very easy to manage.  Now with Spur not only do we have
become:, we have a fairly fast become:.

Does this make sense?

> Also...I'll just toss in this very nice article on unicode and how
> NSString deals with it.
> https://www.objc.io/issues/9-strings/unicode/
>
> -Todd Blanchard
>

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20151215/9a38c998/attachment.htm