[Vm-dev] Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Wed Dec 16 10:22:37 UTC 2015

On 12/16/15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> Hi Todd,
>
> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <tblanchard at mac.com> wrote:
>
>> Hi Eliot,
>>
>> On Dec 15, 2015, at 13:46, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>
>> Just so you know, I will dig my heels in as deeply as I am able to
>> prevent
>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>> the most important thing we have for VM development productivity.  As far
>> as I'm concerned any use of external libraries to implement core
>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>> are
>> built upon.
>>
>>
>> OK, I defer to you because you certainly know more about the VM internals
>> and what does and doesn't work well than anyone else.
>>
>> So I guess I would like to know your recommendation for 1) how best to
>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>> get to worry about endian).
>>
>
> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
> should start as an 8-bit-per-Unicode-character string.  Attempts to store
> Character values that won't fit cause the String to become a String whose
> element size is large enough to accommodate the character.

This is the case: see tests here http://wiki.squeak.org/squeak/6316

In Spur,
> become: is cheap so this growth pays only for the reallocation and copying
> of the at a, not for an expensive heap scan necessary to do the become:.

Could you elaborate on this please?

>
>
>> Bearing in mind that both representations are variable length and so
>> while
>> accessing the n'th byte/word is O(1), accessing the n'th character is
>> necessarily O(n) unless you know you have no surrogates in your string.
>>
>
> Right, so UTF-8 and UTF-16 are not convenient representations and to be
> provided only for interchange.

+1

Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
There are converters for UTF-8 and UTF-16.

>
>>
>> Also...since NSString has been mentioned...it is worth noting that
>> NSString is built atop CFString (source code here:
>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>> does a fair job of optimizing memory by using bytes where it can and
>> shorts
>> where it cannot.  It is also worth noting that characterAt: actually does
>> the wrong thing, since it assumes characters are no bigger than FFFF
>> rather
>> than 10FFFF.
>>
>
> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
> has become: it is very easy to manage.  Now with Spur not only do we have
> become:, we have a fairly fast become:.
>
> Does this make sense?
>
>
>> Also...I'll just toss in this very nice article on unicode and how
>> NSString deals with it.
>> https://www.objc.io/issues/9-strings/unicode/
>>
>> -Todd Blanchard
>>
>
> _,,,^..^,,,_
> best, Eliot
>