[Vm-dev] Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Wed Dec 16 12:18:20 UTC 2015

On Wed, Dec 16, 2015 at 6:22 PM, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>
> On 12/16/15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>> Hi Todd,
>>
>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <tblanchard at mac.com> wrote:
>>
>>> Hi Eliot,
>>>
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>>
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>>
>>>
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>>
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>>>
>>
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
>
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
>
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
>
> Could you elaborate on this please?


https://hal.inria.fr/hal-01152610/file/partialReadBarrier.pdf

cheers -ben


>>
>>
>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>>>
>>
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
>
> +1
>
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
>
>
>
>>
>>>
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>>>
>>
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>>
>> Does this make sense?
>>
>>
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>>
>>> -Todd Blanchard
>>>
>>
>> _,,,^..^,,,_
>> best, Eliot
>>