[Vm-dev] Re: [Pharo-dev] [squeak-dev] Re: [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: Unicode Support))

Wed Dec 16 14:46:22 UTC 2015

Hi Hannes,

> On Dec 16, 2015, at 2:22 AM, H. Hirzel <hannes.hirzel at gmail.com> wrote:
> 
> 
>> On 12/16/15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>> Hi Todd,
>> 
>>> On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard <tblanchard at mac.com> wrote:
>>> 
>>> Hi Eliot,
>>> 
>>> On Dec 15, 2015, at 13:46, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>> 
>>> Just so you know, I will dig my heels in as deeply as I am able to
>>> prevent
>>> the use of C++ libraries in the VM.  It destroys the simulator, which is
>>> the most important thing we have for VM development productivity.  As far
>>> as I'm concerned any use of external libraries to implement core
>>> functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo)
>>> are
>>> built upon.
>>> 
>>> 
>>> OK, I defer to you because you certainly know more about the VM internals
>>> and what does and doesn't work well than anyone else.
>>> 
>>> So I guess I would like to know your recommendation for 1) how best to
>>> store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we
>>> get to worry about endian).
>> 
>> Raw Unicode, either as 8-bit, 16-bit or 32-bit.  When creating a String it
>> should start as an 8-bit-per-Unicode-character string.  Attempts to store
>> Character values that won't fit cause the String to become a String whose
>> element size is large enough to accommodate the character.
> 
> This is the case: see tests here http://wiki.squeak.org/squeak/6316
> 
> In Spur,
>> become: is cheap so this growth pays only for the reallocation and copying
>> of the at a, not for an expensive heap scan necessary to do the become:.
> 
> Could you elaborate on this please?

Well, there are two presentations online, one at ESUG 2014, and one at Splash 2015, a paper at ISMM 2015, and a blog post if you want a full account, but...

Spur supports lazy become via forwarders.  When an object is becommed, it is turned into a forwarder to the object it becomes, and when a pair of objects are becommed each is copied and the two originals turned into forwarders to the opposite copy.  Immediately after the forwarding the receiver in each stack frame in the stack zone is scanned to follow forwarders so that no read barrier is needed when accessing inst vars.

Forwarders have a unique class index so any message send to a forwarder will fail. The check for sends to forwarders needs to be done only just before a full lookup.  A forwarder looks different to other objects (they have a unique format field in their object header) so any primitive that encounters a forwarder in its operands will fail (since primitives validate their arguments).  So in primitive failure the VM checks for forwarders amongst the operands and if any are found, they are followed and the primitive is retried.

There are read barriers in some operations, hence I call the scheme a partial read barrier.  The main cost is the stack zone scan which must be performed for pointer objects (since only these can have inst vars).    This takes very little time on current machinery.  Becoming bit objects like strings is faster because no scan is needed.

>>> Bearing in mind that both representations are variable length and so
>>> while
>>> accessing the n'th byte/word is O(1), accessing the n'th character is
>>> necessarily O(n) unless you know you have no surrogates in your string.
>> 
>> Right, so UTF-8 and UTF-16 are not convenient representations and to be
>> provided only for interchange.
> 
> +1
> 
> Squeak/Pharo uses 8bit/32bit (UTF-32) internally and UTF-8 externally.
> There are converters for UTF-8 and UTF-16.
> 
>>> Also...since NSString has been mentioned...it is worth noting that
>>> NSString is built atop CFString (source code here:
>>> https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which
>>> does a fair job of optimizing memory by using bytes where it can and
>>> shorts
>>> where it cannot.  It is also worth noting that characterAt: actually does
>>> the wrong thing, since it assumes characters are no bigger than FFFF
>>> rather
>>> than 10FFFF.
>> 
>> Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages.  If one
>> has become: it is very easy to manage.  Now with Spur not only do we have
>> become:, we have a fairly fast become:.
>> 
>> Does this make sense?
>> 
>> 
>>> Also...I'll just toss in this very nice article on unicode and how
>>> NSString deals with it.
>>> https://www.objc.io/issues/9-strings/unicode/
>>> 
>>> -Todd Blanchard
>> 
>> _,,,^..^,,,_
>> best, Eliot