Hi Todd,
On Tue, Dec 15, 2015 at 3:46 PM, Todd Blanchard tblanchard@mac.com wrote:
Hi Eliot,
On Dec 15, 2015, at 13:46, Eliot Miranda eliot.miranda@gmail.com wrote:
Just so you know, I will dig my heels in as deeply as I am able to prevent the use of C++ libraries in the VM. It destroys the simulator, which is the most important thing we have for VM development productivity. As far as I'm concerned any use of external libraries to implement core functionality kills the VM-in-Smalltalk concept that Squeak (and Pharo) are built upon.
OK, I defer to you because you certainly know more about the VM internals and what does and doesn't work well than anyone else.
So I guess I would like to know your recommendation for 1) how best to store strings - byte arrays (UTF8), - 2-byte word arrays (UTF16 - now we get to worry about endian).
Raw Unicode, either as 8-bit, 16-bit or 32-bit. When creating a String it should start as an 8-bit-per-Unicode-character string. Attempts to store Character values that won't fit cause the String to become a String whose element size is large enough to accommodate the character. In Spur, become: is cheap so this growth pays only for the reallocation and copying of the at a, not for an expensive heap scan necessary to do the become:.
Bearing in mind that both representations are variable length and so while accessing the n'th byte/word is O(1), accessing the n'th character is necessarily O(n) unless you know you have no surrogates in your string.
Right, so UTF-8 and UTF-16 are not convenient representations and to be provided only for interchange.
Also...since NSString has been mentioned...it is worth noting that NSString is built atop CFString (source code here: https://www.opensource.apple.com/source/CF/CF-855.11/CFString.c) which does a fair job of optimizing memory by using bytes where it can and shorts where it cannot. It is also worth noting that characterAt: actually does the wrong thing, since it assumes characters are no bigger than FFFF rather than 10FFFF.
Yes, and Squeak (and AFAIA, Pharo) has been doing this for ages. If one has become: it is very easy to manage. Now with Spur not only do we have become:, we have a fairly fast become:.
Does this make sense?
Also...I'll just toss in this very nice article on unicode and how NSString deals with it. https://www.objc.io/issues/9-strings/unicode/
-Todd Blanchard
_,,,^..^,,,_ best, Eliot