At Thu, 7 May 2009 11:37:10 -0700, Eliot Miranda wrote:
Yes, among these choices, my vote would be for UTF-32 (for 21-bit space). But variable-length-ness doesn't really go away when even when using UTF-32, as there are composition characters. Alternatively, we could go for all UTF-8 in image representation for Strings (as a data buffer) and when you need a Character, create an instance, or return the one in a table, that is in UTF-32. And in the image side, displayable "String" should (almost) always accompany the attributes like Text.
I'm a bit out of my depth here. I would have thought that you would want the basic string types to be fixed width for fast accessing, simply because variable length doesn't scale to e.g. indexing 1 megabyte strings. But that for the platform interface one would want efficient conversion to/from fixed and variable length encodings. But that's just my gut. I expect I'll implement whatever y'all say makes sense.
Basically, I think UTF-32 is ok for the time being and requires very little change to the code.
With the presence of composition characters, the situation where you randomly access to an element and expect it to be a meaningful value itself is rarer.
My proposition is that for a String (as data), we would rather avoid random access anyway and always access it via a Stream. Then, the actual representation can be different.
-- Yoshiki