[Vm-dev] mantis http://bugs.squeak.org/view.php?id=7349

Yoshiki Ohshima yoshiki at vpri.org
Thu May 7 21:41:36 UTC 2009


At Thu, 7 May 2009 11:37:10 -0700,
Eliot Miranda wrote:
> 
>     Yes, among these choices, my vote would be for UTF-32 (for 21-bit
>     space). But variable-length-ness doesn't really go away when even
>     when using UTF-32, as there are composition characters.
>    
>     Alternatively, we could go for all UTF-8 in image representation for
>     Strings (as a data buffer) and when you need a Character, create an
>     instance, or return the one in a table, that is in UTF-32. And in the
>     image side, displayable "String" should (almost) always accompany the
>     attributes like Text.
> 
> I'm a bit out of my depth here. I would have thought that you would want the basic string types to be fixed width for
> fast accessing, simply because variable length doesn't scale to e.g. indexing 1 megabyte strings. But that for the
> platform interface one would want efficient conversion to/from fixed and variable length encodings. But that's just my
> gut. I expect I'll implement whatever y'all say makes sense.

  Basically, I think UTF-32 is ok for the time being and requires very
little change to the code.

  With the presence of composition characters, the situation where you
randomly access to an element and expect it to be a meaningful value
itself is rarer.

  My proposition is that for a String (as data), we would rather avoid
random access anyway and always access it via a Stream.  Then, the
actual representation can be different.

-- Yoshiki


More information about the Vm-dev mailing list