[Vm-dev] Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

KenD Ken.Dickey at Whidbey.com
Fri Dec 11 17:14:15 UTC 2015


> > On Dec 10, 2015, at 6:43 PM, EuanM <euanmee at gmail.com> wrote:
>...
> > One thing I don't understand....  why does the fact the composed
> > abstract character (aka grapheme cluster) is a sequence mean that an
> > array cannot be used to hold the sequence?


Sorry, I missed the start of this discussion, so I may be _way_ off base here, but I wanted to suggest an alternative representation.

An array of binary bytes could hold the Unicode.  No GC scans needed.  An internediate map (easily compacted) could note the grapheme clusters so that one would get an O(1) access to Unicode characters.  In the trivial case, the map is direct.

This would allow UTF8, UTF16, UTF32, whatever, at ther binary level and it would handle grapheme clusters when accessing a composed "character".

This does not solve the "wide char replacing narrow char" problem, but a ropes like solution would work here.  I.e. the binary-bytes vec is immutable and at:put: just adds new chars to the map layer.  "copying the string" could yield a new binary-vec with the characters inserted.

$0.02,
-KenD


More information about the Vm-dev mailing list