[Vm-dev] Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

Eliot Miranda eliot.miranda at gmail.com
Sat Dec 12 02:31:59 UTC 2015


Hi Euan,

On Fri, Dec 11, 2015 at 5:45 PM, EuanM <euanmee at gmail.com> wrote:

> Elliot, what's your take on having heterogenous collections for the
> composed Unicode?
>

I'm not sure I'm understanding the question, but... I'm told by someone in
the know that string concatenation is a big deal in certain applications,
so providing tree-like representations for strings can be a win since
concatenation is O(1) (allocate a new root and assign the two subtrees).
It seems reasonable to have a rich library with several representations
available with different trade-offs.  But I'd let requirements drive
design, not feature dreams.


> i.e. collections with one element for each character, with some
> characters being themselves a collection of characters
>
> (Simple character like "a" is one char, and a character which is a
> collection of characters is the fully composed version of Ǖ (01d5), a
> U (0055) with a diaeresis - ¨ - ( 00a8  aka 0308 in combining form )
> on top to form the compatibility character Ü (ooDC) which then gets a
> macron-  ̄  -( 0304) on top of that
>
> so
> a  #(0061)
>
> Ǖ  #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)
>
> i.e a string which alternated those two characters
>
> 'aǕaǕaǕaǕ'
>
> would be represented by something equivalent to:
>
> #( 0061 #( 0055 0308 0304)  0061 #( 0055 0308 0304)  0061 #( 0055 0308
> 0304)  0061 #( 0055 0308 0304) )
>
> as opposed to a string of compatibility characters:
> #( 0061 01d5  0061 01d5 0061 01d5 0061 01d5)
>
> Does alternating the type used for characters in a string have a
> significant effect on speed?
>

I honestly don't know.  You've just gone well beyond my familiarity with
the issues :-).  I'm just a VM guy :-). But I will say that in cases like
this, real applications and the profiler are your friends.  Be guided by
what you need now, not by what you think you'll need further down the road.


> On 11 December 2015 at 23:08, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
> > Hi Todd,
> >
> > On Dec 11, 2015, at 12:57 PM, Todd Blanchard <tblanchard at mac.com> wrote:
> >
> >
> > On Dec 11, 2015, at 12:19, EuanM <euanmee at gmail.com> wrote:
> >
> > "If it hasn't already been said, please do not conflate Unicode and
> > UTF-8. I think that would be a recipe for
> > a high P.I.T.A. factor."  --Richard Sargent
> >
> >
> > Well, yes. But  I think you guys are making this way too hard.
> >
> > A unicode character is an abstract idea - for instance the letter 'a'.
> > The letter 'a' has a code point - its the number 97.  How the number 97
> is
> > represented in the computer is irrelevant.
> >
> > Now we get to transfer encodings.  These are UTF8, UTF16, etc....  A
> > transfer encoding specifies the binary representation of the sequence of
> > code points.
> >
> > UTF8 is a variable length byte encoding.  You read it one byte at a time,
> > aggregating byte sequences to 'code points'.  ByteArray would be an
> > excellent choice as a superclass but it must be understood that #at: or
> > #at:put refers to a byte, not a character.  If you want characters, you
> have
> > to start at the beginning and process it sequentially, like a stream (if
> > working in the ASCII domain - you can generally 'cheat' this a bit).  A C
> > representation would be char utf8[]
> >
> > UTF16 is also a variable length encoding of two byte quantities - what C
> > used to call a 'short int'.  You process it in two byte chunks instead of
> > one byte chunks.  Like UTF8, you must read it sequentially to interpret
> the
> > characters.  #at and #at:put: would necessarily refer to byte pairs and
> not
> > characters.  A C representation would be short utf16[];  It would also to
> > 50% space inefficient for ASCII - which is normally the bulk of your
> text.
> >
> > Realistically, you need exactly one in-memory format and stream
> > readers/writers that can convert (these are typically table driven state
> > machines).  My choice would be UTF8 for the internal memory format and
> the
> > ability to read and write from UTF8 to UTF16.
> >
> > But I stress again...strings don't really need indexability as much as
> you
> > think and neither UTF8 nor UTF16 provide this property anyhow as they are
> > variable length encodings.  I don't see any sensible reason to have more
> > than one in-memory binary format in the image.
> >
> >
> > The only reasons are space and time.  If a string only contains code
> points
> > in the range 0-255 there's no point in squandering 4 bytes per code point
> > (same goes for 0-65535).  Further, if in some application interchange is
> > more important than random access it may make sense in performance
> grounds
> > to use utf-8 directly.
> >
> > Again, Smalltalk's dynamic typing makes it easy to have one's cake and
> eat
> > it too.
> >
> > My $0.02c
> >
> >
> > _,,,^..^,,,_ (phone)
> >
> >
> > I agree. :-)
> >
> > Regarding UTF-16, I just want to be able to export to, and receive
> > from, Windows (and any other platforms using UTF-16 as their native
> > character representation).
> >
> > Windows will always be able to accept UTF-16.  All Windows apps *might
> > well* export UTF-16.  There may be other platforms which use UTF-16 as
> > their native format.  I'd just like to be able to cope with those
> > situations.  Nothing more.
> >
> > All this is requires is a Utf16String class that has an asUtf8String
> > method (and any other required conversion methods).
> >
> >
>
>


-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20151211/d1592ad9/attachment-0001.htm


More information about the Vm-dev mailing list