[squeak-dev] Re: [Pharo-dev] Unicode Support

EuanM euanmee at gmail.com
Sat Dec 12 01:45:54 UTC 2015


Elliot, what's your take on having heterogenous collections for the
composed Unicode?

i.e. collections with one element for each character, with some
characters being themselves a collection of characters

(Simple character like "a" is one char, and a character which is a
collection of characters is the fully composed version of Ǖ (01d5), a
U (0055) with a diaeresis - ¨ - ( 00a8  aka 0308 in combining form )
on top to form the compatibility character Ü (ooDC) which then gets a
macron-  ̄  -( 0304) on top of that

so
a  #(0061)

Ǖ  #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)

i.e a string which alternated those two characters

'aǕaǕaǕaǕ'

would be represented by something equivalent to:

#( 0061 #( 0055 0308 0304)  0061 #( 0055 0308 0304)  0061 #( 0055 0308
0304)  0061 #( 0055 0308 0304) )

as opposed to a string of compatibility characters:
#( 0061 01d5  0061 01d5 0061 01d5 0061 01d5)

Does alternating the type used for characters in a string have a
significant effect on speed?

On 11 December 2015 at 23:08, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> Hi Todd,
>
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <tblanchard at mac.com> wrote:
>
>
> On Dec 11, 2015, at 12:19, EuanM <euanmee at gmail.com> wrote:
>
> "If it hasn't already been said, please do not conflate Unicode and
> UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor."  --Richard Sargent
>
>
> Well, yes. But  I think you guys are making this way too hard.
>
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97.  How the number 97 is
> represented in the computer is irrelevant.
>
> Now we get to transfer encodings.  These are UTF8, UTF16, etc....  A
> transfer encoding specifies the binary representation of the sequence of
> code points.
>
> UTF8 is a variable length byte encoding.  You read it one byte at a time,
> aggregating byte sequences to 'code points'.  ByteArray would be an
> excellent choice as a superclass but it must be understood that #at: or
> #at:put refers to a byte, not a character.  If you want characters, you have
> to start at the beginning and process it sequentially, like a stream (if
> working in the ASCII domain - you can generally 'cheat' this a bit).  A C
> representation would be char utf8[]
>
> UTF16 is also a variable length encoding of two byte quantities - what C
> used to call a 'short int'.  You process it in two byte chunks instead of
> one byte chunks.  Like UTF8, you must read it sequentially to interpret the
> characters.  #at and #at:put: would necessarily refer to byte pairs and not
> characters.  A C representation would be short utf16[];  It would also to
> 50% space inefficient for ASCII - which is normally the bulk of your text.
>
> Realistically, you need exactly one in-memory format and stream
> readers/writers that can convert (these are typically table driven state
> machines).  My choice would be UTF8 for the internal memory format and the
> ability to read and write from UTF8 to UTF16.
>
> But I stress again...strings don't really need indexability as much as you
> think and neither UTF8 nor UTF16 provide this property anyhow as they are
> variable length encodings.  I don't see any sensible reason to have more
> than one in-memory binary format in the image.
>
>
> The only reasons are space and time.  If a string only contains code points
> in the range 0-255 there's no point in squandering 4 bytes per code point
> (same goes for 0-65535).  Further, if in some application interchange is
> more important than random access it may make sense in performance grounds
> to use utf-8 directly.
>
> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat
> it too.
>
> My $0.02c
>
>
> _,,,^..^,,,_ (phone)
>
>
> I agree. :-)
>
> Regarding UTF-16, I just want to be able to export to, and receive
> from, Windows (and any other platforms using UTF-16 as their native
> character representation).
>
> Windows will always be able to accept UTF-16.  All Windows apps *might
> well* export UTF-16.  There may be other platforms which use UTF-16 as
> their native format.  I'd just like to be able to cope with those
> situations.  Nothing more.
>
> All this is requires is a Utf16String class that has an asUtf8String
> method (and any other required conversion methods).
>
>


More information about the Squeak-dev mailing list