Unicode support

Wed Sep 15 01:39:07 UTC 1999

UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.

UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.

The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.

So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.

So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.

For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.

-John

----- Original Message -----
From: Duane Maxwell <dmaxwell at entrypoint.com>
To: <squeak at cs.uiuc.edu>
Sent: Tuesday, September 14, 1999 3:24 PM
Subject: Re: Unicode support

> I would suggest instead looking to implement one of the useful
> transformations of Unicode, such as UTF-8.  It's a variable-length
encoding
> which could still use the current ByteArray character string
> representation, still be able to encode the entire Unicode space if
> necessary, as well as be efficient for the extremely common 7-bit
ASCII
> case.  The Unicode specification describes various algorithms for
> conversion and manipulation of the various transformations, as well
as
> mappings to platform specific extended character sets.
>
> Both XML and BeOS use UTF-8 as their default encoding.
>
> Bert Freudenberg writes:
> >On Tue, 14 Sep 1999, Todd Blanchard wrote:
> >
> >> > On Mon, 13 Sep 1999, Todd Blanchard wrote:
> >> >
> >> > > I'm wanting to implement some  unicode support.  Who can tell
me -
> >> > > how big is a word?
> >> > > Is it two bytes?
> >> >
> >> > No, it's four bytes. There is no two-byte primitive supported
array in
> >> > Squeak (yet).
> >>
> >> So whats it going to take to get one? Is this something that
could
> >> be put together by an experienced C programmer with some
high-level
> >> Smalltalk experience by cloning the variableByteArray class and
> >> adjusting the data sizes?
> >
> >Currently there are only 1-byte arrays (ByteArray) and 4-byte
arrays
> >(object pointers and words). You would have to find all places that
> >accesses the class format and change them to recognize the new
2-byte
> >format. These are a lot. Look, for example, into
> >Interpreter>>primitiveStringReplace which you certainly would want
to use
> >for fast Unicode string manipulations.
> >
> >But basically you could just start using the byte-wise stuff and
adjusting
> >all sizes by a factor of 2. In #at: you would construct a Unicode
> >character from 2 bytes etc. I'd think this would be not even that
slow,
> >and you could still switch to primitives later.
> >
> >> Can you point me to info on low-level data formats in squeak?
> >
> >No ... except for that's all in the image ;-)
> >
> >I'll copy this back to the list, maybe someone else knows better.
> >
> >  /bert
>
>
>
>
>