Unicode support

agree at carltonfields.com agree at carltonfields.com
Tue Sep 14 20:01:21 UTC 1999


Right.  UTF-8 is upward compatible from ASCII.

> -----Original Message-----
> From: MIME :dmaxwell at entrypoint.com > Sent: Tuesday, September 14, 1999 3:32 PM
> To: squeak at cs.uiuc.edu
> Subject: Re: Unicode support
> > > I would suggest instead looking to implement one of the useful
> transformations of Unicode, such as UTF-8.  It's a > variable-length encoding
> which could still use the current ByteArray character string
> representation, still be able to encode the entire Unicode space if
> necessary, as well as be efficient for the extremely common > 7-bit ASCII
> case.  The Unicode specification describes various algorithms for
> conversion and manipulation of the various transformations, as well as
> mappings to platform specific extended character sets.
> > Both XML and BeOS use UTF-8 as their default encoding.
> > Bert Freudenberg writes:
> >On Tue, 14 Sep 1999, Todd Blanchard wrote:
> >
> >> > On Mon, 13 Sep 1999, Todd Blanchard wrote:
> >> >
> >> > > I'm wanting to implement some  unicode support.  Who > can tell me -
> >> > > how big is a word?
> >> > > Is it two bytes?
> >> >
> >> > No, it's four bytes. There is no two-byte primitive > supported array in
> >> > Squeak (yet).
> >>
> >> So whats it going to take to get one? Is this something that could
> >> be put together by an experienced C programmer with some high-level
> >> Smalltalk experience by cloning the variableByteArray class and
> >> adjusting the data sizes?
> >
> >Currently there are only 1-byte arrays (ByteArray) and 4-byte arrays
> >(object pointers and words). You would have to find all places that
> >accesses the class format and change them to recognize the new 2-byte
> >format. These are a lot. Look, for example, into
> >Interpreter>>primitiveStringReplace which you certainly > would want to use
> >for fast Unicode string manipulations.
> >
> >But basically you could just start using the byte-wise stuff > and adjusting
> >all sizes by a factor of 2. In #at: you would construct a Unicode
> >character from 2 bytes etc. I'd think this would be not even > that slow,
> >and you could still switch to primitives later.
> >
> >> Can you point me to info on low-level data formats in squeak?
> >
> >No ... except for that's all in the image ;-)
> >
> >I'll copy this back to the list, maybe someone else knows better.
> >
> >  /bert
> > > > > > > 





More information about the Squeak-dev mailing list