UTC-8 (was Re: Celeste encoding (was: Duplicate messages in Celeste))

Bijan Parsia bparsia at email.unc.edu
Thu Mar 16 20:08:14 UTC 2000


On Thu, 16 Mar 2000, Lex Spoon wrote:

> Bijan Parsia <bparsia at email.unc.edu> wrote:
> 
> > 
> > Is there a plan for dealing with Unicode, at least externally? I'm really
> > just talking about enough to satisfy a parser...I'm happy to punt on
> > display and font issues. Though I would like ASCIIish text to be
> > intelligible ;)
> > 
> 
> It's been batted around before, and I thought a project group had
> started up to discuss it further.

To do something that's just enough to support an XML parser?

[snip lots of hard stuff]

Er..I just want enough to support an XML parser for the kind of
docuemnts I, personally, am likely to see ;) I would like to do so in a
manner that doesn't mean a major pain when the larger issues have been
dealt with.
 
> Furthermore, there is the issue of displaying your nice Unicode stuff. 
> I suppose people in practice will load into an image fotns for the
> portions of Unicode they see a lot, and other characters will show up as
> little squares or something.

Hey, that's A-OK for my purposes, eh? The point is to be able to process
XML documents. If some are second class wrt to display, then that's a
different issue! I just don't want them *all* to be yucky (specifically, I
want *mine* to look cool! ;))

> Then there is string comparison.  Probably the letter A appears in
> oogles of places throughout unicode, and all of them should be
> considered equal in a case-insensitive comparison.

Er..will this affect me XMLically parserwise? Ack, I'm degenerating into
icky neologism!
 
> I dunno, these are just some of the issues that spring to mind.  It
> would be awesome to do the switch while Squeak is still relatively
> young, but it ain't trivial.

But clearly these aren't all needed to support an XML parser. I'd settle
for UTF-8 to start, so long as I have a known mechanism for switching
encoders as new one's crop up.

If folks are still working on that project, I'd be interested in what you
think of how VisualWorks handled it. It seems enormously complex, in that
VisualWorks way. Dolphin 2.x (the free version) just has a UnicodeString
class as a subclass of String. I didn't see much to it, and I don't know
if this suffices for *any* purpose ;) I haven't checked Smalltalk/X, but
since it has a JVM built in, I'd imagine it can just delegate Unicode
manipulations to that.

IIRC, XML parsers are supposed to support UTF-8 and -16, yes? If so, I'll
settle for starting with UTF-8. I'm just now wondering if we have a
framework idea for this stuff, or whether I should just be ad hoc about it
:)

Cheers,
Bijan.





More information about the Squeak-dev mailing list