Unicode support

agree at carltonfields.com agree at carltonfields.com
Tue Sep 21 19:59:30 UTC 1999


> The space consumption argument against an object oriented String
> implementation is that each character object takes more space. Yes, a
> character object based string would use 4 bytes per string as > opposed to 1
> byte per string. This is a definate concern and needs some > study to find
> out how much of a concern it is.

Am I misunderstanding Peter's suggestions, or did he mean to say 4-bytes per CHARACTER in a string, as opposed to 1 byte per string.  As understood, all string objects are themselves objects: it is only their contents that are byte data.

But I'm not sure why he thinks that an appropriate GenericString class must suffer the slings and arrows of outrageous costs.  When I draw a character from a String object presently, via:

	aString at: anIndex

the result is the creation of a CHARACTER OBJECT, whose pointer (as with each and every other oop pointer) takes up an entire word, and which itself takes up some number of words with an object header and so forth.  The CONTAINER, on the other hand, is implemented at bottom as a ByteVariable object, which means that it is stored compactly as data (rather than objects).  This can be understood by considering how String>>at: is implemented (comment elided).

at: index 

	<primitive: 63>
	^character value: (super at: index)

so that a query on the string invokes a primitive, or upon failure looks up the byte value (converted now to SmallInteger) in a table.  An excerpt from the corresponding primtive confirms this is what happens "under the table" as well.

	successFlag _ true.
	result _ self stObject: rcvr at: index.
	successFlag ifTrue:
		[stringy ifTrue: [result _ self characterForAscii: (self integerValueOf: result)].
		^ self pop: 2 thenPush: result]

Of course, none of this contradict's Peter's plan, just his analysis.  A generalized character class is no more inherently expensive to maintain than the present Character class, except that the optimization of a 256-byte table would no longer be available.  Indeed, a word-indexable String is a possible implementation for unicode (and would take up the big space), but wouldn't REQUIRE that this be applied to every string.  Just as we presently do with SmallInteger, when it overflows, we can "kick up the ante" whenever "more" is needed, while the general case -- (optimizing to a tight, 256-byte "cache" table) is handled separately.

> Speed issues may be a concern for "complicated byte encodings" like
> "unicode" due to the "magic" that must be done to meet the encoding
> specification with uses special "encodings". No wonder that > OpenStep's text
> system implemenation slows down when dealing with large > strings. And their
> system is largely written in C for speed with lots of > Objective-C object
> wrappers. In many cases it is actually faster to process > 32bits instead of
> bytes.

Except for block moves (probably the most expensive operation for strings presently, by the way), I agree that it already is more expensive under the status quo than a word-based operation.  However, savings in creation and copying probably outweight this substantially.

But we don't know until we know.  Has anyone actually studied this question?

> Regardless of these reasons not to have an "object oriented > string class"
> it still makes sense to have one. A byte oriented string > class is just too
> limited. I can't put any characters I want into it. A pure > object oriented
> string class enables ALL characters from ALL languages to > exist in the same
> object string. This simplifies displaying characters of ANY > character set
> or font on the screen in a single unit. It also doesn't > require the user to
> be setting which character encoding to use: Japanese this or > that (they
> have at least three seperate encodings excluding UNICODE), > Unicode, ASCII,
> UTF-8, etc... Since they can mix any characters they wish in > to a string
> when typing into an edit box or into document text.

OK, OK.  So what are the essential elements of a Generalized String?  Can you propose a straightforward hierarchy of the principal actions we will permit, and not permit?  Once we have a proposal, we can kick it around.  Until then, we're all tilting at straw men and women.

*snip*

>I am simply talking about > creating the
> most general string implementation possible.

We already have the most general string implementation possible.  It's called Array, and its very, very fast and well-implemented.  What is it that makes a String different from an Array?  Let's define the protocol and a class hierarchy, and see where we should be going, if anywhere.

> Using a "generic object oriented > character based
> string class" is the most general string object design > possible and has
> many benefits. 

Once again, how is it different from Class array?  What do we need to add to make it happen?

At the end of the day, I don't think anyone is really disagreeing with Peter on the points to which he has addressed himself.  There is no reason one can't go ahead and define or code these protocols -- it just hasn't been done yet.  Once we see what it needs to do, and only then, we can meaningfully analyze its merits and weaknesses, if any.





More information about the Squeak-dev mailing list