[squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

Fri Aug 28 13:28:20 UTC 2009

On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote:

> Wouldn't ByteArrays be a better way to efficiently store arrays of  
> bytes? Strings are conceptually made of Characters, and there are  
> more than 256 of them. E.g. a la Python 3:

So you're proposing that WideString, once it no longer has language  
tags, use its 4 bytes per character to point to Character objects  
rather than encoding the string at all? That would certainly be an  
interesting implementation. It would trade space for speed (of certain  
operations) in the case of CJK and other writing systems that involve  
large numbers of characters, as you'd have a bunch of Character  
objects persisting in the image, rather than just ephemerally. For  
some applications, that's exactly the right design choice, no doubt.

On the other hand EncodedString (and subclasses like Utf8String or  
Latin1String) would make a different trade-off, speed (of certain  
operations) for space.  Any #variableByteSubclass can effieciently  
store bytes. The reason to use say, Utf8String rather than ByteArray  
is precisely *because* Strings are conceptually made of Characters.  
Encapsulation and all that.

> A Text defines attributes for Character runs in a String. Instead of  
> storing the tag in each Character, it could be stored in an  
> attribute of the Text. Instead of passing around bare Strings you  
> would pass around Text objects (if you need to preserve language  
> tags).

Sounds good.

Colin