[squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

Fri Aug 28 13:49:32 UTC 2009

> At Thu, 27 Aug 2009 22:19:49 -0700,
> Andreas Raab wrote:
>>
>> Yoshiki Ohshima wrote:
>>>  One question is the roadmap; I would think ByteStrings will be
>>> retained for a while (or forever) but may be also phased out.  And
>>> also it would be nice to tag ByteStrings.  The natural order may  
>>> be to
>>> try to move on to text attribute approach earlier so that the bare
>>> representation doesn't matter much.  How do you think about these
>>> things?
>>
>> Interesting questions. I'm not sure what you mean by "tagging
>> ByteStrings" - generally my opinion is that String/ByteString/ 
>> WideString
>> have the same reationship that Integer/SmallInteger/LargeInteger  
>> have.
>
>  With characters in 0..255 range, somebody may want to define
> language tags and put them.  It would be nice if we can do that to be
> transparent.
>
> -- Yoshiki

On 28.08.2009, at 15:28, Colin Putney wrote:

> On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote:
>
>> Wouldn't ByteArrays be a better way to efficiently store arrays of  
>> bytes? Strings are conceptually made of Characters, and there are  
>> more than 256 of them. E.g. a la Python 3:
>
> So you're proposing that WideString, once it no longer has language  
> tags, use its 4 bytes per character to point to Character objects  
> rather than encoding the string at all? That would certainly be an  
> interesting implementation. It would trade space for speed (of  
> certain operations) in the case of CJK and other writing systems  
> that involve large numbers of characters, as you'd have a bunch of  
> Character objects persisting in the image, rather than just  
> ephemerally. For some applications, that's exactly the right design  
> choice, no doubt.

I'm not really proposing anything at this point, just widening the  
discussion Yoshiki started (cited above for reference).

> On the other hand EncodedString (and subclasses like Utf8String or  
> Latin1String) would make a different trade-off, speed (of certain  
> operations) for space.  Any #variableByteSubclass can effieciently  
> store bytes. The reason to use say, Utf8String rather than ByteArray  
> is precisely *because* Strings are conceptually made of Characters.  
> Encapsulation and all that.

I guess having encoded strings would be nice. OTOH I value simplicity.  
Does anybody have experience with the tradeoffs?

- Bert -