Unicode patch

Thu Jun 14 10:32:31 UTC 2007

Just from glancing at the code this cannot possibly be right.

Like, in many places the isWideString test is simply replaced with  
isFourByteString. But the distinction we need to make is wether we  
have character values below 256 or above (for example to choose  
between the old and the MultiByteScanner). So #isWideString needs to  
be preserved and answer true for all Strings that have character  
values >= 256.

As for the internal representation of TwoByteStrings; I'm not sure  
using big endian on all platforms is a good idea. Should certainly be  
discussed - like, it might be valuable to hand that string to a  
primitive and then platform order would be better.

Also, the renaming of WideString without providing proper conversion  
methods will most certainly break existing projects.

Then there are a lot of nits to pick - like the class comments are  
wrong, ByteString>>replaceFrom:... only creates 32 bit strings,  
bitShift is used all over the place when Smalltalk code traditionally  
uses * and //, what is TwoByteString>>printString good for, why does  
TwoByteString>>asByteString do an unnecessary copy etc.

Before inclusion this still needs a lot of work and testing.

- Bert -

On Jun 14, 2007, at 3:18 , Damien Cassou wrote:

> Hi Janko,
>
> did you try to load your changeset in a Squeak 3.10 image? What is the
> status of the tests?
>
> If your changeset is good enough and if you write unit tests, it may
> be interesting to put your changeset into 3.10.
>
> Bye
>
> 2007/6/12, Janko Mivšek <janko.mivsek at eranova.si>:
>> Dear Squeakers,
>>
>> Please find attached an Unicode patch, which deals with  
>> improvements of
>> internal representation of Unicode characters. It:
>>
>> 1. introduce new class TwoByteString
>> 2. change at:put: on ByteString and other such methods to "scale"  
>> string
>>     to TwoByteString or FourByteString, depending on width of a  
>> character
>> 3. rename WideString to FourByteString for consistency, also
>>     rename all related methods
>> 2. add category CollectionTests-Unicode with tests
>> 3. add class UnicodeBenchmarking for measuring speed of
>>     Unicode handling like at:put speed and UTF8 conversions on  
>> included
>>     English, French, Slovenian, Russian and Chinese text.
>>
>> ByteString and TwoByteString also include UTF8 conversion methods,  
>> which
>> will probably be moved to UTF8TextConverter later.
>>
>> I hope this patch will help improving Squeak Unicode support a bit.
>>
>> Best regards
>> Janko
>>
>>
>> --
>> Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
>>
>>
>>
>>
>
>
> -- 
> Damien Cassou
>