Unicode strings, benchmarks

Mon Jun 11 22:28:24 UTC 2007

Hi Yoshiki,

Yoshiki Ohshima wrote:
> 
>> I also did a bit better UTF8 conversion but is only 25-80% faster that 
>> existing one in UTF8TextConverter.
> 
>   Good!
> 
>> Here are results in VW, Squeak with old UTF8 converter and a new one:
>>
>> 	      VW         old	 new
>> english    30	 313	 248 ByteString,   pure ASCII
>> french     32	 323	 251 ByteString,   ISO8859-1 (Latin 1)
>> slovenian  48	 578	 480 TwoByteString Latin 2
>> russian   112	1306	 720 TwoByteString Cyrillic
>> chinese   107	1544	3825 TwoByteString
>>
>> Notice an exceptional 10x VW performance comparing to Squeak, and they 
>> do all encodings in plain Smalltalk! No primitives! So how come that 
>> Squeak is so slow here?
> 
>   Is it true that you traded the performance for
> Chinese with other languages?

Definitively not, and I just don't understand why Chinese is so slow. I 
hope you'll be able too look at that code to see, what's wrong. And 
Chinese is close to Japanese, right? I learned Chinese a bit 20 years 
ago, but this was not of much help - I forgot too much :)

I'll prepare and publish code and benchmark tomorrow.

>   BTW, I can't see the difference between this and your "With
> corrected table of results:".

The "corrected" should be "with corrected layout", just that. Sorry for 
that ambiguity.

> 
>   - UTF8TextConverter wasn't written with performance in mind (as you
>     can tell^^;)
>   - This kind of tight loop gives 3-5 factor of performance difference
>     in VW and Squeak, plus,
>   - immediate representation for characters must be helping a lot.
> 
>   For the OLPC, I think I will end up with writing primitives for
> Squeak.  One could say that I should like the iconv library, but not
> sure if that is a good idea or not...
> 
> -- Yoshiki
> 
> 

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si