Unicode strings, benchmarks

Mon Jun 11 21:41:07 UTC 2007

Hi Squeakers,

I already extended String with TwoByteString and did a "scaling" with 
auto conversion to wider string when a wider character is put into a 
string. So far so good and this already works in Aida/Web.

I also did a bit better UTF8 conversion but is only 25-80% faster that 
existing one in UTF8TextConverter. To prepare for even better results, I 
  made a benchmark, which measure conversion time for English, French, 
Slovenian, Russian and Chinese 2500 characters long text. It measure 100 
conversions which accumulates to 250K characters of text.

Here are results in VW, Squeak with old UTF8 converter and a new one:

	   VW    old	 new
english	   30	 313	 248 ByteString,   pure ASCII
french	   32	 323	 251 ByteString,   ISO8859-1 (Latin 1)
slovenian  48	 578	 480 TwoByteString Latin 2
russian   112	1306	 720 TwoByteString Cyrillic
chinese   107	1544	3825 TwoByteString

Notice an exceptional 10x VW performance comparing to Squeak, and they 
do all encodings in plain Smalltalk! No primitives! So how come that 
Squeak is so slow here?

Above benchmark was done on Squeak 3.9 on Suse Linux 10.1, P3.2GHz.

Best regards
Janko

-- 
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si