Latin-1 to UTF-8 speedups

Andreas Raab andreas.raab at gmx.de
Sat Jul 14 22:19:30 UTC 2007


Hi -

John asked me for the UTF-8 changes that I had done for our own use and 
since there may be some interested by other people, here are the 
changes. Keep in mind that the speedup is aimed at situations where your 
input is basically Latin-1 and won't have any effect if you are actually 
using anything beyond Latin-1.

The main goal of these changes is to make the overhead of adding UTF-8 
conversions "just in case" diminishingly small. For example, converting 
ASCII text with no extended characters at all is effectively free:

"Convert 1 million ascii characters"
string := (String new: 10000 withAll: $a).

"The current converter"
Transcript cr; show: [1 to: 100 do:[:i|
     string convertToWithConverter: UTF8TextConverter new
]] timeToRun.

=> 1809

"The fast path"
Transcript cr; show: [1 to: 100 do:[:i|
   string squeakToUtf8
]] timeToRun.

=> 4

Even when using the full Latin-1 range, there is still a goodly bit of 
speedup:

"Convert 1 million extended latin-1 characters"
string := (String new: 10000 withAll: $ß).

"The current converter"
Transcript cr; show: [1 to: 100 do:[:i|
   string convertToWithConverter: UTF8TextConverter new
]] timeToRun.

=> 5193

"The fast path"
Transcript cr; show:[1 to: 100 do:[:i|
   string squeakToUtf8
]] timeToRun.

=> 1816

Depending on your concrete usage, the result will be somewhere inbetween 
these extremes - for our use we found it to be close to the optimal case 
but if you use a lot of extended Latin-1 your results will be closer to 
the latter one. In any case, it should be a nice little speedup so enjoy 
the ride.

Cheers,
   - Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: SqueakToUtf8.cs
Type: text/x-csharp
Size: 4317 bytes
Desc: not available
Url : http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20070714/b89d7856/SqueakToUtf8.bin


More information about the Squeak-dev mailing list