UTF8 Squeak
Janko Mivšek
janko.mivsek at eranova.si
Mon Jun 11 11:04:53 UTC 2007
Hi Colin,
Colin Putney wrote:
>
> On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:
>
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>
> "Most efficient yet fast" is a matter of perspective. For the apps I
> work on, UTF-8 is better than your scheme because space efficiency is
> more important than random access, and time spent encoding and decoding
> UTF-8 would dwarf time spent scanning for random access.
Anyone can definitively stay with UTF8 encoded strings in plan BytString
or subclass to UTF8String by himself. But I don't know why we need to
have UTF8String as part of string framework. Just because of meaning?
Then we also need to introduce an ASCIIString :)
> As soon as you try to support more than 256 characters, there are
> trade-offs to be made. The "ideal" solution depends on your application.
> How important is memory efficiency vs. space efficiency? How about
> stream processing vs random access? What format is your input and
> output? Which characters do you need to support, and how many of them
> are there?
> A good string library will be flexible enough to allow its users to
> make those trade-offs according to the needs of the application.
I think that preserving simplicity is also an important goal. We need to
find a general yet simple solution for Unicode Strings, which will be
good enough for most uses, as is the case for numbers for instance. We
deal with more special cases separately. I claim that pure Unicode
strings in Byte, TwoByte or FourByteString is such a general support.
UTF8String is already a specific one.
>
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>
> Yes, a couple of well designed primitives would help quite a bit.
I study UTF8 conversion and it is designed to be efficient, almost as
usual copy. I already did those conversion methods and now I'm preparing
for benchmarks. If conversion is really as fast as a copy, then there
are really not much arguments anymore to convert always to inner Unicode
by default?
Best regards
JAnko
--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si
More information about the Squeak-dev
mailing list
|