[Seaside] 3.9 and encoding
Todd Blanchard
tblanchard at mac.com
Wed Feb 28 04:30:05 UTC 2007
I took a quick look at the request processing and I don't see where
utf-8 stuff gets decoded. AFAICS, it just doesn't do it - thus
producing a one byte to a character transformation, but maybe I'm
missing something.
I have done a LOT of this stuff (formerly chief architect at a web
I18N company). There are a few things that are not so intuitive when
dealing with encodings and http requests.
Escape sequences escape bytes, not characters.
On pass 1, you assume you have latin-1, parse the header and get the
content-type and associated charset. Remember this for later
translation.
Build a byte array from the string by putting ascii characters in as
bytes. Decode escape sequences into single bytes as you go.
Convert the byte array to a string by reading bytes and composing
them into code points according to the encoding specified as the
charset in the content-type. For utf-8 this means reading a byte,
checking the high order bits to find out the length of the byte
sequence, then reading the rest of the sequence, composing the code
point, etc...
Now you have text - start over and parse as normal.
Some of these steps can be folded but conceptually, this is how it
works.
So I don't think WAKomEncoding39 is doing the right thing wrt to
request processing AFAICS.
-Todd Blanchard
On Feb 27, 2007, at 3:26 PM, Philippe Marschall wrote:
> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> (new) Squeak encoding in your image which is basically non-unified
> unicode. For latin-1 characters this will be indistinguishable from
> latin-1.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/seaside/attachments/20070227/f5110569/attachment.htm
More information about the Seaside
mailing list