[Seaside] 3.9 and encoding

Wed Feb 28 04:30:05 UTC 2007

I took a quick look at the request processing and I don't see where  
utf-8 stuff gets decoded.  AFAICS, it just doesn't do it - thus  
producing a one byte to a character transformation, but maybe I'm  
missing something.

I have done a LOT of this stuff (formerly chief architect at a web  
I18N company).  There are a few things that are not so intuitive when  
dealing with encodings and http requests.

Escape sequences escape bytes, not characters.

On pass 1, you assume you have latin-1, parse the header and get the  
content-type and associated charset.  Remember this for later  
translation.

Build a byte array from the string by putting ascii characters in as  
bytes.  Decode escape sequences into single bytes as you go.

Convert the byte array to a string by reading bytes and composing  
them into code points according to the encoding specified as the  
charset in the content-type.  For utf-8 this means reading a byte,  
checking the high order bits to find out the length of the byte  
sequence, then reading the rest of the sequence, composing the code  
point, etc...

Now you have text - start over and parse as normal.

Some of these steps can be folded but conceptually, this is how it  
works.

So I don't think WAKomEncoding39 is doing the right thing wrt to  
request processing AFAICS.

-Todd Blanchard

On Feb 27, 2007, at 3:26 PM, Philippe Marschall wrote:

> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> (new) Squeak encoding in your image which is basically non-unified
> unicode. For latin-1 characters this will be indistinguishable from
> latin-1.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/seaside/attachments/20070227/f5110569/attachment.htm