[Seaside] 3.9 and encoding

Wed Feb 28 09:55:05 UTC 2007

On Wed, 2007-02-28 at 10:03 +0100, Philippe Marschall wrote:
> 2007/2/28, Norbert Hartl <norbert at hartl.name>:
> > On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
> > > 2007/2/28, Norbert Hartl <norbert at hartl.name>:
> > > > Hi,
> > > >
> > > > I ran into a encoding problem. I'm using seaside together
> > > > with Glorp. For the web server I use WAKomEncoded39.
> > > > WAKomEncoded39 converts the output to the browser to utf-8.
> > > > But on incoming requests the url escaped characters are
> > > > translated to something different. For me it appears to
> > > > be latin-1 but I've no glue why it should be that way.
> > > > I detected it because my postgresql session has client
> > > > encoding utf-8 turned on and I get an error trying to
> > > > store strings containing characters like ö.
> > >
> > > If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> > > (new) Squeak encoding in your image which is basically non-unified
> > > unicode. For latin-1 characters this will be indistinguishable from
> > > latin-1. If your database is utf-8 you need to encode your strings to
> > > utf-8 when writing them to your database and decode your strings from
> > > utf-8 when reading from the database (only to convert it back to utf-8
> > > when generating html). You can configure the PostgreS database driver
> > > to do this automatically for you.
> > >
> > Oh, this seems quite easy. But I didn't found anything to configure
> > in the Postgres driver. Do you have any hint?
> 
> PGConnection >> class #buildDefaultFieldConverters
> TestPGConnection >> #testFieldConverter
> 
> You need to register a field converter for your string types that does
> #convertFromEncoding: #utf8
> 
This way it is working already. I think as long as no one is touching
the string it comes as utf-8 from the database und gets encoded a
second time by WAKomEncoded39 which has no effect.

> Sorry that does only do the decoding and not the encoding. I guess in
> your case Glorp does the encoding. I don't know how you can customize
> the Sql generation there but it everything else fails you can change
> PGConnection >> #execute (yes, this is a hack)
> 
I don't think Glorp does encoding and I think it shouldn't. 
Glorp should be happy with strings. If there is conversion happening
it should happen in the postgres driver (it is the only one who
could know which encoding is needed for the database). 

My strings are carried by ByteString. It seems that ByteString (got
from WAKomEncoded39) contains a bunch of bytes with any encoding (
ok, it is the non-unified unicode, you said, and i don't know what 
that means :) ). 
I can convert it with convertToEncoding: to another encoding still
using ByteString. But there is no information about encoding in the
object. I think this is really dangerous. I have to look at WideString.
I'm curious how those deal with encodings they are created from. 

I think there are only two possibilities. Handle it like Java, Lisp
and convert every encoding to the internal (UCS-2) on string creation.
The other option which would be easier (i think) is to add the 
character encoding information into the string.

What do you think?

> sql := sqlString.
> to
> sql := sqlString convertToEncoding: #utf8.
> 
The hack is actually adding the conversion to
SqueakDatabaseAccessor>>basicExecuteSQLString:

I understand a lot more now. Thanks very much.

Norbert
> P.S.:
> PGConnection >> class #buildDefaultFieldConverters
> has given us a lot of pain because Squeak doesn't have full block closures
> 
Oh, wow, another day hearing a lot of basic things I don't have any idea
about :) What are "full" block closures?