[Seaside] UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

Philippe Marschall philippe.marschall at gmail.com
Fri Oct 14 17:53:50 UTC 2011


2011/10/13 Marten Feldtmann <itlists at schrievkrom.de>:
> Hello,
>
> I have a question about what all these classes do (in a big picture) and how
> they work together and when they are actually called. I looked into the
> source code, but I am still having problems of fully understanding.
>
> When I have an Adapter with GRNullCodec I assume, that all (?) traffic,
> content (?) goes through the GRNullCodec, but due to the fact, that
> GRNullCodec does nothing, the traffic/content is not changed.
>
> What exactly goes through these codec ?
>
> If I use an adapter with GRPharoUtf8Codec is then the content converted
> to/from UTF8 ????
>
> What does this mean to strings (in my application) I render on my pages like
> in the following command:
>
> html text: stringInSomeCodePage
>
> in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific
> code pages like Utf8, Latin1 and "true" Unicode (Utf32).
>
> In my firsts demos I held all my strings in UTF8 and used the GRNullCodec
> (and everything is ok in the browser side).
>
> Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an
> additional UTF8 conversion. Then I switched my application strings back to
> Latin1 and it was ok again.
>
> How does this all work with Unicode characters with code points > 255 (and
> usage of GRPharoUtf8Codec) (in Squeak: WideString)?
>
> When is a GRPharoUtf8Codec really needed ??
>
> Perhaps this is a stupid question .... but then I would like to know it :-))

So, …

A codec is an object that handles string encodings, it has two main
responsibilities:
 encodedString - #decode: -> decodedString
 decodedString - #encode: ->encodedString

[1] for example

 utf-8  - #decode: -> "native" string
 "native" string  - #encode: -> utf-8

"Native" means whatever the semantics in of a string are in the
current dialect. That's ByteString/WideStream in Pharo,
ByteString/TwoByteString/FourByteString in VW and so on. ä is 'ä', €
is '€' and ☃ is '☃'.

A codec stream is a stream that encodes the elements you write on it
and passes them to an underlying stream. So for example you can write
"native" strings to a codec stream and it encodes them and passes them
to an underlying stream. You can ask a codec for a stream.

GRPharoUtf8Codec and GRPharoUtf8CodecStream are the Pharo specific
implementation classes for utf-8, they contain some fast path that
only works in Pharo. They are part of Seaside. UTF8TextConverter is
part of Pharo and implements utf-8 decoding and encoding. We use it
for cases that are not covered by the fast path.

GRNullCodec implements an identity transformation, #encode: and
#decode: always answer the exact same string you passed them. It's
there for historic reasons and because some non-single byte string
classes have (had) quite severe bugs and performance regressions.

So how does all this fit together?

The codec on the server adapter #decodes: the request and #encodes:
the response [2].

So when you set the codec on der server adapter to utf-8 the following
is supposed to happen:
 request (utf-8)  - #decode: -> "native" string
 response ("native" string) - #encode: -> utf-8

That means the strings you #render: or #text: have to be "native".
They must not be in any encoding other than "Smalltalk". It also means
the encoding the application reports to the browser has to be utf-8
(happens automatically unless you override it).

So what if you set the codec to a NullCodec? Well nothing happens.

 request (whatever encoding)  - #decode: -> whatever encoding
 response (whatever encoding) - #encode: -> whatever encoding

You get strings where each character is a byte as sent by the browser
and you're supposed to deliver strings where each character is byte in
the same encoding.

So in the case of utf-8 you would get:

 request (utf-8)  - #decode: -> utf-8
 response (utf-8) - #encode: -> utf-8

So instead of 'ä' you would get (String with: (Character value: 195)
with: (Character value: 164)) (an ä encoded as utf-8). The same is
true for the strings you #render: and #text:, they have to be utf-8
encoded already as well. It also means the encoding the application
reports to the browser has to be utf-8.

I hope this makes what's supposed to happen a bit more clear.

 [1] Yes, it's actually wrong that we have encoded strings we should
have byte arrays instead.
 [2] Actually it's a bit more involved. It also has an #url codec that
is responsible for the url encoding when rendering URLs. URLs do not
necessarily have the same encoding as the HTML page on which they are
rendered (yes).

Cheers
Philippe


More information about the seaside mailing list