[squeak-dev] Re: UTF8 in JSON (was: Re: [ANN] WebClient and WebServer 1.0 for Squeak)

Levente Uzonyi leves at elte.hu
Tue May 11 23:47:43 UTC 2010


On Tue, 11 May 2010, Hannes Hirzel wrote:

> 1) UFT8 conversion
> 2) Change to JSON package of Tony Garnock-Jones
> 3) My updated Test case
> 4) Conclusion
>
>
> 1) UFT8 conversion
>
> My question was:
>    How do I convert a WideString to UTF8?
>
>
> Levente answered:
>
> There are various possibilities:
> 'äbc' squeakToUtf8.
> 'äbc' convertToEncoding: 'utf-8'.
> 'äbc' convertToWithConverter: UTF8TextConverter new.
> UTF8TextConverter new encodeString: 'äbc'.
>
>
>
> 2) Change to JSON package of Tony Garnock-Jones
>
> As CouchDB stores UTF8 values I did not want to escape them with
> \uNNNN as the forked JSON package in SCouchDB does. But instead I
> wanted to keep UTF8 in the db. As Rado pointed out the UFT8 conversion
> is not correct in the original JSON package.
>
> So I did the following correction.
>
> In the class
>  String  - category *JSON-writing
>  (from package http://www.squeaksource.com/JSON)
> I replaced
>
>  jsonWriteOn: aStream
> 	| replacement |
> 	aStream nextPut: $".
> 	self do: [ :ch |
> 		(replacement := Json escapeForCharacter: ch)    "***"
> 			ifNil: [ aStream nextPut: ch ]
> 			ifNotNil: [ aStream nextPutAll: replacement ] ].
> 	aStream nextPut: $".
>
>
> WITH
>
>  jsonWriteOn: aStream
> 	aStream nextPut: $".
> 	aStream nextPutAll:  (UTF8TextConverter new encodeString: self).
> 	aStream nextPut: $".

This is just wrong. According to http://json.org a string can contain any 
unicode character except for \ " and control characters. So here should be 
no UTF-8 conversion.

You only need to convert the characters to UTF-8, because you're sending 
them over the network to a server, and unicode characters have to be 
converted to bytes someway. So the JSON printer shouldn't do any 
conversion by default except for escaping. The only problem is that 
escaping is not done as the spec requires it, but that's easy to fix.


Levente

>
>
> "*** NOTE: escapeForCharacter is incorrectly implemented in
> http://www.squeaksource.com/JSON
> and is corrected by Rado in the SCouchDB fork of the package JSON
> http://www.squeaksource.com/SCouchDB/SCouchDB-Core-rh.8.mcz"
>
>
>
> 3) My updated Test case
>
> myWideString := ('ä', 8220 asCharacter asString, Character cr, 'b').
> d := Dictionary new. d at: 'title' put:   'aTitle'. d at: 'body' put:
> myWideString.
> r := WriteStream on: String new.
> (JsonObject newFrom: d) jsonWriteOn: r.
> WebClient httpPut: host, '/notes/test24' content: r contents type: 'text/plain'.
>
> RESULT: OK.
>
>
>
> 4) Conclusion
>
> With the change to the JSON package I am now fine in using WebClient
> for storing objects in a couchdB.
>
> However I did not commit my change to
>  http://www.squeaksource.com/JSON
> as I do not (yet) understand the full impact of it.
>
>
> Thank you Andreas Raab, Levente Uzony and Rado Hodnicak for your help
>
> --Hannes
>
> On 5/11/10, Igor Stasenko <siguctua at gmail.com> wrote:
>> On 11 May 2010 17:44, Hannes Hirzel <hannes.hirzel at gmail.com> wrote:
>>> On 5/10/10, radoslav hodnicak <rh at 4096.sk> wrote:
>>>>
>>>> Which JSON package/version are you using? I fixed a bug in the one
>>>> distributed with SCouchDB few weeks ago, where it didn't encode utf8
>>>> characters properly - the correct escaped form is \uNNNN - always padded
>>>> to 4 Ns. that's why you get that warning, yours is only 2-3
>>>>
>>>> rado
>>>
>>> I have been using
>>> http://www.squeaksource.com/JSON (over 7000 downloads)
>>> in combination with WebClient.
>>>
>>> Thank you Rado, I found
>>> http://www.squeaksource.com/SCouchDB/SCouchDB-Core-rh.8.mcz
>>> and will have a look at it.
>>> (Your comment: added handling of utf8 encoded input data - this is
>>> necessary for couchdb-lucene which sends results directly in utf8 and
>>> not \uNNNN encoded)
>>>
>> SCouchDB using a forked version of JSON package, which you can find in
>> SCouchDB repository
>> http://www.squeaksource.com/SCouchDB/JSON-Igor.Stasenko.34.mcz
>>
>> If you looking for that method, it can be found in Json>>unescapeUnicode
>>
>>
>>> --Hannes
>>>
>>>
>>>> On Mon, 10 May 2010, Hannes Hirzel wrote:
>>>>
>>>>> The test case made simpler
>>>>>
>>>>> WebClient httpPut: host, '/notes/test7' content:
>>>>> '{"content":"\uC3\uA4s"}' type: 'text/plain'.
>>>>>
>>>>> gives back as answer: '{"error":"bad_request","reason":"invalid UTF-8
>>>>> JSON"}
>>>>> '
>>>>>
>>>>> whereas
>>>>>
>>>>> WebClient httpPut: host, '/notes/test8' content: '{"content":"abc"}'
>>>>> type: 'text/plain'.
>>>>>
>>>>> gives back
>>>>> '{"ok":true,"id":"test8","rev":"1-f40e52919735ae6775af3d388361b3da"}
>>>>> '
>>>>>
>>>>> --Hannes
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>>
>
>


More information about the Squeak-dev mailing list