[squeak-dev] The Trunk: WebClient-Core-topa.112.mcz

Thu Nov 2 14:41:51 UTC 2017

> Sent: Tuesday, October 31, 2017 at 2:23 PM
> From: "Tobias Pape" <Das.Linux at gmx.de>
> To: "The general-purpose Squeak developers list" <squeak-dev at lists.squeakfoundation.org>
> Subject: Re: [squeak-dev] The Trunk: WebClient-Core-topa.112.mcz
>
> Followup 2:
> 
> > On 31.10.2017, at 18:03, Tobias Pape <Das.Linux at gmx.de> wrote:
> > 
> > Followup
> >> On 31.10.2017, at 17:54, Tobias Pape <Das.Linux at gmx.de> wrote:
> >> 
> >> 
> >>> On 31.10.2017, at 07:51, monty <monty2 at programmer.net> wrote:
> >>> 
> >>>> Sent: Friday, October 27, 2017 at 1:33 PM
> >>>> From: "Tobias Pape" <Das.Linux at gmx.de>
> >>>> To: "The general-purpose Squeak developers list" <squeak-dev at lists.squeakfoundation.org>
> >>>> Subject: Re: [squeak-dev] The Trunk: WebClient-Core-topa.112.mcz
> >>>> 
> >>>> 
> >>>>> On 27.10.2017, at 18:40, monty <monty2 at programmer.net> wrote:
> >>>>> 
> >>>>> This series of content decoding changes is not backwards compatible. Existing code that does its own response decoding with the assumption that WebClient won't do it will now double-decode the response--corrupting it. But breaking existing code is probably worth it here because of the gains in intuitiveness and correctness.
> >>>>> 
> >>>>> Is there any way to disable the automatic response decoding?
> >>>> 
> >>>> That's problematic.
> >>>> Some conversation was always done, you never ended up with mere Strings/ByteArrays.
> >>>> 
> >>>> This is the old version, that opportunistically applied inflate/deflate:
> >>> 
> >>> Which is part of the HTTP protocol itself and is meant to be applied before transit and removed after in a way that is entirely transparent to the application. By decoding the response content from UTF-8 or some other character encoding, you're directly modifying the content before giving it to the application, and in a way that is permanent and maybe incorrect.
> >>> 
> >> 
> >> Yes, when the http-header says so?
> >> 
> >> (see https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html)
> >> 
> >> I don't see why handling the 'content-encoding' (section 3.5, eg, deflate), the 'transfer-encoding' (section 3.6, eg, chunked), and the 'content-type' (section 3.7, with 'charset') headers differently.

Transfer-Encoding/Content-Encoding don't change the underlying message body the application is given. The same byte sequence the server started with, for example, a UTF-16 encoded text file read from disk, is available to the application after the HTTP client reverses the GZip compression applied by the server before transit. But decoding from a particular character encoding permanently transforms the message body before it's given to the application.

> >> Actually, it probably should also handle the 'charset' (section 3.4) header, too.

There is no charset header, only a charset parameter for certain headers like Content-Type.

> > Also, the protocol says:
> > 
> > "HTTP/1.1 recipients MUST respect the charset label provided by the sender"
> 
> 
> For html (and xml) it is stated that the header takes precedence regardless:
> 
> https://www.w3.org/International/questions/qa-html-encoding-declarations.en
> "The HTTP header information has the highest priority when it conflicts with in-document declarations […]"

For HTML5, you're wrong. BOMs in the content take precedence over the Content-Type charset parameter:

"That this step happens before the next one honoring the HTTP Content-Type header is a willful violation of the HTTP specification, motivated by a desire to be maximally compatible with legacy content. [HTTP]"

https://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding

And even if .114.mcz is rejected, .113.mcz is still worth integrating.

> > 
> > 
> >> 
> >> Also, HTTP explicitly says that anything not tagged is to be handled as iso-8859-1, so anything that does _not_ tag in the header but only in the content (say, xml whith utf8) will be problematic.
> >> 
> >> OTOH, you still are able to retrieve the information from the message.
> >> 
> >> So, say the XML doc says it's utf-8 encoded, you can still ask the message whether the content-type;charset or the charset header already where sent as utf-8, and if not, recode from iso-8859-1.
> >> 
> >> 
> >> 
> >>> Please review and merge WebClient-Core-monty.114.mcz.
> >> 
> >> I'll review.
> >> 
> >> 
> >> Best regards
> >> 	-Tobias
> >> 
> >> 
> >> 
> >>> 
> >>>> ================
> >>>> getContentWithProgress: progressBlockOrNil
> >>>> 	"Reads available content and returns it."
> >>>> 
> >>>> 	| length result |
> >>>> 	length := self contentLength.
> >>>> 	result := (stream isBinary ifTrue:[ ByteArray ] ifFalse: [ ByteString ]) 
> >>>> 		new: (length ifNil: [ 1000 ])
> >>>> 		streamContents: [ :outputStream | 
> >>>> 			self 
> >>>> 				streamFrom: stream
> >>>> 				to: outputStream
> >>>> 				size: length
> >>>> 				progress: progressBlockOrNil ].
> >>>> 	(self headerAt: 'content-encoding') = 'gzip' ifFalse: [ ^result ].
> >>>> 	^(GZipReadStream on: result) upToEnd
> >>>> =================
> >>>> 
> >>>> 
> >>>> Maybe we need a plain "get me the bytes" variant.
> >>>> Oh there is:
> >>>> 
> >>>> WebMessage>>streamFrom:to:size:progress: 
> >>>> 
> >>>> That's somewhat inelegant but should work…
> >>>> 
> >>>> 
> >>>> 
> >>>> Best regards
> >>>> 	-Tobias
> >>>> 
> >>>>> 
> >>>>>> Sent: Wednesday, September 20, 2017 at 11:29 AM
> >>>>>> From: commits at source.squeak.org
> >>>>>> To: squeak-dev at lists.squeakfoundation.org, packages at lists.squeakfoundation.org
> >>>>>> Subject: [squeak-dev] The Trunk: WebClient-Core-topa.112.mcz
> >>>>>> 
> >>>>>> Tobias Pape uploaded a new version of WebClient-Core to project The Trunk:
> >>>>>> http://source.squeak.org/trunk/WebClient-Core-topa.112.mcz
> >>>>>> 
> >>>>>> ==================== Summary ====================
> >>>>>> 
> >>>>>> Name: WebClient-Core-topa.112
> >>>>>> Author: topa
> >>>>>> Time: 20 September 2017, 5:29:06.096983 pm
> >>>>>> UUID: 60b494fc-0652-4a28-be5a-1578963e5aed
> >>>>>> Ancestors: WebClient-Core-topa.111
> >>>>>> 
> >>>>>> Abide Postel's law for text conversion.
> >>>>>> 
> >>>>>> Be conservative in what you do, be liberal in what you accept from others.
> >>>>>> 
> >>>>>> =============== Diff against WebClient-Core-topa.111 ===============
> >>>>>> 
> >>>>>> Item was changed:
> >>>>>> ----- Method: WebMessage>>getContentWithProgress: (in category 'private') -----
> >>>>>> getContentWithProgress: progressBlockOrNil
> >>>>>> 	"Reads available content and returns it."
> >>>>>> 
> >>>>>> 	| length result |
> >>>>>> 	length := self contentLength.
> >>>>>> 	result := (stream isBinary ifTrue:[ ByteArray ] ifFalse: [ ByteString ]) 
> >>>>>> 		new: (length ifNil: [ 1000 ])
> >>>>>> 		streamContents: [ :outputStream | 
> >>>>>> 			self 
> >>>>>> 				streamFrom: stream
> >>>>>> 				to: outputStream
> >>>>>> 				size: length
> >>>>>> 				progress: progressBlockOrNil ].
> >>>>>> 	self decoderForContentEncoding ifNotNil: [:decoder |
> >>>>>> 		result := (decoder on: result) upToEnd].
> >>>>>> 	self textConverterForContentType ifNotNil: [:converter |
> >>>>>> + 		[result := result convertFromWithConverter: converter]
> >>>>>> + 		on: InvalidUTF8 "some servers lie"
> >>>>>> + 		do: [^ result]].
> >>>>>> - 		result := result convertFromWithConverter: converter].
> >>>>>> 	^ result
> >>>>>> !
> >>>>>> 
> >>>>>> Item was changed:
> >>>>>> ----- Method: WebMessage>>textConverterForContentType (in category 'accessing') -----
> >>>>>> textConverterForContentType
> >>>>>> 
> >>>>>> 	| index contentType |
> >>>>>> 	contentType := self contentType.
> >>>>>> 	contentType size < 8 ifTrue: [ ^nil ].
> >>>>>> - 	(contentType beginsWithAnyOf: #('application/' 'image/' 'video/' 'audio/')) ifTrue: [^nil].
> >>>>>> 	index := contentType findString: 'charset=' startingAt: 1 caseSensitive: false.
> >>>>>> 	index = 0 ifTrue: [ ^nil ].
> >>>>>> 	^TextConverter newForEncoding: (contentType allButFirst: index + 7) "'charset=' size - 1"!
> 
> 
>