[squeak-dev] Re: SocketSteam: Switching ascii/binary modes

Tue Mar 16 09:51:30 UTC 2010

2010/3/16 Igor Stasenko <siguctua at gmail.com>:
> On 16 March 2010 10:57, Nicolas Cellier
> <nicolas.cellier.aka.nice at gmail.com> wrote:
>> 2010/3/16 Igor Stasenko <siguctua at gmail.com>:
>>> On 16 March 2010 05:51, Andreas Raab <andreas.raab at gmx.de> wrote:
>>>> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>>>>
>>>>> There could be an alternative approach:
>>>>> - keep buffers in a single (binary) format and covert an output
>>>>> depending on mode.
>>>>>
>>>>> The choice is, when you should pay the conversion price:
>>>>> - each time you read something
>>>>> - each time you switching the mode
>>>>>
>>>>> If input is a mix of ascii/binary content, it will be very ineffective
>>>>> converting the cache each time mode switching.
>>>>> For example - HTTP 'transfer-encoding: chunked'.
>>>>> Content may be a binary data, but it could be chunked, then input
>>>>> becomes a mix of
>>>>> binary data and hexadecimal ascii values, and crlf's.
>>>>>
>>>>> So, it requires mode deep analyzis than just saying 'convert it' :)
>>>>
>>>> I don't think it's all that complicated :-)
>>>>
>>>> First, you'd slow down all current use cases and introduce a lot of
>>>> potential bugs if you added conversion upon access. You would also break any
>>>> extension methods (the next:into: methods were originally extensions on
>>>> SocketStream before I added them to trunk). Given all of that changing
>>>> SocketStream in that way seems highly questionable.
>>>>
>>>> The specific use case of chunked encoding is interesting too, since the
>>>> motivation of adding the next:into: family of methods came from reading
>>>> chunked encoding :-) As a consequence, the fastest way to read chunked
>>>> encoding in Squeak today is the following:
>>>>
>>>> buffer := ByteArray new. "or: ByteString new"
>>>> [firstLine := socketStream nextLine.
>>>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
>>>> chunkSize = 0] whileFalse:[
>>>>  buffer size < chunkSize
>>>>    ifFalse:[buffer := buffer class new: chunkSize].
>>>>  buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>>>>  outStream next: chunkSize putAll: buffer.
>>>>  socketStream skip: 2. "CRLF"
>>>> ].
>>>> socketStream skip: 2. "CRLF"
>>>>
>>>> There is no conversion needed between ascii/binary since the next:into: code
>>>> accepts both strings and byte arrays. By the end of the day switching
>>>> between ascii and binary is a bit of a convenience function which means that
>>>> you probably shouldn't be writing high-performance code that depends on
>>>> constantly switching between the two (I think that's a fair tradeoff). The
>>>> next:into: family was specifically provided for high-performance situations
>>>> by providing a pre-allocated buffer and avoiding the allocation overhead.
>>>>
>>>
>>> Yes, #next:into: is convenient, if you know the content size from the start
>>> or if you want to read all at once into memory.
>>> But conversion strategy which you showing above doesn't works well for
>>> all cases.
>>> For persistent streams, used for exchanging the data between peers,
>>> there is no notion of 'read everything up to end',
>>> but usually 'read what is currently available', because peers
>>> exchanging data in real time and you can't predict what will follow
>>> next to the last input.
>>>
>>> My current intent is to make a fast reader which using a socket as
>>> backend, and which is:
>>>
>>> - reads/parsing http headers
>>> - handling chunked content transfer encoding
>>> - handling utf8 content encoding
>>> - and only then, there is a consumer, which is a JSON parser parsing
>>> input character by character,
>>> and, as many other parsers, obviously have no use of #next:into: , but
>>> using #peek and #next all the way.
>>>
>>> The idea is to parse data, once it become available, instead of
>>> reading all up to the end, and only then start parsing.
>>> You could ask, why its more effective? - Because of network latency.
>>> A client, instead of simply waiting for a next data packet to arrive
>>> could spend this time more productively by parsing the available input
>>> (besides, it will spend this time anyways, so why wasting the time?).
>>> This means that results of parsing will be available earlier,
>>> comparing to scheme, when you start parsing only when all data
>>> arrives.
>>> Also, the bigger content size, the more efficient it will be not only
>>> by speed, but also by memory consumption.
>>>
>>> So, i tend to look for a ways, when socket stream design is focused on
>>> streaming the data and doesn't assuming that consumer of it having any
>>> preferences to use buffered approach (#next:into: ) over non-buffered
>>> one (#next).
>>>
>>
>> Then you should really consider looking at VW-XTream transforming: stuff.
>> The idea is to have parallel processing (pipelines).
>
> Err. Pipes is not parallel processing. It is a sequential - output of
> one pipe is input of another one.

Well, Mr Ford understood that before us ;) - having only one object
working while the others are resting is not the most efficient way to
process the sequential stream.

Nicolas

> And sure thing, this is how i think a good streams should be working.
> Too bad, i have to use what we have in Squeak/Pharo.. or do everything
> from scratch.
>
>> Of  course, we can not have true parallelism yet in Smalltalk, but at
>> least the 1st level can work with non blocking squeak socket.
>>
>> Nicolas
>>
>>>> Cheers,
>>>>  - Andreas
>>>
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>>
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>