[squeak-dev] Re: SocketSteam: Switching ascii/binary modes

Tue Mar 16 09:28:56 UTC 2010

On 16 March 2010 10:57, Nicolas Cellier
<nicolas.cellier.aka.nice at gmail.com> wrote:
> 2010/3/16 Igor Stasenko <siguctua at gmail.com>:
>> On 16 March 2010 05:51, Andreas Raab <andreas.raab at gmx.de> wrote:
>>> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>>>
>>>> There could be an alternative approach:
>>>> - keep buffers in a single (binary) format and covert an output
>>>> depending on mode.
>>>>
>>>> The choice is, when you should pay the conversion price:
>>>> - each time you read something
>>>> - each time you switching the mode
>>>>
>>>> If input is a mix of ascii/binary content, it will be very ineffective
>>>> converting the cache each time mode switching.
>>>> For example - HTTP 'transfer-encoding: chunked'.
>>>> Content may be a binary data, but it could be chunked, then input
>>>> becomes a mix of
>>>> binary data and hexadecimal ascii values, and crlf's.
>>>>
>>>> So, it requires mode deep analyzis than just saying 'convert it' :)
>>>
>>> I don't think it's all that complicated :-)
>>>
>>> First, you'd slow down all current use cases and introduce a lot of
>>> potential bugs if you added conversion upon access. You would also break any
>>> extension methods (the next:into: methods were originally extensions on
>>> SocketStream before I added them to trunk). Given all of that changing
>>> SocketStream in that way seems highly questionable.
>>>
>>> The specific use case of chunked encoding is interesting too, since the
>>> motivation of adding the next:into: family of methods came from reading
>>> chunked encoding :-) As a consequence, the fastest way to read chunked
>>> encoding in Squeak today is the following:
>>>
>>> buffer := ByteArray new. "or: ByteString new"
>>> [firstLine := socketStream nextLine.
>>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
>>> chunkSize = 0] whileFalse:[
>>>  buffer size < chunkSize
>>>    ifFalse:[buffer := buffer class new: chunkSize].
>>>  buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>>>  outStream next: chunkSize putAll: buffer.
>>>  socketStream skip: 2. "CRLF"
>>> ].
>>> socketStream skip: 2. "CRLF"
>>>
>>> There is no conversion needed between ascii/binary since the next:into: code
>>> accepts both strings and byte arrays. By the end of the day switching
>>> between ascii and binary is a bit of a convenience function which means that
>>> you probably shouldn't be writing high-performance code that depends on
>>> constantly switching between the two (I think that's a fair tradeoff). The
>>> next:into: family was specifically provided for high-performance situations
>>> by providing a pre-allocated buffer and avoiding the allocation overhead.
>>>
>>
>> Yes, #next:into: is convenient, if you know the content size from the start
>> or if you want to read all at once into memory.
>> But conversion strategy which you showing above doesn't works well for
>> all cases.
>> For persistent streams, used for exchanging the data between peers,
>> there is no notion of 'read everything up to end',
>> but usually 'read what is currently available', because peers
>> exchanging data in real time and you can't predict what will follow
>> next to the last input.
>>
>> My current intent is to make a fast reader which using a socket as
>> backend, and which is:
>>
>> - reads/parsing http headers
>> - handling chunked content transfer encoding
>> - handling utf8 content encoding
>> - and only then, there is a consumer, which is a JSON parser parsing
>> input character by character,
>> and, as many other parsers, obviously have no use of #next:into: , but
>> using #peek and #next all the way.
>>
>> The idea is to parse data, once it become available, instead of
>> reading all up to the end, and only then start parsing.
>> You could ask, why its more effective? - Because of network latency.
>> A client, instead of simply waiting for a next data packet to arrive
>> could spend this time more productively by parsing the available input
>> (besides, it will spend this time anyways, so why wasting the time?).
>> This means that results of parsing will be available earlier,
>> comparing to scheme, when you start parsing only when all data
>> arrives.
>> Also, the bigger content size, the more efficient it will be not only
>> by speed, but also by memory consumption.
>>
>> So, i tend to look for a ways, when socket stream design is focused on
>> streaming the data and doesn't assuming that consumer of it having any
>> preferences to use buffered approach (#next:into: ) over non-buffered
>> one (#next).
>>
>
> Then you should really consider looking at VW-XTream transforming: stuff.
> The idea is to have parallel processing (pipelines).

Err. Pipes is not parallel processing. It is a sequential - output of
one pipe is input of another one.
And sure thing, this is how i think a good streams should be working.
Too bad, i have to use what we have in Squeak/Pharo.. or do everything
from scratch.

> Of  course, we can not have true parallelism yet in Smalltalk, but at
> least the 1st level can work with non blocking squeak socket.
>
> Nicolas
>
>>> Cheers,
>>>  - Andreas
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>>
>
>

-- 
Best regards,
Igor Stasenko AKA sig.