[squeak-dev] Re: SocketSteam: Switching ascii/binary modes

Tue Mar 16 04:36:46 UTC 2010

On 16 March 2010 05:51, Andreas Raab <andreas.raab at gmx.de> wrote:
> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>
>> There could be an alternative approach:
>> - keep buffers in a single (binary) format and covert an output
>> depending on mode.
>>
>> The choice is, when you should pay the conversion price:
>> - each time you read something
>> - each time you switching the mode
>>
>> If input is a mix of ascii/binary content, it will be very ineffective
>> converting the cache each time mode switching.
>> For example - HTTP 'transfer-encoding: chunked'.
>> Content may be a binary data, but it could be chunked, then input
>> becomes a mix of
>> binary data and hexadecimal ascii values, and crlf's.
>>
>> So, it requires mode deep analyzis than just saying 'convert it' :)
>
> I don't think it's all that complicated :-)
>
> First, you'd slow down all current use cases and introduce a lot of
> potential bugs if you added conversion upon access. You would also break any
> extension methods (the next:into: methods were originally extensions on
> SocketStream before I added them to trunk). Given all of that changing
> SocketStream in that way seems highly questionable.
>
> The specific use case of chunked encoding is interesting too, since the
> motivation of adding the next:into: family of methods came from reading
> chunked encoding :-) As a consequence, the fastest way to read chunked
> encoding in Squeak today is the following:
>
> buffer := ByteArray new. "or: ByteString new"
> [firstLine := socketStream nextLine.
> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
> chunkSize = 0] whileFalse:[
>  buffer size < chunkSize
>    ifFalse:[buffer := buffer class new: chunkSize].
>  buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>  outStream next: chunkSize putAll: buffer.
>  socketStream skip: 2. "CRLF"
> ].
> socketStream skip: 2. "CRLF"
>
> There is no conversion needed between ascii/binary since the next:into: code
> accepts both strings and byte arrays. By the end of the day switching
> between ascii and binary is a bit of a convenience function which means that
> you probably shouldn't be writing high-performance code that depends on
> constantly switching between the two (I think that's a fair tradeoff). The
> next:into: family was specifically provided for high-performance situations
> by providing a pre-allocated buffer and avoiding the allocation overhead.
>

Yes, #next:into: is convenient, if you know the content size from the start
or if you want to read all at once into memory.
But conversion strategy which you showing above doesn't works well for
all cases.
For persistent streams, used for exchanging the data between peers,
there is no notion of 'read everything up to end',
but usually 'read what is currently available', because peers
exchanging data in real time and you can't predict what will follow
next to the last input.

My current intent is to make a fast reader which using a socket as
backend, and which is:

- reads/parsing http headers
- handling chunked content transfer encoding
- handling utf8 content encoding
- and only then, there is a consumer, which is a JSON parser parsing
input character by character,
and, as many other parsers, obviously have no use of #next:into: , but
using #peek and #next all the way.

The idea is to parse data, once it become available, instead of
reading all up to the end, and only then start parsing.
You could ask, why its more effective? - Because of network latency.
A client, instead of simply waiting for a next data packet to arrive
could spend this time more productively by parsing the available input
(besides, it will spend this time anyways, so why wasting the time?).
This means that results of parsing will be available earlier,
comparing to scheme, when you start parsing only when all data
arrives.
Also, the bigger content size, the more efficient it will be not only
by speed, but also by memory consumption.

So, i tend to look for a ways, when socket stream design is focused on
streaming the data and doesn't assuming that consumer of it having any
preferences to use buffered approach (#next:into: ) over non-buffered
one (#next).

> Cheers,
>  - Andreas

-- 
Best regards,
Igor Stasenko AKA sig.